Sleeping Beauty: an Accuracy-based Approach

post by glauberdebona · 2025-02-10T15:40:29.619Z · LW · GW · 2 comments

Contents

  Summary
  Introduction
  Formal Setting
  Single Experiment
  Multiple Experiments 
    Minimizing the Expected Total Inaccuracy
    Minimizing the Expected Sum of Experiments' Mean Inaccuracies
    Minimizing the Expected Average Inaccuracy across Awakenings
  Concluding Remarks
None
2 comments

This post does not propose a solution to the Sleeping Beauty problem [? · GW], but presents arguments based on accuracy for thirders, halfers and double-halfers. A more detailed draft paper can be found here.

Summary

Accuracy-based arguments claim that one should plan to adopt posterior credences that maximize the expected (according to prior credences) accuracy. Applying this approach to the Sleeping Beauty problem, the solution depends on how we aggregate the accuracy of  the credences held in the two indistinguishable awakenings when the coin lands Tails. Kierland and Monton (2005) have shown that, employing the Brier score,  averaging accuracy yields halving (), and summing leads to thirding (). We generalize that result for any strictly proper scoring rule. With multiple, repeated experiments, we show that accuracy can also be averaged in a different way, with the solution varying from  to  when the number of experiments increases indefinitely.

Introduction

The Sleeping Beauty problem [? · GW[1] is about updating credences and can be approached via arguments based on accuracy (epistemic utility). When an agent is about to learn a proposition from a partition, it is well-known that they should plan to update via conditionalization in order to minimize the expected inaccuracy of their posterior credences. Similarly, before being put to sleep on Sunday, Sleeping Beauty (SB) can plan to update her credence on the coin landing Heads upon waking on Monday in such a way that minimizes the expected inaccuracy of that credence. 

In the SB setup, we are interested in her credence on the coin outcome being Heads upon waking on Monday. However, if the coin lands Tails, her awakening on Tuesday is indistinguishable from the other on Monday, and presumably she would assign the same credences as Monday. In the Tails-world, thus, there are two identical credences to take into account while computing inaccuracy, one from Monday and one from Tuesday. The question is then how SB should aggregate the inaccuracies of her two indistinguishable temporal parts (awakenings) in the possible world where the coin lands Tails. It has been shown that averaging the inaccuracies yields the halfers' solution, while summing leads to the thirder's when Brier (quadratic) score is employed. These results can actually be extended to any strictly proper scoring rule. 

More generally, we explore a repeated version[2] of the SB problem where the experiment is conducted  times, over  weeks, with  coin tosses. SB is awakened only on Mondays and (possibly) on Tuesdays, and does not remember previous awakenings. For instance, if the coin lands Heads in the first experiment, she will be awakened and put to sleep on the first Monday, then she will be awakened on the next Monday to start the second experiment, with another coin toss. All awakenings in the series of experiments are thus indistinguishable to her. Upon each awakening, we seek the credence she should assign to the coin landing Heads for that experiment. In this setting, we again can sum the inaccuracies in a possible world,  but there are two ways to average. We can consider the mean inaccuracy per experiment and then average those means. Alternatively, the average can be taken across all awakenings in a possible world, regardless of the experiments. We present the credences SB should assign to minimize the expected inaccuracy for each of these three aggregation methods.

Formal Setting

With  experiments, we have a set  with  possible worlds, since each coin toss can result in either Heads or Tails. A credence function is a mapping  from propositions to real numbers. We use  to denote the set of worlds (proposition) where the coin toss for the -experiment landed Heads. 

The prior credences assigned by SB, just before she is firstly put to sleep, are denoted by the credence function . As the coin is fair and the coin tosses are independent, we assume that all possible worlds are a priori equiprobable to SB; formally,   for all , implying  for all .  

The posterior credences assigned upon each awakening during the -experiment are denoted by the credence function [3]. In the -experiment, the credence of interest, which answers the SB problem's question, is [4]. Since all awakenings are indistinguishable to SB, we assume that, for some ,   for all . The (repeated) Sleeping Beauty problem is to determine the rational value for .

To measure inaccuracy, we use a scoring rule  that maps the numeric credence in a proposition and its truth value (0=FALSE, 1=TRUTH) to a non-negative real number. We assume  is strictly proper[5], such as the Brier  (quadratic) score[6] or the logarithmic score. Each awakening corresponds to a credence of interest (in the -experiment, ), and the random variable  denotes some aggregation of their inaccuracies in a possible world .  

Single Experiment

Considering only one experiment, the set  of possible worlds contains only two worlds, which are equiprobable to SB on Sunday:  has one awakening (Heads) and  has two (Tails). In , the inaccuracy of  (the credence of interest) is simply . In , there are two awakenings, each yielding inaccuracy  to the credence of interest. To pick an optimal , one has to aggregate the inaccuracies for the two awakenings in  somehow. We assume this aggregation is done either via the sum  (resulting in an aggregate inaccuracy of  in ) or via the arithmetical mean (which yields ), which are probably the two simplest choices among many[7].

When the aggregate inaccuracy  is  computed by summing the inaccuracies across indistinguishable awakenings, the expected inaccuracy is:

For a strictly proper , only  minimizes this expression.

Alternatively, if  is the arithmetic mean of the inaccuracies of  held in different, indistinguishable awakenings in the same possible world, the expected inaccuracy of  is given by:

Since  is strictly proper, only  minimizes the expression above.

These results generalize the findings from Kierland and Monton, who considered only the Brier score. They see both aggregation methods as permissible, but attaining different goals: averaging minimizes the expected inaccuracy on Monday; summing considers equally all possible awakenings (Monday-Heads, Monday-Tails and Tuesday-Tails).

Multiple Experiments 

When the experiment is repeated  times, in each possible world there are multiple awakenings, from different experiments, all of which must be considered when measuring inaccuracy. In each awakening, we have a credence of interest , whose inaccuracy we want to minimize. Thus, to measure inaccuracy in a possible world, we need to aggregate the inaccuracies corresponding to different awakenings, possibly from different experiments and different credences of interest.

To ilustrate three possible aggregation approaches, we will consider as an example the world , where the first experiment has one awakening (Heads) and the second, two (Tails). In the first experiment's awakening, the credence  has inaccuracy . In the second experiment, there are two awakenings, and, in each, the credence  has inaccuracy .

Minimizing the Expected Total Inaccuracy

One way to measure how inaccurate the Sleeping Beauty is in the world  is to sum the three inaccuracy measurements, for  and , which yields . In general, the total inaccuracy in a possible world is the sum of the total inaccuracies per experiment. If  is a random variable denoting the total inaccuracy of  in the -experiment, the expected total inaccuracy can be written as . By the linearity of expectancy, this becomes . Any experiment has one awakening in half of the possible worlds, where , and has two awakenings in the other half, where . As possible worlds are equiprobable according to , we have  for all , yielding:

For any positive , only  minimizes this expression, for  is strictly proper.

Minimizing the Expected Sum of Experiments' Mean Inaccuracies

A second aggregation approach would be to sum (or average[8]) the mean inaccuracies of the experiments. In , the second experiment's mean inaccuracy is , resulting in  when summed with the first experiment's awakening inaccuracy. In general, this sum can be decomposed per experiments again, and linearity of expectancy can applied as before. If  is the random variable denoting the mean inaccuracy of  in the -experiment, the sum of experiments' mean inaccuracies is . Since  is  in exactly half of the worlds, and  in the other half, we have . Summing for all experiments, the expected sum of experiments' mean inaccuracies is:

For any positive , this expression is minimized only at , for  is strictly proper. 

Minimizing the Expected Average Inaccuracy across Awakenings

A third aggregation option is to arithmetically average all the inaccuracies of all awakenings' credences of interest ( in the -experiment) in a possible world. For , this approach yields an average inaccuracy of . Note that minimizing the expected value of this average is not equivalent to minimizing the expected total inaccuracy, as the denominator (the number of awakenings) might change in different worlds. In fact, in this case the expected inaccuracy can no longer be decomposed across experiments and we need a different approach.

Given an  and a , the average inaccuracy of the credences of interest in a possible world depends only on the number of experiments with one/two awakenings. For any world where there are exactly  experiments with two awakenings (Tails), there are  awakenings occurring in pairs in the same experiment, each yielding inaccuracy  for the credence of interest. In each of the remaining  experiments, there is a single awakening,  and the inaccuracy of the credence of interest is . In other words, the aggregate inaccuracy  can be written using the random variable  that maps worlds to the number of experiments with two awakenings:

Note that  follows a binomial distribution with  trials and probability . That is, according to , the probability that there are exactly  experiments with two awakenings (coin tosses landing Tails) is . Therefore, the expected inaccuracy is given by:

Replacing  by , we have an expression minimized only at :

This is no surprise, as averaging across awakenings is considering the mean inaccuracy per experiment when . When we make , the expected inaccuracy is:

That is, only  minimizes the expected inaccuracy. This aligns with the result from Bostrom's hybrid model.

One can see that the optimal  keeps growing with  and ask what happens in the limit. Informally, we can argue that the binomial random variable  concentrates around  as  tends to infinity[9]. Replacing  by  in the expression for , we obtain:

That is, when exactly half of the experiments have two awakenings,  is optimal. The limit when  tends to infinity can indeed be formally proven[10], so we can write:

Consequently, when , only  minimizes the expected average inaccuracy across awakenings. Again, this result aligns with Bostrom's hybrid model.

Concluding Remarks

For the repeated Sleeping Beauty problem, we presented three ways of aggregating the inaccuracies of credences assigned in different, indistinguishable awakenings in a given possible world. Minimizing total inaccuracy supports the thirder's argument. Considering first the mean inaccuracy per experiment, then minimizing their sum or average, leads to the halfer's solution, where there is no update on the credences (as conditionalizing on a tautology). Taking the average across awakenings agrees with Bostrom's hybrid model, which is a double-halfer solution[11]. Although these results by themselves do not point to a solution, they give another battleground for the dispute, where maximizing accuracy (epistemic utility) is the agent's goal.

  1. ^

    Interesting discussion on the problem can be found, for instance, here [LW · GW], here [AF · GW] or here [LW · GW].

  2. ^

    Our repeated version is equivalent to Bostrom's N-fold version, being different from the Repeated (and improved) Sleeping Beauty problem [AF · GW].

  3. ^

    Note that each , for , does not necessarily refer to a unique point in time, but possibly to a pair of instants, when SB is awakened during the -experiment. As these instants are indistinguishable for her, we assume the corresponding credence functions are all equal.

  4. ^

    As SB does not know in which experiment she is in our setup, upon each awakening in the -experiment, she can think of  as "The coin toss landed Heads for this experiment".

  5. ^

    A scoring rule  is strictly proper if  is minimized only at .

  6. ^

    Defined via . For instance, with , if Sleeping Beauty assigns  upon awakening, but the coin lands Tails (the proposition  is false), Brier score yields .

  7. ^

    There are infinite choices for this aggregation. For instance, one can consider the inaccuracies on Monday and Tuesday as components of a vector  and then pick a norm (e.g. the Euclidean) in the corresponding vector space.

  8. ^

    We could equivalently consider the average of the means, but, as  is fixed across all worlds, this does not change the  minimizing the aggregate inaccuracy.

  9. ^

    Bostrom employs this informal argument to determine his hybrid model yields optimal  for large .

  10. ^

    A proof sketch is given in the paper draft.

  11. ^

    Another double-halfer solution can found here [LW · GW].

2 comments

Comments sorted by top scores.

comment by SMK (Sylvester Kollin) · 2025-02-11T12:45:19.963Z · LW(p) · GW(p)

I think it would be good if you made clear in the abstract what your contributions to the literature are, and how your results relate to those of e.g. Kierland and Monton (2005).

Replies from: glauberdebona
comment by glauberdebona · 2025-02-11T14:09:46.182Z · LW(p) · GW(p)

Thanks for the feedback! The summary here and the abstract in the draft paper have been updated; I hope it is clearer now.