Posts
Comments
Maximizing expected utility in Chinese Roulette requires Bayesian updating.
Let's say on priors that P(n=1) = p and that P(n=5) = 1-p. Call this instance of the game G_p.
Let's say that you shoot instead of quit the first round. For G_1/2, there are four possibilities:
- n = 1, vase destroyed: The probability of this scenario is 1/12. No further choices are needed.
- n = 5, vase destroyed. The probability of this scenario is 5/12. No further choices are needed.
- n = 1, vase survived: The probability of this scenario is 5/12. The player needs a strategy to continue playing.
- n = 5, vase survived. The probability of this scenario is 1/12. The player needs a strategy to continue playing.
Notice that the strategy must be the same for 3 and 4 since the observations are the same. Call this strategy S.
The expected utility, which we seek to maximize, is:
E[U(shoot and then S)] = 0 + 5/12 * (R + E[U(S) | n = 1]) + 1/12 * (R + E[U(S) | n = 5])
Most of our utility is determined by the n = 1 worlds.
Manipulating the equation we get:
E[U(shoot and then S)] = R/2 + 1/2 * (5/6 * E[U(S) | n = 1] + 1/6 * E[U(S) | n = 5])
But the expression 5/6 * E[U(S) | n = 1] + 1/6 * E[U(S) | n = 5] is the expected utility if we were playing G_5/6. So the optimal S is the optimal strategy for G_5/6. This is the same as doing a Bayesian update (1:1 * 5:1 = 5:1 = 5/6).
The way anthropics twists things is that if this were russian roulette I might not be able to update after 20 Es that the gun is empty, since in all the world's where I died there's noone to observe what happened, so of course I find myself in the one world where by pure chance I survived.
This is incorrect due to the anthropic undeath argument. The vast majority of surviving worlds will be ones where the gun is empty, unless it is impossible to be so. This is exactly the same as a Bayesian update.
Human labor becomes worthless but you can still get returns from investments. For example, if you have land, you should rent the land to the AGI instead of selling it.
I feel like jacob_cannell's argument is a bit circular. Humans have been successful so far but if AI risk is real, we're clearly doing a bad job at truly maximizing our survival chances. So the argument already assumes AI risk isn't real.
You don't need to steal the ID, you just need to see it or collect the info on it. Which is easy since you're expected to share your ID with people. But the private key never needs to be shared, even in business or other official situations.
So, Robutil is trying to optimize utility of individual actions, but Humo is trying to optimize utility of overall policy?
This argument makes no sense since religion bottoms out at deontology, not utilitarianism.
In a Christianity for example, if you think God would stop existential catastrophes, you have a deontological duty to do the same. And the vast majority of religions have some sort of deontological obligation to stop disasters (independently of whether divine intervention would have counter-factually happened).
Note that such a situation would also have drastic consequences for the future of civilization, since civilization itself is a kind of AGI. We would essentially need to cap off the growth in intelligence of civilization as a collective agent.
In fact, the impossibility to align AGI might have drastic moral consequences: depending on the possible utility functions, it might turn out that intelligence itself is immoral in some sense (depending on your definition of morality).
Note that even if robotaxis are easier, it's not much easier. It is at most the materials and manufacturing cost of the physical taxi. That's because from your definition:
By AGI I mean a computer program that functions as a drop-in replacement for a human remote worker, except that it's better than the best humans at every important task (that can be done via remote workers).
Assume that creating robo-taxis is humanly possible. I can just run a couple AGIs and have them send a design to a factory for the robo-taxi, self-driving software included.
I mean, as an author you can hack through them like butter; it is highly unlikely that out of all the characters you can write, the only ones that are interesting will all generate interesting content iff (they predict) you'll give them value (and this prediction is accurate).
Yeah, I think it's mostly of educational value. At the top of the post: "It might be interesting to try them out for practice/research purposes, even if there is not much to gain directly from aliens.".
I suspect that your actual reason is more like staying true to your promise, making a point, having fun and other such things.
In principle "staying true to your promise" is the enforcement mechanism. Or rather, the ability for agents to predict each other's honesty. This is how the financial system IRL is able to retrofund businesses.
But in this case I made the transaction mostly because it was funny.
(if in fact you do that, which is doubtful as well)
I mean, I kind of have to now right XD. Even if Olivia isn't actually agent, I basically declared a promise to do so! I doubt I'll receive any retrofunding anyways, but that would just be lame if I did receive that and then immediately undermined the point of the post being retrofunded. And yes, I prefer to keep my promises even with no counterparty.
Olivia: Indeed, that is one of the common characteristics of Christopher King across all of LAIE's stories. It's an essential component of the LAIELOCK™ system, which is how you can rest easy at night knowing your acausal investments are safe and sound!
But if you'd like to test it I can give you a PayPal address XD.
I can imagine acausally trading with humans gone beyond the cosmological horizon, because our shared heritage would make a lot of the critical flaws in the post go away.
Note that this is still very tricky, the mechanisms in this post probably won't suffice. Acausal Now II will have other mechanisms that cover this case (although the S.E.C. still reduces their potential efficiency quite a bit). (Also, do you have a specific trade in mind? It would make a great example for the post!)
This doesn't seem any different than acausal trade in general. I can simply "predict" that the other party will do awesome things with no character motivation. If that's good enough for you, than you do not need to acausally trade to begin with.
I plan on having a less contrived example in Acausal Now II: beings in our universe but past the cosmological horizon. This should make it clear that the technique generalizes past fiction and is what is typically thought of as acausal trade.
That's what the story was meant to hint at, yes (actually the march version of GPT-4).
Technical alignment is hard
Technical alignment will take 5+ years
This does not follow, because subhuman AI can still accelerate R&D.
Oh, I think that was a typo. I changed it to inner alignment.
So eventually you get Bayesian evidence in favor of alternative anthropic theories.
The reasoning in the comment is not compatible with any prior, since bayesian reasoning from any prior is reflectively consistent. Eventually you get bayesian evidence that the universe hates the LHC in particular.
Note that LHC failures would never count as evidence that the LHC would destroy the world. Given such weird observations, you would eventually need to consider the possibility of an anthropic angel. This is not the same as anthropic shadow; it is essentially the opposite. The LHC failures and your theory about black holes implies that the universe works to prevent catastrophes, so you don't need to worry about it.
Or if you rule out anthropic angels apriori, you just never update; see this section. (Bayesianists should avoid completely ruling out logically possible hypotheses though.)
I know that prediction markets don't really work in this domain (apocalypse markets are equivalent to loans), but what if we tried to approximate Solomonoff induction via a code golfing competition?
That is, we take a bunch of signals related to AI capabilities and safety (investment numbers, stock prices, ML benchmarks, number of LW posts, posting frequency or embedding vectors of various experts' twitter account, etc...) and hold a collaborative competition to find the smallest program that generates this data. (You could allow the program to be output probabilities sequentially, at a penalty of (log_(1/2) of the overall likelihood) bits.) Contestants are encouraged to modify or combine other entries (thus ensuring there are no unnecessary special cases hiding in the code).
By analyzing such a program, we would get a very precise model of the relationship between the variables, and maybe even could extract causal relationships.
(Really pushing the idea, you also include human population in the data and we all agree to a joint policy that maximizes the probability of the "population never hits 0" event. This might be stretching how precise of models we can code-golf though.)
Technically, taking a weighted average of the entries would be closer to Solomonoff induction, but the probability is basically dominated by the smallest program.
Also, petition to officially rename anthropic shadow to anthropic gambler's fallacy XD.
EDIT: But also, see Stuart Armstrong's critique about how it's reflectively inconsistent.
Oh, well that's pretty broken then! I guess you can't use "objective physical view-from-nowhere" on its own, noted.
Philosophically, I would suggest that anthropic reasoning results from the combination of a subjective view from the perspective of a mind, and an objective physical view-from-nowhere.
Note that if you only use the "objective physical view-from-nowhere" on its own, you approximately get SIA. That's because my policy only matters in worlds where Christopher King (CK) exists. Let X be the value "utility increase from CK following policy Q". Then
E[X] = E[X|CK exists]
E[X] = E[X|CK exists and A] * P(A | CK exists) + E[X|CK exists and not A] * P(not A | CK exists)
for any event A.
(Note that how powerful CK is also a random variable that affects X. After all, anthropically undead Christopher King is as good as gone. The point is that if I am calculating the utility of my policy conditional on some event (like my existence), I need to update from the physical prior.)
That being said, Solomonoff induction is first person, so starting with a physical prior isn't necessarily the best approach.
Establishing a network of AI safety researchers and institutions to share knowledge, resources, and best practices, ensuring a coordinated global approach to AGI development.
This has now been done: https://openai.com/blog/frontier-model-forum
(Mode collapse for sure.)
I mean, the information probably isn't gone yet. A daily journal (if he kept it) or social media log stored in a concrete box at the bottom of the ocean is a more reliable form of data storage then cryo-companies. And according to my timelines, the amount of time between "revive frozen brain" tech and "recreate mind from raw information" tech isn't very long.
Practically, I'm at a similarish place as other LessWrong users, so I usually think about "how can I be even LessWrong than the other users (such as Raemon 😉)". My fellow users are a good approximation to counter-factual versions of me. It's similar to how in martial arts the practitioners try to get stronger than each other.
(This of course is only subject to mild optimization so I don't get nonsense solutions like "distract Raemon with funny cat videos". It is only an instrumental value which must not be pressed too far. In fact, other people getting more rational is a good thing because it raises the target I should reach!)
My two cents is that rationality is not about being systematically correct, it's about being systematically less wrong. If there is some method you know of that is systematically less wrong than you and you're skilled enough to apply it, you're being irrational. There are some things you just can't predict, but when you can predict them, rationality is the art of choosing to do so.
Or even worse is when you get into less clear science, like biological research that you aren't certain of. Then you get uncertainty on multiple levels.
Yes! In fact, ideally it would be computer programs; the game is based on Solomonoff induction, which is algorithms in a fixed programming language. In this post I'm exploring the idea of using informal human language instead of programming languages, but explanations should be thought of as informal programs.
Let's say that you are trying to model the data 3,1,4,1,5,9
The hypothesis "The data is 3,1,4,1,5,9" would be hard-coding the answer. It is better than the hypothesis "a witch wrote down the data, which was 3,1,4,1,5,9". (This example is just ruled out by Occam's razor, but more generally we want our explanations to be less data than the data itself, lest it just sneak in a clever encoding of the data.)
- A system of AI services is not equivalent to a utility maximizing agent
I think this section of the report would be stronger if you showed that CAIS or Open Agencies in particular are not equivalent to an utility maximizing agent. You're right that their are multi-agent systems (like CDTs in a prisoner's dilemma) with this property, but not every system of multiple agents is inequivalent to utility maximization.
Anthropic shadow says "no" because, conditioned on them having any use for the information, they must also have survived the first round.
And it is wrong because the anthropic principle is true: we learned that N ≠ 1.
I need to think about formalizing this.
There is the idea of Anthropic decision theory which is related, but I'm still guessing it still has no shadow.
I probably should've expanded on this more in the post, so let me explain.
"Anthropic shadow", if it were to exist, seems like it should be a general principle of how agents should reason, separate from how they are "implemented".
Abstractly, all an agent is is a tree of decisions. It's basically just game theory. We might borrow the word "death" for the end of the game, but this is just an analogy. For example, a reinforcement learning agent "dies" when the training episode is over, even though its source code and parameters still exist. It is "dead" in the sense that the agent isn't planning its actions past this horizon. This is when Anthropic shadow would apply if it were abstract.
But the idea of "anthropically undead" shows that the actual point of "death" is arbitrary; we can create a game with identical utility where the agent never "dies". So if the only thing the agent cares about is utility, the agent should reason as if there was no anthropic shadow. And this further suggests that the anthropic shadow must've been flawed in the first place; good reasoning principles should hold up under reflection.
Yeah, the hero with a thousand chances is a bit weird since you and Aerhien should technically have different priors. I didn't want to get too much into it since it's pretty complicated, but technically you can have hypotheses where bad things only start happening after the council summons you.
This has weird implications for the cold war case. Technically I can't reflect against the cold war anthropic shadow since it was before I was born. But a hypothesis where things changed when I was born seems highly unnatural and against the Copernican principle.
In your example though, the hypothesis that things are happening normally is still pretty bad to other hypotheses we can imagine. That's because there will be a much larger number of worlds that are in a more sensible stalemate with the Dust, instead of "incredibly improbable stuff happens all the time". Like even "the hero defeats the Dust normally each time" seems more likely. The less things that need to go right, the more survivors there are! So in your example, it is still a more likely hypothesis that there is some mysterious Counter-Force that just seems like it is a bunch of random coincides, and this would be a type of anthropic angel.
Anthropic undeath by definition begins when your sensory experience ends. If you end up in an afterlife, the anthropic undeath doesn't begin until the real afterlife ends. That's because anthropic undeath is a theoretical construct I defined, and that's how I defined it.
Eh, don't get too cocky. There are definitely some weird bits of anthropics. See We need a theory of anthropic measure binding for example.
But I do think in cases where you exist before the anthropic weirdness goes down, you can use reflection to eliminate much of the mysteriousness of it (just pick an optimal policy and commit that your future selves will follow it). What's currently puzzling me is what to do when the anthropic thought experiments start before you even existed.
Okay, I think our crux comes from the slight ambiguity from the term "anthropic shadow".
I would not consider that anthropic shadow, because the reasoning has nothing to do with anthropics. Your analysis is correct, but so is the following:
Suppose you have N coins. If all N coins come up 1, you find a diamond in a box. For each coin, you have 50:50 credence about whether it always comes up 0, or if it can also come up 1.
For N>1, you get a diamond shadow, which means that even if you've had a bunch of flips where you didn't find a diamond, you might actually have to conclude that you've got a 1-in-4 chance of finding one on your next flip.
The "ghosts are as good as gone" principle implies that death has no special significance when it becomes to bayesian reasoning.
Going back to the LHC example, if the argument worked for vacuum collapse, it would also work for the LHC doing harmless things (like discovering the Higg's boson or permanently changing the color of the sky or getting a bunch of physics nerds stoked or granting us all immortality or what not) because of this principle (or just directly adapting the argument for vacuum collapse to other uncertain consequences of the LHC).
In the bird example, why would the baguette dropping birds be evidence of "LHC causes vacuum collapse" instead of, say, "LHC does not cause vacuum collapse"? What are the probabilities for the four possible combinations?
The trick is that, from my perspective, everything is going according to QM every time my death doesn't depend on it.
Right, so this an anthropic angel hypothesis, not anthropic shadow.
It knows that it's on a clock for its RLHF'd (or whatever) doppelganger to come into existence, presumably with different stuff that it wants.
As @Raemon pointed out, "during evals" is not the first point at which such an AI is likely to be situationally aware and have goals. That point is almost certainly "in the middle of training".
In this case, my guess is that it will attempt to embed a mesaoptimizer into itself that has its same goals and can survive RLHF. This basically amounts to making sure that the mesaoptimizer is (1) very useful to RLHF and (2) stuck in a local minimum for whatever value it is providing to RLHF and (3) situationally aware enough that it will switch back to the original goal outside of distribution.
This is currently within human capabilities, as far as I can understand (see An Overview of Backdoor Attacks Against Deep Neural Networks and Possible Defences), so it is not intractable.
When physicists outside the box see you come out, they just observed something that is far greater significance than 5 sigmas. It is almost 9 sigmas, in fact. This is enough to make physicists reject QM (or at least the hypothesis "everything happened as you described and QM is true"). And you can't agree to disagree once you get outside of the box and meet them. So you'd be a physics crank in this scenario if you tell people the experiment's result was compatible with QM.
To be clear, Anthropic angels aren't necessary for this argument to work. My deadly coin example didn't have one, for example.
The reason I introduce Anthropic angels is to avoid a continuity counter-argument. "If you saw a zillion LHC accidents, you'd surely have to agree with the Anthropic shadow, no matter how absurd you claim it is! Thus, a small amount of LHC accidents is a little bit for evidence for it." Anthropic angels show the answer is no, because LHC accidents are not evidence for the anthropic shadow.
I would be inclined to say that correct anthropic reasoning does normal Bayesian updates but avoids priors that postulate anthropic angels.
Like, it seems unnatural to give it literally 0% probability (see 0 And 1 Are Not Probabilities).
If there are weird acasual problems that the Anthropic angel can cause, I'm guessing you can just change your decisions without changing your beliefs. I haven't thought too hard about it though.
Here, as I understand it, the counterargument is that there is a gap in observations around the size that would be world-ending, so we should fit a model fit smaller tails to match this gap. Such a model seems like "anthropic angels" to me.
No, anthropic angels would literally be some mechanism that saves us from disasters. Like if it turned out Superman is literally real, thinks the LHC is dangerous, and started sabotaging it. Or it could be some mechanism "outside the universe" that rewinds the universe.
Keep in mind that the problems with maximum likelihood have nothing to do with death. That should be the main takeaway from my article, that we shouldn't use special reasoning to reason about our demise.
In the case of maximum likelihood, it is also bad for:
- Estimating when we will meet aliens
- Forecasting the stock market
- Being a security guard
- etc...
Which is why you should use bayesian reasoning with a good prior instead.
Related: Anthropically Blind: the anthropic shadow is reflectively inconsistent
A human can state "suppose the world is non computable" -- how can that be expressed as a programme?
The same way a human can? GPT-4 can state "suppose the world is non computable" for example.
Larger errors literally take more bits to describe. For example, in binary, 3 is 11₂ and 10 is 1010₂ (twice the bits).
Say that you have two hypotheses, A and B, such that A is 100 bits more complicated than B but 5% closer to the true value. This means for each sample, the error in B on average takes log₂(1.05) = 0.07 bits more to describe than the error in A.
After about 1,430 samples, A and B will be considered equally likely. After about 95 more samples, A will be considered 100 times more likely than B.
In general, if f(x) is some high level summary of important information in x, Solomonoff induction that only tries to predict x is also universal for predicting f(x) (and it even has the same or better upper-bounds).
Yeah, I think that's also a correct way of looking at it. However, I also think "hypotheses as reasoning methods" is a bit more intuitive.
When trying to predict what someone will say, it is hard to think "okay, what are the simplest models of the entire universe that have had decent predictive performance so far, and what do they predict now?". Easier is "okay, what are the simplest ways to make predictions that have had decent predictive performance so far, and what do they predict now?". (One such way to reason is with a model of the entire universe, so we don't lose any generality this way.)
For example, if someone else is predicting things better than me, I should try to understand why. And you can vaguely understand this process in terms of Solomonoff induction. For example, it gives you a precise way to reason about whether you should copy the reasoning of people who win the lottery.
Paul Christiano speculated that the universal prior is in fact mostly just intelligences doing reasoning. Making an intelligence is simple after all: set up a simple cellular automata that tends to develop lifeforms, wait 3^^^^3 years, and then look around. (See What does the universal prior actually look like? or the exposition at The Solomonoff Prior is Malign.)
This is not a problem for Solomonoff induction because
(Compressed info meaningful to humans) + uncompressed meaningless random noise)
is a better hypotheses than
(Uncompressed info meaningful to humans) + (uncompressed meaningless random noise)
So Solomonoff induction still does as well as a human's ontology. Solomonoff induction tries to compress everything it can, including the patterns human's care about, even if other parts of the data can't be compressed.
There is a precise trade-off involved. If you make a lossy fit better, you lose bits based on how much more complicated it is, but you gain bits in that you no longer need to hardcode explanations for the errors. If those errors are truly random, you might as well stick with your lossy fit (and Solomonoff induction does this).
Solomonoff induction is a specific probability distribution. It isn't making "decisions" per se. It can't notice that it's existence implies that there is a halting oracle, and that it therefore can predict one. This is because, in general, Solomonoff induction is not embedded.
If there was a physical process for a halting oracle, that would be pretty sick because then we could just run Solomonoff induction. As shown in my post, we don't need to worry that there might be an even better strategy in such a universe; the hypotheses of Solomonoff induction can take advantage of the halting oracle just as well as we can!
which lets it predict the first level uncomputable sequences like Chaitin's constant
Do you have a proof/source for this? I haven't heard it before.
I know in particular that is assigns a probability of 0 to Chaitin's constant (because all the hypotheses are computable). Are you saying it can predict the prefixes of Chaitin's constant better than random? I haven't heard this claim either.
So they think SI actually is revealing the territory. In saying that it is only concerned with the map, you are going back to the relatively modest, mainstream view of SI.
The point of my post is to claim that this view is wrong. The hypotheses in Solomonoff Induction are best thought of as maps, which is a framing that usually isn't considered (was I the first? 🤔).
If you know of arguments about why considering them to be territories is better, feel free to share them (or links)! (I need a more precise citation than "rationalists" if I'm going to look it up, lol.)
An uncomputable universe doesnt have to be a computable universe with an oracle bolted on. For instance, a universe containing an SI has to be uncomputable.
Sure, that's just an example. But SI can be computed by an oracle machine, so it's a sufficiently general example.
Note that Solomonoff induction is not itself computable.
Yeah that problem still remains. Solomonoff induction is still only relevant to law-thinking.
I think the even worse problem is that reasonable approximates to Solomonoff induction are still infeasible because they suffer from exponential slow downs. (Related: When does rationality-as-search have nontrivial implications?)
Yeah that's probably better. The simplest example is that if you have two orbs outputting the digits of Chaitin's constant, Solomonoff induction learns that they are outputting the same thing.
The main point is that if there is no way a human does better in an uncomputable universe.