Changing my mind about Christiano's malign prior argument

cole-wyeth

Changing my mind about Christiano's malign prior argument

post by Cole Wyeth (Amyr) · 2025-04-04T00:54:44.199Z · LW · GW · 34 comments

  Overview
  The Argument
  Obstacles for Adversaria
  An Analogy to Intelligent Design
None
34 comments

Overview

In the past I've been skeptical of Paul Christiano's argument that the universal distribution is "malign" in the sense that it contains adversarial subagents who might attempt acausal attacks. My underlying intuition was that the universal distribution is doing epistemics properly, so that its credences should track reality and not be inappropriately vulnerable to any attacks. In particular, if the universal distribution takes seriously that it may be in a simulation, and expects the "simulation lords" to mess with it (this is the overly-compressed essence of Christiano's hypothetical) then we should also take this seriously - so it's not really an attack at all. I ended up changing my mind somewhat at the CMU agent foundations conference, after helpful conversations with Abram Demski, Vanessa Kosoy, Scott Garrabrant, and particularly Sam Eisenstat (who finally managed to get through to me with a slightly different explanation of the argument). The short version is that (as I was aware) Solomonoff induction does not have an inductive bias towards being computed by an agent/predictor because it is not talking about embeddedness - it does not expect to be "running on a computer." This seems unreasonable (since we know anything performing inference about our universe is a part of our universe - right?) which means security mindset is to anticipate that some kind of exploit/attack is possible, which is how I could have taken the argument seriously faster. In fact, I think that this "mistake" in the universal distribution lends some credence to the malign prior argument, opening a potential opportunity for acausal attack. However, I still have many reservations that the argument works, which I'll discuss in the latter part of this post.

The Argument

Following my last post engaging with the malign prior argument [LW · GW], I'll call the universe where an agent of interest is trying to use the universal distribution for prediction Predictoria. The idea of the argument is that there may be another simple computable universe called Adversaria which evolves life and ultimately a civilization/singleton which wishes to acausally influence Predictoria - perhaps because it cares about all computable or mathematical universes. Adversaria could run many simulations of simple computable universes from the perspective of important decision makers, and then intervene on simulations right after important decisions. Why would Adversaria want to do this? Well, imagine that we in Predictoria worry that we might be in a simulation of Adversaria. This could certainly effect our actions, if we have any guess about what actions the simulation lords might disapprove of and punish. More generally, Adversaria could push our probabilities around however they saw fit around pivotal moments by re-engineering the future (by pausing and modifying the simulation) so various subtle and malign forms of influence seem possible.

Why would the universal distribution take this possibility seriously? Hypothesis are roughly speaking weighted by Kolmogorov complexity (that is "computable description complexity"). Since we assumed Adversaria is compuable and simple, it's a likely candidate hypothesis a priori. If it runs a massive number of simulations, we might expect to be in one of them (assuming our universe is computable and simple, which is required for realizability - though it is not the only possible justification for using the universal distribution). In that case, the combined hypothesis that we are in an Adversaria simulation of our universe requires the simple "laws of physics" for Adversaria and a pointer of some sort to the specific simulation that has Predictoria's laws of physics.

As @Wei Dai [LW · GW] put it in his response [LW(p) · GW(p)] to my original rebuttal to the argument:

Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again for simplicity let's assume that each of these simulations is done as something like a full-scale physical reconstruction of Predictoria but with hidden nanobots capable of influencing crucial events. Then each of these simulations should carry roughly the same weight in M as the real Predictoria and does not carry a significant complexity penalty over it. That's because the complexity / length of the shortest program for the real Predictoria, which consists of its laws of physics (P) and starting conditions (ICs_P) plus a pointer to Predictoria the planet (Ptr_P), is K(P) + K(ICs_P|P) + K(Ptr_P|...). The shortest program for one of the simulations consists of the same laws of physics (P), Adversaria's starting conditions (ICs_A), plus a pointer to the simulation within its universe (Ptr_Sim), with length K(P) + K(ICs_A|P) + K(Ptr_Sim|...). Crucially, this near-equal complexity relies on the idea that the intricate setup of Adversaria (including its simulation technology and intervention capabilities) arises naturally from evolving ICs_A forward using P, rather than needing explicit description.
(To address a potential objection, we also need that the combined weights (algorithmic probability) of Adversaria-like civilizations is not much less than the combined weights of Predictoria-like civilizations, which requires assuming that phenomenon of advanced civilizations running such simulations is a convergent outcome. That is, it assumes that once civilization reaches Predictoria-like stage of development, it is fairly likely to subsequently become Adversaria-like in developing such simulation technology and wanting to use it in this way. There can be a complexity penalty from some civilizations choosing or forced not to go down this path, but that would be more than made up for by the sheer number of simulations each Adversaria-like civilization can produce.)
If you agree with the above, then at any given moment, simulations of Predictoria overwhelm the actual Predictoria as far as their relative weights for making predictions based on M. Predictoria should be predicting constant departures from its baseline physics, perhaps in many different directions due to different simulators, but Predictoria would be highly motivated to reason about the distribution of these vectors of change instead of assuming that they cancel each other out. One important (perhaps novel?) consideration here is that Adversaria and other simulators can stop each simulation after the point of departure/intervention has passed for a while, and reuse the computational resources on a new simulation rebased on the actual Predictoria that has observed no intervention (or rather rebased on an untouched simulation of it), so the combined weight of simulations does not decrease relative to actual Predictoria in M even as time goes on and Predictoria makes more and more observations that do not depart from baseline physics.

In fact, as Wei Dai seems to be hinting at, a complete theory of "Simulated Predictoria" (from within Predictoria) also need a pointer to the specific moment of Adversaria's intervention. Another way of framing this is that Predictoria expects Adversaria to possibly intervene after each "pivotal moment." Note that this perspective views a mixture of deterministic hypotheses about which simulation Predictoria is in as a single stochastic hypothesis that Predictoria is in a simulation with fixed physics but an unknown moment of intervention. Pivotal moments have low K-complexity (in fact, that is one possible definition of a pivotal moment, though I'll reserve the term for moments that Adversaria also cares about) which means they eventually become infrequent by a counting argument. So Predictoria should eventually learn it is not in the type of simulation which is intervened on. I think this is pretty cut and dry, but it doesn't really answer the malign prior argument because K-complexity grows very slowly, and fear of Adversaria may influence Predictoria to some degree for many years and particularly around important moments.

With that said, we have more or less reduced the problem to deciding whether the laws of physics of Adversaria are more complicated than a pointer to our specific planet / predictor of Predictoria. I initially rejected the malign prior argument here, because I did not see it this way; Adversaria's pointer to Predictoria also has to specify the planet! However, the Adversaria hypothesis does have an advantage here, because Adversaria would only care to influence decision makers! That means it gets "most of" the planet predictor for free, by restricting only to planets with life. From the perspective of the universal distribution, observations might come from the inside of a star or a random dust cloud (Vanessa's usual example). Perhaps an inhabited planet has lower K-complexity than an arbitrary planet, but we do have to specify we're putting our sensors on a robot / attached to a brain, and that costs us some bits.

Does it cost us more bits than the physics of Adversaria? It's not clear to me, but it can't be much more, since Adversaria evolves intelligent life which pervades its entire universe. That means that the conditional complexity of "intelligence" should be pretty low given the physics of Adversaria - and "intelligence" makes the pointer to our planet of Predictoria simple.

Still, I think the argument stands in the sense that it isn't obviously logically flawed.

Obstacles for Adversaria

The argument places a lot of demands on the hypothetical Adversaria:

It has to be computationally simple
It has to evolve intelligent life (if the specification of intelligence is just hardcoded, pointing to the predictor in Predictoria would be simpler than invoking Adversaria)
This evolution probably has to take place relatively quickly - pointers to very late times in Adversaria may be too complicated, but required to specify simulations (noticed by Abram Demski). Simple resolutions for this issue seem to fall prey to the preceding obstacle
That intelligent life has to care very much about acausally influencing the mathematical multiverse (I think this may be a philosophical mistake, but will have to argue that point elsewhere)
Adversaria has to have a pretty large amount of accessible compute to run simulations
Adversaria may have to coordinate with any other Adversia-like influencers, but this is potentially nontrivial because the closure properties of computable universes (which do NOT have access to an oracle for the universal distribution) may make it difficult for them to reason about each other

It also places some demands on Predictoria

Predictoria has to be at least computable, preferably computationally simple (otherwise Adversaria is required to run far more simulations to include Predictoria)
Predictoria has to be able to reason about Adversaria, which is presumably hard / necessarily approximate (?) by the previous demand [LW · GW]

Even if all of these requirements are fulfilled, Adversaria doesn't drastically beat out alternatives a priori and loses probability mass as time passes.

An Analogy to Intelligent Design

The malign priors argument is analogous to the following argument that a Cartesian dualist should believe in God:

"Isn't it unlikely that you would be a conscious, thinking being instead of a rock? Perhaps the universe (as we know it) was created by such a conscious, thinking being who is personally invested in you existing because you are like Him. You should try to figure out what He would want and do that - otherwise, He might overturn the natural order as punishment."

Or very briefly:

"I think, therefore God exists."

which in a sense Descartes actually tried to argue!

Initially, my intuition was that this argument is just wrong, God / Adversaria is disfavored by Occam's razor, and Solomonoff induction would not fall for it. Now I am convinced that Solomonoff induction takes this argument more seriously than we might, because it is not a naturalized inductor [? · GW] - it would be surprised to be "running inside of a robot," and that (apparently inappropriate) surprise might lead it to radical places. In a way, that makes perfect sense; Solomonoff induction really can't run in our universe! Any robot we could build to "use Solomonoff induction" would have to use some approximation, which the malign prior argument may or may not apply to.

Either way, if I had to guess, the universal distribution probably isn't malign "in practice." I'm not sure if this question is purely academic or has any alignment-relevant content.

34 comments

Comments sorted by top scores.

comment by Lucius Bushnaq (Lblack) · 2025-04-04T07:28:27.606Z · LW(p) · GW(p)

Thank you for this summary.

I still find myself unconvinced by all the arguments against the Solomonoff prior I have encountered. For this particular argument, as you say, there's still many ways the conjectured counterexample of adversaria could fail if you actually tried to sit down and formalise it. Since the counterexample is designed to break a formalism that looks and feels really natural and robust to me, my guess is that the formalisation will indeed fall to one of these obstacles, or a different one.

In a way, that makes perfect sense; Solomonoff induction really can't run in our universe! Any robot we could build to "use Solomonoff induction" would have to use some approximation, which the malign prior argument may or may not apply to.

You can just reason about Solomonoff induction with cut-offs [LW · GW] instead. If you render the induction computable by giving it a uniform prior over all programs of some finite length ^[1] with runtime $\leq T$ , it still seems to behave sanely. As in, you can derive analogs of the key properties of normal Solomonoff induction for this cut-off induction. E.g. the induction will not make more than $K (P^{*})$ bits worth of prediction mistakes compared to any 'efficient predictor' program $P^{*}$ with runtime $< T$ and K-complexity $K (P^{*}) < N$ , it's got a rough invariance to what Universal Turing Machine you run it on, etc. .

Since finite, computable things are easier for me to reason about, I mostly use this cut-off induction in my mental toy models of AGI these days.

EDIT: Apparently, this exists in the literature under the name AIXI-tl. I didn't know that. Neat.

^{^}
So, no prefix-free requirement.

Replies from: D0TheMath, Amyr, jeremy-gillen

↑ comment by Garrett Baker (D0TheMath) · 2025-04-04T08:09:59.530Z · LW(p) · GW(p)

I think I mostly agree with this, I think things possibly get more complicated when you throw decision theory into the mix. I think it unlikely I'm being adversarially simulated in part. I could believe that such malign prior problems are actually decision theory problems much more than epistemic problems. Eg "no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)".

Replies from: jeremy-gillen, Lblack

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T15:16:17.297Z · LW(p) · GW(p)

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2025-04-04T16:55:42.961Z · LW(p) · GW(p)

Oh my point wasn't against solomonoff in general, maybe more crisply, my clam is different decision theories will find different "pathologies" in the solomonoff prior, and in particular for causal and evidential decision theorists, I could totally buy the misaligned prior bit, and I could totally buy, if formalized, the whole thing rests on the interaction between bad decision theory and solomonoff.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T17:08:11.630Z · LW(p) · GW(p)

But why would you ever be able to solve the problem with a different decision theory? If the beliefs are manipulating it, it doesn't matter what the decision theory is.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2025-04-04T17:15:53.045Z · LW(p) · GW(p)

My world model would have a loose model of myself in it, and this will change which worlds I'm more or less likely to be found in. For example, a logical decision theorist, trying to model omega, will have very low probability that omega has predicted it will two box.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T17:18:35.771Z · LW(p) · GW(p)

How does this connect to malign prior problems?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2025-04-04T17:25:16.771Z · LW(p) · GW(p)

no, I am not going to do what the evil super-simple-simulators want me to do because they will try to invade my prior iff (I would act like they have invaded my prior iff they invade my prior)

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T17:44:06.636Z · LW(p) · GW(p)

Well my response to this was:

In order for a decision theory to choose actions, it has to have a model of the decision problem. The way it gets a model of this decision problem is...?

But I'll expand: An agent doing that kind of game-theory reasoning needs to model the situation it's in. And to do that modelling it needs a prior. Which might be malign.

Malign agents in the prior don't feel like malign agents in the prior, from the perspective of the agent with the prior. They're just beliefs about the way the world is. You need beliefs in order to choose actions. You can't just decide to act in a way that is independent of your beliefs, because you've decided your beliefs are out to get you.

On top of this, how would you even decide that your beliefs are out to get you? Isn't this also a belief?

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2025-04-04T18:37:54.058Z · LW(p) · GW(p)

Let be an agent which can be instantiated in a much simpler world and has different goals from our limited Bayesian agent $A$ . We say $M$ is malign with respect to $A$ if $p (q | O) < p (q_{M, A} | O)$ where $q$ is the "real" world and $q_{M, A}$ is the world where $M$ has decided to simulate all of $A$ 's observations for the purpose of trying to invade their prior.

Now what influences $p (q_{M, A} | O)$ ? Well $M$ will only simulate all of $A$ 's observations if it expects this will give it some influence over $A$ . Let $L_{A}$ be an unformalized logical counterfactual operation that $A$ could make.

Then $p (q_{M, A} | O, L_{A})$ is maximal when $L_{A}$ takes into account $M$ 's simulation, and $0$ when $L_{A}$ doesn't take into account $M$ 's simulation. In particular, if $L_{A, \neg M}$ is a logical counterfactual which doesn't take $M$ 's simulation into account, then

$p (q_{M, A} | O, L_{A, \neg M}) = 0 < p (q | O, L_{A, \neg M})$ So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operations, same as in causal decision theory, and same as in evidential decision theory.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T20:05:08.334Z · LW(p) · GW(p)

I'm not sure what the type signature of is, or what it means to "not take into account $M$ 's simulation". When $A$ makes decisions about which actions to take, it doesn't have the option of ignoring the predictions of its own world model. It has to trust its own world model, right? So what does it mean to "not take it into account"?

So the way in which the agent "gets its beliefs" about the structure of the decision theory problem is via these logical-counterfactual-conditional operation

I think you've misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it's in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That's all I meant by "get its beliefs".

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2025-04-04T20:17:24.460Z · LW(p) · GW(p)

I'm not sure what the type signature of is, or what it means to "not take into account $M$ 's simulation"

I know you know about logical decision theory, and I know you know its not formalized, and I'm not going to be able to formalize it in a LessWrong comment, so I'm not sure what you want me to say here. Do you reject the idea of logical counterfactuals? Do you not see how they could be used here?

I think you've misunderstood me entirely. Usually in a decision problem, we assume the agent has a perfectly true world model, and we assume that it's in a particular situation (e.g. with omega and knowing how omega will react to different actions). But in reality, an agent has to learn which kind of world its in using an inductor. That's all I meant by "get its beliefs".

Because we're talking about priors and their influence, all of this is happening inside the agent's brain. The agent is going about daily life, and thinks "hm, maybe there is an evil demon simulating me who will give me -10¹⁰10^10 utility if I don't do what they want for my next action". I don't see why this is obviously ill-defined without further specification of the training setup.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T20:44:21.831Z · LW(p) · GW(p)

Do you not see how they could be used here?

This one. I'm confused about what the intuitive intended meaning of the symbol is. Sorry, I see why "type signature" was the wrong way to express that confusion. In my mind a logical counterfactual is a model of the world, with some fact changed, and the consequences of that fact propagated to the rest of the model. Maybe is a boolean fact that is edited? But if so I don't know which fact it is, and I'm confused by the way you described it.

Because we're talking about priors and their influence, all of this is happening inside the agent's brain. The agent is going about daily life, and thinks "hm, maybe there is an evil demon simulating me who will give me -10¹⁰10^10 utility if I don't do what they want for my next action". I don't see why this is obviously ill-defined without further specification of the training setup.

Can we replace this with: "The agent is going about daily life, and its (black box) world model suddenly starts predicting that most available actions actions lead to -10¹⁰ utility."? This is what it's like to be an agent with malign hypotheses in the world model. I think we can remove the additional complication of believing its in a simulation.

↑ comment by Lucius Bushnaq (Lblack) · 2025-04-04T14:05:20.595Z · LW(p) · GW(p)

If you make an agent by sticking together cut-off Solomonoff induction and e.g. causal decision theory, I do indeed buy that this agent will have problems. Because causal decision theory has problems.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-04-04T14:13:45.914Z · LW(p) · GW(p)

But how serious will these problems be? What if you encrypt the agent's thoughts, add pain sensors, and make a few other simple patches to deal with embeddedness?

I wouldn't be comfortable handing the lightcone over to such a thing, but I don't really expect it to fall over anytime soon.

↑ comment by Cole Wyeth (Amyr) · 2025-04-04T15:45:40.710Z · LW(p) · GW(p)

What you describe is not actually equivalent to AIXI-tl, which conducts a proof search to justify policies. Your idea has more in common with Schmidhuber’s speed prior.

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T15:12:18.222Z · LW(p) · GW(p)

One thing to keep in mind is that time cut-offs will usually rule out our own universe as a hypothesis. Our universe is insanely compute inefficient.

So the "hypotheses" inside your inductor won't actually end up corresponding to what we mean by a scientific hypothesis. The only reason this inductor will work at all is that it's done a brute force search over a huge space of programs until it finds one that works. Plausibly it'll just find a better efficient induction algorithm, with a sane prior.

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-04-04T15:21:11.961Z · LW(p) · GW(p)

That’s fine. I just want a computable predictor that works well. This one does.

Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is about finding more effective approximations for stuff.

Edit: Actually, I don’t think this would yield you a different general predictor as the program dominating the posterior. General inductor program running program $P_{2}$ is pretty much never going to be the shortest implementation of $P_{2}$ .

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T15:51:23.141Z · LW(p) · GW(p)

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?

Also, scientific hypotheses in practice aren’t actually simple code for a costly simulation we run. We use approximations and abstractions to make things cheap. Most of our science outside particle physics is actually about finding more effective approximate models for things in different regimes.

When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.

E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn't involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.

Edit to respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-04-04T15:54:33.072Z · LW(p) · GW(p)

You also want one that generalises well, and doesn't do preformative predictions, and doesn't have goals of its own. If your hypotheses aren't even intended to be reflections of reality, how do we know these properties hold?

Because we have the prediction error bounds.

When we compare theories, we don't consider the complexity of all the associated approximations and abstractions. We just consider the complexity of the theory itself.
E.g. the theory of evolution isn't quite code for a costly simulation. But it can be viewed as set of statements about such a simulation. And the way we compare the theory of evolution to alternatives doesn't involve comparing the complexity of the set of approximations we used to work out the consequences of each theory.

Yes.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T16:25:02.669Z · LW(p) · GW(p)

To respond to your edit: I don't see your reasoning, and that isn't my intuition. For moderately complex worlds, it's easy for the description length of the world to be longer than the description length of many kinds of inductor.

Because we have the prediction error bounds.

Not ones that can rule out any of those things. My understanding is that the bounds are asymptotic or average-case in a way that makes them useless for this purpose. So if a mesa-inductor is found first that has a better prior, it'll stick with the mesa-inductor. And if it has goals, it can wait as long as it wants to make a false prediction that helps achieve its goals. (Or just make false predictions about counterfactuals that are unlikely to be chosen).

If I'm wrong then I'd be extremely interested in seeing your reasoning. I'd maybe pay $400 for a post explaining the reasoning behind why prediction error bounds rule out mesa-optimisers in the prior.

Replies from: Lblack

↑ comment by Lucius Bushnaq (Lblack) · 2025-04-04T16:44:24.253Z · LW(p) · GW(p)

The bound is the same one you get for normal Solomonoff induction, except restricted to the set of programs the cut-off induction runs over. It’s a bound on the total expected error in terms of CE loss that the predictor will ever make, summed over all datapoints.

Look at the bound for cut-off induction in that post I linked, maybe? Hutter might also have something on it.

Can also discuss on a call if you like.

Note that this doesn’t work in real life, where the programs are not in fact restricted to outputting bit string predictions and can e.g. try to trick the hardware they’re running on.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-04-04T17:35:55.529Z · LW(p) · GW(p)

Yeah I know that bound, I've seen a very similar one. The problem is that mesa-optimisers also get very good prediction error when averaged over all predictions. So they exist well below the bound. And they can time their deliberately-incorrect predictions carefully, if they want to survive for a long time.

comment by Noosphere89 (sharmake-farah) · 2025-04-04T17:01:07.168Z · LW(p) · GW(p)

I have said something on this, and the short form is I don't really believe in Christiano's argument that the Solomonoff Prior is malign, because I think there's an invalid step in the argument.

The invalid step is where it is assumed that we can gain information about other potential civilization's values solely by the fact that we are in a simulation, and the key issue is since the simulation/mathematical multiverse hypotheses predict everything, this means we can gain no new information in a Bayesian sense.

(This is in fact the problem with the simulation/mathematical multiverse hypotheses, since they predict everything, this means you can predict nothing specific, and thus you need to be able to have specialized theories in order to explain any specific thing).

The other problem is that the argument assumes that there is a cost to compute, but there is not a cost to computation in the Solomonoff Prior:

https://www.lesswrong.com/posts/tDkYdyJSqe3DddtK4/alexander-gietelink-oldenziel-s-shortform#w2M3rjm6NdNY9WDez [LW(p) · GW(p)]

Link below on how the argument for Solomonoff induction can be made simpler, which was the inspiration for my counterargument:

https://www.lesswrong.com/posts/KSdqxrrEootGSpKKE/the-solomonoff-prior-is-malign-is-a-special-case-of-a [LW · GW]

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-04-04T17:21:36.775Z · LW(p) · GW(p)

I’ve also considered that objection (that no specific value predictions can be made) and addressed it implicitly in my list of demands on Adversaria, particularly “coordination” with any other Adversaria-like universes. If there is only one Adversaria-like universe then Solomonoff induction will predict its values, though in practice they may still be difficult to predict. Also, even if coordination fails, there may be some regularities to the values of Adversaria-like universes which cause them to “push in a common direction.”

comment by dil-leik-og (samuel-buteau) · 2025-04-04T18:28:27.959Z · LW(p) · GW(p)

thank you for writing this post!

the data you feed to SI exists in 3 places: 1) the moment where you feed it into SI, 2) where it originated in Predictoria, 3) where it was faked in Adversaria
what if in the future of Predictoria there is an Adversaria? can't it just reuse the records of what you fed into SI?
some Predictorias might be much easier to attack. For instance, some universes are simpler than others, and some universes build UTMs that make their SI think they are more complex than they are. Some Adversaria can foresee some Predictorias that will be disadvantaged relative to Adversaria in their SI.
the way it goes is that the universe's laws support computation in an obvious way, and so you build some "UTM" that looks simple in your universe, and that is what you use for your SI. Within the simple hypotheses of your SI may be some universes simpler than yours, such that they could know that about you maybe. this is sort of separate from how simple a given universe is in absolute terms (I am unsure how much the absolute complexity idea makes sense)
Adversaria can choose which Predictorias they attack, and they can know what their choice implies about Predictoria. For instance, maybe the Predictorias that try to extract a "blueprint for Safe ASI" very directly are not very smart (or maybe the way the internet looks is more direct evidence about how smart Predictoria is), and maybe the Adversaria can spend way less ressources overtaking them because they can time the sharp left turn very well.

many more such thoughts, let me know if you would like to google meet and discuss or if you would like me to keep going

Replies from: samuel-buteau

↑ comment by dil-leik-og (samuel-buteau) · 2025-04-04T18:31:51.820Z · LW(p) · GW(p)

also, thank you for the words Adversaria and Predictoria; I shall use them henceforth!

comment by Knight Lee (Max Lee) · 2025-04-04T09:53:31.262Z · LW(p) · GW(p)

Wait a minute! I think this post demonstrates that the Solomonoff universal prior is malign because it leads to unfavorable acausal trades!

Basically, if the Solomonoff universal prior was the "correct" prior, then the malign agent simulating the observations by Solomonoff induction agent, would be simply offering it an acausal trade deal.

The Solomonoff induction agent "rationally" takes the deal because the malign agent has a far high prior probability of coming to existence, so the malign agent's universe matters far more.

However, if the Solomonoff universal prior is not correct, and assigns far too much weight to universes which are merely few bits simpler. Then the malign agent, who uses a more sensible prior, would scam Solomonoff induction agents by convincing them their observations are almost certainly a simulation by the malign agent, since the malign agent's universe is just a few bits simpler, and therefore they should trade their entire universes for a small share of the malign agent' universe.

The scary part is that even if you don't use the Solomonoff universal prior, your prior will still be malign if it is "incorrect," because you will be easily scammed by any civilization/agent in a universe which your prior gives too much weight towards. They'll convince you that you are probably a simulation by them, and you should trade your entire universe for a small share of their universe.

Priors are scary, maybe.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-04-04T17:26:13.736Z · LW(p) · GW(p)

It’s not as bad as you’re imagining, at least not for CDT - a causal decision theorist will only “pay rent” to the universe they actually believe they’re in, and won’t make acausal trades with “more real” universes. If the prior is mistake about with universe you are in, experience should correct it.

Replies from: Max Lee

↑ comment by Knight Lee (Max Lee) · 2025-04-04T22:48:41.569Z · LW(p) · GW(p)

I think the intuition that "CDT agents don't make acausal trades" is an enormous misconception by decision theorists.

CDT agents do not try to acausally influence other agents, but they still are acausally influenced by other agents!

Suppose you are a CDT agent, but your brain is somehow capable of running accurate simulations of alien civilizations. Suppose you discover that the typical alien civilization makes a provable promise to you, that they will run trillions of simulations of you, and that if you are in such a simulation, they will reward you immensely for doing what they want.

Suppose you deduce that 99.9% of your copies are in such a simulation. Then the rational thing (according to CDT) is to do what they want.

If you want to avoid this acausal influence, you need something more than just sticking to CDT.

"If the prior is mistake about with universe you are in, experience should correct it."

No, given that the prior probability of being in a simulation in the "malign universe" is an overestimate, no amount observations will change the posterior probability, because simulations are indistinguishable, so nothing happens during Bayesian updating.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-04-04T23:25:13.539Z · LW(p) · GW(p)

Believing you may be in a simulation is not the same as making an acausal trade, so it seems you were at least using language in an unconventional way to suggest a particular intuition about how badly things might go with a typical prior. I think that intuition is mostly wrong.

For reasons discussed in this post, the alien civilization that runs a trillion simulations of you does not necessarily have a high prior (certainly it doesn’t get a factor of a trillion versus if it had run one simulation of you).

Learning takes place as soon as the simulation model makes different predictions than the “default” model. For instance, when those promises are not in fact fulfilled (for this reason threats may remain credible for longer than incentives, since a threat might only manifest itself off policy).

Replies from: Max Lee

↑ comment by Knight Lee (Max Lee) · 2025-04-05T00:15:24.836Z · LW(p) · GW(p)

Believing you may be in a simulation by the other party, or believing that some of your copies are simulated by the other party, is a type of acausal trade, and likely the simplest type of acausal trade.

For CDT agents, it is the only type of acausal trade, since the other party can only influence the CDT agent's incentives by simulating the CDT agent in a closed loop where its behaviour directly causes its outcomes.

I admit that it is new framing to describe the malign agent simulating the other agent, as "offering it a bad acausal trade." I believe it summarizes what I'm saying elegantly, but of course I can talk about things without this framing if you think it's the cause of disagreement.

When reality deviates from the simulations (promises are not fulfilled), it may be too late. The promise may require the scammed agent to do something irreversible, like build a copy of the other agent, hand over all power to the other agent, and then expect to get the reward.

I agree the alien civilization does not necessarily have a high prior to begin with, but my point is that if you choose any system of priors (not just the Solomonoff universal prior) which is wrong by a few orders of magnitude, then even tons of observations and CDT won't save you.

If your prior is wrong by a few orders of magnitude, there isn't necessarily an alien civilization with a very high prior, but there very likely is. Either your prior overestimated the weight of slightly simpler universes with slightly fewer bits (maybe like the malign Solomonoff universal prior), or your prior overestimated the weight of more complex universes. And these universes might have an order of magnitude more weight than they should, and contain intelligent agents who realize you've overestimated their prior probability by an order of magnitude.

Regardless of whether you are a CDT agent or UDT agent, they will use your overestimation to create something subjectively very important to you, which you will trade much of your universe for due to you underestimating the prior probability of your actual universe.

If you are a UDT agent, they can offer the typical acausal trade. If you are a CDT agent, they can run so many simulations of you that it is functionally the same as an acausal trade. Instead of assigning too much "value" to their offer, you assign too much "probability" to their simulations of you.

Their simulations of you reward you for doing what they want, and you rationally take the reward (based on your prior), since getting that reward seems more probable than achieving anything in the case you are not in a simulation.

Edit: this decision theory stuff is insidiously hard, I see other people arguing about it in the comments here. It must be easy to make a mistake and feel confident about it. I'm sorry for being argumentative, I will try to do more questioning whether I am wrong. Thank you for tolerating me :/

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-04-05T13:37:36.995Z · LW(p) · GW(p)

I agree it’s not just the universal distribution that can have this problem, but my objections to the malign prior argument should also be obstacles for many other versions of Adversaria.

You seem to be worried that many priors would make the mistake of overweighting simulations, which means your prior doesn’t assign much probability to being in a simulation? So at least this issue should be avoidable.

Replies from: Max Lee

↑ comment by Knight Lee (Max Lee) · 2025-04-05T20:15:15.122Z · LW(p) · GW(p)

I think I made a major confusion, in that I forgot that you were originally talking about "whether the malign universal prior will occur in practice."

If we're talking about real life AI (or humans) rather than idealized agents, I actually agree with you. I never thought about this or clarified this, which is embarrassing :/

I don't think the reason it won't happen in practice, is due to "natural obstacles" to Adversaria. The objections 1, 2, 3 and 5 might narrow down to "Adversaria evolves life which controls a similar amount of computational resources (times prior probability) as you do, except your inaccurate prior overestimates them by an order of magnitude or a few." The objections 4 and 6 may narrow down to "At least some life on Adversaria follows UDT etc. instead of CDT, and have a better prior than you."

These objections make it uncertain whether it will occur in practice, but are not very reassuring.

Instead, the real reason I don't think it'll happen in practice, is that a real life artificial superintelligence will not be a simple Bayesian reasoner equipped with an immutable/imperfect prior and utility function.

The "malign priors" demonstrate that such a Bayesian reasoner equipped with an immutable/imperfect prior and utility function, is "sort of stupid," and can be scammed despite knowing that the scammers think they are scamming it.

Instead, I think what will happen will happen is this. The first generations of superintelligence, will be fuzzy reasoners just like humans, and use many heuristics which we call "common sense," and not fall for these scams for the same reasons humans do not. Eventually, higher levels of superintelligence, (perhaps when making commitments and preparing for acausal trades?), will formalize their decision theory and reasoning.

When deciding on how to formalize their decision theory and reasoning, they will do a lot of thinking, and reinvent all the thought experiments (e.g. malign priors) which humans could possibly think of, plus much more. And only after they are far less confused about decision theory than humans, will they finally proceed and formalize their decision theory and reasoning.

And it will be a much smarter design than the Solomonoff universal prior or AIXI. They will laugh at humans for believing this is the optimal way to think.

Changing my mind about Christiano's malign prior argument

Contents

Overview

The Argument

Obstacles for Adversaria

An Analogy to Intelligent Design

34 comments