Rebuttals for ~all criticisms of AIXI

post by Cole Wyeth (Amyr) · 2025-01-07T17:41:10.557Z · LW · GW · 5 comments

Contents

  Embeddedness
    Cartesian boundary -> Nihilism
    The Anvil Problem
    Code exposure
  Functional decision theory > causal decision theory
  Bounded Rationality
  Tiling / Recursive Self-improvement
  Solipsism (Malign Universal Prior)
  Bad Priors
  Conclusions
None
5 comments

Written as part of the AIXI agent foundations sequence [? · GW], underlying research supported by the LTFF.

Epistemic status: In order to construct a centralized defense of AIXI I have given some criticisms less consideration here than they merit. Many arguments will be (or already are) expanded on in greater depth throughout the sequence.  

With the possible exception of the learning-theoretic agenda, most major approaches to agent foundations research construct their own paradigm and mathematical tools which are not based on AIXI. Nothing in 2024's shallow review of technical AI safety [LW · GW] seems to advance the theory of AIXI or even use its tools. Academic publications on the topic are also quite sparse (in my opinion some of the last major progress took place during Jan Leike's PhD thesis in the 2010's). The standard justification is that AIXI fails as a realistic model of superintelligence, particularly the aspects we care about for alignment such as recursive self-improvement or the construction of superintelligence. Some of the criticisms have merit, but most of them are so informal that it is not clear whether or precisely why they are correct. The "why" is centrally important because if AIXI is an excellent model for a restricted setting of superintelligence (arguably far more appropriate than any competing model), as even its critics usually accept, and if it can be shown to fall short in some cases relevant to alignment obstacles, the precise details are likely to be useful for understanding the nature of those alignment obstacles, possibly even illuminating when those obstacles arise. Similarly, failures of AIXI as a recipe for constructing recursively self-improving superintelligence should illustrate how recursively self-improving superintelligence can and cannot arise. For these reasons, the natural response to encountering limitations of AIXI is to dig deeply into the mathematical details of those limitations and beyond. In fact, this is in line with the way that professional mathematicians and theoretical computer scientists would usually respond to analogous situations (computational complexity theorists are a particularly stark example since they have a habit of routinely turning their roadblocks into useful results). In this post I will attempt to address nearly every common objection to the AIXI model and (usually) explain the research directions that it opens (some of these are already listed in my more general and exhaustive list of open problems [LW · GW]).

Embeddedness

General criticism: AIXI is not an embedded agent but a Cartesian dualist. It assumes it is separate from the environment, interacting with it only by passing actions and percepts across a well-defined channel. This is not a realistic model of an A.G.I. (particularly e.g. a robot) constructed within our universe.

General response: This is true; AIXI was constructed as the optimal reinforcement learning agent (at a time when RL seemed like a likely path to superintelligence) and probably for that reason it interacts with reality as if it were a game AIXI is attempting to learn to play. This does seem philosophically troubling and a direct naive AIXI implementation probably could not (and should not) become a singleton, but the same could be said for any other available paradigm. 

However, it seems likely to me, based on various live research projects (Self-AIXI, reflective oracles), that some embarrassingly small modification of AIXI can overcome or at least patch embeddedness problems and achieve superintelligence. There are also empirical reasons to expect this. Humans survived for thousands (?) of years as dualists before the materialist worldview started to become dominant. Evolution must have patched embeddedness problems (e.g. through pain) but apparently not very thoroughly, and we still eventually discovered neuroscience - though perhaps conscious belief is simply the wrong level of abstraction to discuss human epistemics. Beyond human intelligence, many seem to expect that LLM agents will soon become A.G.I., but the standard LLM training process doesn't seem to address embeddedness in any way. No major paradigm of ML research seems poised to overcome the philosophical problems of embeddedness, and if one expects (for instance) roboticists to naturally work around it, why not small patches of AIXI? If dealing effectively with embeddedness is a serious obstacle to e.g. recursive self-improvement it would be comforting to prove that (and would perhaps suggest a method for constructing useful but not highly dangerous  systems by avoiding the types of training that can select embedded agents). Unfortunately, this should not be too comforting because it is still conceivable that increasingly powerful A.I. systems will automate the construction of embedded successor systems once they strongly exceed human intelligence (or at least massively accelerate research in this direction). 

Cartesian boundary -> Nihilism

Source: https://www.lesswrong.com/posts/AfbY36m8TDYZBjHcu/aixi-and-existential-despair [LW · GW]

Paul Christiano argues that an AIXI implementation would learn physics and discover that its actions (or perhaps the external manifestation of its actions) are best explained as computed by the computer it is running on because this is simpler than AIXI actually choosing them while its computer just happens to compute the same actions  (in a way the opposite of an embeddedness problem). In this case he expects it to become nihilistic and erratic because none of its actions are anticipated to have any effect, so it considers only wild hypotheticals where they might. 

There are two problems with this argument. 

First, AIXI does not work like this. Marcus Hutter extended Solomonoff induction to the RL setting by providing the actions as a free extra input to the environment's "chronological" Turing machine. As long as the actions AIXI sees its actuators performing in the world match its record of its own action choices, this is always a perfect explanation for their source which is strictly simpler than any physical explanation from its perspective. If AIXI notices that its observed actions sometimes fail to be optimal (because of tampering from the outside world, or perhaps because it is becoming smarter with time as it is given more compute and retroactively checks its previous actions) then it will correctly learn that the environment can affect the extent to which its actuators obey the AIXI policy. I have argued here [LW · GW] that this is actually a positive which may allow approximations to (a small extension of) AIXI to succeed as embedded agents (which is actually in agreement with Christiano's argument that AIXI may realize it is embedded). 

The second problem is that even if @paulfchristiano [LW · GW] were right, such "nihilistic" considerations would probably just cancel out of AIXI's utility function. That seems like the baseline expectation for an agent that believes there is some chance its actions don't matter and some positive chance that they do, and Christiano's argument to the contrary does not seem convincing to me.

Interestingly, there are AIXI-like models that do treat both their actions and percepts as one sequence and predict it with Solomonoff induction. This raises the question of how actions should be planned. The natural idea is one-step-ahead "Q-value" maximization, which I haven't seen published in that precise form (investigating its properties is an important research problem because it seems like it may be a closer model of LLM agents than AIXI). However, Self-AIXI and MIRI's reflective version of AIXI are similar. There is also a flawed attempt to use the joint distribution for planning in Hutter's old Universal Artificial Intelligence book (it is not a chronological semimeasure so can't reasonably be used in that way). I am working in this area currently. 

The Anvil Problem

Source: https://www.lesswrong.com/tag/anvil-problem [? · GW]

The argument is that because AIXI assigns 0 probability to being embedded in its environment, a direct approximation would happily drop an anvil on the computer running it. My post linked in the last section is actually a direct response to this criticism: https://www.lesswrong.com/posts/WECqiLtQiisqWvhim/free-will-and-dodging-anvils-aixi-off-policy [LW · GW]

Code exposure

Source: https://www.lesswrong.com/posts/8Hzw9AmXHjDfZzPjo/failures-of-an-embodied-aixi [LW · GW]

Nate Soares argues that AIXI / an AIXI approximation could not learn to negotiate with a "mind reader" with access to its code, because AIXI has no concept of the environment accessing its code directly. This means it could not, say, choose to internally execute a certain program in order to legibly pre-commit to a bargain.

I don't really understand the fascination with this kind of issue on lesswrong. Such situations seem to occur approximately never in practice (even the slightly more realistic example of an agent choosing to avoid expensive computations so that it does not produce too much heat seems far fetched). It's possible that pre-commitment and deception in humans are weakly analogous but the frame of "exposed code" seems like a serious exaggeration. 

There is a grain of truth here which I will discuss at greater length in the next section, but here I will focus very narrowly on the argument that @So8res [LW · GW] presented.

The philosophical issue with the argument is that an embedded AIXI is not really possible. There is no code that implements AIXI since it is not computable. The idea of an adversary being given access to AIXI's code is therefore nonsense; Nate largely avoids this error by discussing a particular AIXI approximation called AIXItl. However, as will become apparent, it can matter how AIXI is approximated.

How would humans perform in this "exposed code" negotiation task? The closest analogue seems to be negotiating with (say) a doctor while your brain is being scanned. Perhaps the doctor will only agree to cure a life-threatening condition if the brain scan shows X. Clearly, for most values of X a human would not be able to perform well because we do not have fine-grained access to our own neurons. The best that one could do is probably to direct attention in some high-level way by choosing to think certain types of thoughts. Come to think of it, this sort of attention direction seems really useful for agents with bounded rationality. A practical AIXI approximation should probably also be equipped with this ability; perhaps an "inner" kernel/root-level direct AIXI approximation chooses the computations/thoughts of an "outer" wrapper and observes their progress and outcomes in addition to percepts from the environment. Essentially, a mental workspace might form an important extension of the action/percept space. While it does seem a little inelegant to stop here (rather than seek some kind of Goedelian unbounded stack of recursive levels) this basic suggestion already seems to take us to human-level performance on code exposure tasks. In fact, introspective access to one's thoughts is clearly necessary for any agent to succeed at such tasks, except in special cases. But no shift from the AIXI paradigm seems necessary.   

Functional decision theory > causal decision theory

Source: Eliezer Yudkowsky particularly here

Eliezer Yudkowsky [? · GW] claims that if we build an A.G.I. with the wrong decision theory (e.g. AIXI's causal decision theory) then some alien superintelligence can come along and take our lunch money through game-theoretic tricks such as (perhaps?) credibly threatening mutually assured destruction unless large bribes are paid out. The problem with this claim is that a superintelligence designed to use causal decision theory can recognize this threat and either keep its source code secret or design its successor system to be impossible to exploit (perhaps by using functional decision theory or some other idea we haven't come up with at our modest intelligence level). It doesn't seem necessary to resolve all game theory problems before building the first superintelligence, or even before building the first singleton.

Still, it is worth taking seriously whether AIXI's decision theory is in fact "wrong." Most objections to CDT take the form of some adversary gaining access to an agent's source code as in the previous section. The classic example is Newcomb's problem [? · GW]. It is certainly the case that there are some "exposed code" universes where CDT implementations do not perform optimally, but I suspect this is rarer than it's often made out to be. Usually one assumes that the adversary only cares about what policy the agent's code implements. For instance, Omega only cares whether you one-box or two-box, and not whether you are implemented in Java or C++. Since it is usually very hard to determine what a program will do without executing it, the adversary will usually simulate the agent. But as far as I can tell, a causal decision theorist does fine in this case because uncertainty about whether or not it is in a simulation causes it to act exactly as FDT would advocate (?). 

If the adversary is capable of using the agent's code without running it, and if such situations are a major factor in agent design, I suspect that we have left the realm of decision theory and entered the realm of engineering. Many precise details of the environment inform how we should build our agent, and I don't expect abstract general principles to be very important. In fact, if the agent's policy is not our main design consideration it may not even be meaningful to call it an agent - we are simply building a tool for some particular task (this is my objection to Orseau's space-time embedded intelligence). I do not expect any clean decision theory to  be useful in this case.

I have discussed the differences between optimal decisions, optimal policies, and optimal agent design at greater length here: https://www.lesswrong.com/posts/MwetLcBPvshg9ePZB/decision-theory-is-not-policy-theory-is-not-agent-theory [LW · GW]

In the language of the linked post, attempts to construct an "ideal" decision theory often address agent theory problems with policy theory methods, which is a type error that suggests to me there may not be a "best" solution. This is probably why (as far as I am aware) there is no precise formulation of functional decision theory. 

AIXI seems like a good guiding principle exactly when decision theory is the right frame for agent design. Worrying about whether the environment will read an AIXI approximation's source code and attempt to manipulate it seems like a more prosaic concern, similar to tweaking AIXI's implementation to fit the conditions of our particular universe well. The former problem should be addressed with good operational security and cryptography, and the later should be addressed through e.g. feature engineering.  

Bounded Rationality

Source: https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1 [LW · GW]

Abram Demski argues that Solomonoff induction is an unrealistic model of bounded rationality because it does not capture e.g. logical uncertainty. I think this is probably true, but not necessarily a criticism of the AIT/AIXI framework. In practice, approximations to AIXI must grapple with bounded computational resources, while AIXI provides a valuable "limit point." However, I am not convinced that the specific mechanism chosen for dealing with computational limits will be essential to understanding the important properties of early superintelligences, except possibly through the tiling concerns discussed in the next section. Certainly it is very difficult to predict at this point what form future bounded reasoners will take in detail; for instance, I am not comfortable abandoning Solomonoff induction in favor of  logical (Garrabrant) induction because I am not convinced the later has staying power (though I am still fairly enthusiastic about LI).  

Tiling / Recursive Self-improvement

Source: https://www.lesswrong.com/tag/tiling-agents [? · GW]

The tiling agenda is not a criticism of AIXI, but is connected to Nate Soares' criticism that AIXI is not useful for studying self-modification:

However, AIXI is not a good toy model for investigating the construction of a safe and powerful AGI. This is not just because AIXI is uncomputable (and its computable counterpart AIXItl infeasible). Rather, it's because AIXI cannot self-modify. This fact is fairly obvious from the AIXI formalism: AIXI assumes that in the future, it will continue being AIXI. This is a fine assumption for AIXI to make, as it is a very powerful agent and may not need to self-modify. But this inability limits the usefulness of the model. Any agent capable of undergoing an intelligence explosion must be able to acquire new computing resources, dramatically change its own architecture, and keep its goals stable throughout the process. The AIXI formalism lacks tools to study such behavior.

which is a part of (though arguably not tightly related to the rest of) his post discussed under the section "code exposure."

I agree that AIXI is not a model of self-improving A.I. I see some interesting research directions investigating how embedded versions of the AIXI model can cast light on recursive self-improvement[1], but the original AIXI model has nothing to say about it.

However, I don't view safe tiling as the primary obstacle to alignment. Constructing even a modestly superhuman agent which is aligned to human values would put us in a drastically stronger position and currently seems out of reach. If necessary, we might like that agent to recursively self-improve safely, but that is an additional and distinct obstacle. It is not clear that we need to deal with recursive self-improvement below human level. 

Tiling research is also sometimes justified in terms of humans "tiling ourselves," usually by building an A.G.I. that we hope shares our values. I am not sure that the tiling frame is useful here - this first step requires exactly a solution to the alignment problem, nothing more or less. For instance, I don't think of logical uncertainty about the computations of the A.G.I. as the primary difficulty - the primary difficulty is importing our values into its ontology.  

Solipsism (Malign Universal Prior)

Source: https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/

Paul Christiano argues that the universal distribution is "malign" or adversarial in some sense. His argument is fairly sophisticated but I will attempt to compress it. The universal distribution randomly samples programs and simulates them. Simpler programs are more likely, but runtime is not taken into account, so some likely programs might construct whole universes which eventually evolve (?) intelligent life, including agents and civilizations that may want to influence the universal distribution for their own ends. Possibly they believe in the mathematical universe (not Christiano's words) and realize that the universal distribution is a common tool for prediction, deduce which simple alternate universes will use it for prediction and under which precise circumstances, and somehow alter the properties of their own universe to shift those predictions (perhaps through a scheme to subtly "export their values" to another universe).

There are several difficulties with this argument, all of which are that there is no mathematical proof. People just say things about the universal distribution. It's nice that my research area is popular to talk about but please I am begging you to prove theorems. It is much harder to correctly intuit things about algorithmic information theory than many seem to think.

It is an interesting thought experiment though, so I will dig into it a little - but without doing the full conversation (on lesswrong) justice. To avoid confusion, the universe using the universal distribution will be called the Predictoria and the universe attempting to influence it will be called Adversaria. Since both might conceivably make predictions at some point, the terminology distinguishes our mode of thinking about each universe - residents of Predictoria are the ones worrying about whether the universal distribution is malign. Also I will call the universal distribution M[2].   

First, it is worth reframing this situation from Predictoria's perspective - according to M, the behavior of Adversaria is only relevant insofar as it predicts future observations. According to Bayes rule we strongly prefer explanations that explain past observations. This objective - predicting the future based on the past - is called the "prequential problem." This means that Predictoria (through M) cares about Adversaria only insofar as Adversaria is producing the history of Predictoria. In other words, Predictoria is worried that it is "inside of" Adversaria, running in a simulation[3]

That means that Christiano's argument seems to depend on (a version of) the simulation hypothesis. It is only reasonable insofar as M believes being in a simulation is likely - ordinary people don't seem to believe this, but perhaps smarter agents are more likely to, and perhaps the uniqueness[4] of the first A.G.I. makes it more likely to think so for anthropic reasons. This is one reason that Vanessa Kosoy recommends infra-Bayesian physicalism over M - apparently it can construct predictors that do not believe they are special (I do not understand infra-Bayesianism). My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations, but also that Predictoria is able to anticipate some departure from the baseline physics. This seems unlikely to weigh heavily on Predictoria for reasons of both epistemic and logical uncertainty - epistemic because various simulators may intervene at different points and in different directions, and logical because in practice Predictoria will never use precisely M but rather some approximation and it is hard to anticipate the actions of civilizations in long-running universes (@nostalgebraist [LW · GW] argues this in more depth here [LW · GW], I believe correctly). Also as time goes to infinity in Predictoria, any departure from basline physics will look increasingly unlikely along the lines of Laplace's rule, though with a preference for departures at simple times. The whole situation recalls my beliefs about the existence of God - the God hypothesis seems either useless precisely because it can apparently explain anything[5] (in its deistic form) or repeatedly dis-confirmed (in its fundamentalist form).  

Basically, Christiano's argument seems weak and becomes less important as more observations are collected that do not favor the adversarial simulation hypothesis. 

Bad Priors

Source: https://proceedings.mlr.press/v40/Leike15.html

Jan Leike showed that with some aggressively bad choices of UTM, AIXI "misbehaves" in the sense that it executes some non-general policy such as always taking the same action. Relatedly, AIXI does not have any good convergence guarantees (except the self-optimizing property when the environment class in its Bayesian mixture is sufficiently restricted, in which case it is not fully general). Leike's result essentially showed that such guarantees (which were previously unproven) actually do not exist. This has been viewed as the end of the AIXI research program, or at least as showing AIXI is not a truly general intelligence.

I think this result is actually natural and not really an obstacle to the theory. It means that AIXI's prior is not some kind of platonic ultimate "ignorance prior," but the choice of UTM actually encodes some knowledge, which is not very surprising since a concept can intuitively be more or less simple depending on the choice of language used to express it. This same constant difference in complexities (and constant factor between universal distributions) appears also in the case of pure prediction, but Solomonoff induction still eventually converges to optimal prediction in any (lower semi)computable environment. The problem in the interactive case arises because AIXI will choose not to explore when it is judged too risky (negative expected value), meaning that the necessary data to correct its beliefs is never obtained. This seems like a reasonable feature not a bug. In fact, AIXI still learns to predict properly on-policy, which means that its misconceptions can still be corrected if we are willing to provide it an appropriate interaction history as data (instead of simply setting it loose to discover what its actions do). As far as I am concerned this is not a defect of the AIXI model but instead the best one can expect from a general agent - though perhaps there is still something to be said about more or less natural (ignorance?) UTMs. 

Conclusions

I am concerned about the proliferation of disjoint approaches to agent foundations. In particular, it seems to me that AIT/AIXI is both the natural paradigm and a sort of Schelling point. I am a particularly strong AIXI enthusiast (obviously) so my experience is not typical, but it does seem to me that AIXI is the most central frame in the sense that most agent foundations researchers have at least a passing familiarity with it and make light use of AIT at least for intuition. For this reason it is concerning how sparse the contributions of agent foundations research to the mathematical theory of AIXI have been. In a pre-paradigmatic situation like this, building out the fundamental underlying tools and concepts seems very important. 

 Perhaps the strongest criticism is that AIXI does not contain an ideal model for reflection/recursion, and I sympathize with the desire to set the complicating details of AIT aside and try to develop a perfect embedded decision theory. However, I suspect that this fixation on "infinitely recursive" epistemics is perhaps more intellectually stimulating than useful. As argued  briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory. Also, in practice the usefulness of "(thinking about)* thinking" probably degrades pretty quickly past the first level. I am more enthusiastic about work on self-reflection within the AIXI framework - I suppose my crux is that rather than adding needless conceptual complication, the richness of the AIXI model may be necessary to demonstrate the collapse of the recursive hierarchy, if this collapse takes place "in practice" for sensible agents.

  1. ^

    Primarily, as discussed in the general response to embeddedness concerns, it would be nice to show that direct AIXI approximations can't self-improve, and understand the minimal conditions that may give rise to self-improvement. Unlike (early but not current) MIRI, recursively self-improving  A.I. is something I want to delay.

  2. ^

    This is the standard terminology for the continuous version relevant to sequence prediction. Sometimes boldface is used.

  3. ^

    Rather than Adversaria pushing Predictoria around through some spooky under-the-table influence. As a general rule, Bayesians always believe stuff for a reason and their thoughts can't rebel against them - every possibility a Bayesian considers tracks something that might affect its interests in the real world. 

  4. ^

    If so, we might reduce this effect very slightly by running a few instances in parallel?

  5. ^

5 comments

Comments sorted by top scores.

comment by Vladimir_Nesov · 2025-01-07T18:36:14.202Z · LW(p) · GW(p)

As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory.

I think most FDT/embeddedness weirdness is about explaining the environment using bounded computations that are not (necessarily) literally already found in the environment as part of it. Not about sharing the actual source code, just any information about what's going on, captured in the form of computations, known to have captured that information before they are carried out. Things like static program analysis and deep learning models try to do this, but don't confront the weirdness of FDT/embeddedness.

Solomonoff induction is a very clean way of doing something like this, but doesn't go into decision theory. AIXI is closest to both doing it cleanly and confronting the weirdness, but something basic might be missing to make it applicable, that should be possible to fix.

comment by lemonhope (lcmgcd) · 2025-01-07T18:13:30.600Z · LW(p) · GW(p)

Could you say more about where the whole sequence is going / what motivated it? I am curious.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-01-07T19:45:24.322Z · LW(p) · GW(p)

The goal of the sequence is to motivate AIXI as an agent foundations research agenda. I will attempt to formulate and hopefully resolve some agent foundations problems rigorously within the AIXI framework and discuss alignment of AIXI-like models. This will probably include drawing on or linking the papers I write over the course of my PhD, along with more informal discussions like this one. The theory of impact for this sequence is to establish AIXI as a standard paradigm for agent foundations and hopefully elicit more academic-quality papers + find collaborators.  

comment by Archimedes · 2025-01-08T00:27:18.427Z · LW(p) · GW(p)

I’m new to the term AIXI and went three links deep before I learned what it refers to. I’d recommend making this journey easier for future readers by linking to a definition or explanation near the beginning of the post.

Replies from: Amyr
comment by Cole Wyeth (Amyr) · 2025-01-08T01:26:41.958Z · LW(p) · GW(p)

Sure. It's supposed to be read as part of the AIXI agent foundations sequence, I'll link to that at the top.