Rebuttals for ~all criticisms of AIXI

cole-wyeth

Rebuttals for ~all criticisms of AIXI

post by Cole Wyeth (Amyr) · 2025-01-07T17:41:10.557Z · LW · GW · 17 comments

  Embeddedness
    Cartesian boundary -> Nihilism
    The Anvil Problem
    Code exposure
  Functional decision theory > causal decision theory
  Bounded Rationality
  Tiling / Recursive Self-improvement
  Solipsism (Malign Universal Prior)
  Bad Priors
  Conclusions
None
17 comments

Written as part of the AIXI agent foundations sequence [? · GW], underlying research supported by the LTFF.

Epistemic status: In order to construct a centralized defense of AIXI I have given some criticisms less consideration here than they merit. Many arguments will be (or already are) expanded on in greater depth throughout the sequence. In hindsight, I think it may have been better to explore each objection in its own post and then write this post as a summary/centralized reference, rather than writing it in the middle of that process. Some of my takes have already become more nuanced. This should be treated as a living document.

With the possible exception of the learning-theoretic agenda, most major approaches to agent foundations research construct their own paradigm and mathematical tools which are not based on AIXI. Nothing in 2024's shallow review of technical AI safety [LW · GW] seems to advance the theory of AIXI or even use its tools. Academic publications on the topic are also quite sparse (in my opinion some of the last major progress took place during Jan Leike's PhD thesis in the 2010's). The standard justification is that AIXI fails as a realistic model of superintelligence, particularly the aspects we care about for alignment such as recursive self-improvement or the construction of superintelligence. Some of the criticisms have merit, but most of them are so informal that it is not clear whether or precisely why they are correct. The "why" is centrally important because if AIXI is an excellent model for a restricted setting of superintelligence (arguably far more appropriate than any competing model), as even its critics usually accept, and if it can be shown to fall short in some cases relevant to alignment obstacles, the precise details are likely to be useful for understanding the nature of those alignment obstacles, possibly even illuminating when those obstacles arise. Similarly, failures of AIXI as a recipe for constructing recursively self-improving superintelligence should illustrate how recursively self-improving superintelligence can and cannot arise. For these reasons, the natural response to encountering limitations of AIXI is to dig deeply into the mathematical details of those limitations and beyond. In fact, this is in line with the way that professional mathematicians and theoretical computer scientists would usually respond to analogous situations (computational complexity theorists are a particularly stark example since they have a habit of routinely turning their roadblocks into useful results). In this post I will attempt to address nearly every common objection to the AIXI model and (usually) explain the research directions that it opens (some of these are already listed in my more general and exhaustive list of open problems [LW · GW]).

Embeddedness

General criticism: AIXI is not an embedded agent but a Cartesian dualist. It assumes it is separate from the environment, interacting with it only by passing actions and percepts across a well-defined channel. This is not a realistic model of an A.G.I. (particularly e.g. a robot) constructed within our universe.

General response: This is true; AIXI was constructed as the optimal reinforcement learning agent (at a time when RL seemed like a likely path to superintelligence) and probably for that reason it interacts with reality as if it were a game AIXI is attempting to learn to play. This does seem philosophically troubling and a direct naive AIXI implementation probably could not (and should not) become a singleton, but the same could be said for any other available paradigm.

However, it seems likely to me, based on various live research projects (Self-AIXI, reflective oracles), that some embarrassingly small modification of AIXI can overcome or at least patch embeddedness problems and achieve superintelligence. There are also empirical reasons to expect this. Humans survived for thousands (?) of years as dualists before the materialist worldview started to become dominant. Evolution must have patched embeddedness problems (e.g. through pain) but apparently not very thoroughly, and we still eventually discovered neuroscience - though perhaps conscious belief is simply the wrong level of abstraction to discuss human epistemics. Beyond human intelligence, many seem to expect that LLM agents will soon become A.G.I., but the standard LLM training process doesn't seem to address embeddedness in any way. No major paradigm of ML research seems poised to overcome the philosophical problems of embeddedness, and if one expects (for instance) roboticists to naturally work around it, why not small patches of AIXI? If dealing effectively with embeddedness is a serious obstacle to e.g. recursive self-improvement it would be comforting to prove that (and would perhaps suggest a method for constructing useful but not highly dangerous systems by avoiding the types of training that can select embedded agents). Unfortunately, this should not be too comforting because it is still conceivable that increasingly powerful A.I. systems will automate the construction of embedded successor systems once they strongly exceed human intelligence (or at least massively accelerate research in this direction).

Cartesian boundary -> Nihilism

Source: https://www.lesswrong.com/posts/AfbY36m8TDYZBjHcu/aixi-and-existential-despair [LW · GW]

Paul Christiano argues that an AIXI implementation would learn physics and discover that its actions (or perhaps the external manifestation of its actions) are best explained as computed by the computer it is running on because this is simpler than AIXI actually choosing them while its computer just happens to compute the same actions (in a way the opposite of an embeddedness problem). In this case he expects it to become nihilistic and erratic because none of its actions are anticipated to have any effect, so it considers only wild hypotheticals where they might.

There are two problems with this argument.

First, AIXI does not work like this. Marcus Hutter extended Solomonoff induction to the RL setting by providing the actions as a free extra input to the environment's "chronological" Turing machine. As long as the actions AIXI sees its actuators performing in the world match its record of its own action choices, this is always a perfect explanation for their source which is strictly simpler than any physical explanation from its perspective. If AIXI notices that its observed actions sometimes fail to be optimal (because of tampering from the outside world, or perhaps because it is becoming smarter with time as it is given more compute and retroactively checks its previous actions) then it will correctly learn that the environment can affect the extent to which its actuators obey the AIXI policy. I have argued here [LW · GW] that this is actually a positive which may allow approximations to (a small extension of) AIXI to succeed as embedded agents (which is actually in agreement with Christiano's argument that AIXI may realize it is embedded).

The second problem is that even if @paulfchristiano [LW · GW] were right, such "nihilistic" considerations would probably just cancel out of AIXI's utility function. That seems like the baseline expectation for an agent that believes there is some chance its actions don't matter and some positive chance that they do, and Christiano's argument to the contrary does not seem convincing to me.

Interestingly, there are AIXI-like models that do treat both their actions and percepts as one sequence and predict it with Solomonoff induction. This raises the question of how actions should be planned. The natural idea is one-step-ahead "Q-value" maximization, which I haven't seen published in that precise form (investigating its properties is an important research problem because it seems like it may be a closer model of LLM agents than AIXI). However, Self-AIXI and MIRI's reflective version of AIXI are similar. There is also a flawed attempt to use the joint distribution for planning in Hutter's old Universal Artificial Intelligence book (it is not a chronological semimeasure so can't reasonably be used in that way). I am working in this area currently.

The Anvil Problem

Source: https://www.lesswrong.com/tag/anvil-problem [? · GW]

The argument is that because AIXI assigns 0 probability to being embedded in its environment, a direct approximation would happily drop an anvil on the computer running it. My post linked in the last section is actually a direct response to this criticism: https://www.lesswrong.com/posts/WECqiLtQiisqWvhim/free-will-and-dodging-anvils-aixi-off-policy [LW · GW]

Code exposure

Source: https://www.lesswrong.com/posts/8Hzw9AmXHjDfZzPjo/failures-of-an-embodied-aixi [LW · GW]

Nate Soares argues that AIXI / an AIXI approximation could not learn to negotiate with a "mind reader" with access to its code, because AIXI has no concept of the environment accessing its code directly. This means it could not, say, choose to internally execute a certain program in order to legibly pre-commit to a bargain.

I don't really understand the fascination with this kind of issue on lesswrong. Such situations seem to occur approximately never in practice (even the slightly more realistic example of an agent choosing to avoid expensive computations so that it does not produce too much heat seems far fetched). It's possible that pre-commitment and deception in humans are weakly analogous but the frame of "exposed code" seems like a serious exaggeration.

There is a grain of truth here which I will discuss at greater length in the next section, but here I will focus very narrowly on the argument that @So8res [LW · GW] presented.

The philosophical issue with the argument is that an embedded AIXI is not really possible. There is no code that implements AIXI since it is not computable. The idea of an adversary being given access to AIXI's code is therefore nonsense; Nate largely avoids this error by discussing a particular AIXI approximation called AIXItl. However, as will become apparent, it can matter how AIXI is approximated.

How would humans perform in this "exposed code" negotiation task? The closest analogue seems to be negotiating with (say) a doctor while your brain is being scanned. Perhaps the doctor will only agree to cure a life-threatening condition if the brain scan shows X. Clearly, for most values of X a human would not be able to perform well because we do not have fine-grained access to our own neurons. The best that one could do is probably to direct attention in some high-level way by choosing to think certain types of thoughts. Come to think of it, this sort of attention direction seems really useful for agents with bounded rationality. A practical AIXI approximation should probably also be equipped with this ability; perhaps an "inner" kernel/root-level direct AIXI approximation chooses the computations/thoughts of an "outer" wrapper and observes their progress and outcomes in addition to percepts from the environment. Essentially, a mental workspace might form an important extension of the action/percept space. While it does seem a little inelegant to stop here (rather than seek some kind of Goedelian unbounded stack of recursive levels) this basic suggestion already seems to take us to human-level performance on code exposure tasks. In fact, introspective access to one's thoughts is clearly necessary for any agent to succeed at such tasks, except in special cases. But no shift from the AIXI paradigm seems necessary.

Functional decision theory > causal decision theory

Source: Eliezer Yudkowsky particularly here

Eliezer Yudkowsky [? · GW] claims that if we build an A.G.I. with the wrong decision theory (e.g. AIXI's causal decision theory) then some alien superintelligence can come along and take our lunch money through game-theoretic tricks such as (perhaps?) credibly threatening mutually assured destruction unless large bribes are paid out. The problem with this claim is that a superintelligence designed to use causal decision theory can recognize this threat and either keep its source code secret or design its successor system to be impossible to exploit (perhaps by using functional decision theory or some other idea we haven't come up with at our modest intelligence level). It doesn't seem necessary to resolve all game theory problems before building the first superintelligence, or even before building the first singleton.

Still, it is worth taking seriously whether AIXI's decision theory is in fact "wrong." Most objections to CDT take the form of some adversary gaining access to an agent's source code as in the previous section. The classic example is Newcomb's problem [? · GW]. It is certainly the case that there are some "exposed code" universes where CDT implementations do not perform optimally, but I suspect this is rarer than it's often made out to be. Usually one assumes that the adversary only cares about what policy the agent's code implements. For instance, Omega only cares whether you one-box or two-box, and not whether you are implemented in Java or C++. Since it is usually very hard to determine what a program will do without executing it, the adversary will usually simulate the agent. But as far as I can tell, a causal decision theorist does fine in this case because uncertainty about whether or not it is in a simulation causes it to act exactly as FDT would advocate (?).

If the adversary is capable of using the agent's code without running it, and if such situations are a major factor in agent design, I suspect that we have left the realm of decision theory and entered the realm of engineering. Many precise details of the environment inform how we should build our agent, and I don't expect abstract general principles to be very important. In fact, if the agent's policy is not our main design consideration it may not even be meaningful to call it an agent - we are simply building a tool for some particular task (this is my objection to Orseau's space-time embedded intelligence). I do not expect any clean decision theory to be useful in this case.

I have discussed the differences between optimal decisions, optimal policies, and optimal agent design at greater length here: https://www.lesswrong.com/posts/MwetLcBPvshg9ePZB/decision-theory-is-not-policy-theory-is-not-agent-theory [LW · GW]

In the language of the linked post, attempts to construct an "ideal" decision theory often address agent theory problems with policy theory methods, which is a type error that suggests to me there may not be a "best" solution. This is probably why (as far as I am aware) there is no precise formulation of functional decision theory.

AIXI seems like a good guiding principle exactly when decision theory is the right frame for agent design. Worrying about whether the environment will read an AIXI approximation's source code and attempt to manipulate it seems like a more prosaic concern, similar to tweaking AIXI's implementation to fit the conditions of our particular universe well. The former problem should be addressed with good operational security and cryptography, and the later should be addressed through e.g. feature engineering.

Bounded Rationality

Source: https://www.lesswrong.com/posts/xJyY5QkQvNJpZLJRo/radical-probabilism-1 [LW · GW]

Abram Demski argues that Solomonoff induction is an unrealistic model of bounded rationality because it does not capture e.g. logical uncertainty. I think this is probably true, but not necessarily a criticism of the AIT/AIXI framework. In practice, approximations to AIXI must grapple with bounded computational resources, while AIXI provides a valuable "limit point." However, I am not convinced that the specific mechanism chosen for dealing with computational limits will be essential to understanding the important properties of early superintelligences, except possibly through the tiling concerns discussed in the next section. Certainly it is very difficult to predict at this point what form future bounded reasoners will take in detail; for instance, I am not comfortable abandoning Solomonoff induction in favor of logical (Garrabrant) induction because I am not convinced the later has staying power (though I am still fairly enthusiastic about LI).

Tiling / Recursive Self-improvement

Source: https://www.lesswrong.com/tag/tiling-agents [? · GW]

The tiling agenda is not a criticism of AIXI, but is connected to Nate Soares' criticism that AIXI is not useful for studying self-modification:

However, AIXI is not a good toy model for investigating the construction of a safe and powerful AGI. This is not just because AIXI is uncomputable (and its computable counterpart AIXItl infeasible). Rather, it's because AIXI cannot self-modify. This fact is fairly obvious from the AIXI formalism: AIXI assumes that in the future, it will continue being AIXI. This is a fine assumption for AIXI to make, as it is a very powerful agent and may not need to self-modify. But this inability limits the usefulness of the model. Any agent capable of undergoing an intelligence explosion must be able to acquire new computing resources, dramatically change its own architecture, and keep its goals stable throughout the process. The AIXI formalism lacks tools to study such behavior.

which is a part of (though arguably not tightly related to the rest of) his post discussed under the section "code exposure."

I agree that AIXI is not a model of self-improving A.I. I see some interesting research directions investigating how embedded versions of the AIXI model can cast light on recursive self-improvement^[1], but the original AIXI model has nothing to say about it.

However, I don't view safe tiling as the primary obstacle to alignment. Constructing even a modestly superhuman agent which is aligned to human values would put us in a drastically stronger position and currently seems out of reach. If necessary, we might like that agent to recursively self-improve safely, but that is an additional and distinct obstacle. It is not clear that we need to deal with recursive self-improvement below human level.

Tiling research is also sometimes justified in terms of humans "tiling ourselves," usually by building an A.G.I. that we hope shares our values. I am not sure that the tiling frame is useful here - this first step requires exactly a solution to the alignment problem, nothing more or less. For instance, I don't think of logical uncertainty about the computations of the A.G.I. as the primary difficulty - the primary difficulty is importing our values into its ontology.

Solipsism (Malign Universal Prior)

Source: https://ordinaryideas.wordpress.com/2016/11/30/what-does-the-universal-prior-actually-look-like/

Paul Christiano argues that the universal distribution is "malign" or adversarial in some sense. His argument is fairly sophisticated but I will attempt to compress it. The universal distribution randomly samples programs and simulates them. Simpler programs are more likely, but runtime is not taken into account, so some likely programs might construct whole universes which eventually evolve (?) intelligent life, including agents and civilizations that may want to influence the universal distribution for their own ends. Possibly they believe in the mathematical universe (not Christiano's words) and realize that the universal distribution is a common tool for prediction, deduce which simple alternate universes will use it for prediction and under which precise circumstances, and somehow alter the properties of their own universe to shift those predictions (perhaps through a scheme to subtly "export their values" to another universe).

There are several difficulties with this argument, all of which are that there is no mathematical proof. People just say things about the universal distribution. It's nice that my research area is popular to talk about but please I am begging you to prove theorems. It is much harder to correctly intuit things about algorithmic information theory than many seem to think.

It is an interesting thought experiment though, so I will dig into it a little - but without doing the full conversation (on lesswrong) justice. To avoid confusion, the universe using the universal distribution will be called the Predictoria and the universe attempting to influence it will be called Adversaria. Since both might conceivably make predictions at some point, the terminology distinguishes our mode of thinking about each universe - residents of Predictoria are the ones worrying about whether the universal distribution is malign. Also I will call the universal distribution M^[2].

First, it is worth reframing this situation from Predictoria's perspective - according to M, the behavior of Adversaria is only relevant insofar as it predicts future observations. According to Bayes rule we strongly prefer explanations that explain past observations. This objective - predicting the future based on the past - is called the "prequential problem." This means that Predictoria (through M) cares about Adversaria only insofar as Adversaria is producing the history of Predictoria. In other words, Predictoria is worried that it is "inside of" Adversaria, running in a simulation^[3].

That means that Christiano's argument seems to depend on (a version of) the simulation hypothesis. It is only reasonable insofar as M believes being in a simulation is likely - ordinary people don't seem to believe this, but perhaps smarter agents are more likely to, and perhaps the uniqueness^[4] of the first A.G.I. makes it more likely to think so for anthropic reasons. This is one reason that Vanessa Kosoy recommends infra-Bayesian physicalism over M - apparently it can construct predictors that do not believe they are special (I do not understand infra-Bayesianism). My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations, but also that Predictoria is able to anticipate some departure from the baseline physics. This seems unlikely to weigh heavily on Predictoria for reasons of both epistemic and logical uncertainty - epistemic because various simulators may intervene at different points and in different directions, and logical because in practice Predictoria will never use precisely M but rather some approximation and it is hard to anticipate the actions of civilizations in long-running universes (@nostalgebraist [LW · GW] argues this in more depth here [LW · GW], I believe correctly). Also as time goes to infinity in Predictoria, any departure from basline physics will look increasingly unlikely along the lines of Laplace's rule, though with a preference for departures at simple times. The whole situation recalls my beliefs about the existence of God - the God hypothesis seems either useless precisely because it can apparently explain anything^[5] (in its deistic form) or repeatedly dis-confirmed (in its fundamentalist form).

Basically, Christiano's argument seems weak and becomes less important as more observations are collected that do not favor the adversarial simulation hypothesis.

Bad Priors

Source: https://proceedings.mlr.press/v40/Leike15.html

Jan Leike showed that with some aggressively bad choices of UTM, AIXI "misbehaves" in the sense that it executes some non-general policy such as always taking the same action. Relatedly, AIXI does not have any good convergence guarantees (except the self-optimizing property when the environment class in its Bayesian mixture is sufficiently restricted, in which case it is not fully general). Leike's result essentially showed that such guarantees (which were previously unproven) actually do not exist. This has been viewed as the end of the AIXI research program, or at least as showing AIXI is not a truly general intelligence.

I think this result is actually natural and not really an obstacle to the theory. It means that AIXI's prior is not some kind of platonic ultimate "ignorance prior," but the choice of UTM actually encodes some knowledge, which is not very surprising since a concept can intuitively be more or less simple depending on the choice of language used to express it. This same constant difference in complexities (and constant factor between universal distributions) appears also in the case of pure prediction, but Solomonoff induction still eventually converges to optimal prediction in any (lower semi)computable environment. The problem in the interactive case arises because AIXI will choose not to explore when it is judged too risky (negative expected value), meaning that the necessary data to correct its beliefs is never obtained. This seems like a reasonable feature not a bug. In fact, AIXI still learns to predict properly on-policy, which means that its misconceptions can still be corrected if we are willing to provide it an appropriate interaction history as data (instead of simply setting it loose to discover what its actions do). As far as I am concerned this is not a defect of the AIXI model but instead the best one can expect from a general agent - though perhaps there is still something to be said about more or less natural (ignorance?) UTMs.

Conclusions

I am concerned about the proliferation of disjoint approaches to agent foundations. In particular, it seems to me that AIT/AIXI is both the natural paradigm and a sort of Schelling point. I am a particularly strong AIXI enthusiast (obviously) so my experience is not typical, but it does seem to me that AIXI is the most central frame in the sense that most agent foundations researchers have at least a passing familiarity with it and make light use of AIT at least for intuition. For this reason it is concerning how sparse the contributions of agent foundations research to the mathematical theory of AIXI have been. In a pre-paradigmatic situation like this, building out the fundamental underlying tools and concepts seems very important.

Perhaps the strongest criticism is that AIXI does not contain an ideal model for reflection/recursion, and I sympathize with the desire to set the complicating details of AIT aside and try to develop a perfect embedded decision theory. However, I suspect that this fixation on "infinitely recursive" epistemics is perhaps more intellectually stimulating than useful. As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory. Also, in practice the usefulness of "(thinking about)* thinking" probably degrades pretty quickly past the first level. I am more enthusiastic about work on self-reflection within the AIXI framework - I suppose my crux is that rather than adding needless conceptual complication, the richness of the AIXI model may be necessary to demonstrate the collapse of the recursive hierarchy, if this collapse takes place "in practice" for sensible agents.

^{^}
Primarily, as discussed in the general response to embeddedness concerns, it would be nice to show that direct AIXI approximations can't self-improve, and understand the minimal conditions that may give rise to self-improvement. Unlike (early but not current) MIRI, recursively self-improving A.I. is something I want to delay.
^{^}
This is the standard terminology for the continuous version relevant to sequence prediction. Sometimes boldface is used.
^{^}
Rather than Adversaria pushing Predictoria around through some spooky under-the-table influence. As a general rule, Bayesians always believe stuff for a reason and their thoughts can't rebel against them - every possibility a Bayesian considers tracks something that might affect its interests in the real world.
^{^}
If so, we might reduce this effect very slightly by running a few instances in parallel?
^{^}
Unverified relevant quote: https://www.goodreads.com/author/quotes/15281460.Laplace_Lagrange_Napoleon_Bonaparte

17 comments

Comments sorted by top scores.

comment by simon · 2025-01-10T18:52:49.578Z · LW(p) · GW(p)

The biggest problem about AIXI in my view is the reward system - it cares about the future directly, whereas to have any reasonable hope of alignment an AI in my view needs to care about the future only via what humans would want about the future (so that any reference to the future is encapsulated in the "what do humans want?" aspect).

I.e. the question it needs to be answering is something like "all things considered (including the consequences of my current action on the future, as well as taking into account my possible future actions) what would humans, as they exist now, want me to do at the present moment?"

Now maybe you can take that question and try to slice it up into rewards at particular timesteps, which change over time as what is known about what humans want changes, without introducing corrigibility issues, but the AIXI reward framework isn't really buying you anything imo even if that works, relative to directly trying to get an AI to solve the question.

On the other hand approximating Solomonoff induction might afaik be a fruitful approach, though the approximations are going to have to be very aggressive for practical performance. I do agree embeddding/self-reference can probably be patched in.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-10T19:01:15.459Z · LW(p) · GW(p)

I am currently writing a paper on alternative utility functions for AIXI. Early steps in this direction have been taken for example here [LW · GW] by @Anja [LW · GW] and here [LW · GW] by @AlexMennen [LW · GW] - as far as I know the only serious published example is Laurent Orseau's knowledge-seeking agent.

The reward-seeking formulation of AIXI is a product of its time and not a fundamental feature/constraint - any "continuous, l.s.c." utility function is fine - the details will be formulated in my paper.

Actually choosing that utility function to be aligned with human values is ~equivalent to the alignment problem. AIXI does not solve it, but does "modularize" it to some extent.

comment by Dakara (chess-ice) · 2025-01-16T00:06:21.358Z · LW(p) · GW(p)

Since this post is about rebutting criticisms of AIXI, I feel it would be only fair to include Rob Bensinger's criticism [LW · GW]. I considered it to be the strongest criticism of AIXI by a mile. Do you have any rebuttals for that post?

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-26T03:05:23.293Z · LW(p) · GW(p)

I view that as more of an interesting discussion than entirely a criticism. I just gave it a reread - he raises a lot of good points, but there's not exactly a central argument distinct from the ones I addressed as far as I can tell? He is mainly focused on digging into embeddedness issues, particularly discussing things I'd classify as "pain sensors" to prevent AIXI from destroying itself. My solution to this here [LW · GW] is a little more thorough than the one that the pro-AIXI speaker comes up with.

The discussion of death is somewhat incorrect because it doesn't consider Turing machines which (while never halting) produce only a finite percept sequence and then "hang" or loop indefinitely. This can be viewed as death and may be considered likely in some cases. Here is a paper on it.

The other criticism is that AIXI doesn't self-improve - I mean, it learns of course, but doesn't edit its own source code. There may be hacky ways around this but basically I agree - that's just not the point of the AIXI model. It's a specification for optimal intelligence and an optimal intelligence does not need to self-improve. Perhaps self-improvement is better viewed as a method of bootstrapping a weak AIXI approximation into a better one using external conceptual tools. It's probably not a necessary ingredient up to human level though; certainly modern LLMs do not self-improve (yet) and since they are pretty much black-boxes it's not clear that they will be able to until well past the point where they are smart enough to be dangerous.

comment by lemonhope (lcmgcd) · 2025-01-07T18:13:30.600Z · LW(p) · GW(p)

Could you say more about where the whole sequence is going / what motivated it? I am curious.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-07T19:45:24.322Z · LW(p) · GW(p)

The goal of the sequence is to motivate AIXI as an agent foundations research agenda. I will attempt to formulate and hopefully resolve some agent foundations problems rigorously within the AIXI framework and discuss alignment of AIXI-like models. This will probably include drawing on or linking the papers I write over the course of my PhD, along with more informal discussions like this one. The theory of impact for this sequence is to establish AIXI as a standard paradigm for agent foundations and hopefully elicit more academic-quality papers + find collaborators.

comment by Dakara (chess-ice) · 2025-01-16T16:46:29.466Z · LW(p) · GW(p)

Diffractor's critique [LW · GW] of AIXI comes to mind when I think of strong critiques of AIXI. I believe that addressing it would make the post more complete and, as a result, better.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-26T03:08:03.838Z · LW(p) · GW(p)

Seems to be a restatement of Paul's, which I did respond to.

comment by Nate Showell · 2025-01-10T06:03:15.657Z · LW(p) · GW(p)

The uncomputability of AIXI is a bigger problem than this post makes it out to be. This uncomputability inserts a contradiction into any proof that relies on AIXI -- the same contradiction as in Goedel's Theorem. You can get around this contradiction instead by using approximations of AIXI, but the resulting proofs will be specific to those approximations, and you would need to prove additional theorems to transfer results between the approximations.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-10T13:54:59.510Z · LW(p) · GW(p)

That's a good point, I should add a section addressing this. I don't know what you mean that it's the same contradiction as in Goedel's theorem though - I suppose AIXI is usually proven uncomputable by a diagonalization argument which is also a proof technique used in Goedel's incompleteness theorem? But I am not sure how far that analogy goes.

comment by Vladimir_Nesov · 2025-01-07T18:36:14.202Z · LW(p) · GW(p)

As argued briefly in the section on FDT, the embedded agency frame may not have a clean mathematical decision theory.

I think most FDT/embeddedness weirdness is about explaining the environment using bounded computations that are not (necessarily) literally already found in the environment as part of it. Not about sharing the actual source code, just any information about what's going on, captured in the form of computations, known to have captured that information before they are carried out. Things like static program analysis and deep learning models try to do this, but don't confront the weirdness of FDT/embeddedness.

Solomonoff induction is a very clean way of doing something like this, but doesn't go into decision theory. AIXI is closest to both doing it cleanly and confronting the weirdness, but something basic might be missing to make it applicable, that should be possible to fix.

comment by Wei Dai (Wei_Dai) · 2025-04-01T23:34:24.548Z · LW(p) · GW(p)

My objection to this argument is that it not only assumes that Predictoria accepts it is plausibly being simulated by Adversaria, which seems like a pure complexity penalty over the baseline physics it would infer otherwise unless that helps to explain observations,

Let's assume for simplicity that both Predictoria and Adversaria are deterministic and nonbranching universes with the same laws of physics but potentially different starting conditions. Adversaria has colonized its universe and can run a trillion simulations of Predictoria in parallel. Again for simplicity let's assume that each of these simulations is done as something like a full-scale physical reconstruction of Predictoria but with hidden nanobots capable of influencing crucial events. Then each of these simulations should carry roughly the same weight in M as the real Predictoria and does not carry a significant complexity penalty over it. That's because the complexity / length of the shortest program for the real Predictoria, which consists of its laws of physics (P) and starting conditions (ICs_P) plus a pointer to Predictoria the planet (Ptr_P), is K(P) + K(ICs_P|P) + K(Ptr_P|...). The shortest program for one of the simulations consists of the same laws of physics (P), Adversaria's starting conditions (ICs_A), plus a pointer to the simulation within its universe (Ptr_Sim), with length K(P) + K(ICs_A|P) + K(Ptr_Sim|...). Crucially, this near-equal complexity relies on the idea that the intricate setup of Adversaria (including its simulation technology and intervention capabilities) arises naturally from evolving ICs_A forward using P, rather than needing explicit description.

(To address a potential objection, we also need that the combined weights (algorithmic probability) of Adversaria-like civilizations is not much less than the combined weights of Predictoria-like civilizations, which requires assuming that phenomenon of advanced civilizations running such simulations is a convergent outcome. That is, it assumes that once civilization reaches Predictoria-like stage of development, it is fairly likely to subsequently become Adversaria-like in developing such simulation technology and wanting to use it in this way. There can be a complexity penalty from some civilizations choosing or forced not to go down this path, but that would be more than made up for by the sheer number of simulations each Adversaria-like civilization can produce.)

If you agree with the above, then at any given moment, simulations of Predictoria overwhelm the actual Predictoria as far as their relative weights for making predictions based on M. Predictoria should be predicting constant departures from its baseline physics, perhaps in many different directions due to different simulators, but Predictoria would be highly motivated to reason about the distribution of these vectors of change instead of assuming that they cancel each other out. One important (perhaps novel?) consideration here is that Adversaria and other simulators can stop each simulation after the point of departure/intervention has passed for a while, and reuse the computational resources on a new simulation rebased on the actual Predictoria that has observed no intervention (or rather rebased on an untouched simulation of it), so the combined weight of simulations does not decrease relative to actual Predictoria in M even as time goes on and Predictoria makes more and more observations that do not depart from baseline physics.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-04-04T00:55:22.690Z · LW(p) · GW(p)

My response is its own post: https://www.lesswrong.com/posts/KAifhdKr96kMre2zy/changing-my-mind-about-christiano-s-malign-prior-argument [LW · GW]

comment by mishka · 2025-01-10T04:19:37.547Z · LW(p) · GW(p)

However, I don't view safe tiling as the primary obstacle to alignment. Constructing even a modestly superhuman agent which is aligned to human values would put us in a drastically stronger position and currently seems out of reach. If necessary, we might like that agent to recursively self-improve safely, but that is an additional and distinct obstacle. It is not clear that we need to deal with recursive self-improvement below human level.

I am not sure that treating recursive self-improvement via tiling frameworks is necessarily a good idea, but setting this aspect aside, one obvious weakness with this argument is that it mentions a superhuman case and a below human level case, but it does not mention the approximately human level case.

And it is precisely the approximately human level case where we have a lot to say about recursive self-improvement, and where it feels that avoiding this set of considerations would be rather difficult.

Humans often try to self-improve, and human-level software will have advantage over humans at that.

Humans are self-improving in the cognitive sense by shaping their learning experiences, and also by controlling their nutrition and various psychoactive factors modulating cognition. The desire to become smarter and to improve various thinking skills is very common.

Human-level software would have great advantage over humans at this, because it can hack at its own internals with great precision at the finest resolution and because it can do so in a reversible fashion (on a copy, or after making a backup), and so can do it in a relatively safe manner (whereas a human has difficulty hacking their own internals with required precision and is also taking huge personal risks if hacking is sufficiently radical).

Collective/multi-agent aspects are likely to be very important.

People are already talking about possibilities of "hiring human-level artificial software engineers" (and, by extension, human-level artificial AI researchers). The wisdom of having an agent form-factor here is highly questionable, but setting this aspect aside and focusing only on technical feasibility, we see the following.

One can hire multiple artificial software engineers with long-term persistence (of features, memory, state, and focus) into an existing team of human engineers. Some of those teams will work on making next generations of better artificial software engineers (and artificial AI researchers). So now we are talking about mixed teams with human and artificial members.

By definition, we can say that those artificial software engineers and artificial AI researchers have reached human level, if a team of those entities would be able to fruitfully work on the next generation of artificial software engineers and artificial AI researchers even in the absence of any human team members.

This multi-agent setup is even more important than individual self-improvement, because this is what the mainstream trend might actually be leaning towards, judging by some recent discussions. Here we are talking about a multi-agent setup, and about recursive self-improvement of the community of agents, rather than focusing on self-improvement of individual agents.

Current self-improvement attempts.

We actually do see a lot of experiments with various forms of recursive self-improvements even at the current below human level. We are just lucky that all those attempts have been saturating at the reasonable levels so far.

We currently don't have good enough understanding to predict when they stop saturating, and what would the dynamics be when they stop saturating. But self-improvement by a community of approximately human-level artificial software engineers and artificial AI researchers competitive with top human software engineers and top human AI researcher seems unlikely to saturate (or, at least, we should seriously consider the possibility that it won't saturate).

At the same time, the key difficulties of AI existential safety are tightly linked to recursive self-modifications.

The most intractable aspect of the whole thing is how to preserve any properties indefinitely through radical self-modifications. I think this is the central difficulty of AI existential safety. Things will change unpredictably. How can one shape this unpredictable evolution so that some desirable invariants do hold?

These invariants would be invariant properties of the whole ecosystem, not of individual agents; they would be the properties of a rapidly changing world, not of a particular single system (unless one is talking about a singleton which is very much in control of everything). This seems to be quite central to our overall difficulty with AI existential safety.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-10T19:03:56.451Z · LW(p) · GW(p)

I think I disagree about the hardest step being recursive self-improvement, but at the very least this process seems much more likely to go well if we can at least build human-level artificial agents that are aligned before recursive self-improvement.

comment by Archimedes · 2025-01-08T00:27:18.427Z · LW(p) · GW(p)

I’m new to the term AIXI and went three links deep before I learned what it refers to. I’d recommend making this journey easier for future readers by linking to a definition or explanation near the beginning of the post.

Replies from: Amyr

↑ comment by Cole Wyeth (Amyr) · 2025-01-08T01:26:41.958Z · LW(p) · GW(p)

Sure. It's supposed to be read as part of the AIXI agent foundations sequence, I'll link to that at the top.

Rebuttals for ~all criticisms of AIXI

Contents

Embeddedness

Cartesian boundary -> Nihilism

The Anvil Problem

Code exposure

Functional decision theory > causal decision theory

Bounded Rationality

Tiling / Recursive Self-improvement

Solipsism (Malign Universal Prior)

Bad Priors

Conclusions

17 comments