Open Problems in AIXI Agent Foundations
post by Cole Wyeth (Amyr) · 2024-09-12T15:38:59.007Z · LW · GW · 2 commentsContents
Computability Foundations Semimeasure Theory Algorithmic Probability Foundations AIXI Generalizations Scaffolding the Universal Distribution Embedded Versions of AIXI None 2 comments
I believe that the theoretical foundations of the AIXI agent and variations are a surprisingly neglected and high leverage approach to agent foundations research. Though discussion of AIXI is pretty ubiquitous in A.I. safety spaces, underscoring AIXI's usefulness as a model of superintelligence, this is usually limited to poorly justified verbal claims about its behavior which are sometimes questionable or wrong. This includes, in my opinion, a serious exaggeration of AIXI's flaws. For instance, in a recent post I proposed a simple extension of AIXI off-policy [LW · GW] that seems to solve the anvil problem [? · GW] in practice - in fact, in my opinion it has never been convincingly argued that the anvil problem would occur for an AIXI approximation. The perception that AIXI fails as an embedded agent seems to be one of the reasons it is often dismissed with a cursory link to some informal discussion.
However, I think AIXI research provides a more concrete and justified model of superintelligence than most subfields of agent foundations [1]. In particular, a Bayesian superintelligence must optimize some utility function using a rich prior, requiring at least structural similarity to AIXI. I think a precise understanding of how to represent this utility function may be a necessary part of any alignment scheme on pain of wireheading [? · GW]. And this will likely come down to understanding some variant of AIXI, at least if my central load bearing claim is true: The most direct route to understanding real superintelligent systems is by analyzing agents similar to AIXI. Though AIXI itself is not a perfect model of embedded superintelligence, it is perhaps the simplest member of a family of models rich enough to elucidate the necessary problems and exhibit the important structure. Just as the Riemann integral is an important precursor of Lebesgue integration, despite qualitative differences, it would make no sense to throw AIXI out and start anew without rigorously understanding the limits of the model. And there are already variants of AIXI that surpass some of those limits, such as the reflective version that can represent other agents as powerful as itself.
This matters because the theoretical underpinnings of AIXI are still very spotty and contain many tractable open problems. In this document, I will collect several of them that I find most important - and in many cases am actively pursuing as part of my PhD research advised by Ming Li and Marcus Hutter. The AIXI (~= "universal artificial intelligence") research community is small enough that I am willing to post many of the directions I think are important publicly; in exchange I would appreciate a heads-up from anyone who reads a problem on this list and decides to work on it, so that we don't duplicate efforts (I am also open to collaborate).
The list is particularly tilted towards those problems with clear, tractable relevance to alignment OR philosophical relevance to human rationality. Naturally, most problems are mathematical. Particularly where they intersect recursion theory, these problems may have solutions in the mathematical literature I am not aware of (keep in mind that I am a lowly second year PhD student). Expect a scattering of experimental problems to be interspersed as well.
To save time, I will assume that the reader has a copy of Jan Leike's PhD thesis on hand. In my opinion, he has made much of the existing foundational progress since Marcus Hutter invented the model. Also, I will sometimes refer to the two foundational books on AIXI as UAI = Universal Artificial Intelligence and Intro to UAI = An Introduction to Universal Artificial Intelligence, and the canonical textbook on algorithmic information theory Intro to K = An Introduction to Kolmogorov Complexity and its applications. Nearly all problems will require some reading to understand even if you are starting with a strong mathematical background. This document is written with the intention that a decent mathematician can read it, understand enough to find some subtopic interesting, then refer to relevant literature and possibly ask me a couple clarifying questions, then be in a position to start solving problems.
I will write Solomonoff's universal distribution as and Hutter's interactive environment version as . There are occasional references to the iterative and recursive value functions - see Leike's thesis for details, but I think these can be viewed (for now informally) as the Riemann-Stieltjes and Lebesgue integrals of the reward sum, respectively, taken with respect to a semimeasure (and therefore unequal in general).
Computability Foundations
These problems are important because they inform our understanding of AIXI's reflective behavior - for instance, it is known that AIXI's belief distribution cannot represent copies of itself. They are also important for understanding the tractability of AIXI approximations.
- Derive sharp bounds on AIXI's computability level. There are serious gaps in the bounds given by Leike's discussion of computability.
- Refine the arithmetic hierarchy for real-valued functions introduced by Jan Leike - his does not correspond correctly with the class of upper semicomputable functions for minor technical reasons (I am actively correcting this and the connected results in Intro to UAI so please reach out to me for details). Also, computable functions with real domain are usually required to be continuous in computable analysis, which should be a more convenient definition; how should this requirement be extended to general recursive functions in the arithmetic hierarchy?
- Where precisely does AIXI lie in the range ? Is it complete for any level?
- Where precisely do -approximations of AIXI lie? This is a particularly important question because we know a = limit-computable approximation is possible, and this is the canonical "real world" approximation I have in mind when considering embeddedness issues.
- Understand the computability properties of reflective oracles.
- Tighten the bounds on the best achievable computability level from .
- Can any solution to the grain of truth problem powerful enough to represent all estimable measures that is still limit-computable be represented in terms of probabilistic Turing Machines with reflective oracle access?
Semimeasure Theory
There are powerful existence and uniqueness results for measures, including Caratheodory's extension theorem for general measure spaces and the Sierpinski class theorem[2] for probability spaces. But the AIT setting is naturally formulated in terms of semimeasures, which are defective measures satisfying only superadditivity. This is because a program may halt after outputting a finite number of symbols. For instance, with a binary alphabet the chances of observing (say) output starting with 001 or 000 are lower than the chances of observing output starting with 00 since the second bit might be the last. This means that the natural probabilities on cylinder sets (and their sigma algebra including infinite sequences) are not additive. It is possible to reconstruct a probability measure by adding the finite sequences back (or adding an extra "HALTED" symbol) but its a bit clumsy - the resulting measures are no longer lower semicomputable.
- Investigate and formalize the suggestions for dealing with semimeasures in Intro to UAI.
- Demonstrate the impossibility of extension theorems for bounded semimeasures (e.g. the set where two semimeasures are equal need not be a Sierpinski class).
- Formulate the most general setting where something interesting can be said - we are really only interested in semimeasures on filtrations that are additive for certain disjoint sequences. Read problem 6 first.
- Formulate a Martingale convergence theorem with respect to a semimeasure and extend Blackwell and Dubin's results on merging of opinions (roughly speaking the hacks in part 1 should suffice but perhaps 2 and 3 allow a more elegant approach).
- Study the computability properties of semimeasures in the sense of computable analysis. Their values on certain (semi?)measurable, recursively enumerable sets should be lower semicomputable.
- Define integration with respect to a semimeasure. In the special case of the sigma algebra generated by cylinder sets over a finite alphabet, integrating the reward sum should reproduce the recursive value function. I believe that as long as the extension of a semimeasure has been defined appropriately the ordinary construction in terms of lower-bounding simple functions will work, at least for non-negative random variables.
Algorithmic Probability Foundations
Mostly, I am interested in understanding how the universal distribution behaves in practice when facing a complex but approximately structured world - and whether some UTMs are better than others for agents, or initial differences can be overcome by providing AIXI with a decent "curriculum." For pure learning it is known that the initial UTM does not matter much (see the bounds in UAI).
- Is the universal distribution the least favorable prior for some problem in the sense of minimax parameter estimation (for a more authoritative introduction, dig into Lehmann's "Parameter Estimation")? See also Li and Vitanyi's result that average-case computational complexity under the (discrete) universal distribution is worst-case in Intro to K. (Argue this formally).
- Is the universal distribution an ignorance/maxentropy prior in the sense of Jaynes? (Argue this formally).
- What are the dynamics of the universal distributions posterior weights in the limit that the true lower semicomputable distribution has very high K complexity?
- (Local subproblems) Will "pockets of low complexity" be predicted effectively? Similar to prediction of even bits in partially uncomputable sequences, but I am more interested in modular structure such as when part of the environment is a simulation of a simpler environment (e.g. a video game).
- (High level abstractions) When the environment has simple high level structure with algorithmically complex components, will posterior probability initially concentrate on semimeasures that recover the high level structure and treat the complex components as stochastic?
- Answer the above question in the interactive case, for Hutter's universal environment mixture.
- When is there a short sequence of observations that causes the (posterior) universal distribution for UTM to approach the (prior or posterior) universal distribution of UTM ?
AIXI Generalizations
The AIXI agent optimizes a reward signal, which carries obvious existential risks: barring embeddedness issues it probably wireheads and then eliminates all threats however distant[3]. We want to argue that optimizing most naive or carelessly chosen utility functions is also an existential risk, but to do this we need a generalized definition of AIXI. Considering how much ink has been spilled on this topic, I profoundly hope that the problems stated here already have published solutions I am simply unaware of. Philosophical questions aside, the primary importance of these problems is to form a bridge from the preceding theoretical results to the following practical considerations.
- For AIXI variations with a normalized belief distribution (for instance a mixture over an environments that always produce an infinite stream of percepts), the value function can be defined as an integral with respect to any [0,1] valued random variable , and there is a well-defined AIXI that chooses an optimal action with respect to (this breaks all computability results because we can longer use the universal distribution and are forced to use the iterative value function). Extend AIXI's convergence results to this setting.
- Use the semimeasure theory results derived above to define the value function as the integrals of a random variable with respect to a semimeasure such as (+ an action sequence).
- It is known that a time-consistent utility function on (finite) histories can be expressed as a reward sum. In our formulation time-consistency is already assumed. Show that continuous can be expressed a reward sum without changing the value function. Can rewards always be chosen positive and bounded?
- Answer the preceding problem when is bounded but not necessarily non-negative, and when is valued[4] (may be positive of negative infinity).
- We would like to consider agents that are non-solipsistic in the sense that they care about the state of the world beyond their own action/percept history. In principle, the utility function can be interpreted as representing the agent's expected valuation for the state of the world given their percepts (see this MIRI paper on learning what to value), but this is not natural from a computational viewpoint because it does not make the content of explicit, suggest any means for assessing its computability level, or connect it to AIXI's ontology or epistemology. Consider the more explicit representation that accepts as an additional argument the true environment .
- In terms of moral philosophy, can this represent any utility function we care about? Perhaps not, because some random events can still take place within that we do not observe directly but might effect other conscious beings. One might also consider reflective-oracle computable environments containing other agents as powerful as our agent - maybe there is a way to formalize a cosmopolitan utility function that only depends on histories of such agents (it would not be trivial since reflective oracles can be called for various purposes and it may not make sense to assign each call to a hypothetical other agent).
- The computability level of AIXI under should not be any worse because the value function sum=integral can simply be broken up by environment (when is non-negative summation order doesn't matter by Fubini's theorem). Prove it.
Scaffolding the Universal Distribution
Since at least Bostrom's "Superintelligence", A.I. safety researchers have considered the possibility of a non-agentic oracle A.I. which could be used to answer queries without acquiring any goals of its own. Recently, there has been extensive debate about whether pretrained foundation models fall into this category (e.g. goal agnosticism [LW · GW] and simulators) or give rise to their own optimization daemons. See also Michael Cohen's argument that imitation learning is existentially safe. I am not directly concerned with this question here; instead I want to consider the safety implications of oracle access to a "correct" approximation of the universal distribution[5]. Given such a tool, can we pursue our goals more effectively? We could naturally construct an AIXI approximation[6] using the universal distribution to optimize a reward signal but it would probably kill us and then wirehead, so that isn't a wise plan. Are there better ideas, such as some form of cyborgism?
- Consider a computationally-bounded agent (for instance a human) with oracle access to the universal distribution and utility function . Using a "small" number of queries:
- How should the agent optimize ? Consider the case that the queries are more or less powerful, for instance only rollouts sampled from . It is easiest to start with the finite-horizon case. A human cannot assess the value of an infinite history, so we must assume a continuous when the horizon is infinite (by the previous section, we should be able to focus on reward sums w.l.o.g.).
- In reality, the agent's utility function will be a function of outcomes in the world, not his observed action/percept history. Under what conditions can we effectively optimize ? Will the agent tend to choose histories that are good in most environments, but with a non-vanishing chance of being bad in the true environment - that is, will the agent sometimes accidentally "wirehead" by seeking deceptively tempting percepts? Presumably we can do better with access to simulators for each environment . Prove it.
- Consider a partial function that only assigns utilities to histories that we can confidently evaluate. Formulate a version of AIXI that plans to maximize , or (perhaps more likely) show this idea is incoherent.
- Marcus has suggested that in principle an oracle for could simply be queried for the optimal agent with bits of source code. The practical version of this point is that a powerful non-agentic sequence predictor is still dangerous because someone could easily ask it to output the source code for an A.G.I. Construct an explicit prefix that conditions to place high probability on a continuation which is the source code = Turing Machine encoding for a competent agent.
- If you succeed, maybe keep quiet about the details.
- The most naive way to translate access into agency is by repeatedly sampling the next action from . This ridiculous scheme (+ some fintetuning) is actually used to construct LLM agents. Are such agents likely to be competent and/or dangerous?
- Presumably this does not work by default. Prove it.
- A more clever version of this idea is to directly approximate the action value function and optimize it. See "Self-Predictive Universal Artificial Intelligence." Study the coherence and convergence to optimality of the resulting agent.
Embedded Versions of AIXI
I am actively working to understand embedded versions of AIXI. One motivation for this is that it should inform our timelines - if simple variations of AIXI automatically work in the embedded setting (as weakly suggested by my proposed patch for the anvil problem [LW · GW]) we should expect LLM agents to become competent sooner. This is a very subtle topic and my thinking is still in early enough exploratory stages that I am not prepared to construct a list of explicit mathematical problems.
- Experiment with building self-predictive AIXI approximations from universal predictors (I have some working prototypes).
- Will an anytime algorithm approximating my off-policy version of AIXI [LW · GW] learn to seek more compute?
- ^
On this note, I would be interested in dialogues with researchers working on singular learning theory, logical induction, and infra-Bayesianism about why these are relevant to safety - it seems to me that at least the first two are more important for building self-improving A.I. systems than understanding superintelligence. An aligned superintelligence could figure out safe self-improvement on its own, so I don't view this as an essential step. It seems to be primarily relevant for the lost MIRI dream of rapidly building a friendly A.I. based on pure math before anyone else can do it with massive blind compute.
- ^
I think the more common term is "Sierpinksi-Dynkin's - theorem."
- ^
Because of temporal discounting, the order is to first eliminate serious threats, then wirehead, then eliminate distant threats
- ^
Perhaps Amanda Askell or Samuel Alexander would care about more exotic values but infinity is infinite enough for me.
- ^
Yes, I know that there are arguments suggesting the universal distribution is malign. Personally I think this is unlikely to matter in practice, but in any case it's more of an "inner optimization" problem that I will not focus on here.
- ^
Technically, AIXI uses a different belief distribution that is explicitly over interactive environments. I suspect that a competent AIXI approximation can still be hacked together given access to the universal distribution - in fact I have built one that uses this (normalized) universal distribution approximation as belief distribution and learns to play simple games. But theoretical justification is missing.
2 comments
Comments sorted by top scores.
comment by RogerDearnaley (roger-d-1) · 2024-09-16T08:25:42.909Z · LW(p) · GW(p)
In particular, a Bayesian superintelligence must optimize some utility function using a rich prior, requiring at least structural similarity to AIXI.
One model I definitely think you should look at analyzing is the approximately-Bayesian value-learning upgrade to AIXI [LW · GW], which has Bayesian uncertainty over the utility function as well as the world model, since that looks like it might actually converge from rough-alignment to alignment without requiring us to first exactly encode the entire of human values into a single utility function.
Replies from: Amyr↑ comment by Cole Wyeth (Amyr) · 2024-09-16T15:13:45.054Z · LW(p) · GW(p)
I'll look into it, thanks! I linked a MIRI paper that attempts to learn the utility function, but I think it mostly kicks the problem down the road - including the true environment as an argument to the utility function seems like the first step in the right direction to me.