# The Learning-Theoretic Agenda: Status 2023

post by Vanessa Kosoy (vanessa-kosoy) · 2023-04-19T05:21:29.177Z · LW · GW · 13 comments## Contents

Preamble Philosophy Key Problems Problem 1: Computational Resource Constraints Problem 2: Frequentist Guarantees Subproblem 2.1: Traps Subproblem 2.2: Password guessing games Subproblem 2.3: Nonrealizability Problem 3: Value Ontology Problem 4: Cartesian Privilege Problem 5: Descriptive Agent Theory Nonproblem: Expected Utility Research Directions Subproblem 1.1: Frugal Universal (Infra-)Prior Direction 1: Frugal Compositional Languages Direction 1.1: Compositional Polytope-MDPs Direction 1.2: String Machines (for RL) Direction 1.3: Infra-Bayesian Logic Direction 2: Deep Learning Theory Subproblem 2.3: Nonrealizability Direction 3: Infra-Bayesian RL Direction 4: Exploiting properties of simplicity priors Direction 5: Agnostic RL Subproblem 1.2/2.1: Traps Direction 6: Metacognitive Agents Direction 6.1: Bayesian planning via query-based learning of computations Direction 6.2: Expressiveness of string machines Direction 6.3: Query learning of string machines Direction 7: Approximation algorithms for Bayesian planning Direction 8: Expanding Safety Envelope Better regret bounds for simple priors Direction 9: Unifilar MDPs Direction 10: Foreground MDPs Direction 11: Canonical RL Dimension Direction 12: Epistemic Regret Bounds Subproblem 2.4: Multi-Agent Guarantees Direction 13: Population Games Direction 14: Logit equilibria of finite-state repeated games Direction 15: Infra-Bayesian Veil of Ignorance Direction 15.1: IBRL with exact clones Direction 15.2: IBRL with inexact clones Direction 16: Hidden Rewards Direction 16.1: Agency with partial monitoring Direction 16.2: Specifying semi-instrumental reward functions Direction 16.3: Undogmatic Ontologies Direction 16.3: Bayes-optimal planning for communicating RUMDPs Direction 17: Algorithmic Descriptive Agency Measure (ADAM) Direction 17.1: Comparing ADAM variants Direction 17.2: ADAM of random policies Direction 17.3: ADAM Hierarchy Theorem Direction 17.4: ADAM for finite-state policies Direction 17.5: Inferring the utility function Direction 18: Infra-Bayesian Physicalism Physicalist Superimitation Motivation Superimitation Applicable to humans Agent Detection User Identification Full Specification Alignment Guarantee Formal Alignment Criterion Axiological alignment of PSI Epistemic alignment of PSI A Plan Summary None 13 comments

**TLDR:** I give an overview of (i) the key problems that the learning-theoretic AI alignment research agenda is trying to solve, (ii) the insights we have so far about these problems and (iii) the research directions I know for attacking these problems. I also describe "Physicalist Superimitation" (previously knows as "PreDCA"): a hypothesized alignment protocol based on infra-Bayesian physicalism that is designed to reliably learn and act on the user's values.

*I wish to thank Steven Byrnes, Abram Demski, Daniel Filan, Felix Harder, Alexander Gietelink Oldenziel, John S. Wentworth*^{[1]}* and my spouse Marcus Ogren for reviewing a draft of this article, finding errors and making helpful comments and suggestions. Any remaining flaws are entirely my fault. *

# Preamble

I already described [LW · GW] the learning-theoretic agenda (henceforth LTA) in 2018. While the overall spirit stayed similar, naturally LTA has evolved since then, with new results and research directions, and slightly different priorities. This calls for an updated exposition.

There are several reasons why I decided that writing this article is especially important at this point:

- Most of the written output of LTA focuses on the technical, mathy side. This leaves people confused about the motivation for those inquiries, and how they fit into the big picture. I find myself having to explain it again and again in private conversations, and it seems more efficient to have a written reference.
- In particular, LTA is often conflated with infra-Bayesianism. While infra-Bayesianism is a notable part, it is not the only part, and without context it's not even clear why it is relevant to alignment.
- Soares has recently argued [LW · GW] that alignment researchers don't "stack", meaning that it doesn't help much to add more researchers to the same programme. I think that for LTA, this is not at all the case. On the contrary, there is a variety of shovel-ready problems that can be attacked in parallel. With this in mind, it's especially important to explain LTA in order to get more people on board.
- In particular, ALTER has announced a prize [LW · GW] designed to encourage work on LTA. I expect this article to be a useful resource for researchers who decide to compete.
- It seemed important to explain Physicalist Superimitation better, and this is hard to do without the broader context.

I am fairly optimistic that LTA leads to a solution to alignment. On the other hand, it's much harder to say whether the solution will arrive in time. I think that the rate of progress relative to person-hours invested was pretty good so far, and we didn't hit any obstacle that cast doubt on the entire endeavor. At the same time, in *absolute* terms we still have most of the work in front of us. In the coming years, I am planning to put considerable effort into scaling up: getting more researchers on board, and hopefully accelerating progress by a significant factor.

# Philosophy

*The philosophy of LTA was **covered* [LW(p) · GW(p)]* in the original article, I mostly stand behind what I wrote there and don't want to repeat it. The goal of this section is merely to recap and add a couple of points.*

The goal of LTA is creating a general mathematical theory of intelligent agents. There are a number of reasons why this is necessary for AI alignment:

- Having a mathematical theory enables constructing models in which we can prove that a particular AI design is aligned (or unaligned), or at least form rigorous and strongly supported conjectures (similar to the role in cryptography). While such a proof doesn't imply absolute confidence, it does allow us to reduce the problem to becoming confident in the
*assumptions*of the model. This is a much better situation than dealing with a sequence of heuristic steps, each of which might be erroneous and/or hiding assumptions that are not stated clearly. Of course, similarly to how in cybersecurity having a provably secure protocol doesn't protect you from e.g. someone using their birthday for a password, here too we will need to meticulously verify that the assumptions hold in the actual implementation. This might require knowledge from domains outside of theoretical computer science, e.g. physics, cognitive science, evolutionary biology etc. - Empirical data is insufficient in itself, since without an underlying theory it's very hard to be confident about how the results extrapolate to new domains, scales, architectures and algorithms. On the other hand, the combination of theory and experiment can be extremely powerful for such extrapolation, even if the theory cannot produce quantitative predictions ab initio on its own.
- Even if we don't do any explicit calculations using the theory about the AI we are designing, merely knowing the theory leads to much better intuitions, and equips us with the right vocabulary to reason about the problem. To give an analogy, it is much easier to reason about designing engines if you're familiar with concepts such as "heat", "work", "temperature", "entropy", even if you're just reasoning informally rather than actually calculating.

After I wrote the original article about LTA, more was written about the feasibility and importance of mathematics for alignment, both by [LW(p) · GW(p)] me [LW · GW] and by others [LW · GW].

Why is LTA concerned specifically with agents? Aren't there AI systems which are not agents? The reason is: the sort of risks I want to address are risks that arise from AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue unaligned goals, with catastrophic consequences), and the sort of solution I envision also relies on AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue an aligned goal, s.t. the result is protecting humanity from unaligned AI).

Finally, I want to address a common point of confusion. Sometimes I tell someone about subproblems in LTA and they respond by trying to understand why each individual subproblem (e.g. nonrealizability, or cartesian privilege, or value ontology) is relevant to alignment. Often there *are* some specific ways in which the subproblem can be connected to alignment. But the more important point is that *any* fundamental question about intelligent agency is relevant to alignment, just because without answering this question we cannot understand intelligent agency. For any aspect of intelligent agency that we don't understand and any AI that we design, one of the two will hold: either the AI lacks this aspect, in which case it is probably insufficient to protect humanity, or the AI has this aspect but we don't understand why or how, in which case it is probably unaligned (because alignment is a small target, and significant unaccounted for phenomena are likely to make you miss).

# Key Problems

The starting point of LTA is Marcus Hutter's AIXI: a simplistic model of ideal agency.

Intuitively, an "intelligent agent" is a system that builds sophisticated models of the world and uses them to construct and execute plans that lead to particular goals (see also Yudkwosky [LW · GW]). Building models implies starting from a state of ignorance and updating on observations, which can be formalized by Bayesian probability theory. Building sophisticated models requires using Occam's razor, which can be formalized by the Solomonoff prior. Constructing and executing plans can be formalized as maximizing expected utility. Putting all these ingredients together gives us AIXI.

However, there are both important gaps both in our understanding of AIXI-like agents, and significant weaknesses in the AIXI framework itself. Solving these problems is the natural framing for the entire foundational^{[2]} part of LTA.

## Problem 1: Computational Resource Constraints

AIXI is uncomputable. Obviously, real-world agents are not only computable but operate under strong computational resource constraints. This doesn't mean AIXI is not a useful *toy model* for studying the interaction of Occam's razor with Bayesian optimization or reinforcement learning. However, it *is* a major limitation.

One obvious attempted solution is to impose some computational resource constraints on the programs that appear in Solomonoff's prior, for example as in Schmiduber's speed prior or in some other way. This typically leads to agents that are computable but still computationally intractable. Relatedly, a recent line of research connects the hardness of time-bounded Kolmogorov complexity with the existence of one-way functions.

On the other hand, we know some priors for which asymptotic Bayes-optimality is possible in polynomial time, for example priors supported on Markov decision processes (MDPs) with a small (i.e. polynomial in the security parameter) number of states. Some stronger feasible models are also known, e.g. MDPs with linear features or kernel-MDPs. However, all such priors have significant shortcomings:

- They require the MDP to be communicating, i.e. contain no irreversible transitions (a.k.a. "traps"). This is obviously not true in the real-world, where e.g. jumping off a cliff is often irreversible. Indeed, without this assumption approximating the Bayes-optimal policy is NP-hard, even for a small number of deterministic hypotheses.
- They usually don't embody any sort of Occam's razor.
- They are not rich enough to capture sophisticated models of the real world.

## Problem 2: Frequentist Guarantees

AIXI is Bayes-optimal (by definition) but is not necessarily optimal in any sense for any particular environment. In contrast, in statistical learning theory we usually demand a *frequentist* guarantee, i.e. that a learning algorithm converges to optimality (in some sense) for any^{[3]} data source (subject to some assumptions which depend on the setting). Such guarantees are important for several reasons:

*Learning*is a key ability of intelligent agents, and played a fundamental role in AI progress in recent decades. A natural way to operationalize learning is: an agent learned a fact when its behavior is optimal conditional on this fact. In other words, learning which of a set of hypotheses is true means converging to optimal behavior for the true hypothesis, i.e. precisely a frequentist guarantee. This means that a theory of frequentist guarantees tell us both*which*facts agents can learn and how*fast*they learn them (or how much data they require to learn them). These are important questions that a theory of intelligent agents has to answer. In particular, it is required to analyze the feasibility of alignment protocols: e.g. if we expect the AI to learn human values, we need to make sure it has sufficient information to learn them within a practical time frame.- A Bayes-optimal algorithm does well on average across some large ensemble of possible universes. But in reality, we only observe one universe (which is not even in the ensemble: see Subproblem 2.3 below). Without frequentist guarantees, it's not clear why would a Bayes-optimal algorithm do especially well in our specific universe, or why should algorithms that do well in our specific universe be Bayes-optimal.
- Evolution selected humans by their performance in the ancestral environment, which involved e.g. gathering berries and running away from lions. But they turned out to perform reasonably well in very different environments that require e.g. writing code or designing rockets. This hints at the existence of some underlying frequentist guarantee as a key property of intelligent agents.

Moreover, in addition to single-agent frequentist guarantees it is desirable to derive *multi-agent* frequentist guarantees. Interactions between agents are quite important in the real-world, and have significance specifically for AI alignment as well (AI-human, AI-other AI, AI-acausal trade partner, to some extent human-human too). A multi-agent frequentist guarantee can take the form of convergence to a particular game-theoretic solution concept, or an asymptotic lower bound on expected utilities corresponding to some notion of game value.

There are multiple difficulties in deriving frequentist guarantees for AIXI-like agents. The first two difficulties ("traps" and "password guessing games" below) are not problems with the agent, but rather with the way conventional frequentist guarantees are defined (which are nonetheless non-trivial to solve). The third difficulty (nonrealizability) might be a problem with the way the agent is defined.

### Subproblem 2.1: Traps

The most common type of frequentist guarantee in reinforcement learning (RL) is regret bounds^{[4]}. However, AIXI doesn't satisfy any regret bound w.r.t. the underlying hypothesis class (i.e. the class of all computable environments) because these hypotheses involve irreversible transitions (traps). For example, suppose that in environment 1 taking action A sends you to a sink state with reward , whereas taking action B sends you to a sink state with reward , and in environment 2 the opposite is true. Obviously a hypothesis class which contains both environments doesn't admit a regret bound.

This problem is especially acute in the multi-agent setting, because other agents have memory. Either it's impossible to erase memory in which case the environment is irreversible by design, or it's possible to erase memory which breaks conventional learning theory in other ways.

### Subproblem 2.2: Password guessing games

What if we arbitrarily rule out traps? Consider an agent with action set and observation set . Suppose that the reward is a function of the last observation only . We can specify an environment^{[5]} by providing a communicating MDP with state set and action set augmented by representation mappings

Here, is required to be an onto mapping from to for any . Also, we require that for any and

However, this only enables a fairly weak regret bound. Naively, it seems reasonable to expect a regret bound of the form where is the Kolmogorov complexity of the MDP+representation, is the diameter^{[6]} and is the time horizon^{[7]}, and are constants s.t. . Indeed, can be regarded as the amount of information that needs to be learned and e.g. Russo and Van Roy showed that in a bandit setting regret scales as the square root of the entropy of the prior (which also expresses the amount of information to be learned). However, this is impossible, as can be seen from the following example.

Fix and (the "password"), let the action set be and the state space be . We think of a state as "the agent entered the string of bits into the terminal". The initial state is (the empty string). The transition kernel works as follows:

In state , taking action produces the state (i.e. the agent enters the digit ). In state , taking action produces the state if (the agent entered the correct password) or state (the agent entered an incorrect password and has to start from the beginning). In state taking action stays in state and taking action produces state (restarts the game).

All states have reward except for the state which has reward .

Obviously the agent needs time to learn the correct password, in expectation w.r.t. the uniform distribution over passwords. At the same time, and . This is incompatible with having a regret bound of the desired form.

### Subproblem 2.3: Nonrealizability

As we discussed in Problem 1, a realistic agent cannot be Bayes-optimal for a prior over all computable environments. Typically, each hypothesis in the prior has to admit a computationally feasible approximately optimal policy in order for it to be feasible to approximate Bayes-optimality (in particular this has to be the case if the prior is learnable), which usually implies the hypotheses are computationally feasible to simulate^{[8]}. However, in this situation we can no longer assume the true environment is within the hypothesis class. Even for AIXI itself, it is not really fair to assume the environment is computable since that would exclude environments that contain e.g. other AIXIs.

A setting in which the true environment is not in the hypothesis class is known in learning theory as "nonrealizable". For offline (classification/regression) and online learning, there is a rich theory of nonrealizable learning^{[9]}. However, for RL (which, among classical learning theory frameworks, is the most relevant for agents), the nonrealizable setting is much less understood (although there are some results, e.g. Zanette et al).

Therefore, even if we arbitrarily assume away traps, and are willing to accept a factor of in the regret bound, a satisfactory theory of frequentist guarantees would require good understanding of nonrealizable RL.

This problem is also especially acute in the multi-agent case, because it's rarely the case that each agent is a realizable environment from the perspective of the other agent (roughly speaking, if Alice would simulate Bob simulating Alice, Alice would enter an infinite loop). This is known as the "grain of truth" problem^{[10]}. One attempt to solve the problem is by Leike, Taylor and Fallenstein, using agents that are equipped with "reflective oracles". However, there are many different reflective oracles, and the solution relies on all agents in the system to use the same one, i.e. that the design of the agents has been essentially centrally coordinated to make them compatible.

## Problem 3: Value Ontology

AIXI's utility function depends on its direct observations. This is also assumed in RL theory. While this might be a legitimate type of agent, it is insufficiently general. We can easily imagine agents that place value on parameters they don't directly observe (e.g. the number of paperclips in the observable universe, or the number of people who suffer from malaria). Arguably, humans are such agents.

The obvious workaround is to translate the utility function from its original domain to observations by taking a conditional expectation. That is, suppose the utility function is for some space , let be the space of everything that is directly observed (e.g. action-observation time sequences) and suppose we have some prior over . Then, we can define by

However, there are problems with this:

- In order to have reasonable computational complexity and frequentist guarantees, we will need to make assumptions about the utility function. While such assumptions can be natural for , they make much less sense for unless we can somehow justify them via the prior over . Without a theory that explicitly talks about and it is impossible to know whether such justification is possible.
- Even if satisfies the required assumptions, a frequentist guarantee about does not necessarily imply an analogous frequentist guarantee about . That's because the transformation above involves taking expected value which implicitly mixes different "possible universes", while a frequentist guarantee is supposed to refer to a particular universe.
- Different agents have different observation spaces and there is no natural joint prior on all of them. This means it's impossible to talk about different agents having "aligned" preferences.

Therefore, it is desirable to have a theory in which it is possible to explicitly talk about utility functions with domains other than observations. (See also de Blanc.)

## Problem 4: Cartesian Privilege

The formalization of Occam's razor in AIXI relates to the description complexity of hypotheses represented in terms of the agent's actions and observations (as functions ). However, this is not how Occam's razor is used in science. When we talk about a "simple" or "elegant" scientific theory, we refer to the equations governing the fundamental degrees of freedom on that theory (e.g. particles, or fields or strings), *not* to the equations that would be needed to describe the RGB values of points on a person's retina. Indeed, expressing physics via the latter equations would be horrifically complicated.

In other words, AIXI-like agents believe they occupy a privileged position in the universe, which has various pathological consequences. See "infra-Bayesian physicalism [LW · GW]" for more discussion and Demski and Garrabrant for related prior work.

## Problem 5: Descriptive Agent Theory

The framing of AIXI is: given certain subjective choices (e.g. universal Turing machine and utility function), what is an ideal agent? However, in the real-world agents are not ideal. Moreover, we want to be able to understand which systems are agents, and what kind of agents they are, without already assuming e.g. a specific utility function. In particular, such a theory would have applications to:

- Value learning, by regarding humans as agents.
- Studying "accidental" formation of agents, e.g. via evolution or as mesa-optimizers (related: Wentworth's selection theorems [LW · GW] programme).

Legg and Hutter defined an intelligence measure that for every policy produces a number determining its intelligence, s.t. AIXI has maximal intelligence. The measure is essentially the policy's expected utility w.r.t. the Solomonoff prior. This allows us to talk how close any given system is to an ideal agent. However, it suffers from two major limitations (in addition to the other problems of AIXI discussed before):

- It depends on the choice of universal Turning machine (UTM). While the same is true of everything in algorithmic information theory, usually we have theorems that limit this dependence. For example, Kolmogorov complexity only changes by when we switch to a different UTM. On the other hand, the Legg-Hutter measure doesn't have any such property (except in the asymptotic in which it approaches 0, which is pretty meaningless).
- It depends on the choice of utility function. Technically, they don't make an explicit choice but instead assume the reward is one of the observation channels. In practice, this
*is*a choice (and a very restrictive one).

Moreover, naive attempts to ascribe a utility function to a policy run into difficulties. Specifically, any policy can arguably be ascribed a utility function which rewards following precisely this policy. With respect to such a utility function, the policy is Bayes-optimal for any prior. Armstrong and Mindermann have argued on the basis of this type of reasoning that ascribing a utility function is essentially impossible in a purely behaviorist framework (but I argue the opposite, see Direction 17.5 [LW · GW] below).

## Nonproblem: Expected Utility

Finally, I want to briefly discuss one issue which I *don't* consider a key problem. Namely, the reliance of AIXI on expected utility maximization.

Expected utility is often justified by coherence theorems such as von Neumann–Morgenstern (VNM) and Savage. These theorems are indeed important: they show that a small number of natural assumptions produce a fairly narrow mathematical object (expected utility). This should indeed increase our credence in expected utility as a useful model. Often, objections are put forward that argue with individual assumptions. IMO these objections miss the point: a coherence theorem is a piece of evidence, not a completely watertight proof that expected utility is philosophically correct. And, if expected utility is philosophically correct, such a watertight proof is still not something we should expect to exist (what would it even look like?)

The more important justification of expected utility is the large body of work it supports: control theory, game theory, reinforcement learning theory etc. The methodology I believe in is, start with *plausible* assumptions and try to build a theory. If the theory that results is rich, has synergy with other bodies of knowledge, has explanatory power, is useful in practice, each of these is evidence that it's on the right track. (And if we failed to build such a theory, we probably also learned something.)

Another objection that is often raised is "humans have no utility function". This is ostensibly supported by research in behavioral economics which shows that humans behave irrationally. I consider this deeply unconvincing. For one thing, I suspect that a large part of that research doesn't replicate. But even putting that aside, the interesting thing about irrational behavior is that we *recognize* it as irrational: i.e. learning about it makes you behave differently (unless it's so minor that it isn't worth the effort). This already indicates that this irrationality is better viewed not as a fundamental fact about human preferences, but as some combination of:

- Computational or informational resource constraints
- Error terms that vanish in the relevant asymptotic regime
- Random noise or some other effect that's not part of the cognitive algorithm

All of this is not to say that expected utility cannot be questioned. Rather that a *productive* objection would be to either come up with a demonstrably superior alternative, or at least a good argument why some extremely *concrete* problem cannot be solved without an alternative to expected utility. Lacking that, the best strategy is to press forward unless and until we encounter such a problem.

Now, we actually *do* have examples where deviating from purist VNM in specific ways is useful, e.g.:

- Infra-Bayesianism [LW · GW], where the conventional notion of "expectation" is replaced by some non-linear concave functional.
- Taylor's quantilization, where instead of
*maximizing*expected utility we sample a random choice out of some top fraction of choices. - Nash bargaining, where we maximize the product of several expected utilities (see also Garrabrant's "geometric rationality [? · GW]" sequence).

However, these examples came about from studying various concrete problems, not from arguing with the VNM model per se. Moreover, all of these examples can still be mathematically recast in VNM form: infra-Bayesianism can be regarded as a VNM zero-sum game (against "Murphy"), quantilization and Nash bargaining can be regarded as special cases of infra-Bayesianism (as will be discussed in an ~~upcoming~~ article [LW · GW] by Appel). Therefore, even here the theory built around the VNM model remains useful.

# Research Directions

This section focuses entirely on the foundational part of the programme. I will mention some of the applied part in the next section (although different applications are also possible, e.g. studying quantilization or IDA), but for the most part I believe the foundational part to be the top priority.

In the following, I roughly grouped the research directions by the problems they are trying to address. However, the real relationship between directions and problems is many-to-many: a single direction can have implications on many problems. Moreover, there are many connections between the different directions. And, even for directions that are not especially connected at present, if they are successful, we will be faced with the task of merging them into a unified theory.

## Subproblem 1.1: Frugal Universal (Infra-)Prior

One part of solving Problem 1 (computational resource constraints) is finding a prior (more precisely a family of priors) with the following properties:

- Formalizes Occam's razor, i.e. assigns higher probability to hypotheses that are simpler to describe.
- Sufficiently rich to contain sophisticated models applicable to the real-world.
- The (approximately) optimal policy for each hypothesis is feasible to find. Possible operationalization of "feasible": polynomial time in history length and description length of hypothesis.
- If we assume away traps, there is a feasible learning algorithm with a good regret bound. Possible operationalization: polynomial time in history length, regret bound similar to what was described in section "Problem 2.2" above. (This desideratum is probably strictly stronger than the previous bullet point.)

More generally, we might want an infra-prior (see Direction 3 [LW · GW] below), or a metacognitive (infra-)prior (see Direction 6 [LW · GW] below), or a physicalist infra-prior (see Direction 18 [LW · GW] below) with analogous properties.

### Direction 1: Frugal Compositional Languages

The time complexity of finding the optimal policy for a generic (tabular) MDP scales with its number of states. The same is true of the sample complexity of learning an unknown generic MDP. However, the number of states in a sophisticated model of the real world has to be enormous. For example, if I reason about the world as it's comprised of objects with possible states each, the over number of states is already .

Therefore, any realistic RL algorithm has to exploit some *structure* in its environment. For example, the environment might be comprised of spatially separated parts, or different processes happening on different spatial scales, or different processing happening on different temporal scales. We want to find an appropriate *compositional language* for describing such structure.

The compositional language used by AIXI is that of programs for some fixed UTM. This language is very powerful, but also unfeasible to work with, because programs can consume arbitrarily large amounts of computational resources or even fail to halt altogether. Imposing ad hoc computational resource bounds also doesn't work well, perhaps because such bounds are extraneous to the compositional properties of the language.

Arguably we need some new "frugal" compositional language. Here are the candidate starting points I currently know. It is possible that the right answer will involve combining several of them.

**Direction 1.1: Compositional Polytope-MDPs**

Compositional polytope-MDPs [LW(p) · GW(p)] is a language with two operations: union of asynchronous noninteracting parts and time hierarchy. Importantly, the optimal policy is computable in time polynomial in description length. The time hierarchy operation also offers a natural formalization for the notion of "instrumental values".

Further research might involve:

- Incorporating the state/action representation into the semantics.
- Proving regret bounds for hypotheses classes based on this language, first with arbitrary and then with computationally efficient algorithms.
- Extending the language with more operations.
- Creating an infra-Bayesian version.

**Direction 1.2: String Machines (for RL)**

String machines [LW · GW] is a category-theoretic framework that allows constructing automata and transducers from string diagrams^{[11]} (that can be probabilistic). In particular, it can be used to describe MDPs. In fact, there is a version of it that allows constructing arbitrary Turing machines, but there are natural ways to "throttle" to expressiveness of the language.

Further research might involve:

- Understanding the expressiveness of the language better, in terms of complexity classes and maybe parameterized complexity classes. [
**EDIT:**See also initial foray using filtered string machines [LW · GW].] - In particular, understanding how it depends on the categories used.
- Studying control/learning algorithms for hypotheses based on the language.

**Direction 1.3: Infra-Bayesian Logic**

Infra-Bayesian logic [LW(p) · GW(p)] is a sort of higher-order logic where the semantics has infradistributions instead of sets corresponding to predicates, and can be viewed as a synthesis of logic and probability theory. In particular, it is fairly natural for describing infra-(PO)MDPs because the main part of the latter is the transition infrakernel.

However, the semantics are not fully defined because of some technical difficulties. There is also a first-order version which *is* fully defined but seems less natural for describing RL hypotheses. Finally, there are hints that propositional infra-Bayesian logic might be more computationally tractable than classical propositional logic, but it's as yet unclear [LW · GW].

### Direction 2: Deep Learning Theory

It seems plausible that deep learning algorithms (e.g. transformers) are already learning algorithms for the frugal universal prior. Therefore, building a rigorous theory of deep learning (in particular, its inductive biases and generalization properties) is another potential path to understanding this prior and its properties.

While this is a relevant research direction, I don't prioritize it for two reasons:

- It already receives a lot of attention in mainstream academia (some examples [LW(p) · GW(p)], another interesting approach is singular learning theory [? · GW]).
- Substantial breakthroughs here might have major capability implications. (Although this is a consideration that might be relevant to some other directions in LTA as well.)

Nevertheless, it is a type of research that has implications on LTA and alignment researchers can reasonably consider to participate in it, if they feel especially well positioned for it and have the discipline and judgement required to self-censor if necessary.

## Subproblem 2.3: Nonrealizability

The research directions below are different ways to get frequentist guarantees in the nonrealizable setting, but they are not necessarily mutually incompatible. In addition to exploring each approach separately, it can be interesting to study various combinations.

### Direction 3: Infra-Bayesian RL

Infra-Bayesian RL [LW · GW] (IBRL) is a way to combine reinforcement learning theory with imprecise probabilities which produces a specific kind of frequentist guarantee in the nonrealizable setting (infra-Bayesian regret). A rough summary of the results we got in this framework so far:

- There is a dynamically consistent update rule [? · GW] for a notion of imprecise beliefs that generalizes credal sets (infradistributions). This rule is equivalent to the one described by Gilboa and Schmeidler but we represented it using different mathematical language (concave functional or convex set, related by Legendre-Fenchel duality, instead of an order on functions).
- Newcombian environments (where naive causality is violated) can be represented [? · GW] as causal infra-Bayesian laws.
- Infra-Bayesian laws can express [? · GW] a very large variety of possible frequentist desiderata.
- Miscellaneous [? · GW] results [? · GW] about [? · GW] the [? · GW] mathematical properties of infradistributions, e.g. topological, information-theoretic and category-theoretic
- An analysis of regret for an infra-Bayesian generalization of stochastic linear bandits. This is an
~~upcoming~~paper: "Imprecise Multi-Armed Bandits [LW · GW]" (IMAB).

In particular, the result about Newcombian environments and also infra-Bayesian physicalism (see Direction 18 [LW · GW] below) are two products of infra-Bayesianism that I don't see, at present, how to reproduce with any of the alternative approaches to nonrealizability.

Directions for further research include:

- (Direction 3.1) Extending the results of IMAB to full-fledged IBRL. In particular, it might be interesting to get a result that subsumes Tian et al as a special case.
- (Direction 3.2) Finding an infra-Bayesian analogue for Russo-Van Roy ("eluder") dimension.
- (Direction 3.3) Analyzing the computational complexity of learning in the IMAB setting.
- (Direction 3.4) Studying versions of the IMAB setting with non-crisp infradistributions.

### Direction 4: Exploiting properties of simplicity priors

In general, Bayesian inference has only fairly weak guarantees in the nonrealizable setting. However, simplicity priors might produce stronger guarantees. In particular, Lattimore, Hutter and Gavane (LHG) showed^{[12]} that (a specific version of) Solomonoff induction converges to predicting a deterministic computable subsequence even when the full sequence is uncomputable.

Directions for further research include:

- (Direction 4.1) Generalizing LHG to stochastic subsequences (this is stated as an open problem in the paper).
- (Direction 4.2) Nonrealizable guarantees for RL with a Solomonoff prior.
- (Direction 4.3) Generalizing LHG to bounded versions of the Solomonoff prior, e.g. a version with bounded space complexity (this might be easier to study than bounded time complexity, since PSPACE is in some ways a simpler class than P, e.g. APSPACE = NPSPACE = PSPACE).

### Direction 5: Agnostic RL

The conventional approach to nonrealizability in offline and online learning is to replace convergence to the true hypothesis with convergence to the hypothesis closest to the truth. Similarly, in RL, given some natural metric or divergence on the space of environments, we can try to prove regret bounds of the form

Here, is the time horizon, is the hypothesis class and is the true environment.

That is, we allow a sublinear term (as in the realizable case) and a linear term which scales with the distance of from . Indeed, a result of this form was proven in Zanette et al, although in a model-free setting (i.e. instead of the environments themselves they consider their corresponding state-action value functions, although their divergence also depends directly on the environment).

At least superficially, a potential drawback of this approach is that Bayes-optimality doesn't imply this "agnostic" regret bound (AFAICT), as opposed to ordinary regret bound of the realizable setting. There is also no immediately obvious agnostic analogue of Bayesian regret. However, we can rectify this by recasting the setting in IBRL language. Indeed, for any it is possible to construct a (non-crisp) infra-Bayesianian law^{[13]} s.t., for any policy and :

Notice that when is total variation distance, we can take .

Then, learning the hypothesis class is equivalent to agnostically learning . Moreover, IBRL does have Bayesian regret, and a notion of Bayes-optimality that implies infra-Bayesian regret bounds, which thereby carries over to agnostic RL.

A few examples of directions for further research:

- (Direction 5.1) Finding an agnostic version of the regret bound in Jin, Liu and Miryoosefi (their bound uses the so called Bellman-Eluder dimension of the hypothesis class, as opposed to the much more restrictive linear dimension of Zanette et al.)
- (Direction 5.2) Studying model-based agnostic regret bounds
^{[14]}. - (Direction 5.3) Combining agnostic results with other IBRL results, e.g. with IMAB.

## Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

- (Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
- (Subproblem 2.1) It's not clear how to to talk about learnability and learning rates.

It makes some sense to consider these problems together, but different direction emphasize different sides.

### Direction 6: Metacognitive Agents

This research direction is motivated not only by traps, but also by producing especially strong frequentist guarantees via metacognition (a form of self-improvement). More speculatively, this model might be also applicable to humans (and related specifically to conscious reasoning), which makes it relevant to superimitation (see relevant section [LW(p) · GW(p)] below). It was originally called Turing RL [LW(p) · GW(p)], but that name seems incorrect now because "RL" assumes learnability (no traps).

In a metacognitive agent, in addition to the usual "external" interfaces (actions and observations), there is also an "internal" interface that allows the agent to execute arbitrary programs and observe their outputs. This allows us to use a learning algorithm to form subjective beliefs about computations. The beliefs can be entangled with the agent's beliefs about the external world.

For example, let be the set of all lambda terms. Then our agent might have a belief over that refers to the relation of -equivalence on . In a Bayesian framework is a distribution, and the agent might also have a belief about the external world that takes the form of an RL environment that depends on a point in the support of . In an infra-Bayesian framework, is an infradistribution and there is an appropriate way to define a notion of "law" s.t. is its marginal^{[15]}.

However, it might be more interesting to put all beliefs about the world "inside quotation". Indeed, let be a prior about the external world that a program can sample from: for example the unnormalized Solomonoff prior. Given a -equivalence oracle, we can sample this prior using some oracle machine (since such an oracle allows to instantly know whether a given program produces given output). Since any point in has the right type signature to serve as such an oracle, we can construct a kernel from into the external world s.t. is defined by plugging the oracle into . We can then combine any with (i.e. form the semidirect product ) to obtain the overall belief.

Notice that this setting is inherently nonrealizable, since the ground truth about is uncomputable. On the other hand, the nonrealizability of the external world can in principle be circumvented, because we have computationally simple hypotheses of the form "the external world behaves according to program ". Moreover, even "the external world is sampled from the unnormalized Solomonoff prior" is a computationally simple hypothesis. Hence, as long as the world is computable, we can describe it exactly modulo uncertainty about computations.

While the external world can contain irreversible transitions, it seems reasonable to employ an approximation in which executing most^{[16]} programs doesn't have irreversible harmful side-effects. As a consequence, if the agent survives long enough, it can learn most facts about computations. In the "quoted prior" setting, it will then use what it learned about computations to approximate Bayes-optimal behavior in the external world, which addresses subproblem 1.2. Moreover, since some hypotheses assert quoted priors about *computations*, it can arguably use what it learned to improve its own learning algorithm.

This framework is closely related to MIRI's notion of "logical uncertainty". Hence, it makes sense to compare it Garrabrant's logical induction (LI). While LI might turn out to have a useful role in this framework, in itself, LI doesn't address some key relevant questions:

- LI is focused entirely on epistemology, and the guarantee it satisfies reflects that. On the other hand, we are interested in the agent's performance as measured by achieving goals (expected utility) in the external world, and want guarantees that reflect it.
- The guarantees of LI are purely asymptotic, without especially useful rates on convergence. Indeed, LI assumes learning from a fixed sequence of theorems that are revealed to the algorithm, which is bound to be astronomically inefficient. On the other hand, a metacognitive agent should decide which programs to run online in some manner that guarantees good sample complexity.
- LI in its present form is not concerned with computational complexity, while we are ultimately interested in models that are computationally feasible (although it might be useful to consider more or less realistic models of computational resource constraints).
- LI doesn't use what it has learned to improve its own learning process.

There are several potential starting points for research into metacognitive agents:

**Direction 6.1: Bayesian planning via query-based learning of computations**

Arguably tree automata and more generally string machines [LW · GW] are a natural class of hypotheses about computations. Therefore, we can try building on existing research in automata inference^{[17]}. This field is infamous for its many impossibility results, but there is one model of learning that has been quite successful. Namely, learning automata assuming two types of allowed queries:

- Running the unknown automaton on specific input.
- Stating a hypothesis automaton, and receiving a counterexample if the hypothesis is false. (By "counterexample", we mean an input on which the hypothesis automaton produces a different answer from the true automaton.)

That is, algorithms exist that guarantee convergence to a correct automaton with a number of queries that is polynomial in minimal automaton size.

In our setting, the first type of query can be implemented by running a specific program. The second type of query cannot be implemented directly, but is effectively available in various natural setups. For example:

- If we are interested in PAC-learning w.r.t. a known distribution over programs, we can implement the counterexample query by sampling from the distribution and testing the hypothesis on the sample.
- If we want the algorithm to answer a sequence of questions of the form "predict the output of this program", we can implement the counterexample query by answering according to the current hypothesis, which leads to a polynomial mistake bound.

Therefore, it is a natural question whether we can implement Bayesian planning by some protocol of this type, assuming (for starters) that it is possible to efficiently sample from the prior. For example, assume that the hypothesis about computations translates to a hypothetical infra-Bayesian coarsening (or Bayesian approximation?) of the prior for which optimal planning is feasible. Assume also that there if resulting policy performs less well than promised on the true prior, there is some way to produce a counterexample to the hypothesis about computations. Then, counterexample queries are effectively available.

Automata learning might also be feasible in a setting where the automaton generates a sequence that we need to predict. (At least for a deterministic automaton this is trivial because the sequence is periodic.) Speculatively, the amount of available computational resources can be treated as defining a "sequence" (possibly related to "logical time [LW · GW]") and this can be exploited to get guarantees even when the prior cannot be sampled efficiently.

**Direction 6.2: Expressiveness of string machines**

String machines are a potential starting point for finding a frugal universal prior for metacognitive agents. In order to strike the right balance between "frugality" and "universality" we need to understand the expressiveness of the string language and the computational complexity of evaluating string machines (or answering more complicated questions about string machines) under various assumptions. In particular, parameterized complexity theory might be relevant here, since there seem to be more relevant parameters than just description size (e.g. the number of meta-vertex levels we allow).

**Direction 6.3: Query learning of string machines**

We can also study the sample complexity of learning string machines in the two-type query model above. While existing results guarantee learning with a number of queries polynomial in the *number of states* of an automaton, we would like query complexity to scale well with the *description size* of an automaton in string language. More general string machines are not even automata.

### Direction 7: Approximation algorithms for Bayesian planning

Subproblem 1.2 comes from an NP-hardness result, but that still leaves room for studying approximation algorithms. There are many NP-hard optimization problems for which we have conjectured optimal approximation ratios (e.g. based on the PCP theorem or the unique games conjecture). It is conceivable that we can prove a result of this form for Bayesian planning. Some approaches:

- (Direction 7.1) Fix a prior over set of MDPs with state set and action set , and an arbitrary stationary policy . Let be the (unknown) Bayes-optimal policy. For any policy we can consider the approximation ratio

- (Direction 7.2) Finding the Bayes-optimal policy is NP-hard, but is it NP-easy? If it is e.g. PSPACE-hard then it seems significantly less likely that we can get optimal approximation results.

### Direction 8: Expanding Safety Envelope

The Expanding Safety Envelope [LW(p) · GW(p)] (XSE) is an approach to traps which can be summarized as follows:

- Assume there is some a priori known safe baseline policy, or more generally a non-empty set of known safe actions at every state reachable by these actions.
- Exploring the known safe actions might be sufficient to discover new safe actions. Exploring the new safe actions might be sufficient to discover even more safe actions, etc.
- Therefore, we can study guarantees that compare asymptotic performance to the policy which is optimal among those restricted to discoverable safe actions.

One significant limitation of this approach is, when using a rich prior, discovery of new safe actions can at best be approximate. While we can study variants where we sometimes try risky actions, guaranteeing any sort of optimality there is likely to be intractable. However, maybe it is feasible to require that, whenever the posterior admits an approximation (e.g. in the sense of total variation distance) by a prior with iteratively discoverable safe actions, we get a corresponding lower bound on the performance of our algorithm starting from this point.

## Better regret bounds for simple priors

This section is a grab bag of research directions about improving regret bound theory for simple hypothesis classes based on MDPs. They are not directly related to AIXI, but rather fit perfectly within classical RL theory^{[18]}. However, it seems plausible that they are necessary pieces of the bigger puzzle (i.e. for a full solution of Problem 2).

### Direction 9: Unifilar MDPs

[**EDIT:** What I called "unifilar MDP" is already known in the literature under the name "regular decision process" (RDP): see Ronca and De Giacomo. Notably, Subproblem 2.2 (password games) already emerges in learning RDPs with small number of states. Ronca and De Giacomo address it by assuming that the uniformly random policy has a high probability to reach every RDP state: a condition which is clearly insufficiently general, but does serve as an interesting starting point.]

Most of RL theory literature focuses on learning an MDP with a known state space, implicitly assume a known representation. Ortner et al prove a regret bound in a setting with unknown representation, but it scales with the *cardinality* of the set of possible representations. For real-world agents, I expect this cardinality to be something like doubly-exponentially large, so their bound is extremely weak.

As a first step towards a better theory of representation learning, I propose the following setting. Fix action set and observation set . Then, a *unifilar MDP* (UMDP: combination of MDP and unifilar hidden Markov model) consists of the following data: the state set , the initial observation distribution , the state initialization function , the response kernel and the transition function . The interaction of a policy with a UMDP works as follows:

- First, the first observation is sampled from .
- Second, the state is initialized to .
- After every observation :
- selects an action .
- The next observation is sampled from .
- The next state is set to .

Notice that the state at any moment of time can be recovered from the observation-action history, i.e. the above data defines some . Hence this is indeed an MDP rather than a POMDP.

Fix a reward function that can itself be represented by a deterministic finite state machine. We then constrain by requiring that there exists some s.t. for any and we have

The problem is then proving the existence of an algorithm for interacting with an unknown communicating UMDP compatible with which satisfies a regret bound that scales with 's number of states, number of actions and diameter.

### Direction 10: Foreground MDPs

Regret bounds in (lifelong/non-episodic) RL scale with MDP diameter because of the need to rule out traps. On the other hand, sequence prediction / online learning requires no such constraint and even allows environments with infinitely many states. However, sequence prediction / online learning can also be regarded as a degenerate special case of RL. This suggests that there should be a natural theory that combines both.

Motivated by the above, I propose to study MDPs of the following form. Assume that the state set of the MDP factors as (background and foreground). Moreover, there is some s.t. the -marginal distribution of is for any , and , where is the transition kernel of the MDP. That is, the background dynamics depends neither on the foreground state nor on the agent's actions. On the other hand, the foreground dynamics can depend both on the background state and the agent's actions.

We can then define a notion of diameter for the foreground only, and derive regret bounds that scale with it. Such a regret bound can scale with the total number of states, or with some dimension parameter (see Direction 11 below). In the latter case, can be infinite.

### Direction 11: Canonical RL Dimension

[**EDIT:** Foster et al achieved significant progress on this, which I didn't know at the time of writing.]

This direction already receives some attention in mainstream academia, so it might be of relatively lower priority.

In multiple areas of learning theory, we have a fairly complete characterization of how sample complexity depends on the hypothesis class. In many cases, the key parameter of the hypothesis class is some type of "dimension". For example:

- For binary classification, Vapnik–Chervonenkis dimension
- For multi-class learning, Natarajan dimension
- For online learning, Littlestone dimension

In RL, our understanding is less good. We do have some results of similar nature, most of them derived from Russo and Van Roy's "eluder" dimension. For example Jin, Liu and Miryoosefi demonstrate a model-free regret bound that scales with "Bellman-Eluder" dimension and the logarithm of a covering number (the later can be said to correspond to Minkowski-Bouligand dimension). However, we don't have matching *lower bounds* for these results. This means that we don't know whether the results we have cannot be subsumed by some better theory (indeed, Du et al show results that are neither strictly weaker nor strictly stronger than Jin, Liu and Miryoosefi). In contrast, the flavors of learning theory mentioned above come with proofs that their notions of dimension are optimal (albeit, only under the assumption of a worst-case distribution).

## Direction 12: Epistemic Regret Bounds

This direction attempts to address Subproblem 2.2.

The motivating intuition is: formulating the correct frequentist guarantee requires correctly distinguishing between *epistemic* and *aleatoric* uncertainty. The problem with password games is that the password is completely random information that is impossible to infer by any method other than brute force. Hence, we should treat this information as part of "aleatoric" uncertainty.

Let's recall the definition of Bayesian regret. Let be a set representing hypotheses and the prior. Since each hypothesis defines an environment, we have a mapping

The Bayesian regret of a policy is given by

Here is the time horizon, and is the reward at time .

For any environment and history , denote

That is, is the probability that produces assuming a deterministic policy compatible with the actions in .

Notice that minimizing Bayesian regret is equivalent to maximizing expected utility. However, expected utility only depends on and via defined by

Here, the expectation is in the sense of the Bayes' rule for environments, not pointwise. That is, first is sampled out of and then the agent faces the environment . Explicitly:

On the other hand, Bayesian regret depends directly on and . Indeed, the minimal Bayesian regret is typically non-zero even for learnable . On the other hand, if we took to be a single element set and defined by , we would get but the corresponding Bayesian regret of the Bayes-optimal policy would vanish.

This means that the definition of Bayesian regret treats the uncertainty coming from and the uncertainty coming from the stochasticity of differently. We can call the former "epistemic" and the latter "aleatoric". Intuitively, the -dependence of Bayesian regret measures the rate at which the agent reduces its goal-relevant epistemic uncertainty, and it is sublinear iff the agent asymptotically reduces it to 0 i.e. learns all the "epistemic information" it needs. On the other hand, we don't expect the agent to learn all "aleatoric information" but only to maximize expected utility w.r.t. to aleatoric uncertainty.

In the case of the Solomonoff prior, the naive approach regards the uncertainty about the program for the UTM as epistemic uncertainty and the uncertainty corresponding to the probabilities computed by that program as aleatoric uncertainty. However, in algorithmic statistics^{[19]} there are more nuanced tools to separate "structured" and "random" information, like the notion of "sophistication".

Given any string , we can consider ways to represent it as running program (which we assume to halt on *all* inputs) on input . Let's constrain this representation to be nearly-optimal compression, i.e. require that for some some "small" number

The minimal length of over all such representations is called the -*sophistication* of and can be interpreted as the amount of "structured" information in . In our case, the minimal of an MDP might be considered as the "epistemic" information it contains, which leads to a novel decomposition of the Solomonoff prior into some and . Care might be required for choosing . For example, maybe we can set .

In purely frequentist terms, we can hope for a regret bound of e.g. the form

Here is the -sophistication of the MDP+representation (s.t. is the amount of "random" information) and is another constant.

More generally, instead of a single MDP we can consider a computable distribution over a -dimensional family of MDPs of diameter at most , using some notion of dimension from RL theory (see Direction 11 above). Then, we can expect that the average (-Bayesian) regret of the family can be bounded by

Here is yet another constant, whereas and refer to the sophistication and Kolmogorov complexity of .

Given an appropriate notion of "computable distribution over MDPs", this would encompass learning of MDPs with arbitrary (uncomputable) transition probabilities.

## Subproblem 2.4: Multi-Agent Guarantees

One special case is completely straightforward in IBRL, namely two-player zero-sum games: IBRL essentially *is* a zero-sum game, so it's enough that each player views the other as "Murphy". So, for example two IBRL agents equipped with sources of randomness (s.t. effectively their actions are lotteries over game moves) playing a stochastic game necessarily converge to a Nash equilibrium, assuming that the corresponding infra-MDPs are in their respective hypothesis classes.

Beyond that, there are roughly two approaches to multi-agent guarantees:

- (Subproblem 2.4.1) Trying to derive additional parts of classical game theory from learning theory.
- (Subproblem 2.4.2) Trying to derive
*superrational*behavior from learning theory.

AFAIK either of the following possibilities is conceivable:

- (World 2.4.1) Subproblem 2.4.2 requires theory built on top of the theory required for Subproblem 2.4.1.
- (World 2.4.2) Subproblems 2.4.1 and 2.4.2 are partially or even largely independent.
- (World 2.4.3) Subproblem 2.4.1 is not a reasonable desideratum at all.

Different directions below have different relationships with these worlds.

[**EDIT:** See also Infra-Bayesian Haggling [LW · GW], a new direction for Worlds 2.4.2/3]

### Direction 13: Population Games

This direction is motivated by Worlds 2.4.1/2.

A population game [LW · GW] is a setting where large populations of players play against each other by being sorted into random groups on each turn, s.t. the probability to meet the same opponent twice is negligible. This setting is designed to reduce incentives for superrationality, although it's still unclear whether this is sufficient to justify classical game theory.

An appealing feature of this setting is that it can be naturally interpreted as a dynamical system. For learning agents satisfying very mild assumptions, every fixed point corresponds to an -Nash equilibrium (where as the geometric time discount constant ). Moreover, a fixed point exists for any tuple of player policies.

It is therefore natural to ask whether any of the fixed points have to be attractors (under some assumptions about the learning algorithms), study their attraction basins, study other fixed submanifolds etc.

### Direction 14: Logit equilibria of finite-state repeated games

This direction is motivated by World 2.4.1. (See this [LW · GW] for early thinking on the topic.)

A logit equilibrium is defined similarly to a Nash equilibrium with "optimal response" replaced by "soft-optimal response" (i.e. the players behave according to the softmax distribution associated with the expected payoffs of different plays for fixed opponent distributions). Consider the iterated prisoner's dilemma where each player's pure strategies are constrained to be of the form:

- Choose some play ( or ) on round number 0.
- Choose some mapping . On round , play according to where is the opponent's play on the previous round.

In particular, tit-for-tat corresponds to and .

For geometric time discount close to 1, both always-defect/always-defect and tit-for-tat/tit-for-tat are Nash equilibria. Moreover, tit-for-tat/tit-for-tat is also an approximate logit equilibrium for softmax parameters^{[20]}

However, always-defect/always-defect is *not* an approximate logit equilibrium as long as

This happens for the following reason. Tit-for-tat is almost as good a response to always-defect as always-defect: it only loses utility because of round 0. Therefore, the soft-optimal response to always-defect is a mixture of at least always-defect *and* tit-for-tat. Moreover, the soft-optimal response to a mixture of always-defect and tit-for-tat places little probability on always-defect, because tit-for-tat does much better.

As a result, in this asymptotic regime logit equilibrium entails a probability approaching 1 of cooperating forever. More generally, I propose the following conjecture: in any repeated (or even stochastic?) game in which we constrain the strategies to be finite state machines (with/without a uniform bound on the number of states?), while allowing a sufficient number of states, any logit equilibrium converges to Pareto efficiency in the asymptotic regime above.

This seems like a potential method of obtaining superrational behavior as a corollary of classical game theory in some situations.

### Direction 15: Infra-Bayesian Veil of Ignorance

This direction is motivated by Worlds 2.4.2/3.

As a warm-up, consider a repeated symmetric game between two agents who are copies of each other, either without randomization or with both using the same random generator. In this setting, each agent learns that the other parrots its actions precisely. Therefore, even a simple learning algorithm will converge to the symmetric Pareto efficient outcome.

**Direction 15.1: IBRL with exact clones**

What if the game is not symmetric, or there is randomization which breaks the symmetry? If we assume all the agents are exact clones (i.e. they execute the exact same code / follow the exact same policy), each agent can regard this as a Newcombian environment (the clones are implemented by "Omega"). And, we know that Newcombian environments can be represented as infra-Bayesian laws. We will call it the *clone law* in this case.

Notice that, even though exact clones must have the same utility function, this doesn't rule out asymmetric games. Indeed, the utility functions are the same *in terms of the particular agent's actions and observations*, but this is consistent with each agent identifying their own role ("which clone/player am I") and receiving payoffs accordingly. As a simple example, we can imagine the payoff being a part of the observation channel.

Notice also that the clone law is *not* a convex combination of laws corresponding to the different roles the agent can play. Rather, the clone law requires Murphy to predict the behavior of the agent in each possible role *in advance* of the random role selection. This way pseudocausality is ensured (i.e. Murphy can't "cheat" by only predicting the behavior of one role correctly.)

Moreover, two clone laws which are identical except for a different probability distribution over roles typically cannot coexist in the same learnable hypothesis class. This is because there is no way to infer the distribution over roles from observations. Instead, we should think of this distribution as determined by the prior. Indeed, mixing two clone laws is equivalent to mixing the role distributions in the limit, as long as the role distributions have full support. Hence, we can imagine all clone laws for the same game inside the prior effectively aggregating into a single clone law with the average role distribution.

The optimal policy for the clone law is maximizing the linear combination of utilities of different roles according to the prior distribution over roles. Hence, the prior determines which point on the Pareto frontier is selected, and can be regarded as defining a "notion of fairness".

One issue with clone laws is that they cannot be represented as e.g. infra-POMDPs with a finite number of states (the state has to encode Murphy's prediction i.e. the entire relevant part of the policy). This leads to a concrete research problem: *find natural hypotheses classes containing clone laws that admit efficient learning algorithms*.

**Direction 15.2: IBRL with inexact clones**

What if we want to allow agents with different utility functions? One trick to achieve this is postulate that the agent receives a description of its utility function upon waking up (e.g. in the form of a program, if it's a metacognitive agent). This way, the policy has to account for all admissible utility functions. It is possible that we can apply similar methods without this trick, using instead the observation that most utility functions can appear as *instrumental* *goals* in some situations (e.g. using the formalization of compositional polytope-MDPs).

What if the agents have different priors (in particular over utility functions but also in general)? Now they are not exact clones anymore. However, it is conceivable that their policies are still *similar* in some sense that can be used to represent the setting as Newcombian. For example, we can equip the space of utility functions with some metric. We can then define a metric on *policies *via the Hausdorff distance between their *graphs*.

As a toy model, we can consider a fixed game in normal form, with the agent having some prior over:

- Role (player)
- Payoff matrix
- Distance of opponents from itself

We can then try to prove that, under some assumptions, priors that are "close" (according to some divergence) lead to policies that are close. Moreover, when agents with close priors are facing each other (and know their mutual distances are small), they should play close to Pareto efficiency. It would be fascinating if we could recover something like Yudkowky's proposal [LW · GW].

In particular, consider a game with a *unique* Pareto efficient outcome. Assume that all admissible payoff matrices are monotonically increasing transformations of each other. Then, the agents will always play Pareto efficiently, because doing so unconditionally guarantees that all opponents will do the same thing, for any finite distance from opponents. (In particular, the actual distance is 0).

## Direction 16: Hidden Rewards

This direction attempts to address Problem 3 (value ontology).

In order to define a utility function, we need some ontology to specify its domain. One natural way to represent an ontology is using an infra-POMDP. That is, we have the state set , the initial infradistribution , the observation mapping and the transition infrakernel . Here, stands for credal sets over , i.e. non-empty closed convex subsets of (we can also allow more general infradistributions). The reason we're using an infra-POMDP instead of a POMDP is, we don't want to fully specify the environment.

To specify an environment over (in ontology ), we consider a *maximal refinement* of the infra-POMDP (we call this a *hidden environment*^{[21]}). That is, a state set , an initial distribution , an observation mapping , a transition kernel and a translation mapping s.t. the following conditions hold:

- For all and :

If we want to do IBRL rather than Bayesian RL, we can consider general (non-maximal) refinements, i.e. refinements that are infra-POMDPs rather than POMDPs.

To specify a reward function over ("hidden reward function"), we consider a function . More generally, we can allow bounded functions . Given a reward function, the corresponding utility function is defined as usual i.e. as a time-discount sum of rewards.

An interesting special case of a hidden environment is a UMDP equipped with a translation mapping. We will call it a *relative *UMPD (RUMDP). Assuming a hidden reward function of the form , it admits efficiently computing the optimal policy.

Given hidden environments/rewards instead of observable (i.e. ordinary) environments/rewards, it is straightforward to define analogues to concepts from RL theory such as prior, Bayes-optimality, regret, learnability etc.

### Direction 16.1: Agency with partial monitoring

Learnability is more difficult to establish with hidden reward functions than with observable reward functions, since ruling out traps is insufficient (see Example 1 here [LW · GW]). For stateless environments, we already have the closely related theory of partial monitoring, which provides a comprehensive classification of possible regret bounds^{[22]}. Therefore, we can try building an extension of this theory to our RL-like setting.

### Direction 16.2: Specifying semi-instrumental reward functions

A related setting in which learnability is established is instrumental reward functions [LW · GW] (and very likely we can extend that to semi-instrumental reward functions as well). Its drawback is that (as opposed to hidden rewards) it doesn't come with a simple natural way to specify reward function. A formal relationship between the two frameworks could allow us enjoying both worlds.

Given any hidden reward function , we can ask whether there exists a semi-instrumental reward function s.t. for any hidden environment and history compatible with

Here, stands for the instrumental state corresponding to , and we abuse notation on the left hand side by implicitly projecting to using .

The problem, then, is to characterize hidden reward functions with this property. Also, we can try to characterize for which all hidden reward functions have this property.

### Direction 16.3: Undogmatic Ontologies

Every hidden environment defines an observable environment. However, for an arbitrary , not every observable environment can be obtained this way. In other words, constraints what the agent can expect to see. Arguably, this is undesirable: why would the real world agree with an ontology which is just an arbitrary/subjective feature of the agent's axiology?

Hence, it is natural to consider *undogmatic* ontologies: that admit all observable environments. However, it seems tricky to find pairs of and a hidden reward function that simultaneously satisfy all of the following conditions:

- is undogmatic.
- The class of all communicating RUMDPs over is learnable for .
- is not reducible to an observable reward function. That is, there is no s.t. for any it holds that .

As an existence proof, we can consider the ontology whose states are instrumental states () and initialization is completely uncertain (), and corresponding to any instrumental reward function. It would be interesting to find a good general characterization. ~~In particular, I don't even know whether ~~~~ can be finite.~~ [**EDIT:** Here [LW(p) · GW(p)] is a way to construct many examples, including finite ones.]

### Direction 16.3: Bayes-optimal planning for communicating RUMDPs

We know that approximating Bayes-optimal planning is intractable for MDPs with traps. However, maybe for communicating RUMDPs it is tractable even when they form an unlearnable class. This seems like an interesting question to investigate.

## Direction 17: Algorithmic Descriptive Agency Measure (ADAM)

This direction attempts to address Problem 5 (descriptive agent theory).

ADAM is a formal way of quantifying the extent to which some arbitrary policy is "agentic".

Fix some UTM .

We will consider computable utility functions of the form . Notice that such a function is automatically continuous. Since it is computable, we can talk about its Kolmogorov complexity (w.r.t. ).

Also, for any UTM we denote the corresponding Solomonoff (semi-)environment (prior). We can talk about : the Kolmogorov complexity of w.r.t. . Also, given as above we can consider the *joint* Kolmogorov complexity .

We can now tentatively define ADAM by

Here the maximum is over all computable utility functions and UTMs .

Intuitively, we are looking for a way to interpret as maximizing w.r.t. prior s.t., on the one hand, does a good job at maximizing (enforced by the first term), and on the other hand, this interpretation is not contrived (enforced by the second term).

Notice that inherits all the other problems of AIXI. Hence, we will ultimately need to modify it according to the respective solutions (e.g. use frugal universal priors and their associated complexity measure instead of Solomonoff priors and Kolmogorov complexity).

has a few easy to check nice properties:

- The AIXI policy has .
- Any computable policy has . This allows us to intuitively interpret as something like "how many of the bits needed to describe are contributing to its agency".
- For every there is a computable policy s.t. .
- Changing the UTM only changes by .

However, at present I don't know much else about its properties. This leaves a lot of questions to explore:

### Direction 17.1: Comparing ADAM variants

Here are some superficially similar definitions:

Here, is a Solomonoff prior over policies and is a Solomonoff prior over utility functions and UTMs^{[23]}.

At present, I slightly prefer over and because for the latter two, I don't know property 3 above (I know that e.g. but not ).

It is easy to see that

**Open problem:** Are the same as up to ?

### Direction 17.2: ADAM of random policies

Deterministically generating an object of high Kolmogorov complexity is hard: for any program that outputs a bitstring we have , so generating with very high requires a very long program. Since , this means that deterministically generating an agentic policy is also hard. But, *randomly* generating an object of high Kolomogorov complexity is easy: most bitstrings of length have . On the other hand, I expect most policies to be unagentic.

**Conjecture:** For any computable , we have

Here, is computable in the sense that it's a Turing machine that receives the first argument on a separate input tape, and is the IID fair coin.

### Direction 17.3: ADAM Hierarchy Theorem

It would be interesting to understand how dense the possible values of are. That is, find a function , as slow growing as we can, s.t. for every there is a policy for which

### Direction 17.4: ADAM for finite-state policies

Consider policies that can be implemented by a deterministic finite state machine. That is, such a machine has a finite state set , an initial state and a transition function . Suppose that . Clearly, the policy implemented by the machine has

**Open problem:** Is this bound on in terms of close to tight? Are there infinitely many s.t. for some -state policy it holds that ?

### Direction 17.5: Inferring the utility function

Given an agentic policy, can we infer its utility function and prior? For simplicity, let's start with the case when the prior is known^{[24]}. For this purpose, we can consider the "semidescriptive" agency measure

Here, .

The obvious candidates for the utility function associated with are all that attain the supremum on the right hand side, or at least come close to attaining the supremum. Notice that the contrived utility function which just rewards behavior according to will typically fail this criterion: and hence the expression inside the maximum is when .

As opposed to , is not approximately UTM invariant, strictly speaking. However, it is still approximately invariant w.r.t. changing the UTM used for defining and while holding fixed. That is, we imagine the Kolmogorov complexities defined w.r.t. some UTM and w.r.t. a different UTM , and we have approximate invariance w.r.t. changing .

This raises the question of whether the inferred utility function is at least somewhat stable w.r.t. changing . Intuitively, it is unlikely to be the case for arbitrary : there is no canonical way to ascribe a utility function to a rock. However, we can expect it to be the case for agentic , i.e. in the limit .

To simply matters even further, let's focus on the case . Moreover, let's assume that there is exists a computable s.t. is the associated AIXI^{[25]}, i.e.

We probably still don't have full uniqueness since small changes in rewards that happen a bounded number of times can be insufficient to affect the optimal policy. However, the *asymptotic* rewards might be unique. Formally, I propose the following uniqueness conjecture.

**Definition:** Given a topological space and bounded function , and are said to be *locally equivalent utility functions* when for any there is an open neighborhood of , and s.t. for any

**Conjecture:** Let be a policy and computable utility functions s.t. is AIXI for both and . Then and are locally equivalent in the sense of the product topology on .

[**EDIT:** The Conjecture as stated is false, but other versions might be true, see discussion here [LW · GW].]

## Direction 18: Infra-Bayesian Physicalism

This research direction attempts to address Problem 4.

Infra-Bayesian Physicalism [LW · GW] (IBP) is a framework that builds on metacognitve agents, introducing a way to talk about hypotheses, priors and counterfactuals that does away with cartesian privilege.

IBP comes with an agent-independent ontology for values, which provides another possible answer to Problem 3. There might be a natural way to translate hidden rewards to the IBP framework (see original article [LW · GW]), so these two answers are not mutually exclusive. However, the price tag is a bizarre constraint on values which we call the "monotonicity principle", for which I only have wildly speculative explanations at present. Hopefully, further research can shed some light on the question.

IBP also has an interesting relationship with ADAM: it seems that in that framework it's impossible to define an "ideal" agent, but we have to be content with defining an agency measure [LW · GW]. At the very least, "ideal agent" is a less natural object in that context.

See the original article [LW · GW] for concrete directions for further research.

# Physicalist Superimitation

## Motivation

Physicalist Superimitation (PSI) is a (currently informal) approach to creating a rigorous alignment protocol, i.e a formal abstract model of a (hopefully feasible) AGI design that will be provably aligned under some (hopefully realistic) assumptions.

PSI is not the* *main motivation for LTA. The case for creating a mathematical theory of intelligent agents, both in general and even in particular via LTA, is much more robust than the case for PSI specifically. Moreover, I expect the theory produced by LTA to be applicable to analyzing many different alignment protocols (e.g. IDA, debate or AQRL [LW(p) · GW(p)]), not just to PSI.

However, there are several reasons why discussing PSI here seems important:

- PSI is a demonstration of one relatively concrete way how LTA might produce a full solution for alignment. Even if it's ultimately not the right approach, it gives us some intuition about how theory can cache out into novel solutions.
- PSI offers unique advantages that all competing approaches lack AFAICT. Namely, I think that PSI is the only protocol that has a story for robust inner alignment. Certainly if we also require a plan for formalizing the story. At the same time it also doesn't handicap the AI's capability much
^{[26]}: the AI is optimizing directly towards goals in the world, without e.g. routing through prediction or through human understanding of plans, and also without any constraints on the AI's knowledge, methods or magnitude of impact. - Thinking about alignment protocols and attack vectors is a useful way of noticing flaws in our models, or understanding their implications better. For example, thinking about Christiano's acausal attack made me formulate Problem 4, even though the problem can be justified without explicitly referring to alignment.

## Superimitation

The first component of PSI is *superimitation*: an agent (henceforth: the "imitator") that receives the policy of another agent (henceforth: the "original"), and produces behavior which pursues the same goals but *significantly* *better*. (Later, we will apply the framework in a way in which the role of "policy" is played by something broader than externally visible behavior: the required "behaviorist" assumptions about values are a lot weaker than they might seem.) This is different from mainstream research in inverse reinforcement learning where the goal is usually producing behavior which is merely equally good, or better only by the virtue of being less noisy. For now, we will put aside the question of *how* to obtain the original policy (but, see Agent Detection below).

While PSI heavily leans on IBP, basic superimitation can be formulated and studied in a cartesian setting as well. In fact, starting the research from a cartesian setting seems like the reasonable approach. More concretely, the starting point can be ADAM and in particular Direction 17.5.

However, the successful completion of Direction 17.5 is would still fall short of a model of superimitation. Indeed, presumably the utility function can only be inferred (more or less) perfectly in the limit , but in this limit the original is already optimal and cannot be improved upon. And, for any finite , the inaccuracy of the inference might negate any improved optimization. Therefore we need to add to our model some mechanism by which the imitator gains an *advantage* over the original.

One advantage mechanism is allowing the imitator more actions and/or observations. Examples of operationalizations:

- The original action set is , the original observation set is , the imitator action set is , the imitator observation set is and we are given surjections and . The imitator optimizes the utility function which is obtained by
*pulling back*the original utility function using and . (One problem with this is: the AI having e.g. the apparent experience of a person enjoying themself is very different from the person having that experience.) - The original and the imitator act in parallel, like in CIRL (but, unlike in CIRL, the imitator knows the original's full policy in advance), the imitator can observe the original's observations and actions (as well as its own), and optimizes the original's utility function. (Some problems with this are: it is an experience machine, not to mention that it's unclear how to practically implement an imitator that can fully observe the original.)

The parenthetical issues related to the domain of the utility function can hopefully be addressed given a solution of Problem 3. But, even putting these aside, this mechanisms seems somewhat disappointing: we want the AI to be superhuman not only thanks to better input/output channels, but also thanks to some deeper notion of "cognitive power". In other words, the resulting system might be very uncompetitive.

An improved advantage mechanism is suggested by the framework of metacognitive agents. The imitator can have superior computing hardware which allows it run computations faster than the original (via the internal interface), or even run computations that the original cannot run^{[27]}. It has some formal similarity to the first mechanism, since the computing hardware can be regarded as a type of input/output channel, but seems like a better model of "cognitive power".

### Applicable to humans

I often hear objections along the lines of "but humans don't have utility functions / are not rational / are not coherent / are not agents". I think that that reasoning is confused.

I already touched on this topic in the section Nonproblem [LW · GW] above. But, in the context of alignment, there is a different important argument: alignment is fundamentally a *normative* problem. Whenever we talk about whether an AI is "aligned" or "good" or "safe" we're using value-laden concepts that only makes sense within a *model* (approximation) in which we actually possess (more or less) well-defined values. Whenever we discuss what AI design we should *choose*, we're working on a problem that only makes sense within a *model* in which we are rational agents that make choices according to our values. Let's call this model "the anthroposinepic^{[28]} view".

This is not to say that the anthroposinepic view is a perfect description of reality. The map is not the territory, all models are wrong - but some are useful. For example, "the center of mass of the sun" is also not a well-defined concept: it's not clear which particles are part of the sun, in QFT particle positions are inherently fuzzy within the Compton wavelength, and in quantum gravity *any* notion of position is probably approximate at best. Nevertheless there are many contexts in astrophysics in which speaking of the center of mass of the sun is perfectly sensible. Similarly, there are contexts in which the anthroposinepic view is perfectly sensible: in particular normative questions, since they assume this view from the onset!

Another objection of similar flavor is: what if humans are actually made of multiple coexisting agents (see e.g. Sotala [? · GW]), or have dynamically inconsistent preferences (i.e. a utility function that varies over time, such as hyperbolic discounting)? My answer is: maybe it means that we will have to extend the framework to deal with those possibilities. But it's also possible that it's unnecessary or that it works out "automatically": multiple agents (coexisting or distributed along the subjective timeline) behave effectively as a single agent (thanks to superrationality). My methodology it to always start with the simplest model (in this case, humans as unitary agents) and only go beyond it when:

- The simple model is well-understood.
- It is clear that the simple model is inadequate.

Most importantly, I don't think this debate can be definitely resolved purely by philosophical arguments. Instead, we should study the emerging theory, check whether it satisfies intuitive desiderata, cross-reference it with cognitive science, and see what it says about the edge cases in which philosophical confusion arises. If the theory is correct, it will ultimately serve as a source of intuition which will lay all philosophical confusion to rest. And if it isn't, well, then we will see where it breaks and grow wiser from that too.

## Agent Detection

Superimitation assumes we can access the original's policy. In the context of an alignment protocol, this assumption has major issues:

- Maybe we can observe human behavior over time, but how do we know what the human would do in counterfactual scenarios?
- Where to draw the boundary around the original agent? (related: Critch's "boundaries" sequence [? · GW]) What constitutes human input/outputs? Retina illuminance and muscle contraction? Neuron membrane potentials on the physical boundary of the brain? Of the cortex? Is inner monologue part of input/output?
- Even if we know where to draw the boundary, how do we point the AI to this boundary? If we e.g. connect the human's muscles to sensors that send signals to the AI, will the AI learn a "pointer" to muscles, or a pointer to the sensors?

PSI addresses this by combining IBP and ADAM. IBP provides us with a rigorous notion of "which computations is the universe running". ADAM provides us with a rigorous way to decide which sequences of computations are agents. Putting the two together should allow specifying an "agent detector": a mathematical operator that transforms any given hypotheses about the universe into the collections of agents that inhabit this universe.

This can solve the problems above modulo the question of *which* agent should we superimitate (for this, see User Identification below). That said, I don't know how the solution would cache out in practice for humans. It would be fascinating to combine this theory with brain science when the former is much more mature. Speculatively, I suspect that a metacognitive version of ADAM would interpret some of human inner mental life as input/output through the internal interface.

Also, I think that there might be some formal relation between this "physicalist agent detector" and some version of Garrabrant's Cartesian Frames, but I won't try to articulate it here.

## User Identification

We don't want the AI to superimitate any and all agents. Some of those agents might be descendants of the AI. Some of those agents might be other AIs, aliens and animals. We don't necessarily want the AI to superimitate humans who lived in the past either.

For now, let's make the simplifying assumption that there is a single human we want the AI to imitate: the "user". Generalizing the method to groups of humans is left as an exercise to the reader^{[29]}.

It seems [LW(p) · GW(p)] that IBP has a natural notion of causality. Since IBP is a computationalist framework, this notion talks about causal relationships between *instantiated computations*. Specifically, the results of some computations control the instantiation or input^{[30]} of other computations. Since an agent is a type of computation, this gives us a notion of causality between agents: the action Alice takes in a particular Alice-counterfactual can affect the observation Bob makes in a particular Bob-counterfactual.

We can use causality to design a protocol for user identification. For example, imagine that the AI wakes up in a room with the user. The user can observe all of the AI's outputs (e.g. characters on a terminal) and the AI can observe many of the user's outputs (e.g. through a camera pointed at the user). This creates a short and tight causal loop between the user and the AI during the AI's early history that doesn't exist between the AI and any other agent^{[31]}.

Moreover, we can use causality to specify the point on the user's subjective timeline at which the AI begins to influence them. The AI is designed to superimitate only the user's *past*: this way there is no perverse incentive to modify the user. Since agents in this framework are *programs*, this still allows access to the user's entire policy. (Although we still need a to disambiguate between equivalent problems, probably based on description complexity.)

## Full Specification

We can now put all the pieces of PSI together:

- For each hypothesis in the prior:
- Identify all agents.
- Out of all agents, identify the user.
- If there are multiple users or no user, discard this hypothesis.
- Infer the user's utility function.

- Choose policy that maximizes prior expected utility (where in each hypothesis, the utility of its respective user is counted).

At least, this is the simplest version. The above is probably not the best way to aggregate preferences across hypothesis, since, for example, it doesn't facilitate acausal trade. Here are additional steps that can fix it

- (Step 3) For each hypothesis, infer the user's prior.
- (Step 4) Perform Nash bargaining between all possible users (with their respective priors and utility functions), using the previously selected policy as a disagreement point.

Notice that all this is just an (informal) specification of the function we're optimizing, it's *not* the algorithm used for optimization. Similarly to how asymptotic Bayes-optimality can be achieved (for certain priors, e.g. small MDPs) by a feasible learning algorithm, without actually tracking every possible hypothesis and calculating the true Bayesian update, this specification can (hopefully) be approximated by some feasible algorithm the details of which we currently don't know (and which we will only begin to describe after substantial additional progress on the foundational part of LTA).

## Alignment Guarantee

### Formal Alignment Criterion

How to formally define alignment? An appealing approach is, formalizing the following criterion:

**Criterion A:*** An AI is (approximately) aligned when it (approximately) optimizes the expectation of the user's utility function ** with respect to the user's beliefs **.*

Indeed, if I am the user, and I am choosing which AI to run, and I am an agent with a utility function and beliefs , clearly it is rational for me (given my subjective epistemic vantage point) to choose an AI satisfying Criterion A.

This might seem confusing at first: shouldn't I want the AI to have more accurate beliefs than the beliefs I hold myself? But, this isn't a contradiction. If we imagine the AI as having access to more *new evidence* than the user, then starting from the same beliefs it arrives at a better outcome. This is closely related to the "advantage mechanisms" we discussed in the section on superimitation.

We will call the part of Criterion A about *axiological alignment*. It is more or less the same as "outer alignment". We will call the part of criterion A about *epistemic alignment*. It is closely related to (and implies) "inner alignment". The last point might be confusing so let's dwell on it a little.

Inner misalignment is usually defined as the formation of malign mesa-optimizers: models produced by the learning algorithm which are in themselves malign agents (see Hubinger et al). In learning theory, possible models correspond to *hypotheses* in the hypothesis class of algorithm. If the AI's starting beliefs are sufficiently different from (it is epistemically misaligned), it might converge to a hypothesis that the user would not have endorsed. Such a hypothesis may lead to actions that are catastrophic from the user's point of view: for example, because the hypothesis encodes a malign agent. Two (not mutually exclusive) examples:

- Christiano's acausal attack, where the malign hypothesis is a certain
*simulation hypothesis*. - If the AI is a metacognitive agent that dogmatically believes running computations through the internal interface
*never*has side effects, it might end up running a simulation of a malign agent which will escape the "box". (This is a non-Cartesian daemon [LW · GW].)

Finally, notice that Criterion A is an approach to formalizing *alignment*, not formalizing corrigibility. Corrigibility seems to me a not especially natural concept, which is henceforth especially difficult to formalize and therefore also especially difficult to implement. For this reason, I am relatively pessimistic about approaches based on corrigibility (which PSI is not). That said, some weak versions of corrigibility might be feasible, e.g. the Hippocractic principle [LW(p) · GW(p)].

In the following subsections, I will give a rough outline of why I think that it might be possible to prove a formal alignment guarantee about PSI, along the lines of Criterion A.

### Axiological alignment of PSI

The axiological alignment of PSI relies on the ultimate success of the ADAM research direction (in conjunction with the necessary parts of the rest of the foundational programme). However, the fact PSI relies on physicalism makes the following problem especially acute: *should PSI use cartesian-ADAM (i.e. model the user as a cartesian agent) or physicalist-ADAM (i.e. model the user as a physicalist agent)?*

Currently I see the following possibilities:

- Humans should be regarded as physicalist agents, PSI should be based on physicalist-ADAM and the AI should be an IBP agent. This would require explaining why the monotonicity principle can be applicable to the "human reflective equilibrium". While I can imagine
*some*ways this can happen, they all require biting bullets. Alternatively, it might be possible to extend or overhaul the IBP framework, in a way that gets rid of monotonicity. - Humans should be regarded as cartesian agents and PSI should be based on cartesian-ADAM. Instead of an IBP agent, the AI should be a
*transcartesian*[LW(p) · GW(p)] agent: i.e. based on some framework for agents which assign cartesian privilege to a*different*agent (in this case the user). I can see a sketch of how this would be formalized (using similar mathematical building blocks as IBP), but this is very early stage at the moment. - Humans should be regarded as cartesian agents, PSI should be based on cartesian-ADAM but the AI should still be an IBP agent. The problem is, this requires translating the user's cartesian utility function into a physicalist utility function that PSI can optimize (e.g. using the techniques in section 3 [LW · GW] of the article about IBP). Because of the monotonicity principle, this would mean PSI is not really axiologically aligned: the user might place negative value on running certain computations which PSI will ignore. However,
*maa...aaybe*the difference is not significant, since PSI doesn't have strong incentives to instantiate those computations.

### Epistemic alignment of PSI

The case for the epistemic alignment of PSI relies on the following (informal) conjecture:

**Physicalist Agreement Conjecture (PhyAC):** *Two IBP agents in the same universe that start with similar priors, and share most of their evidence, will converge to similar posteriors*^{[32]}*.*

Notice that this is *not* the case for *cartesian* agents. This is because cartesian agents with *syntactically* similar priors have very different priors *semantically*: each prior refers to the respective agent's actions and observation. Each agent assumes cartesian privilege for *itself* but not for the other agent.

A strong demonstration for how it fails with cartesian agents is through simulation hypotheses: if Alice is much more influential than Bob, then Alice will assign a simulation hypothesis much more credence than Bob. Even if both Alice and Bob end up believing simulation hypotheses, they will often be *different* simulation hypothesis: each will postulate the kind of simulators that would have incentives to simulate that person's respective subjective viewpoint.

On the other hand, for IBP agents this argument doesn't work anymore. For example, suppose that universe A has simulators with incentives to simulate Alice's viewpoint. Since Alice observes Bob fairly closely (by assumption of PhyAC), simulating Alice's viewpoint requires also running a simulation of Bob. Moreover, since Bob is an IBP agent, and IBP is computationalist, this implies that Bob expects to experience this simulation (created for Alice's sake) as well. Hence, both of them assign similar credence to being in universe A.

In the case of PSI, a possible objection is, maybe the simulators only need a low fidelity simulation of the user to fool the AI, which the user wouldn't identify as themself. This is unclear (since this low-fidelity simulation has to be good enough to be indistinguishable for the AI from the null hypothesis that we're not in a simulation). However, if this *is* a concern, we can imagine a version of the agent detector that would also accept low fidelity simulations.

If the user is an IBP agent, PhyAC can be applicable to prove epistemic alignment of PSI. Is the user an IBP agent? This is unclear, but it's essentially the same debate as in the previous subsection.

Another concern is, even assuming that the user is an IBP agent, how similar is its prior to whatever prior we put into PSI? Presumably, the difference should be comparable to using Solomonoff induction with different UTMs, but hopefully not *too* different (i.e. the associated bisimulation is not too complex). As long as we are in a learnable setting, this doesn't matter. In reality, there are traps, so it *does* matter: but perhaps not critically. Alternatively, we might need to modify the specification of PSI to make stronger use of the inferred user prior in order to close this gap.

## A Plan

Assuming PSI *is* the correct approach to aligning AI, how would we get from here to a practical implementation? I imagine the plan roughly as follows:

- Develop the foundational programme of LTA, until we have fairly good solutions to Problems 1-5 and maybe to other problems that will crystallize in the process.
- Continue fleshing out and improving PSI. I see some of it happening in parallel with step 1, but for the most part this is at best second priority until step 1 is much more advanced. In the process, I expect our model of PSI to evolve through roughly the following phases:
**Informal model:**The model is*inspired*by mathematical ideas but is not fully rigorous yet. This where we are now. It gradually acquires more details and precision, until we are ready to formalize at least a part of it.**Toy model:**The model is rigorous but makes simplifying assumptions that are completely unrealistic. We study the toy model's formal properties (including alignment), while gradually moving to more realistic assumptions. One axis along which this evolution can happen is computational resource constraints. For example, it can be the following progression:- Require only computability or not even that.
- Require polynomial space in hypothesis description length. Alternatively, require polynomial time in number of hypothesis-states.
- Require polynomial time in hypothesis description length.

**(Semi-)Realistic model:**The assumptions are sufficiently realistic that we could conceivably implement the model at least in some utopian scenario (e.g. if all capability research stopped and we had 100 years to bootstrap the AI). We continue moving to more realistic assumptions, while also gaining semiformal understanding about which properties of the model are crucial to alignment and which aren't.

- Study those questions outside of theoretical computer science the answers to which are needed as inputs to the AI's design or as parameters to the alignment guarantees. This might be possible in parallel to step 2 or at least to the late phases of step 2.
- Build a real-life implementation of PSI. This might be anywhere on the spectrum from (on one end) implementing an algorithm with formal guarantees and formally verifying the guarantees on the production code to (on the other end) implementing an algorithm which is only loosely based on the formal model, while using the understanding we gained from the theory to convincingly argue that the differences between the model and the implementation are immaterial. The algorithm might or might not bear strong resemblance to deep learning, depending primarily on the results of research into Problem 1. In any case, it will involve much engineering work and security mindset to make sure that the assumptions of the alignment guarantee actually apply.

# Summary

In this article, I tried to convey the following points:

- LTA sets out to create a general mathematical theory of intelligent agents. It has a concrete, actionable plan for achieving this goal.
- LTA also has an end-to-end proposal for solving alignment (PSI), although at this stage it is still fairly speculative. Nevertheless, this proposal brings some unique advantages.
- LTA has a substantial number of fairly concrete, shovel-ready research directions. There is substantial room for accelerating LTA by recruiting researchers to attack many of those directions in parallel.

There is a lot of work and perhaps not that much time. I hope that this article will inspire more researchers to joint the effort. See you all in 2028^{[33]}!

^{^}listed alphabetically by last name

^{^}By "foundational part" I mean the theory of intelligent agents qua intelligent agents, as opposed to the "applied part" where we use this theory to study alignment per se.

^{^}Sometimes we are content with "any except for a set of prior probability 0", such as when we only bound the Bayesian regret in a Bayesian online/bandit/reinforcement learning setting.

^{^}See e.g. Lattimore and Szepesvari for an introduction to regret bounds, in particular section 38 talks about reinforcement learning.

^{^}Here, we assume that the first observation comes before of the first action, in contrast to my usual convention, because this time it's more convenient.

^{^}The diameter is the maximal expected time to reach one state from another state. The bound has to scale with since as , a trap can develop. Alternative parameterizations are also possible, such as using mixing time or bias span. The latter might be advantageous since bounding the diameter also caps the number of states at , whereas bounding bias span allows an infinite number of states. On the other hand, the bias span depends on the reward function.

^{^}Alternatively, we can consider geometric time discount , in which case is replaced by .

^{^}The existence of a computationally feasible optimal policy is usually a much stronger condition than the computational feasibility of simulation: for example finding an approximately optimal policy for a POMDP is PSPACE-hard while simulating it in P. Of course there are degenerate cases in which the optimal policy is easy even though simulation is hard.

^{^}See e.g. Shalev-Shwartz and Ben-David.

^{^}The name comes from Kalai and Lehrer.

^{^}I haven't properly studied the prior work relating automata to category theory (which definitely exists) so I make no strong claims to originality here.

^{^}The paper uses a theorem stated without proof, for which the citation given is "Lempp, Miller, Ng and Turetsky, 2010, unpublished, private communication". However, Lattimore kindly provided me with this unpulished communication upon request, and, to the best of my judgement, the proof is valid.

^{^}Infra-Bayesian laws were originally called "belief functions" in the infra-Bayesianism sequenece.

^{^}I haven't done a sufficiently thorough literature survey, so there might already be such results.

^{^}^{^}While running

*most*programs doesn't have irreversible harmful side-effects, but it probably doesn't hold for*all*programs: for example we can imagine code that uses some hardware exploit. In particular, this is related to what I previous called non-cartesian daemons [LW · GW]. It should be possible to get guarantees that take this into account by e.g. somewhat randomizing the precise code we run each time (it might be related to quantilization).^{^}See e.g. Kearns and Vazirani chapter 8, Higuera 2004 and Higuera 2010.

^{^}As a consequence, this section carries an especially high risk of missing some pertinent known results that I'm unfamiliar with.

^{^}See e.g. Vereshchagin and Shen.

^{^}The convention I'm using here doesn't normalize the sum of payoffs over time, i.e.

and the probability of playing a strategy is proportional to .

^{^}It would be more natural to say that the hidden environment is an equivalence class of such objects, where two are considered equivalent if it is not possible to distinguish them in terms of actions and states in . However, the distinction is not critical for the exposition.

^{^}See e.g. chapter 37 in Lattimore and Szepesvari.

^{^}Or maybe just TMs, that would still allow making sense of as the lower semicomputable environment computed by .

^{^}If we want to infer the utility function and the prior simultaneously, we probably need to take into account that some ways to redefine both of them together produce an equivalent decision problem. Hence, it makes sense to focus on recovering their

*product*. Formally, any environment corresponds to a measure on s.t. for any , the distribution induced by and is equal to , where is the function that's 1 on sequences compatible with and 0 otherwise. We can then define the*utility measure*to be the product .^{^}This doesn't seem to automatically follow from the assumption . However, it might be possible to show that there is always at least an uncomputable utility function that can be computed with a slow-growing amount of advice.

^{^}There might still be alignment overhead. Specifically, (i) the unusual structure of the loss function, (ii) prior shaping to deal with potential traps and (iii) prior shaping to deal with potential side-effects of computations, might incur overhead. We need to return to this question when we have better models. Maybe the overhead is small, or maybe we can prove that significant overhead is unavoidable.

^{^}Technically, the mathematical analysis will probably need to focus on the asymptotic in which the agents can run all computations, but the original and the imitator can converge to this limit with different speeds.

^{^}from Greek: anthropos (human) + sinepis (coherent, according to ChatGPT)

^{^}Speculatively, we might not even have to do that: the AI itself can facilitate superrational cooperation between all agents who could affect the creation of this or similar AI.

^{^}Input is a special case of instantiation: saying that program runs with input is equivalent to saying that the computation is instantiated.

^{^}We would have to be careful to set the ADAM threshold correctly and make sure the room indeed doesn't contain other agents. Otherwise, we are risking the AI version of "The Fly".

^{^}In some "behaviorist" sense. IBP doesn't really have an update rule, and therefore there is no well-defined "posterior" in general.

^{^}I hope...

## 13 comments

Comments sorted by top scores.

## comment by Marcus Ogren · 2023-04-19T08:22:09.179Z · LW(p) · GW(p)

(Disclosure: Vanessa is my wife.)

I want to share my thoughts on how the LTA can have a large impact. I think the main plan - to understand agency and intelligence fully enough to construct a provably aligned AI (perhaps modulo a few reasonable assumptions about the real world) - is a good plan. It’s how a competent civilization would go about solving the alignment problem, and a non-negligible chunk of the expected impact of the LTA comes from it working about as planned. But there are also plenty of other, less glamorous ways for it to make a big difference as well.

The LTA gives us tools to think about AI better. Even without the greater edifice of LTA and without a concentrated effort to complete the LTA and build an aligned AI in accordance with it, it can yield insights that help other alignment researchers. The LTA can identify possible problems that need to be solved and currently-unknown pitfalls that could make an AI unsafe (along the lines of the problem of privilege and acausal attack). It can also produce “tools” for solving certain aspects of the alignment problem that could be applied in an ad-hoc manner (such as the individual components of PSI). While this is decidedly inferior to creating a provably aligned AI, it is also far more likely to happen.

As for PSI, I think it’s a promising plan for creating an aligned AI in and of itself; it doesn’t appear to require greatly reduced capabilities and gives the AI an unhackable pointer to human values. But its main significance is as a proof of concept: the LTA has delivered this alignment proposal, and the LTA isn’t even close to being finished. My best guess is that, given enough time, some variant of PSI could be created as a provably aligned AI (modulo assumptions about human psychology, etc.). But I also expect better ideas in the future. PSI demonstrates that considering the fundamental questions of agency can lead to novel and elegant solutions to the alignment problem. Before Vanessa came up with PSI, I thought the main value of her research lay in solving weird-sounding problems (like acausal attack) that originally sounded more like an AI being stupid than like the AI being misaligned. PSI shows that the LTA is much, much more than this.

## comment by harfe · 2024-01-09T16:21:45.645Z · LW(p) · GW(p)

Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low score. This is due to explicit optimization algorithms being low complexity.

(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix for these considerations too.) Consider an utility function . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for ) and has an approximation parameter n (this could be AIXI-tl, plus some approximation of ; higher is better approximation). We will call this approximation of the optimal policy . This approximation algorithm has complexity , where is a constant needed to describe the general algorithm (this should not be too large).

We can get better approximation by using a quickly growing function, such as the Ackermann function with . Then we have .

What is the score of this policy? We have . Let be maximal in this expression. If , then .

For the other case, let us assume that if , the policy is at least as good at maximizing than . Then, we have .

I don't think that the assumption ( maximizes better than ) is true for all and , but plausibly we can select such that this is the case (exceptions, if they exist, would be a bit weird, and if ADAM working well due to these weird exceptions feels a bit disappointing to me). A thing that is not captured by approximations such as AIXI-tl are programs that halt but have insane runtime (longer than ). Again, it would feel weird to me if ADAM sort of works because of low-complexity extremely-long-running halting programs.

To summarize, maybe there exist policies which strongly optimize a non-trivial utility function with approximation parameter , but where is relatively small.

Replies from: vanessa-kosoy## ↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2024-01-10T10:48:20.248Z · LW(p) · GW(p)

Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes *computational complexity* into account instead of . For example, you can look at the measure I defined here [LW(p) · GW(p)]. More realistically, this measure should be based on the frugal universal prior.

## comment by Frank_R · 2023-04-23T07:07:42.380Z · LW(p) · GW(p)

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in and let be a strictly monotonous function. Then and have the same maxima. can be non-linear, e.g. . Therefore, I wonder if the condition should be weaker.

Moreover, I ask myself if it is possible to modify by a small amount at a place far away from the optimal policy such that is still optimal for the modified utility function. This would weaken the statement about the uniqueness of the utility function even more. Think of an AI playing Go. If a weird position on the board has the utility -1.01 instead of -1, this should not change the winning strategy. I have to go through all of the definitions to see if I can actually produce a more mathematical example. Nevertheless, you may have a quick opinion if this could happen.

Replies from: vanessa-kosoy## ↑ comment by Vanessa Kosoy (vanessa-kosoy) · 2023-04-23T09:12:23.313Z · LW(p) · GW(p)

I have a question about the conjecture at the end of Direction 17.5. Let be a utility function with values in and let be a strictly monotonous function. Then and have the same maxima. can be non-linear, e.g. . Therefore, I wonder if the condition should be weaker.

No, because it changes the *expected value* of the utility function under various distributions.

Moreover, I ask myself if it is possible to modify by a small amount at a place far away from the optimal policy such that is still optimal for the modified utility function.

Good catch, the conjecture as stated is obviously false. Because, we can e.g. take to be the same as everywhere except after some action which doesn't actually take, in which case make it identically 0. Some possible ways to fix it:

- Require the utility function to be of the form (i.e. not depend on actions).
- Use (strictly) instrumental reward functions.
- Weaken the conclusion so that we're only comparing and on-policy (but this might be insufficient for superimitation).
- Require to be optimal off-policy (but it's unclear how can this generalize to finite ).

## ↑ comment by harfe · 2023-07-10T15:08:04.140Z · LW(p) · GW(p)

I think the conjecture is also false in the case that utility functions map from to .

Let us consider the case of and . We use , where is the largest integer such that starts with (and ). As for , we use , where is the largest integer such that starts with (and ). Both and are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy that always picks action is the optimal policy for both and (see proof below).

Note that since the policy is computable and very simple, is not true, and we have instead. I suspect that the issues are still present even with an additional condition, but finding a concrete example with an uncomputable policy is challenging.

**proof**:
Suppose that and are locally equivalent. Let be an open neighborhood of the point and , be such that for all .

Since , we have . Because is an open neighborhood of , there is an integer such that for all . For such , we have This implies . However, this is not possible for all . Thus, our assumption that and are locally equivalent was wrong.

*Assumptions about the solomonoff prior*:
For all , the sequence of actions that produces the sequence of with the highest probability is (recall that we start with observations in this setting).
With this assumption, it can be seen that the policy that always picks action is among the best policies for both and .

*I think this is actually a natural behaviour for a reasonable Solomonoff prior*:
It is natural to expect that is more likely than . It is natural to expect that the sequence of actions that leads to over has low complexity. Always picking is low complexity.

*It is possible to construct an artificial UTM that ensures that
"always take " is the best policy for , *:
An UTM can be constructed such that the corresponding Solomonoff prior assigns 3/4 probability to the program/environment "start with o_1. after action a_i, output o_i". The rest of the probability mass gets distributed according to some other more natural UTM.

Then, for , in each situation with history the optimal policy has to pick (the actions outside of this history have no impact on the utility): With 3/4 probability it will get utility of at least . And with probability at least . Whereas, for the choice of , with probability it will have utility of , and with probability it can get at most . We calculate , ie. taking action is the better choice.

Similarly, for , the optimal policy has to pick too in each situation with history . Here, the calculation looks like .

## comment by David Matolcsi (matolcsid) · 2023-04-20T12:08:26.733Z · LW(p) · GW(p)

Thanks for Vanessa for writing this, I find it a useful summary of the goals and directions of LTA, which was sorely missing until now. Readers might also be interested in my write-up A mostly critical review of infra-Bayesianism [LW · GW] that tries to give a more detailed explanation about a subset of the questions above, and how much progress there was towards their solutions so far. I also give my thoughts and criticism of Infra-Bayesian Physicalism, the theory on which PSI rests.

I will also edit the post to include a link to this post. So far, I advised people to read Embedded agency [LW · GW] for the motivating questions, but now I can recommend this post too.

## comment by RogerDearnaley (roger-d-1) · 2023-05-29T04:27:14.497Z · LW(p) · GW(p)

## Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

- (Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
- (Subproblem 2.1) It's not clear how to to talk about learnability and learning rates.
It makes some sense to consider these problems together, but different direction emphasize different sides.

Evolved organisms (such as humans) are good at dealing with traps: getting eaten is always a possibility. At the simplest level they do this by having multiple members of the species die, and using an evolutionary learning mechanism to evolve detectors for potential trap situations and some trap-avoiding behavior for this to trigger. An example of this might be the human instinct of vertigo near cliff edges — it's hard not to step back. The cost of this is that some number of individuals die from the traps before the species evolves a way of avoiding the trap.

As a sapient species using the scientific method, we have more sophisticated ways to detect traps. Often we may have a well-supported model of the world that lets us predict and avoid a trap ("nuclear war could well wipe out the human race, let's not do that"). Or we may have an unproven theory that predicts a possible trap, but that also predicts some less dangerous phenomenon. So rather than treating the universe like a multi-armed bandit and jumping into the potential trap to find out what happens and test our theory, we perform the lowest risk/cost experiment that will get us a good Bayesian update on the support for our unproven theory, hopefully at no cost to life or limb. If that raises the theory's support, then we become more cautious about the predicted trap, or if it lowers it, we become less. Repeat until your Bayesian updates converge on either 100% or 0%.

An evolved primate heuristic for this is "if nervous of an unidentified object, poke it with a stick and see what happens". This of course works better on, say, live/dead snakes than on some other perils that modern technology has exposed us to.

The basic trick here is to have a world model sophisticated enough that it can predict traps in advance, and we can find hopefully non-fatal ways of testing them that don't require us to jump into the trap. This requires that the universe has some regularities strong enough to admit models like this, as ours does. Likely most universes that didn't would be uninhabitable and life wouldn't evolve in them.

## comment by Frank_R · 2023-05-17T08:05:21.460Z · LW(p) · GW(p)

I have discovered another minor point. You have written at the beginning of Direction 17 that any computable utility function is automatically continuous. This seems to be not always true.

I fix some definitions to make sure that we talk about the same stuff. For reasons of simplicity, I assume that and are finite. Let be the space of all infinite sequences with values in . The -th projection is given by

The product topology is defined as the coarsest topology such that all projection maps are continuous. A base of this topology consists of all sets such that there are finitely many indices and subsets with

In particular, any open set contains such an as a subset, which means that its image under is for all but finitely many . For my counterexample, let . Let be a sequence with values in . If is never 2, we define

Otherwise, we define