The Learning-Theoretic Agenda: Status 2023

post by Vanessa Kosoy (vanessa-kosoy) · 2023-04-19T05:21:29.177Z · LW · GW · 17 comments

Contents

  Preamble
  Philosophy
  Key Problems
    Problem 1: Computational Resource Constraints
    Problem 2: Frequentist Guarantees
      Subproblem 2.1: Traps
      Subproblem 2.2: Password guessing games
      Subproblem 2.3: Nonrealizability
    Problem 3: Value Ontology
    Problem 4: Cartesian Privilege
    Problem 5: Descriptive Agent Theory
    Nonproblem: Expected Utility
  Research Directions
    Subproblem 1.1: Frugal Universal (Infra-)Prior
      Direction 1: Frugal Compositional Languages
        Direction 1.1: Compositional Polytope-MDPs
        Direction 1.2: String Machines (for RL)
        Direction 1.3: Infra-Bayesian Logic
      Direction 2: Deep Learning Theory
    Subproblem 2.3: Nonrealizability
      Direction 3: Infra-Bayesian RL
      Direction 4: Exploiting properties of simplicity priors
      Direction 5: Agnostic RL
    Subproblem 1.2/2.1: Traps
      Direction 6: Metacognitive Agents
        Direction 6.1: Bayesian planning via query-based learning of computations
        Direction 6.2: Expressiveness of string machines
        Direction 6.3: Query learning of string machines
      Direction 7: Approximation algorithms for Bayesian planning
      Direction 8: Expanding Safety Envelope
    Better regret bounds for simple priors
      Direction 9: Unifilar MDPs
      Direction 10: Foreground MDPs
      Direction 11: Canonical RL Dimension
    Direction 12: Epistemic Regret Bounds
    Subproblem 2.4: Multi-Agent Guarantees
      Direction 13: Population Games
      Direction 14: Logit equilibria of finite-state repeated games
      Direction 15: Infra-Bayesian Veil of Ignorance
        Direction 15.1: IBRL with exact clones
        Direction 15.2: IBRL with inexact clones
    Direction 16: Hidden Rewards
      Direction 16.1: Agency with partial monitoring
      Direction 16.2: Specifying semi-instrumental reward functions
      Direction 16.3: Undogmatic Ontologies
      Direction 16.3: Bayes-optimal planning for communicating RUMDPs
    Direction 17: Algorithmic Descriptive Agency Measure (ADAM)
      Direction 17.1: Comparing ADAM variants
      Direction 17.2: ADAM of random policies
      Direction 17.3: ADAM Hierarchy Theorem
      Direction 17.4: ADAM for finite-state policies
      Direction 17.5: Inferring the utility function
    Direction 18: Infra-Bayesian Physicalism
  Physicalist Superimitation
    Motivation
    Superimitation
      Applicable to humans
    Agent Detection
    User Identification
    Full Specification
    Alignment Guarantee
      Formal Alignment Criterion
      Axiological alignment of PSI
      Epistemic alignment of PSI
    A Plan
  Summary
None
17 comments

TLDR: I give an overview of (i) the key problems that the learning-theoretic AI alignment research agenda is trying to solve, (ii) the insights we have so far about these problems and (iii) the research directions I know for attacking these problems. I also describe "Physicalist Superimitation" (previously knows as "PreDCA"): a hypothesized alignment protocol based on infra-Bayesian physicalism that is designed to reliably learn and act on the user's values.

I wish to thank Steven Byrnes, Abram Demski, Daniel Filan, Felix Harder, Alexander Gietelink Oldenziel, John S. Wentworth[1] and my spouse Marcus Ogren for reviewing a draft of this article, finding errors and making helpful comments and suggestions. Any remaining flaws are entirely my fault. 

Preamble

I already described [LW · GW] the learning-theoretic agenda (henceforth LTA) in 2018. While the overall spirit stayed similar, naturally LTA has evolved since then, with new results and research directions, and slightly different priorities. This calls for an updated exposition.

There are several reasons why I decided that writing this article is especially important at this point:

I am fairly optimistic that LTA leads to a solution to alignment. On the other hand, it's much harder to say whether the solution will arrive in time. I think that the rate of progress relative to person-hours invested was pretty good so far, and we didn't hit any obstacle that cast doubt on the entire endeavor. At the same time, in absolute terms we still have most of the work in front of us. In the coming years, I am planning to put considerable effort into scaling up: getting more researchers on board, and hopefully accelerating progress by a significant factor.

Philosophy

The philosophy of LTA was covered [LW(p) · GW(p)] in the original article, I mostly stand behind what I wrote there and don't want to repeat it. The goal of this section is merely to recap and add a couple of points.

The goal of LTA is creating a general mathematical theory of intelligent agents. There are a number of reasons why this is necessary for AI alignment:

After I wrote the original article about LTA, more was written about the feasibility and importance of mathematics for alignment, both by [LW(p) · GW(p)] me [LW · GW] and by others [LW · GW].

Why is LTA concerned specifically with agents? Aren't there AI systems which are not agents? The reason is: the sort of risks I want to address are risks that arise from AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue unaligned goals, with catastrophic consequences), and the sort of solution I envision also relies on AIs displaying agentic behavior (i.e. building sophisticated models of the world and using these models to construct plans that pursue an aligned goal, s.t. the result is protecting humanity from unaligned AI).

Finally, I want to address a common point of confusion. Sometimes I tell someone about subproblems in LTA and they respond by trying to understand why each individual subproblem (e.g. nonrealizability, or cartesian privilege, or value ontology) is relevant to alignment. Often there are some specific ways in which the subproblem can be connected to alignment. But the more important point is that any fundamental question about intelligent agency is relevant to alignment, just because without answering this question we cannot understand intelligent agency. For any aspect of intelligent agency that we don't understand and any AI that we design, one of the two will hold: either the AI lacks this aspect, in which case it is probably insufficient to protect humanity, or the AI has this aspect but we don't understand why or how, in which case it is probably unaligned (because alignment is a small target, and significant unaccounted for phenomena are likely to make you miss).

Key Problems

The starting point of LTA is Marcus Hutter's AIXI: a simplistic model of ideal agency.

Intuitively, an "intelligent agent" is a system that builds sophisticated models of the world and uses them to construct and execute plans that lead to particular goals (see also Yudkwosky [LW · GW]). Building models implies starting from a state of ignorance and updating on observations, which can be formalized by Bayesian probability theory. Building sophisticated models requires using Occam's razor, which can be formalized by the Solomonoff prior. Constructing and executing plans can be formalized as maximizing expected utility. Putting all these ingredients together gives us AIXI.

However, there are both important gaps both in our understanding of AIXI-like agents, and significant weaknesses in the AIXI framework itself. Solving these problems is the natural framing for the entire foundational[2] part of LTA.

Problem 1: Computational Resource Constraints

AIXI is uncomputable. Obviously, real-world agents are not only computable but operate under strong computational resource constraints. This doesn't mean AIXI is not a useful toy model for studying the interaction of Occam's razor with Bayesian optimization or reinforcement learning. However, it is a major limitation.

One obvious attempted solution is to impose some computational resource constraints on the programs that appear in Solomonoff's prior, for example as in Schmiduber's speed prior or in some other way. This typically leads to agents that are computable but still computationally intractable. Relatedly, a recent line of research connects the hardness of time-bounded Kolmogorov complexity with the existence of one-way functions.

On the other hand, we know some priors for which asymptotic Bayes-optimality is possible in polynomial time, for example priors supported on Markov decision processes (MDPs) with a small (i.e. polynomial in the security parameter) number of states. Some stronger feasible models are also known, e.g. MDPs with linear features or kernel-MDPs. However, all such priors have significant shortcomings:

Problem 2: Frequentist Guarantees

AIXI is Bayes-optimal (by definition) but is not necessarily optimal in any sense for any particular environment. In contrast, in statistical learning theory we usually demand a frequentist guarantee, i.e. that a learning algorithm converges to optimality (in some sense) for any[3] data source (subject to some assumptions which depend on the setting). Such guarantees are important for several reasons:

Moreover, in addition to single-agent frequentist guarantees it is desirable to derive multi-agent frequentist guarantees. Interactions between agents are quite important in the real-world, and have significance specifically for AI alignment as well (AI-human, AI-other AI, AI-acausal trade partner, to some extent human-human too). A multi-agent frequentist guarantee can take the form of convergence to a particular game-theoretic solution concept, or an asymptotic lower bound on expected utilities corresponding to some notion of game value.

There are multiple difficulties in deriving frequentist guarantees for AIXI-like agents. The first two difficulties ("traps" and "password guessing games" below) are not problems with the agent, but rather with the way conventional frequentist guarantees are defined (which are nonetheless non-trivial to solve). The third difficulty (nonrealizability) might be a problem with the way the agent is defined.

Subproblem 2.1: Traps

The most common type of frequentist guarantee in reinforcement learning (RL) is regret bounds[4]. However, AIXI doesn't satisfy any regret bound w.r.t. the underlying hypothesis class (i.e. the class of all computable environments) because these hypotheses involve irreversible transitions (traps). For example, suppose that in environment 1 taking action A sends you to a sink state with reward , whereas taking action B sends you to a sink state with reward , and in environment 2 the opposite is true. Obviously a hypothesis class which contains both environments doesn't admit a regret bound.

This problem is especially acute in the multi-agent setting, because other agents have memory. Either it's impossible to erase memory in which case the environment is irreversible by design, or it's possible to erase memory which breaks conventional learning theory in other ways.

Subproblem 2.2: Password guessing games

What if we arbitrarily rule out traps? Consider an agent with action set  and observation set . Suppose that the reward is a function of the last observation only . We can specify an environment[5]  by providing a communicating MDP with state set  and action set  augmented by representation mappings

Here,  is required to be an onto mapping from  to  for any . Also, we require that for any  and 

However, this only enables a fairly weak regret bound. Naively, it seems reasonable to expect a regret bound of the form  where  is the Kolmogorov complexity of the MDP+representation,  is the diameter[6] and  is the time horizon[7], and  are constants s.t. . Indeed,  can be regarded as the amount of information that needs to be learned and e.g. Russo and Van Roy showed that in a bandit setting regret scales as the square root of the entropy of the prior (which also expresses the amount of information to be learned). However, this is impossible, as can be seen from the following example.

Fix  and  (the "password"), let the action set be  and the state space be . We think of a state  as "the agent entered the string of bits  into the terminal". The initial state is  (the empty string). The transition kernel works as follows:

In state , taking action  produces the state  (i.e. the agent enters the digit ). In state , taking action  produces the state  if  (the agent entered the correct password) or state  (the agent entered an incorrect password and has to start from the beginning). In state  taking action  stays in state  and taking action  produces state  (restarts the game).

All states have reward  except for the state  which has reward .

Obviously the agent needs time  to learn the correct password, in expectation w.r.t. the uniform distribution over passwords. At the same time,  and . This is incompatible with having a regret bound of the desired form.

Subproblem 2.3: Nonrealizability

As we discussed in Problem 1, a realistic agent cannot be Bayes-optimal for a prior over all computable environments. Typically, each hypothesis in the prior has to admit a computationally feasible approximately optimal policy in order for it to be feasible to approximate Bayes-optimality (in particular this has to be the case if the prior is learnable), which usually implies the hypotheses are computationally feasible to simulate[8]. However, in this situation we can no longer assume the true environment is within the hypothesis class. Even for AIXI itself, it is not really fair to assume the environment is computable since that would exclude environments that contain e.g. other AIXIs.

A setting in which the true environment is not in the hypothesis class is known in learning theory as "nonrealizable". For offline (classification/regression) and online learning, there is a rich theory of nonrealizable learning[9]. However, for RL (which, among classical learning theory frameworks, is the most relevant for agents), the nonrealizable setting is much less understood (although there are some results, e.g. Zanette et al).

Therefore, even if we arbitrarily assume away traps, and are willing to accept a factor of  in the regret bound, a satisfactory theory of frequentist guarantees would require good understanding of nonrealizable RL.

This problem is also especially acute in the multi-agent case, because it's rarely the case that each agent is a realizable environment from the perspective of the other agent (roughly speaking, if Alice would simulate Bob simulating Alice, Alice would enter an infinite loop). This is known as the "grain of truth" problem[10]. One attempt to solve the problem is by Leike, Taylor and Fallenstein, using agents that are equipped with "reflective oracles". However, there are many different reflective oracles, and the solution relies on all agents in the system to use the same one, i.e. that the design of the agents has been essentially centrally coordinated to make them compatible.

Problem 3: Value Ontology

AIXI's utility function depends on its direct observations. This is also assumed in RL theory. While this might be a legitimate type of agent, it is insufficiently general. We can easily imagine agents that place value on parameters they don't directly observe (e.g. the number of paperclips in the observable universe, or the number of people who suffer from malaria). Arguably, humans are such agents.

The obvious workaround is to translate the utility function from its original domain to observations by taking a conditional expectation. That is, suppose the utility function is  for some space , let  be the space of everything that is directly observed (e.g. action-observation time sequences) and suppose we have some prior over . Then, we can define  by

However, there are problems with this:

Therefore, it is desirable to have a theory in which it is possible to explicitly talk about utility functions with domains other than observations. (See also de Blanc.)

Problem 4: Cartesian Privilege

The formalization of Occam's razor in AIXI relates to the description complexity of hypotheses represented in terms of the agent's actions  and observations  (as functions ). However, this is not how Occam's razor is used in science. When we talk about a "simple" or "elegant" scientific theory, we refer to the equations governing the fundamental degrees of freedom on that theory (e.g. particles, or fields or strings), not to the equations that would be needed to describe the RGB values of points on a person's retina. Indeed, expressing physics via the latter equations would be horrifically complicated.

In other words, AIXI-like agents believe they occupy a privileged position in the universe, which has various pathological consequences. See "infra-Bayesian physicalism [LW · GW]" for more discussion and Demski and Garrabrant for related prior work.

Problem 5: Descriptive Agent Theory

The framing of AIXI is: given certain subjective choices (e.g. universal Turing machine and utility function), what is an ideal agent? However, in the real-world agents are not ideal. Moreover, we want to be able to understand which systems are agents, and what kind of agents they are, without already assuming e.g. a specific utility function. In particular, such a theory would have applications to:

Legg and Hutter defined an intelligence measure that for every policy produces a number determining its intelligence, s.t. AIXI has maximal intelligence. The measure is essentially the policy's expected utility w.r.t. the Solomonoff prior. This allows us to talk how close any given system is to an ideal agent. However, it suffers from two major limitations (in addition to the other problems of AIXI discussed before):

Moreover, naive attempts to ascribe a utility function to a policy run into difficulties. Specifically, any policy can arguably be ascribed a utility function which rewards following precisely this policy. With respect to such a utility function, the policy is Bayes-optimal for any prior. Armstrong and Mindermann have argued on the basis of this type of reasoning that ascribing a utility function is essentially impossible in a purely behaviorist framework (but I argue the opposite, see Direction 17.5 [LW · GW] below).

Nonproblem: Expected Utility

Finally, I want to briefly discuss one issue which I don't consider a key problem. Namely, the reliance of AIXI on expected utility maximization.

Expected utility is often justified by coherence theorems such as von Neumann–Morgenstern (VNM) and Savage. These theorems are indeed important: they show that a small number of natural assumptions produce a fairly narrow mathematical object (expected utility). This should indeed increase our credence in expected utility as a useful model. Often, objections are put forward that argue with individual assumptions. IMO these objections miss the point: a coherence theorem is a piece of evidence, not a completely watertight proof that expected utility is philosophically correct. And, if expected utility is philosophically correct, such a watertight proof is still not something we should expect to exist (what would it even look like?)

The more important justification of expected utility is the large body of work it supports: control theory, game theory, reinforcement learning theory etc. The methodology I believe in is, start with plausible assumptions and try to build a theory. If the theory that results is rich, has synergy with other bodies of knowledge, has explanatory power, is useful in practice, each of these is evidence that it's on the right track. (And if we failed to build such a theory, we probably also learned something.)

Another objection that is often raised is "humans have no utility function". This is ostensibly supported by research in behavioral economics which shows that humans behave irrationally. I consider this deeply unconvincing. For one thing, I suspect that a large part of that research doesn't replicate. But even putting that aside, the interesting thing about irrational behavior is that we recognize it as irrational: i.e. learning about it makes you behave differently (unless it's so minor that it isn't worth the effort). This already indicates that this irrationality is better viewed not as a fundamental fact about human preferences, but as some combination of:

All of this is not to say that expected utility cannot be questioned. Rather that a productive objection would be to either come up with a demonstrably superior alternative, or at least a good argument why some extremely concrete problem cannot be solved without an alternative to expected utility. Lacking that, the best strategy is to press forward unless and until we encounter such a problem.

Now, we actually do have examples where deviating from purist VNM in specific ways is useful, e.g.:

However, these examples came about from studying various concrete problems, not from arguing with the VNM model per se. Moreover, all of these examples can still be mathematically recast in VNM form: infra-Bayesianism can be regarded as a VNM zero-sum game (against "Murphy"), quantilization and Nash bargaining can be regarded as special cases of infra-Bayesianism (as will be discussed in an upcoming article [LW · GW] by Appel). Therefore, even here the theory built around the VNM model remains useful.

Research Directions

This section focuses entirely on the foundational part of the programme. I will mention some of the applied part in the next section (although different applications are also possible, e.g. studying quantilization or IDA), but for the most part I believe the foundational part to be the top priority.

In the following, I roughly grouped the research directions by the problems they are trying to address. However, the real relationship between directions and problems is many-to-many: a single direction can have implications on many problems. Moreover, there are many connections between the different directions. And, even for directions that are not especially connected at present, if they are successful, we will be faced with the task of merging them into a unified theory.

Subproblem 1.1: Frugal Universal (Infra-)Prior

One part of solving Problem 1 (computational resource constraints) is finding a prior (more precisely a family of priors) with the following properties:

More generally, we might want an infra-prior (see Direction 3 [LW · GW] below), or a metacognitive (infra-)prior (see Direction 6 [LW · GW] below), or a physicalist infra-prior (see Direction 18 [LW · GW] below) with analogous properties.

Direction 1: Frugal Compositional Languages

The time complexity of finding the optimal policy for a generic (tabular) MDP scales with its number of states. The same is true of the sample complexity of learning an unknown generic MDP. However, the number of states in a sophisticated model of the real world has to be enormous. For example, if I reason about the world as it's comprised of  objects with  possible states each, the over number of states is already .

Therefore, any realistic RL algorithm has to exploit some structure in its environment. For example, the environment might be comprised of spatially separated parts, or different processes happening on different spatial scales, or different processing happening on different temporal scales. We want to find an appropriate compositional language for describing such structure.

The compositional language used by AIXI is that of programs for some fixed UTM. This language is very powerful, but also unfeasible to work with, because programs can consume arbitrarily large amounts of computational resources or even fail to halt altogether. Imposing ad hoc computational resource bounds also doesn't work well, perhaps because such bounds are extraneous to the compositional properties of the language.

Arguably we need some new "frugal" compositional language. Here are the candidate starting points I currently know. It is possible that the right answer will involve combining several of them.

Direction 1.1: Compositional Polytope-MDPs

Compositional polytope-MDPs [LW(p) · GW(p)] is a language with two operations: union of asynchronous noninteracting parts and time hierarchy. Importantly, the optimal policy is computable in time polynomial in description length. The time hierarchy operation also offers a natural formalization for the notion of "instrumental values".

Further research might involve:

Direction 1.2: String Machines (for RL)

String machines [LW · GW] is a category-theoretic framework that allows constructing automata and transducers from string diagrams[11] (that can be probabilistic). In particular, it can be used to describe MDPs. In fact, there is a version of it that allows constructing arbitrary Turing machines, but there are natural ways to "throttle" to expressiveness of the language.

Further research might involve:

Direction 1.3: Infra-Bayesian Logic

Infra-Bayesian logic [LW(p) · GW(p)] is a sort of higher-order logic where the semantics has infradistributions instead of sets corresponding to predicates, and can be viewed as a synthesis of logic and probability theory. In particular, it is fairly natural for describing infra-(PO)MDPs because the main part of the latter is the transition infrakernel.

However, the semantics are not fully defined because of some technical difficulties. There is also a first-order version which is fully defined but seems less natural for describing RL hypotheses. Finally, there are hints that propositional infra-Bayesian logic might be more computationally tractable than classical propositional logic, but it's as yet unclear [LW · GW].

Direction 2: Deep Learning Theory

It seems plausible that deep learning algorithms (e.g. transformers) are already learning algorithms for the frugal universal prior. Therefore, building a rigorous theory of deep learning (in particular, its inductive biases and generalization properties) is another potential path to understanding this prior and its properties.

While this is a relevant research direction, I don't prioritize it for two reasons:

Nevertheless, it is a type of research that has implications on LTA and alignment researchers can reasonably consider to participate in it, if they feel especially well positioned for it and have the discipline and judgement required to self-censor if necessary.

Subproblem 2.3: Nonrealizability

The research directions below are different ways to get frequentist guarantees in the nonrealizable setting, but they are not necessarily mutually incompatible. In addition to exploring each approach separately, it can be interesting to study various combinations.

Direction 3: Infra-Bayesian RL

Infra-Bayesian RL [LW · GW] (IBRL) is a way to combine reinforcement learning theory with imprecise probabilities which produces a specific kind of frequentist guarantee in the nonrealizable setting (infra-Bayesian regret). A rough summary of the results we got in this framework so far:

In particular, the result about Newcombian environments and also infra-Bayesian physicalism (see Direction 18 [LW · GW] below) are two products of infra-Bayesianism that I don't see, at present, how to reproduce with any of the alternative approaches to nonrealizability.

Directions for further research include:

Direction 4: Exploiting properties of simplicity priors

In general, Bayesian inference has only fairly weak guarantees in the nonrealizable setting. However, simplicity priors might produce stronger guarantees. In particular, Lattimore, Hutter and Gavane (LHG) showed[12] that (a specific version of) Solomonoff induction converges to predicting a deterministic computable subsequence even when the full sequence is uncomputable.

Directions for further research include:

Direction 5: Agnostic RL

The conventional approach to nonrealizability in offline and online learning is to replace convergence to the true hypothesis with convergence to the hypothesis closest to the truth. Similarly, in RL, given some natural metric or divergence  on the space of environments, we can try to prove regret bounds of the form

Here,  is the time horizon,  is the hypothesis class and  is the true environment.

That is, we allow a sublinear term (as in the realizable case) and a linear term which scales with the distance of  from . Indeed, a result of this form was proven in Zanette et al, although in a model-free setting (i.e. instead of the environments themselves they consider their corresponding state-action value functions, although their divergence also depends directly on the environment).

At least superficially, a potential drawback of this approach is that Bayes-optimality doesn't imply this "agnostic" regret bound (AFAICT), as opposed to ordinary regret bound of the realizable setting. There is also no immediately obvious agnostic analogue of Bayesian regret. However, we can rectify this by recasting the setting in IBRL language. Indeed, for any  it is possible to construct a (non-crisp) infra-Bayesianian law[13]  s.t., for any policy  and :

Notice that when  is total variation distance, we can take .  

Then, learning the hypothesis class  is equivalent to agnostically learning . Moreover, IBRL does have Bayesian regret, and a notion of Bayes-optimality that implies infra-Bayesian regret bounds, which thereby carries over to agnostic RL.

A few examples of directions for further research:

Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

It makes some sense to consider these problems together, but different direction emphasize different sides.

Direction 6: Metacognitive Agents

This research direction is motivated not only by traps, but also by producing especially strong frequentist guarantees via metacognition (a form of self-improvement). More speculatively, this model might be also applicable to humans (and related specifically to conscious reasoning), which makes it relevant to superimitation (see relevant section [LW(p) · GW(p)] below). It was originally called Turing RL [LW(p) · GW(p)], but that name seems incorrect now because "RL" assumes learnability (no traps).

In a metacognitive agent, in addition to the usual "external" interfaces (actions and observations), there is also an "internal" interface that allows the agent to execute arbitrary programs and observe their outputs. This allows us to use a learning algorithm to form subjective beliefs about computations. The beliefs can be entangled with the agent's beliefs about the external world.

For example, let  be the set of all lambda terms. Then our agent might have a belief  over  that refers to the relation of -equivalence on . In a Bayesian framework  is a distribution, and the agent might also have a belief about the external world that takes the form of an RL environment that depends on a point in the support of . In an infra-Bayesian framework,  is an infradistribution and there is an appropriate way to define a notion of "law" s.t.  is its marginal[15].

However, it might be more interesting to put all beliefs about the world "inside quotation". Indeed, let  be a prior about the external world that a program can sample from: for example the unnormalized Solomonoff prior. Given a -equivalence oracle, we can sample this prior using some oracle machine  (since such an oracle allows to instantly know whether a given program produces given output). Since any point in  has the right type signature to serve as such an oracle, we can construct a kernel  from  into the external world s.t.  is defined by plugging the oracle  into . We can then combine any  with  (i.e. form the semidirect product ) to obtain the overall belief.

Notice that this setting is inherently nonrealizable, since the ground truth about  is uncomputable. On the other hand, the nonrealizability of the external world can in principle be circumvented, because we have computationally simple hypotheses of the form "the external world behaves according to program ". Moreover, even "the external world is sampled from the unnormalized Solomonoff prior" is a computationally simple hypothesis. Hence, as long as the world is computable, we can describe it exactly modulo uncertainty about computations.

While the external world can contain irreversible transitions, it seems reasonable to employ an approximation in which executing most[16] programs doesn't have irreversible harmful side-effects. As a consequence, if the agent survives long enough, it can learn most facts about computations. In the "quoted prior" setting, it will then use what it learned about computations to approximate Bayes-optimal behavior in the external world, which addresses subproblem 1.2. Moreover, since some hypotheses assert quoted priors about computations, it can arguably use what it learned to improve its own learning algorithm.

This framework is closely related to MIRI's notion of "logical uncertainty". Hence, it makes sense to compare it Garrabrant's logical induction (LI). While LI might turn out to have a useful role in this framework, in itself, LI doesn't address some key relevant questions:

There are several potential starting points for research into metacognitive agents:

Direction 6.1: Bayesian planning via query-based learning of computations

Arguably tree automata and more generally string machines [LW · GW] are a natural class of hypotheses about computations. Therefore, we can try building on existing research in automata inference[17]. This field is infamous for its many impossibility results, but there is one model of learning that has been quite successful. Namely, learning automata assuming two types of allowed queries:

That is, algorithms exist that guarantee convergence to a correct automaton with a number of queries that is polynomial in minimal automaton size.

In our setting, the first type of query can be implemented by running a specific program. The second type of query cannot be implemented directly, but is effectively available in various natural setups. For example:

Therefore, it is a natural question whether we can implement Bayesian planning by some protocol of this type, assuming (for starters) that it is possible to efficiently sample from the prior. For example, assume that the hypothesis about computations translates to a hypothetical infra-Bayesian coarsening (or Bayesian approximation?) of the prior for which optimal planning is feasible. Assume also that there if resulting policy performs less well than promised on the true prior, there is some way to produce a counterexample to the hypothesis about computations. Then, counterexample queries are effectively available. 

Automata learning might also be feasible in a setting where the automaton generates a sequence that we need to predict. (At least for a deterministic automaton this is trivial because the sequence is periodic.) Speculatively, the amount of available computational resources can be treated as defining a "sequence" (possibly related to "logical time [LW · GW]") and this can be exploited to get guarantees even when the prior cannot be sampled efficiently.

Direction 6.2: Expressiveness of string machines

String machines are a potential starting point for finding a frugal universal prior for metacognitive agents. In order to strike the right balance between "frugality" and "universality" we need to understand the expressiveness of the string language and the computational complexity of evaluating string machines (or answering more complicated questions about string machines) under various assumptions. In particular, parameterized complexity theory might be relevant here, since there seem to be more relevant parameters than just description size (e.g. the number of meta-vertex levels we allow).

Direction 6.3: Query learning of string machines

We can also study the sample complexity of learning string machines in the two-type query model above. While existing results guarantee learning with a number of queries polynomial in the number of states of an automaton, we would like query complexity to scale well with the description size of an automaton in string language. More general string machines are not even automata.

Direction 7: Approximation algorithms for Bayesian planning

Subproblem 1.2 comes from an NP-hardness result, but that still leaves room for studying approximation algorithms. There are many NP-hard optimization problems for which we have conjectured optimal approximation ratios (e.g. based on the PCP theorem or the unique games conjecture). It is conceivable that we can prove a result of this form for Bayesian planning. Some approaches:

Direction 8: Expanding Safety Envelope

The Expanding Safety Envelope [LW(p) · GW(p)] (XSE) is an approach to traps which can be summarized as follows:

One significant limitation of this approach is, when using a rich prior, discovery of new safe actions can at best be approximate. While we can study variants where we sometimes try risky actions, guaranteeing any sort of optimality there is likely to be intractable. However, maybe it is feasible to require that, whenever the posterior admits an approximation (e.g. in the sense of total variation distance) by a prior with iteratively discoverable safe actions, we get a corresponding lower bound on the performance of our algorithm starting from this point.

Better regret bounds for simple priors

This section is a grab bag of research directions about improving regret bound theory for simple hypothesis classes based on MDPs. They are not directly related to AIXI, but rather fit perfectly within classical RL theory[18]. However, it seems plausible that they are necessary pieces of the bigger puzzle (i.e. for a full solution of Problem 2).

Direction 9: Unifilar MDPs

[EDIT: What I called "unifilar MDP" is already known in the literature under the name "regular decision process" (RDP): see Ronca and De Giacomo. Notably, Subproblem 2.2 (password games) already emerges in learning RDPs with small number of states. Ronca and De Giacomo address it by assuming that the uniformly random policy has a high probability to reach every RDP state: a condition which is clearly insufficiently general, but does serve as an interesting starting point.]

Most of RL theory literature focuses on learning an MDP with a known state space, implicitly assume a known representation. Ortner et al prove a regret bound in a setting with unknown representation, but it scales with the cardinality of the set of possible representations. For real-world agents, I expect this cardinality to be something like doubly-exponentially large, so their bound is extremely weak. 

As a first step towards a better theory of representation learning, I propose the following setting. Fix action set  and observation set . Then, a unifilar MDP (UMDP: combination of MDP and unifilar hidden Markov model)  consists of the following data: the state set , the initial observation distribution , the state initialization function , the response kernel  and the transition function . The interaction of a policy  with a UMDP works as follows:

Notice that the state at any moment of time can be recovered from the observation-action history, i.e. the above data defines some . Hence this is indeed an MDP rather than a POMDP.

Fix a reward function  that can itself be represented by a deterministic finite state machine. We then constrain  by requiring that there exists some  s.t. for any  and  we have

The problem is then proving the existence of an algorithm for interacting with an unknown communicating UMDP  compatible with  which satisfies a regret bound that scales with 's number of states, number of actions and diameter.

Direction 10: Foreground MDPs

Regret bounds in (lifelong/non-episodic) RL scale with MDP diameter because of the need to rule out traps. On the other hand, sequence prediction / online learning requires no such constraint and even allows environments with infinitely many states. However, sequence prediction / online learning can also be regarded as a degenerate special case of RL. This suggests that there should be a natural theory that combines both.

Motivated by the above, I propose to study MDPs of the following form. Assume that the state set of the MDP factors as  (background and foreground). Moreover, there is some  s.t. the -marginal distribution of  is  for any  and , where  is the transition kernel of the MDP. That is, the background dynamics depends neither on the foreground state nor on the agent's actions. On the other hand, the foreground dynamics can depend both on the background state and the agent's actions.

We can then define a notion of diameter for the foreground only, and derive regret bounds that scale with it. Such a regret bound can scale with the total number of states, or with some dimension parameter (see Direction 11 below). In the latter case,  can be infinite. 

Direction 11: Canonical RL Dimension

[EDIT: Foster et al achieved significant progress on this, which I didn't know at the time of writing.]

This direction already receives some attention in mainstream academia, so it might be of relatively lower priority.

In multiple areas of learning theory, we have a fairly complete characterization of how sample complexity depends on the hypothesis class. In many cases, the key parameter of the hypothesis class is some type of "dimension". For example:

In RL, our understanding is less good. We do have some results of similar nature, most of them derived from Russo and Van Roy's "eluder" dimension. For example Jin, Liu and Miryoosefi demonstrate a model-free regret bound that scales with "Bellman-Eluder" dimension and the logarithm of a covering number (the later can be said to correspond to Minkowski-Bouligand dimension). However, we don't have matching lower bounds for these results. This means that we don't know whether the results we have cannot be subsumed by some better theory (indeed, Du et al show results that are neither strictly weaker nor strictly stronger than Jin, Liu and Miryoosefi). In contrast, the flavors of learning theory mentioned above come with proofs that their notions of dimension are optimal (albeit, only under the assumption of a worst-case distribution).

Direction 12: Epistemic Regret Bounds

This direction attempts to address Subproblem 2.2.

The motivating intuition is: formulating the correct frequentist guarantee requires correctly distinguishing between epistemic and aleatoric uncertainty. The problem with password games is that the password is completely random information that is impossible to infer by any method other than brute force. Hence, we should treat this information as part of "aleatoric" uncertainty.

Let's recall the definition of Bayesian regret. Let  be a set representing hypotheses and  the prior. Since each hypothesis defines an environment, we have a mapping 

The Bayesian regret of a policy  is given by

Here  is the time horizon, and  is the reward at time .

For any environment  and history , denote

That is,  is the probability that  produces  assuming a deterministic policy compatible with the actions in .

Notice that minimizing Bayesian regret is equivalent to maximizing expected utility. However, expected utility only depends on  and  via  defined by

Here, the expectation is in the sense of the Bayes' rule for environments, not pointwise. That is, first  is sampled out of  and then the agent faces the environment . Explicitly:

On the other hand, Bayesian regret depends directly on  and . Indeed, the minimal Bayesian regret is typically non-zero even for learnable . On the other hand, if we took  to be a single element set  and defined  by , we would get  but the corresponding Bayesian regret of the Bayes-optimal policy would vanish.

This means that the definition of Bayesian regret treats the uncertainty coming from  and the uncertainty coming from the stochasticity of  differently. We can call the former "epistemic" and the latter "aleatoric". Intuitively, the -dependence of Bayesian regret measures the rate at which the agent reduces its goal-relevant epistemic uncertainty, and it is sublinear iff the agent asymptotically reduces it to 0 i.e. learns all the "epistemic information" it needs. On the other hand, we don't expect the agent to learn all "aleatoric information" but only to maximize expected utility w.r.t. to aleatoric uncertainty.

In the case of the Solomonoff prior, the naive approach regards the uncertainty about the program for the UTM as epistemic uncertainty and the uncertainty corresponding to the probabilities computed by that program as aleatoric uncertainty. However, in algorithmic statistics[19] there are more nuanced tools to separate "structured" and "random" information, like the notion of "sophistication".

Given any string , we can consider ways to represent it as running program  (which we assume to halt on all inputs) on input . Let's constrain this representation to be nearly-optimal compression, i.e. require that for some some "small" number 

The minimal length of  over all such representations is called the -sophistication of  and can be interpreted as the amount of "structured" information in . In our case, the minimal  of an MDP might be considered as the "epistemic" information it contains, which leads to a novel decomposition of the Solomonoff prior into some  and . Care might be required for choosing . For example, maybe we can set .

In purely frequentist terms, we can hope for a regret bound of e.g. the form 

Here  is the -sophistication of the MDP+representation (s.t.  is the amount of "random" information) and  is another constant.

More generally, instead of a single MDP we can consider a computable distribution  over a -dimensional family of MDPs of diameter at most , using some notion of dimension from RL theory (see Direction 11 above). Then, we can expect that the average (-Bayesian) regret of the family can be bounded by

Here  is yet another constant, whereas  and  refer to the sophistication and Kolmogorov complexity of .

Given an appropriate notion of "computable distribution over MDPs", this would encompass learning of MDPs with arbitrary (uncomputable) transition probabilities.

Subproblem 2.4: Multi-Agent Guarantees

One special case is completely straightforward in IBRL, namely two-player zero-sum games: IBRL essentially is a zero-sum game, so it's enough that each player views the other as "Murphy". So, for example two IBRL agents equipped with sources of randomness (s.t. effectively their actions are lotteries over game moves) playing a stochastic game necessarily converge to a Nash equilibrium, assuming that the corresponding infra-MDPs are in their respective hypothesis classes.

Beyond that, there are roughly two approaches to multi-agent guarantees:

AFAIK either of the following possibilities is conceivable:

Different directions below have different relationships with these worlds.

[EDIT: See also Infra-Bayesian Haggling [LW · GW], a new direction for Worlds 2.4.2/3]

Direction 13: Population Games

This direction is motivated by Worlds 2.4.1/2.

A population game [LW · GW] is a setting where large populations of players play against each other by being sorted into random groups on each turn, s.t. the probability to meet the same opponent twice is negligible. This setting is designed to reduce incentives for superrationality, although it's still unclear whether this is sufficient to justify classical game theory.

An appealing feature of this setting is that it can be naturally interpreted as a dynamical system. For learning agents satisfying very mild assumptions, every fixed point corresponds to an -Nash equilibrium (where  as the geometric time discount constant ). Moreover, a fixed point exists for any tuple of player policies.

It is therefore natural to ask whether any of the fixed points have to be attractors (under some assumptions about the learning algorithms), study their attraction basins, study other fixed submanifolds etc.

Direction 14: Logit equilibria of finite-state repeated games

This direction is motivated by World 2.4.1. (See this [LW · GW] for early thinking on the topic.)

A logit equilibrium is defined similarly to a Nash equilibrium with "optimal response" replaced by "soft-optimal response" (i.e. the players behave according to the softmax distribution associated with the expected payoffs of different plays for fixed opponent distributions). Consider the iterated prisoner's dilemma where each player's pure strategies are constrained to be of the form:

In particular, tit-for-tat corresponds to  and .

For geometric time discount  close to 1, both always-defect/always-defect and tit-for-tat/tit-for-tat are Nash equilibria. Moreover, tit-for-tat/tit-for-tat is also an approximate logit equilibrium for softmax parameters[20]

However, always-defect/always-defect is not an approximate logit equilibrium as long as

This happens for the following reason. Tit-for-tat is almost as good a response to always-defect as always-defect: it only loses  utility because of round 0. Therefore, the soft-optimal response to always-defect is a mixture of at least always-defect and tit-for-tat. Moreover, the soft-optimal response to a mixture of always-defect and tit-for-tat places little probability on always-defect, because tit-for-tat does much better.

As a result, in this asymptotic regime logit equilibrium entails a probability approaching 1 of cooperating forever. More generally, I propose the following conjecture: in any repeated (or even stochastic?) game in which we constrain the strategies to be finite state machines (with/without a uniform bound on the number of states?), while allowing a sufficient number of states, any logit equilibrium converges to Pareto efficiency in the asymptotic regime above.

This seems like a potential method of obtaining superrational behavior as a corollary of classical game theory in some situations.

Direction 15: Infra-Bayesian Veil of Ignorance

This direction is motivated by Worlds 2.4.2/3.

As a warm-up, consider a repeated symmetric game between two agents who are copies of each other, either without randomization or with both using the same random generator. In this setting, each agent learns that the other parrots its actions precisely. Therefore, even a simple learning algorithm will converge to the symmetric Pareto efficient outcome.

Direction 15.1: IBRL with exact clones

What if the game is not symmetric, or there is randomization which breaks the symmetry? If we assume all the agents are exact clones (i.e. they execute the exact same code / follow the exact same policy), each agent can regard this as a Newcombian environment (the clones are implemented by "Omega"). And, we know that Newcombian environments can be represented as infra-Bayesian laws. We will call it the clone law in this case.

Notice that, even though exact clones must have the same utility function, this doesn't rule out asymmetric games. Indeed, the utility functions are the same in terms of the particular agent's actions and observations, but this is consistent with each agent identifying their own role ("which clone/player am I") and receiving payoffs accordingly. As a simple example, we can imagine the payoff being a part of the observation channel.

Notice also that the clone law is not a convex combination of laws corresponding to the different roles the agent can play. Rather, the clone law requires Murphy to predict the behavior of the agent in each possible role in advance of the random role selection. This way pseudocausality is ensured (i.e. Murphy can't "cheat" by only predicting the behavior of one role correctly.)

Moreover, two clone laws which are identical except for a different probability distribution over roles typically cannot coexist in the same learnable hypothesis class. This is because there is no way to infer the distribution over roles from observations. Instead, we should think of this distribution as determined by the prior. Indeed, mixing two clone laws is equivalent to mixing the role distributions in the  limit, as long as the role distributions have full support. Hence, we can imagine all clone laws for the same game inside the prior effectively aggregating into a single clone law with the average role distribution. 

The optimal policy for the clone law is maximizing the linear combination of utilities of different roles according to the prior distribution over roles. Hence, the prior determines which point on the Pareto frontier is selected, and can be regarded as defining a "notion of fairness".

One issue with clone laws is that they cannot be represented as e.g. infra-POMDPs with a finite number of states (the state has to encode Murphy's prediction i.e. the entire relevant part of the policy). This leads to a concrete research problem: find natural hypotheses classes containing clone laws that admit efficient learning algorithms.

Direction 15.2: IBRL with inexact clones

What if we want to allow agents with different utility functions? One trick to achieve this is postulate that the agent receives a description of its utility function upon waking up (e.g. in the form of a program, if it's a metacognitive agent). This way, the policy has to account for all admissible utility functions. It is possible that we can apply similar methods without this trick, using instead the observation that most utility functions can appear as instrumental goals in some situations (e.g. using the formalization of compositional polytope-MDPs).

What if the agents have different priors (in particular over utility functions but also in general)? Now they are not exact clones anymore. However, it is conceivable that their policies are still similar in some sense that can be used to represent the setting as Newcombian. For example, we can equip the space of utility functions with some metric. We can then define a metric on policies via the Hausdorff distance between their graphs.

As a toy model, we can consider a fixed game in normal form, with the agent having some prior over:

We can then try to prove that, under some assumptions, priors that are "close" (according to some divergence) lead to policies that are close. Moreover, when agents with close priors are facing each other (and know their mutual distances are small), they should play close to Pareto efficiency. It would be fascinating if we could recover something like Yudkowky's proposal [LW · GW].

In particular, consider a game with a unique Pareto efficient outcome. Assume that all admissible payoff matrices are monotonically increasing transformations of each other. Then, the agents will always play Pareto efficiently, because doing so unconditionally guarantees that all opponents will do the same thing, for any finite distance from opponents. (In particular, the actual distance is 0).

Direction 16: Hidden Rewards

This direction attempts to address Problem 3 (value ontology).

In order to define a utility function, we need some ontology  to specify its domain. One natural way to represent an ontology is using an infra-POMDP. That is, we have the state set , the initial infradistribution , the observation mapping  and the transition infrakernel . Here,  stands for credal sets over , i.e. non-empty closed convex subsets of  (we can also allow more general infradistributions). The reason we're using an infra-POMDP instead of a POMDP is, we don't want  to fully specify the environment.

To specify an environment over  (in ontology ), we consider a maximal refinement of the infra-POMDP (we call this a hidden environment[21]). That is, a state set , an initial distribution , an observation mapping , a transition kernel  and a translation mapping  s.t. the following conditions hold:

If we want to do IBRL rather than Bayesian RL, we can consider general (non-maximal) refinements, i.e. refinements that are infra-POMDPs rather than POMDPs.

To specify a reward function over  ("hidden reward function"), we consider a function . More generally, we can allow bounded functions . Given a reward function, the corresponding utility function is defined as usual i.e. as a time-discount sum of rewards.

An interesting special case of a hidden environment is a UMDP equipped with a translation mapping. We will call it a relative UMPD (RUMDP). Assuming a hidden reward function of the form , it admits efficiently computing the optimal policy.

Given hidden environments/rewards instead of observable (i.e. ordinary) environments/rewards, it is straightforward to define analogues to concepts from RL theory such as prior, Bayes-optimality, regret, learnability etc.

Direction 16.1: Agency with partial monitoring

Learnability is more difficult to establish with hidden reward functions than with observable reward functions, since ruling out traps is insufficient (see Example 1 here [LW · GW]). For stateless environments, we already have the closely related theory of partial monitoring, which provides a comprehensive classification of possible regret bounds[22]. Therefore, we can try building an extension of this theory to our RL-like setting.

Direction 16.2: Specifying semi-instrumental reward functions

A related setting in which learnability is established is instrumental reward functions [LW · GW] (and very likely we can extend that to semi-instrumental reward functions as well). Its drawback is that (as opposed to hidden rewards) it doesn't come with a simple natural way to specify reward function. A formal relationship between the two frameworks could allow us enjoying both worlds.

Given any hidden reward function , we can ask whether there exists a semi-instrumental reward function  s.t. for any hidden environment  and history  compatible with 

Here,  stands for the instrumental state corresponding to , and we abuse notation on the left hand side by implicitly projecting  to  using .

The problem, then, is to characterize hidden reward functions with this property. Also, we can try to characterize  for which all hidden reward functions have this property.

Direction 16.3: Undogmatic Ontologies

Every hidden environment defines an observable environment. However, for an arbitrary , not every observable environment can be obtained this way. In other words,  constraints what the agent can expect to see. Arguably, this is undesirable: why would the real world agree with an ontology which is just an arbitrary/subjective feature of the agent's axiology?

Hence, it is natural to consider undogmatic ontologies:  that admit all observable environments. However, it seems tricky to find pairs of  and a hidden reward function  that simultaneously satisfy all of the following conditions:

As an existence proof, we can consider the ontology whose states are instrumental states () and initialization is completely uncertain (), and  corresponding to any instrumental reward function. It would be interesting to find a good general characterization. In particular, I don't even know whether  can be finite. [EDIT: Here [LW(p) · GW(p)] is a way to construct many examples, including finite ones.]

Direction 16.3: Bayes-optimal planning for communicating RUMDPs

We know that approximating Bayes-optimal planning is intractable for MDPs with traps. However, maybe for communicating RUMDPs it is tractable even when they form an unlearnable class. This seems like an interesting question to investigate.

Direction 17: Algorithmic Descriptive Agency Measure (ADAM)

This direction attempts to address Problem 5 (descriptive agent theory).

ADAM is a formal way of quantifying the extent to which some arbitrary policy  is "agentic".

Fix some UTM .

We will consider computable utility functions of the form . Notice that such a function is automatically continuous. Since it is computable, we can talk about its Kolmogorov complexity  (w.r.t. ).

Also, for any UTM  we denote  the corresponding Solomonoff (semi-)environment (prior). We can talk about : the Kolmogorov complexity of  w.r.t. . Also, given  as above we can consider the joint Kolmogorov complexity .

We can now tentatively define ADAM by

Here the maximum is over all computable utility functions  and UTMs .

Intuitively, we are looking for a way to interpret  as maximizing  w.r.t. prior  s.t., on the one hand,  does a good job at maximizing (enforced by the first term), and on the other hand, this interpretation is not contrived (enforced by the second term). 

Notice that  inherits all the other problems of AIXI. Hence, we will ultimately need to modify it according to the respective solutions (e.g. use frugal universal priors and their associated complexity measure instead of Solomonoff priors and Kolmogorov complexity).

 has a few easy to check nice properties:

  1. The AIXI policy  has .
  2. Any computable policy  has . This allows us to intuitively interpret  as something like "how many of the bits needed to describe  are contributing to its agency".
  3. For every  there is a computable policy  s.t. .
  4. Changing the UTM only changes  by .

However, at present I don't know much else about its properties. This leaves a lot of questions to explore:

Direction 17.1: Comparing ADAM variants

Here are some superficially similar definitions:

Here,  is a Solomonoff prior over policies and  is a Solomonoff prior over utility functions and UTMs[23].

At present, I slightly prefer  over  and  because for the latter two, I don't know property 3 above (I know that e.g.   but not ).

It is easy to see that

Open problem: Are  the same as  up to ?

Direction 17.2: ADAM of random policies

Deterministically generating an object of high Kolmogorov complexity is hard: for any program  that outputs a bitstring  we have , so generating  with very high  requires a very long program. Since , this means that deterministically generating an agentic policy is also hard. But, randomly generating an object of high Kolomogorov complexity is easy: most bitstrings  of length  have . On the other hand, I expect most policies to be unagentic.

Conjecture: For any computable , we have

Here,  is computable in the sense that it's a Turing machine that receives the first argument on a separate input tape, and  is the IID fair coin.

Direction 17.3: ADAM Hierarchy Theorem

It would be interesting to understand how dense the possible values of  are. That is, find a function , as slow growing as we can, s.t. for every  there is a policy  for which

Direction 17.4: ADAM for finite-state policies

Consider policies that can be implemented by a deterministic finite state machine. That is, such a machine has a finite state set , an initial state  and a transition function . Suppose that . Clearly, the policy  implemented by the machine has

Open problem: Is this bound on  in terms of  close to tight? Are there infinitely many  s.t. for some -state policy  it holds that ?

Direction 17.5: Inferring the utility function

Given an agentic policy, can we infer its utility function and prior? For simplicity, let's start with the case when the prior is known[24]. For this purpose, we can consider the "semidescriptive" agency measure

Here, .

The obvious candidates for the utility function associated with  are all  that attain the supremum on the right hand side, or at least come close to attaining the supremum. Notice that the contrived utility function  which just rewards behavior according to  will typically fail this criterion:  and hence the expression inside the maximum is  when .

As opposed to  is not approximately UTM invariant, strictly speaking. However, it is still approximately invariant w.r.t. changing the UTM used for defining  and  while holding  fixed. That is, we imagine the Kolmogorov complexities defined w.r.t. some UTM  and  w.r.t. a different UTM , and we have approximate invariance w.r.t. changing .

This raises the question of whether the inferred utility function is at least somewhat stable w.r.t. changing . Intuitively, it is unlikely to be the case for arbitrary : there is no canonical way to ascribe a utility function to a rock. However, we can expect it to be the case for agentic , i.e. in the limit .

To simply matters even further, let's focus on the case . Moreover, let's assume that there is exists a computable  s.t.  is the associated AIXI[25], i.e.

We probably still don't have full uniqueness since small changes in rewards that happen a bounded number of times can be insufficient to affect the optimal policy. However, the asymptotic rewards might be unique. Formally, I propose the following uniqueness conjecture.

Definition: Given a topological space  and bounded function  and  are said to be locally equivalent utility functions when for any  there is  an open neighborhood of  and  s.t. for any 

Conjecture: Let  be a policy and  computable utility functions s.t.  is AIXI for both  and . Then  and  are locally equivalent in the sense of the product topology on .

[EDIT: The Conjecture as stated is false, but other versions might be true, see discussion here [LW · GW].]

Direction 18: Infra-Bayesian Physicalism

This research direction attempts to address Problem 4.

Infra-Bayesian Physicalism [LW · GW] (IBP) is a framework that builds on metacognitve agents, introducing a way to talk about hypotheses, priors and counterfactuals that does away with cartesian privilege. 

IBP comes with an agent-independent ontology for values, which provides another possible answer to Problem 3. There might be a natural way to translate hidden rewards to the IBP framework (see original article [LW · GW]), so these two answers are not mutually exclusive. However, the price tag is a bizarre constraint on values which we call the "monotonicity principle", for which I only have wildly speculative explanations at present. Hopefully, further research can shed some light on the question.

IBP also has an interesting relationship with ADAM: it seems that in that framework it's impossible to define an "ideal" agent, but we have to be content with defining an agency measure [LW · GW]. At the very least, "ideal agent" is a less natural object in that context.

See the original article [LW · GW] for concrete directions for further research.

Physicalist Superimitation

Motivation

Physicalist Superimitation (PSI) is a (currently informal) approach to creating a rigorous alignment protocol, i.e a formal abstract model of a (hopefully feasible) AGI design that will be provably aligned under some (hopefully realistic) assumptions.

PSI is not the main motivation for LTA. The case for creating a mathematical theory of intelligent agents, both in general and even in particular via LTA, is much more robust than the case for PSI specifically. Moreover, I expect the theory produced by LTA to be applicable to analyzing many different alignment protocols (e.g. IDA, debate or AQRL [LW(p) · GW(p)]), not just to PSI. 

However, there are several reasons why discussing PSI here seems important:

Superimitation

The first component of PSI is superimitation: an agent (henceforth: the "imitator") that receives the policy of another agent (henceforth: the "original"), and produces behavior which pursues the same goals but significantly better. (Later, we will apply the framework in a way in which the role of "policy" is played by something broader than externally visible behavior: the required "behaviorist" assumptions about values are a lot weaker than they might seem.) This is different from mainstream research in inverse reinforcement learning where the goal is usually producing behavior which is merely equally good, or better only by the virtue of being less noisy. For now, we will put aside the question of how to obtain the original policy (but, see Agent Detection below). 

While PSI heavily leans on IBP, basic superimitation can be formulated and studied in a cartesian setting as well. In fact, starting the research from a cartesian setting seems like the reasonable approach. More concretely, the starting point can be ADAM and in particular Direction 17.5.

However, the successful completion of Direction 17.5 is would still fall short of a model of superimitation. Indeed, presumably the utility function can only be inferred (more or less) perfectly in the limit , but in this limit the original is already optimal and cannot be improved upon. And, for any finite , the inaccuracy of the inference might negate any improved optimization. Therefore we need to add to our model some mechanism by which the imitator gains an advantage over the original.

One advantage mechanism is allowing the imitator more actions and/or observations. Examples of operationalizations:

The parenthetical issues related to the domain of the utility function can hopefully be addressed given a solution of Problem 3. But, even putting these aside, this mechanisms seems somewhat disappointing: we want the AI to be superhuman not only thanks to better input/output channels, but also thanks to some deeper notion of "cognitive power". In other words, the resulting system might be very uncompetitive.

An improved advantage mechanism is suggested by the framework of metacognitive agents. The imitator can have superior computing hardware which allows it run computations faster than the original (via the internal interface), or even run computations that the original cannot run[27]. It has some formal similarity to the first mechanism, since the computing hardware can be regarded as a type of input/output channel, but seems like a better model of "cognitive power".

Applicable to humans

I often hear objections along the lines of "but humans don't have utility functions / are not rational / are not coherent / are not agents". I think that that reasoning is confused.

I already touched on this topic in the section Nonproblem [LW · GW] above. But, in the context of alignment, there is a different important argument: alignment is fundamentally a normative problem. Whenever we talk about whether an AI is "aligned" or "good" or "safe" we're using value-laden concepts that only makes sense within a model (approximation) in which we actually possess (more or less) well-defined values. Whenever we discuss what AI design we should choose, we're working on a problem that only makes sense within a model in which we are rational agents that make choices according to our values. Let's call this model "the anthroposinepic[28] view".

This is not to say that the anthroposinepic view is a perfect description of reality. The map is not the territory, all models are wrong - but some are useful. For example, "the center of mass of the sun" is also not a well-defined concept: it's not clear which particles are part of the sun, in QFT particle positions are inherently fuzzy within the Compton wavelength, and in quantum gravity any notion of position is probably approximate at best. Nevertheless there are many contexts in astrophysics in which speaking of the center of mass of the sun is perfectly sensible. Similarly, there are contexts in which the anthroposinepic view is perfectly sensible: in particular normative questions, since they assume this view from the onset!

Another objection of similar flavor is: what if humans are actually made of multiple coexisting agents (see e.g. Sotala [? · GW]), or have dynamically inconsistent preferences (i.e. a utility function that varies over time, such as hyperbolic discounting)? My answer is: maybe it means that we will have to extend the framework to deal with those possibilities. But it's also possible that it's unnecessary or that it works out "automatically": multiple agents (coexisting or distributed along the subjective timeline) behave effectively as a single agent (thanks to superrationality). My methodology it to always start with the simplest model (in this case, humans as unitary agents) and only go beyond it when:

Most importantly, I don't think this debate can be definitely resolved purely by philosophical arguments. Instead, we should study the emerging theory, check whether it satisfies intuitive desiderata, cross-reference it with cognitive science, and see what it says about the edge cases in which philosophical confusion arises. If the theory is correct, it will ultimately serve as a source of intuition which will lay all philosophical confusion to rest. And if it isn't, well, then we will see where it breaks and grow wiser from that too.

Agent Detection

Superimitation assumes we can access the original's policy. In the context of an alignment protocol, this assumption has major issues:

PSI addresses this by combining IBP and ADAM. IBP provides us with a rigorous notion of "which computations is the universe running". ADAM provides us with a rigorous way to decide which sequences of computations are agents. Putting the two together should allow specifying an "agent detector": a mathematical operator that transforms any given hypotheses about the universe into the collections of agents that inhabit this universe. 

This can solve the problems above modulo the question of which agent should we superimitate (for this, see User Identification below). That said, I don't know how the solution would cache out in practice for humans. It would be fascinating to combine this theory with brain science when the former is much more mature. Speculatively, I suspect that a metacognitive version of ADAM would interpret some of human inner mental life as input/output through the internal interface.

Also, I think that there might be some formal relation between this "physicalist agent detector" and some version of Garrabrant's Cartesian Frames, but I won't try to articulate it here.

User Identification

We don't want the AI to superimitate any and all agents. Some of those agents might be descendants of the AI. Some of those agents might be other AIs, aliens and animals. We don't necessarily want the AI to superimitate humans who lived in the past either.

For now, let's make the simplifying assumption that there is a single human we want the AI to imitate: the "user". Generalizing the method to groups of humans is left as an exercise to the reader[29].

It seems [LW(p) · GW(p)] that IBP has a natural notion of causality. Since IBP is a computationalist framework, this notion talks about causal relationships between instantiated computations. Specifically, the results of some computations control the instantiation or input[30] of other computations. Since an agent is a type of computation, this gives us a notion of causality between agents: the action Alice takes in a particular Alice-counterfactual can affect the observation Bob makes in a particular Bob-counterfactual.

We can use causality to design a protocol for user identification. For example, imagine that the AI wakes up in a room with the user. The user can observe all of the AI's outputs (e.g. characters on a terminal) and the AI can observe many of the user's outputs (e.g. through a camera pointed at the user). This creates a short and tight causal loop between the user and the AI during the AI's early history that doesn't exist between the AI and any other agent[31].

Moreover, we can use causality to specify the point on the user's subjective timeline at which the AI begins to influence them. The AI is designed to superimitate only the user's past: this way there is no perverse incentive to modify the user. Since agents in this framework are programs, this still allows access to the user's entire policy. (Although we still need a to disambiguate between equivalent problems, probably based on description complexity.)

Full Specification

We can now put all the pieces of PSI together:

  1. For each hypothesis in the prior:
    1. Identify all agents.
    2. Out of all agents, identify the user.
    3. If there are multiple users or no user, discard this hypothesis.
    4. Infer the user's utility function.
  2. Choose policy that maximizes prior expected utility (where in each hypothesis, the utility of its respective user is counted).

At least, this is the simplest version. The above is probably not the best way to aggregate preferences across hypothesis, since, for example, it doesn't facilitate acausal trade. Here are additional steps that can fix it

Notice that all this is just an (informal) specification of the function we're optimizing, it's not the algorithm used for optimization. Similarly to how asymptotic Bayes-optimality can be achieved (for certain priors, e.g. small MDPs) by a feasible learning algorithm, without actually tracking every possible hypothesis and calculating the true Bayesian update, this specification can (hopefully) be approximated by some feasible algorithm the details of which we currently don't know (and which we will only begin to describe after substantial additional progress on the foundational part of LTA).

Alignment Guarantee

Formal Alignment Criterion

How to formally define alignment? An appealing approach is, formalizing the following criterion:

Criterion A: An AI is (approximately) aligned when it (approximately) optimizes the expectation of the user's utility function  with respect to the user's beliefs .

Indeed, if I am the user, and I am choosing which AI to run, and I am an agent with a utility function  and beliefs , clearly it is rational for me (given my subjective epistemic vantage point) to choose an AI satisfying Criterion A.

This might seem confusing at first: shouldn't I want the AI to have more accurate beliefs than the beliefs I hold myself? But, this isn't a contradiction. If we imagine the AI as having access to more new evidence than the user, then starting from the same beliefs it arrives at a better outcome. This is closely related to the "advantage mechanisms" we discussed in the section on superimitation.

We will call the part of Criterion A about  axiological alignment. It is more or less the same as "outer alignment". We will call the part of criterion A about  epistemic alignment. It is closely related to (and implies) "inner alignment". The last point might be confusing so let's dwell on it a little.

Inner misalignment is usually defined as the formation of malign mesa-optimizers: models produced by the learning algorithm which are in themselves malign agents (see Hubinger et al). In learning theory, possible models correspond to hypotheses in the hypothesis class of algorithm. If the AI's starting beliefs are sufficiently different from  (it is epistemically misaligned), it might converge to a hypothesis that the user would not have endorsed. Such a hypothesis may lead to actions that are catastrophic from the user's point of view: for example, because the hypothesis encodes a malign agent. Two (not mutually exclusive) examples:

Finally, notice that Criterion A is an approach to formalizing alignment, not formalizing corrigibility. Corrigibility seems to me a not especially natural concept, which is henceforth especially difficult to formalize and therefore also especially difficult to implement. For this reason, I am relatively pessimistic about approaches based on corrigibility (which PSI is not). That said, some weak versions of corrigibility might be feasible, e.g. the Hippocractic principle [LW(p) · GW(p)].

In the following subsections, I will give a rough outline of why I think that it might be possible to prove a formal alignment guarantee about PSI, along the lines of Criterion A.

Axiological alignment of PSI

The axiological alignment of PSI relies on the ultimate success of the ADAM research direction (in conjunction with the necessary parts of the rest of the foundational programme). However, the fact PSI relies on physicalism makes the following problem especially acute: should PSI use cartesian-ADAM (i.e. model the user as a cartesian agent) or physicalist-ADAM (i.e. model the user as a physicalist agent)?

Currently I see the following possibilities:

Epistemic alignment of PSI

The case for the epistemic alignment of PSI relies on the following (informal) conjecture:

Physicalist Agreement Conjecture (PhyAC): Two IBP agents in the same universe that start with similar priors, and share most of their evidence, will converge to similar posteriors[32].

Notice that this is not the case for cartesian agents. This is because cartesian agents with syntactically similar priors have very different priors semantically: each prior refers to the respective agent's actions and observation. Each agent assumes cartesian privilege for itself but not for the other agent.

A strong demonstration for how it fails with cartesian agents is through simulation hypotheses: if Alice is much more influential than Bob, then Alice will assign a simulation hypothesis much more credence than Bob. Even if both Alice and Bob end up believing simulation hypotheses, they will often be different simulation hypothesis: each will postulate the kind of simulators that would have incentives to simulate that person's respective subjective viewpoint.

On the other hand, for IBP agents this argument doesn't work anymore. For example, suppose that universe A has simulators with incentives to simulate Alice's viewpoint. Since Alice observes Bob fairly closely (by assumption of PhyAC), simulating Alice's viewpoint requires also running a simulation of Bob. Moreover, since Bob is an IBP agent, and IBP is computationalist, this implies that Bob expects to experience this simulation (created for Alice's sake) as well. Hence, both of them assign similar credence to being in universe A.

In the case of PSI, a possible objection is, maybe the simulators only need a low fidelity simulation of the user to fool the AI, which the user wouldn't identify as themself. This is unclear (since this low-fidelity simulation has to be good enough to be indistinguishable for the AI from the null hypothesis that we're not in a simulation). However, if this is a concern, we can imagine a version of the agent detector that would also accept low fidelity simulations.

If the user is an IBP agent, PhyAC can be applicable to prove epistemic alignment of PSI. Is the user an IBP agent? This is unclear, but it's essentially the same debate as in the previous subsection.

Another concern is, even assuming that the user is an IBP agent, how similar is its prior to whatever prior we put into PSI? Presumably, the difference should be comparable to using Solomonoff induction with different UTMs, but hopefully not too different (i.e. the associated bisimulation is not too complex). As long as we are in a learnable setting, this doesn't matter. In reality, there are traps, so it does matter: but perhaps not critically. Alternatively, we might need to modify the specification of PSI to make stronger use of the inferred user prior in order to close this gap.

A Plan

Assuming PSI is the correct approach to aligning AI, how would we get from here to a practical implementation? I imagine the plan roughly as follows:

  1. Develop the foundational programme of LTA, until we have fairly good solutions to Problems 1-5 and maybe to other problems that will crystallize in the process.
  2. Continue fleshing out and improving PSI. I see some of it happening in parallel with step 1, but for the most part this is at best second priority until step 1 is much more advanced. In the process, I expect our model of PSI to evolve through roughly the following phases:
    1. Informal model: The model is inspired by mathematical ideas but is not fully rigorous yet. This where we are now. It gradually acquires more details and precision, until we are ready to formalize at least a part of it.
    2. Toy model: The model is rigorous but makes simplifying assumptions that are completely unrealistic. We study the toy model's formal properties (including alignment), while gradually moving to more realistic assumptions. One axis along which this evolution can happen is computational resource constraints. For example, it can be the following progression:
      1. Require only computability or not even that.
      2. Require polynomial space in hypothesis description length. Alternatively, require polynomial time in number of hypothesis-states.
      3. Require polynomial time in hypothesis description length.
    3. (Semi-)Realistic model: The assumptions are sufficiently realistic that we could conceivably implement the model at least in some utopian scenario (e.g. if all capability research stopped and we had 100 years to bootstrap the AI). We continue moving to more realistic assumptions, while also gaining semiformal understanding about which properties of the model are crucial to alignment and which aren't. 
  3. Study those questions outside of theoretical computer science the answers to which are needed as inputs to the AI's design or as parameters to the alignment guarantees. This might be possible in parallel to step 2 or at least to the late phases of step 2.
  4. Build a real-life implementation of PSI. This might be anywhere on the spectrum from (on one end) implementing an algorithm with formal guarantees and formally verifying the guarantees on the production code to (on the other end) implementing an algorithm which is only loosely based on the formal model, while using the understanding we gained from the theory to convincingly argue that the differences between the model and the implementation are immaterial. The algorithm might or might not bear strong resemblance to deep learning, depending primarily on the results of research into Problem 1. In any case, it will involve much engineering work and security mindset to make sure that the assumptions of the alignment guarantee actually apply.

Summary

In this article, I tried to convey the following points:

There is a lot of work and perhaps not that much time. I hope that this article will inspire more researchers to joint the effort. See you all in 2028[33]!


  1. ^

    listed alphabetically by last name

  2. ^

    By "foundational part" I mean the theory of intelligent agents qua intelligent agents, as opposed to the "applied part" where we use this theory to study alignment per se.

  3. ^

    Sometimes we are content with "any except for a set of prior probability 0", such as when we only bound the Bayesian regret in a Bayesian online/bandit/reinforcement learning setting.

  4. ^

    See e.g. Lattimore and Szepesvari for an introduction to regret bounds, in particular section 38 talks about reinforcement learning.

  5. ^

    Here, we assume that the first observation comes before of the first action, in contrast to my usual convention, because this time it's more convenient.

  6. ^

    The diameter is the maximal expected time to reach one state from another state. The bound has to scale with  since as , a trap can develop. Alternative parameterizations are also possible, such as using mixing time or bias span. The latter might be advantageous since bounding the diameter also caps the number of states at , whereas bounding bias span allows an infinite number of states. On the other hand, the bias span depends on the reward function.

  7. ^

    Alternatively, we can consider geometric time discount , in which case  is replaced by 

  8. ^

    The existence of a computationally feasible optimal policy is usually a much stronger condition than the computational feasibility of simulation: for example finding an approximately optimal policy for a POMDP is PSPACE-hard while simulating it in P. Of course there are degenerate cases in which the optimal policy is easy even though simulation is hard.

  9. ^
  10. ^

    The name comes from Kalai and Lehrer.

  11. ^

    I haven't properly studied the prior work relating automata to category theory (which definitely exists) so I make no strong claims to originality here.

  12. ^

    The paper uses a theorem stated without proof, for which the citation given is "Lempp, Miller, Ng and Turetsky, 2010, unpublished, private communication". However, Lattimore kindly provided me with this unpulished communication upon request, and, to the best of my judgement, the proof is valid.

  13. ^

    Infra-Bayesian laws were originally called "belief functions" in the infra-Bayesianism sequenece.

  14. ^

    I haven't done a sufficiently thorough literature survey, so there might already be such results.

  15. ^

    It's not just a law that depends on a point in the support of  because there is no disintegration theorem for infradistributions. See this [LW · GW].

  16. ^

    While running most programs doesn't have irreversible harmful side-effects, but it probably doesn't hold for all programs: for example we can imagine code that uses some hardware exploit. In particular, this is related to what I previous called non-cartesian daemons [LW · GW]. It should be possible to get guarantees that take this into account by e.g. somewhat randomizing the precise code we run each time (it might be related to quantilization).

  17. ^

    See e.g. Kearns and Vazirani chapter 8, Higuera 2004 and Higuera 2010

  18. ^

    As a consequence, this section carries an especially high risk of missing some pertinent known results that I'm unfamiliar with.

  19. ^
  20. ^

    The convention I'm using here doesn't normalize the sum of payoffs over time, i.e.

    and the probability of playing a strategy is proportional to .

  21. ^

    It would be more natural to say that the hidden environment is an equivalence class of such objects, where two are considered equivalent if it is not possible to distinguish them in terms of actions and states in . However, the distinction is not critical for the exposition.

  22. ^

    See e.g. chapter 37 in Lattimore and Szepesvari.

  23. ^

    Or maybe just TMs, that would still allow making sense of  as the lower semicomputable environment computed by .

  24. ^

    If we want to infer the utility function and the prior simultaneously, we probably need to take into account that some ways to redefine both of them together produce an equivalent decision problem. Hence, it makes sense to focus on recovering their product. Formally, any environment  corresponds to a measure  on  s.t. for any , the distribution induced by  and  is equal to , where  is the function that's 1 on sequences compatible with  and 0 otherwise. We can then define the utility measure to be the product .

  25. ^

    This doesn't seem to automatically follow from the assumption . However, it might be possible to show that there is always at least an uncomputable utility function that can be computed with a slow-growing amount of advice.

  26. ^

    There might still be alignment overhead. Specifically, (i) the unusual structure of the loss function, (ii) prior shaping to deal with potential traps and (iii) prior shaping to deal with potential side-effects of computations, might incur overhead. We need to return to this question when we have better models. Maybe the overhead is small, or maybe we can prove that significant overhead is unavoidable.

  27. ^

    Technically, the mathematical analysis will probably need to focus on the asymptotic in which the agents can run all computations, but the original and the imitator can converge to this limit with different speeds.

  28. ^

    from Greek: anthropos (human) + sinepis (coherent, according to ChatGPT)

  29. ^

    Speculatively, we might not even have to do that: the AI itself can facilitate superrational cooperation between all agents who could affect the creation of this or similar AI.

  30. ^

    Input is a special case of instantiation: saying that program  runs with input  is equivalent to saying that the computation  is instantiated.

  31. ^

    We would have to be careful to set the ADAM threshold correctly and make sure the room indeed doesn't contain other agents. Otherwise, we are risking the AI version of "The Fly".

  32. ^

    In some "behaviorist" sense. IBP doesn't really have an update rule, and therefore there is no well-defined "posterior" in general.

  33. ^

    I hope...

17 comments

Comments sorted by top scores.

comment by Vanessa Kosoy (vanessa-kosoy) · 2024-12-08T13:38:33.192Z · LW(p) · GW(p)

This remains the best overview of the learning-theoretic agenda to-date. As a complementary pedagogic resource, there is now also a series of video lectures [AF · GW].

Since the article was written, there were several new publications:

In addition, some new developments were briefly summarized in short-forms:

  • A proposed solution [AF(p) · GW(p)] for the monotonicity problem in infra-Bayesian physicalism. This is potentially very important since the monotonicity problem was by far the biggest issue with the framework (and as a consequence, with PSI).
  • Multiple developments concerning metacognitive agents [AF(p) · GW(p)] (see also recorded talk). This framework seems increasingly important, but an in-depth analysis is still pending.
  • A conjecture [AF(p) · GW(p)] about a possible axiomatic characterization of the maximin decision rule in infra-Bayesianism. If true, it would go a long way to allaying any concerns about whether maximin is the "correct" choice.
  • Ambidistributions [AF(p) · GW(p)]: a cute new mathematical gadget for formalizing the notion of "control" in infra-Bayesianism.

Meanwhile, active research proceeds along several parallel directions:

  • I'm working towards the realization of the "frugal compositional languages" dream. So far, the problem is still very much open, but I obtained some interesting preliminary results which will appear in an upcoming paper (codename: "ambiguous online learning"). I also realized this direction might have tight connections with categorical systems theory (the latter being a mathematical language for compositionality). An unpublished draft was written by my MATS scholars on the subject of compositional polytope MDPs, hopefully to be completed some time during '25.
  • Diffractor achieved substantial progress in the theory of infra-Bayesian regret bounds, producing an infra-Bayesian generalization of decision-estimation coefficients (the latter is a nearly universal theory of regret bounds in episodic RL). This generalization has important connections to Garrabrant induction (of the flavor studied here), finally sketching a unified picture of these two approaches to "computational uncertainty" (Garrabrant induction and infra-Bayesianism). Results will appear in upcoming paper.
  • Gergely Szucs is studying the theory of hidden rewards, starting from the realization in this [AF(p) · GW(p)] short-form (discovering some interesting combinatorial objects beyond what was described there).

It remains true that there are more shovel-ready open problems than researchers, and hence the number of (competent) researchers is still the bottleneck.

comment by Daniel Murfet (dmurfet) · 2025-01-04T10:44:25.880Z · LW(p) · GW(p)

This post and its precusor [LW(p) · GW(p)] from 2018 present a strong and well-written argument for the centrality of mathematical theory to AI alignment. I think the learning-theoretic agenda, as well as Hutter's work on ASI safety in the setting of AIXI, currently seems underrated and will rise in status. It is fashionable to talk about automating AI alignment research, but who is thinking hard about what those armies of researchers are supposed to do? Conceivably one of the main things they should do is solve the problems that Vanessa has articulated here. 
 

The idea of a frugal compositional language and infa-Bayesian logic seem very interesting. As Vanessa points out in Direction 2 it seems likely there are possibilities for interesting cross-fertilisation between LTA and SLT, especially in connection with Solomonoff-like ideas and inductive biases.


I have referred colleagues in mathematics interested in alignment to this post and have revisited it a few times myself.

Replies from: sharmake-farah
comment by Noosphere89 (sharmake-farah) · 2025-01-04T23:51:21.047Z · LW(p) · GW(p)

While the agenda is aesthetically nice, and I do think AI capabilities of the future are surprisingly well-tuned to Vanessa's agenda, I personally think ASI safety in the setting of AIXI is overrated as a way to reduce existential risk, compared to other agendas that rely on automatng research.

comment by Marcus Ogren · 2023-04-19T08:22:09.179Z · LW(p) · GW(p)

(Disclosure: Vanessa is my wife.)

I want to share my thoughts on how the LTA can have a large impact. I think the main plan - to understand agency and intelligence fully enough to construct a provably aligned AI (perhaps modulo a few reasonable assumptions about the real world) - is a good plan. It’s how a competent civilization would go about solving the alignment problem, and a non-negligible chunk of the expected impact of the LTA comes from it working about as planned. But there are also plenty of other, less glamorous ways for it to make a big difference as well.

The LTA gives us tools to think about AI better. Even without the greater edifice of LTA and without a concentrated effort to complete the LTA and build an aligned AI in accordance with it, it can yield insights that help other alignment researchers. The LTA can identify possible problems that need to be solved and currently-unknown pitfalls that could make an AI unsafe (along the lines of the problem of privilege and acausal attack). It can also produce “tools” for solving certain aspects of the alignment problem that could be applied in an ad-hoc manner (such as the individual components of PSI). While this is decidedly inferior to creating a provably aligned AI, it is also far more likely to happen.

As for PSI, I think it’s a promising plan for creating an aligned AI in and of itself; it doesn’t appear to require greatly reduced capabilities and gives the AI an unhackable pointer to human values. But its main significance is as a proof of concept: the LTA has delivered this alignment proposal, and the LTA isn’t even close to being finished. My best guess is that, given enough time, some variant of PSI could be created as a provably aligned AI (modulo assumptions about human psychology, etc.). But I also expect better ideas in the future. PSI demonstrates that considering the fundamental questions of agency can lead to novel and elegant solutions to the alignment problem. Before Vanessa came up with PSI, I thought the main value of her research lay in solving weird-sounding problems (like acausal attack) that originally sounded more like an AI being stupid than like the AI being misaligned. PSI shows that the LTA is much, much more than this.

comment by Satron · 2025-01-05T19:23:27.228Z · LW(p) · GW(p)

A very promising non-mainstream AI alignment agenda.

Learning-Theoretic Agenda (LTA) attempts to combine empirical and theoretical data, which is a step in the right direction as it avoids a lot of "we don't understand how this thing works, so no amount of empirical data can make it safe" concerns.

I'd like to see more work in the future integrating LTA with other alignment agendas, such as scalable oversight or Redwood's AI control [LW · GW].

comment by harfe · 2024-01-09T16:21:45.645Z · LW(p) · GW(p)

Regarding direction 17: There might be some potential drawbacks to ADAM. I think its possible that some very agentic programs have relatively low score. This is due to explicit optimization algorithms being low complexity.

(Disclaimer: the following argument is not a proof, and appeals to some heuristics/etc. We fix for these considerations too.) Consider an utility function . Further, consider a computable approximation of the optimal policy (AIXI that explicitly optimizes for ) and has an approximation parameter n (this could be AIXI-tl, plus some approximation of ; higher is better approximation). We will call this approximation of the optimal policy . This approximation algorithm has complexity , where is a constant needed to describe the general algorithm (this should not be too large).

We can get better approximation by using a quickly growing function, such as the Ackermann function with . Then we have .

What is the score of this policy? We have . Let be maximal in this expression. If , then .

For the other case, let us assume that if , the policy is at least as good at maximizing than . Then, we have .

I don't think that the assumption ( maximizes better than ) is true for all and , but plausibly we can select such that this is the case (exceptions, if they exist, would be a bit weird, and if ADAM working well due to these weird exceptions feels a bit disappointing to me). A thing that is not captured by approximations such as AIXI-tl are programs that halt but have insane runtime (longer than ). Again, it would feel weird to me if ADAM sort of works because of low-complexity extremely-long-running halting programs.

To summarize, maybe there exist policies which strongly optimize a non-trivial utility function with approximation parameter , but where is relatively small.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2024-01-10T10:48:20.248Z · LW(p) · GW(p)

Yes, this is an important point, of which I am well aware. This is why I expect unbounded-ADAM to only be a toy model. A more realistic ADAM would use a complexity measure that takes computational complexity into account instead of . For example, you can look at the measure  I defined here [LW(p) · GW(p)]. More realistically, this measure should be based on the frugal universal prior.

comment by Frank_R · 2023-04-23T07:07:42.380Z · LW(p) · GW(p)

I have a question about the conjecture at the end of Direction 17.5. Let  be a utility function with values in  and let  be a strictly monotonous function. Then  and  have the same maxima.  can be non-linear, e.g. . Therefore, I wonder if the condition  should be weaker.

Moreover, I ask myself if it is possible to modify  by a small amount at a place far away from the optimal policy such that  is still optimal for the modified utility function. This would weaken the statement about the uniqueness of the utility function even more. Think of an AI playing Go. If a weird position on the board has the utility -1.01 instead of -1, this should not change the winning strategy. I have to go through all of the definitions to see if I can actually produce a more mathematical example. Nevertheless, you may have a quick opinion if this could happen.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-04-23T09:12:23.313Z · LW(p) · GW(p)

I have a question about the conjecture at the end of Direction 17.5. Let  be a utility function with values in  and let  be a strictly monotonous function. Then  and  have the same maxima.  can be non-linear, e.g. . Therefore, I wonder if the condition  should be weaker.

No, because it changes the expected value of the utility function under various distributions.

Moreover, I ask myself if it is possible to modify  by a small amount at a place far away from the optimal policy such that  is still optimal for the modified utility function.

Good catch, the conjecture as stated is obviously false. Because, we can e.g. take  to be the same as  everywhere except after some action which  doesn't actually take, in which case make it identically 0. Some possible ways to fix it:

  • Require the utility function to be of the form  (i.e. not depend on actions).
  • Use (strictly) instrumental reward functions.
  • Weaken the conclusion so that we're only comparing  and  on-policy (but this might be insufficient for superimitation).
  • Require  to be optimal off-policy (but it's unclear how can this generalize to finite ).
Replies from: harfe
comment by harfe · 2023-07-10T15:08:04.140Z · LW(p) · GW(p)

I think the conjecture is also false in the case that utility functions map from to .

Let us consider the case of and . We use , where is the largest integer such that starts with (and ). As for , we use , where is the largest integer such that starts with (and ). Both and are computable, but they are not locally equivalent. Under reasonable assumptions on the Solomonoff prior, the policy that always picks action is the optimal policy for both and (see proof below).

Note that since the policy is computable and very simple, is not true, and we have instead. I suspect that the issues are still present even with an additional condition, but finding a concrete example with an uncomputable policy is challenging.

proof: Suppose that and are locally equivalent. Let be an open neighborhood of the point and , be such that for all .

Since , we have . Because is an open neighborhood of , there is an integer such that for all . For such , we have This implies . However, this is not possible for all . Thus, our assumption that and are locally equivalent was wrong.

Assumptions about the solomonoff prior: For all , the sequence of actions that produces the sequence of with the highest probability is (recall that we start with observations in this setting). With this assumption, it can be seen that the policy that always picks action is among the best policies for both and .

I think this is actually a natural behaviour for a reasonable Solomonoff prior: It is natural to expect that is more likely than . It is natural to expect that the sequence of actions that leads to over has low complexity. Always picking is low complexity.

It is possible to construct an artificial UTM that ensures that "always take " is the best policy for , : An UTM can be constructed such that the corresponding Solomonoff prior assigns 3/4 probability to the program/environment "start with o_1. after action a_i, output o_i". The rest of the probability mass gets distributed according to some other more natural UTM.

Then, for , in each situation with history the optimal policy has to pick (the actions outside of this history have no impact on the utility): With 3/4 probability it will get utility of at least . And with probability at least . Whereas, for the choice of , with probability it will have utility of , and with probability it can get at most . We calculate , ie. taking action is the better choice.

Similarly, for , the optimal policy has to pick too in each situation with history . Here, the calculation looks like .

comment by David Matolcsi (matolcsid) · 2023-04-20T12:08:26.733Z · LW(p) · GW(p)

Thanks for Vanessa for writing this, I find it a useful summary of the goals and directions of LTA, which was sorely missing until now. Readers might also be interested in my write-up A mostly critical review of infra-Bayesianism [LW · GW] that tries to give a more detailed explanation about a subset of the questions above, and how much progress there was towards their solutions so far. I also give my thoughts and criticism of Infra-Bayesian Physicalism, the theory on which PSI rests.

I will also edit the post to include a link to this post. So far, I advised people to read Embedded agency [LW · GW] for the motivating questions, but now I can recommend this post too.

comment by RogerDearnaley (roger-d-1) · 2023-05-29T04:27:14.497Z · LW(p) · GW(p)

Subproblem 1.2/2.1: Traps

Allowing traps in the environment creates two different problems:

  • (Subproblem 1.2) Bayes-optimality becomes intractable in a very strong sense (even for a small number of deterministic MDP hypotheses with small number of states).
  • (Subproblem 2.1) It's not clear how to to talk about learnability and learning rates.

It makes some sense to consider these problems together, but different direction emphasize different sides.

 

Evolved organisms (such as humans) are good at dealing with traps: getting eaten is always a possibility. At the simplest level they do this by having multiple members of the species die, and using an evolutionary learning mechanism to evolve detectors for potential trap situations and some trap-avoiding behavior for this to trigger. An example of this might be the human instinct of vertigo near cliff edges — it's hard not to step back. The cost of this is that some number of individuals die from the traps before the species evolves a way of avoiding the trap.

As a sapient species using the scientific method, we have more sophisticated ways to detect traps. Often we may have a well-supported model of the world that lets us predict and avoid a trap ("nuclear war could well wipe out the human race, let's not do that"). Or we may have an unproven theory that predicts a possible trap, but that also predicts some less dangerous phenomenon. So rather than treating the universe like a multi-armed bandit and jumping into the potential trap to find out what happens and test our theory, we perform the lowest risk/cost experiment that will get us a good Bayesian update on the support for our unproven theory, hopefully at no cost to life or limb. If that raises the theory's support, then we become more cautious about the predicted trap, or if it lowers it, we become less. Repeat until your Bayesian updates converge on either 100% or 0%.

An evolved primate heuristic for this is "if nervous of an unidentified object, poke it with a stick and see what happens". This of course works better on, say, live/dead snakes than on some other perils that modern technology has exposed us to.

The basic trick here is to have a world model sophisticated enough that it can predict traps in advance, and we can find hopefully non-fatal ways of testing them that don't require us to jump into the trap. This requires that the universe has some regularities strong enough to admit models like this, as ours does. Likely most universes that didn't would be uninhabitable and life wouldn't evolve in them. 

comment by Frank_R · 2023-05-17T08:05:21.460Z · LW(p) · GW(p)

I have discovered another minor point. You have written at the beginning of Direction 17 that any computable utility function  is automatically continuous. This seems to be not always true.

I fix some definitions to make sure that we talk about the same stuff. For reasons of simplicity, I assume that and  are finite. Let   be the space of all infinite sequences with values in .  The -th projection  is given by

The product topology is defined as the coarsest topology such that all projection maps are continuous. A base of this topology consists of all sets  such that there are finitely many indices  and subsets  with

In particular, any open set contains such an  as a subset, which means that its image under  is  for all but finitely many . For my counterexample, let . Let  be a sequence with values in . If  is never 2, we define

Otherwise, we define  is computable in the sense that for any  we find a finite program whose input is a sequence such that  and  uses only finitely many values of . The preimage of the open set  is

which is not open since its th projection is always . Therefore,  is not contiuous. However, we have the following lemma:

Let  be a reward function and let  be given by

where  is the time discount rate. Then  is continuous with respect to the product topology.

Proof: Since the open intervals are a base of the standard topology of , it suffices to prove that the preimage of any interval  or  with  is open in . For reasons of simplicity, we consider only . The other cases are analogous. Let  such that . Moreover, let  such that . Finally, we choose an  such that . We define the set

 is an open subset of . Since the reward is non-negative, we have

and for any , we have

too. Therefore, 

and furthermore . All in all, any  has an open neighborhood that is a subset of . Therefore,  is open.

That a utility function  is continuous roughly means that for any  there are only finitely many events that have an influence of more than  on the utility. This could be a problem for studying longtermist agents with zero time discount rate. However, studying such agents is hard anyway since there is no guarantee that the sum of rewards converges and we have to deal with infinity ethics. As far as I know, it is standard in learning theory to avoid such situations by assuming a non-zero time discount rate or a finite time horizon. Therefore, it should not be a big deal to add the condition that  is continuous to all theorems.

Replies from: vanessa-kosoy
comment by Vanessa Kosoy (vanessa-kosoy) · 2023-05-17T08:29:33.042Z · LW(p) · GW(p)

Your alleged counterexample is wrong, because the  you constructed is not computable. First, "computable" means that there is a program  which receives  and  as inputs s.t. for all  and , it halts and

Second, even your weaker definition fails here. Let . Then, there is no program that computes  within accuracy , because for every  while . Therefore, determining the value of  within  requires looking at infinitely many elements of the sequence. Any program would that outputs  on  has to halt after reading some finite  symbols, in which case it would output  on  as well.

Replies from: Frank_R
comment by Frank_R · 2023-05-17T12:03:35.856Z · LW(p) · GW(p)

You are right and now it is clear, why your original statement is correct, too. Let  be an arbitrary computable utility function. As above, let  and  with  and . Choose  as in your definition of "computable". Since  terminates, its output depends only on finitely many . Now

is open and a subset of , since .  

comment by Mikhail Samin (mikhail-samin) · 2023-04-19T16:18:51.607Z · LW(p) · GW(p)

Is it fair to say that you’re trying to:

  • make a theory of agency that at least somewhat describes what we’ll likely see in the real world and also precisely corresponds to some parts of the space of possible agents;
  • find a way to talk about alignment and say what it means for agents to be aligned;
  • find a mathematical structure that corresponds to agents aligned with humans;
  • produce desiderata that can be used to engineer a training setup that might lead to a coherent, aligned agent?
comment by Review Bot · 2024-04-08T09:11:34.097Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2024. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?