# Embedded World-Models

post by abramdemski, Scott Garrabrant · 2018-11-02T16:07:20.946Z · score: 78 (None votes) · LW · GW · 10 comments*(A longer text-based version of this post is also available on MIRI's blog* *here, and the bibliography for the whole sequence can be found* *here)*

*(Edit: This post had 15 slides added on Saturday 10th November.)*

*The next post in this sequence, 'Embedded Agency', will come out on Sunday, November 4th.*

*Tomorrow’s AI Alignment Forum sequences post will be 'The easy goal inference problem is still hard' by Paul Christiano, in the sequence 'Value Learning'.*

## 10 comments

This post feels quite similar to things I have written in the past to justify my lack of enthusiasm about idealizations like AIXI and logically-omniscient Bayes. But I would go further: I think that grappling with embeddedness properly will *inevitably* make theories of this *general type* irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want. To specify what I mean, I'll use this paragraph as a jumping-off point:

Embedded agents don’t have the luxury of stepping outside of the universe to think about how to think. What we would like would be a theory of rational belief for situatedagents which provides foundations that are similarly as strong as the foundations Bayesianism provides for dualistic agents.

Most "theories of rational belief" I have encountered -- including Bayesianism in the sense I think is meant here -- are framed at the level of an evaluator outside the universe, and have essentially no content when we try to transfer them to individual embedded agents. This is because these theories tend to be derived in the following way:

- We want a theory of the best possible behavior for agents.
- We have some class of "practically achievable" strategies , which can actually be implemented by agents. We note that an agent's observations provide some information about the quality of different strategies . So if it were possible to follow a rule like "find the best given your observations, and then follow that ," this rule would spit out very good agent behavior.
- Usually we soften this to a performance-weighted average rather than a hard argmax, but the principle is the same: if we could search over all of , the rule that says "do the search and then follow what it says" can be competitive with the very best . (Trivially so, since it has access to the best strategies, along with all the others.)
- But usually . That is, the strategy "search over all practical strategies and follow the best ones" is not a
*practical*strategy. But we argue that this is fine, since we are constructing a theory of*ideal*behavior. It doesn't have to be practically implementable.

For example, in Solomonoff, is defined by computability while is allowed to be uncomputable. In the LIA construction, is defined by polytime complexity while is allowed to run slower than polytime. In logically-omniscient Bayes, finite sets of hypotheses can be manipulated in a finite universe but the full Boolean algebra over hypotheses generally cannot.

I hope the framework I've just introduced helps clarify what I find unpromising about these theories. By construction, any agent you can actually design and run is a *single* element of (a "practical strategy"), so every fact about rationality that can be incorporated into agent design gets "hidden inside" the individual , and the only things you can learn from the "ideal theory" are things which can't fit into a practical strategy.

For example, suppose (reasonably) that model averaging and complexity penalties are broadly good ideas that lead to good results. But all of the model averaging and complexity penalization that can be done *computably* happens inside some Turing machine or other, at the level "below" Solomonoff. Thus Solomonoff *only* tells you about the extra advantage you can get by doing these things *uncomputably*. Any kind of nice Bayesian average over Turing machines that can happen computably is (of course) just another Turing machine.

This also explains why I find it misleading to say that good practical strategies constitute "approximations to" an ideal theory of this type. Of course, since just says to follow the best strategies in , if you are following a very good strategy in your behavior will tend to be close to that of . But this cannot be attributed to *any* of the searching over that does, since you are not doing a search over ; you are executing a *single member* of and ignoring the others. Any searching that can be done practically collapses down to a single practical strategy, and any that doesn't is not practical. Concretely, this talk of approximations is like saying that a very successful chess player "approximates" the rule "consult all possible chess players, then weight their moves by past performance." Yes, the skilled player will *play similarly* to this rule, but they are not *following* it, not even approximately! They are only themselves, not any other player.

Any theory of ideal rationality that wants to be a guide for embedded agents will have to be constrained in the same ways the agents are. But theories of ideal rationality usually get *all of their content* by going to a level above the agents they judge. So this new theory would have to be a very different sort of thing.

Thanks, this is a very clear framework for understanding your objection. Here's the first counterargument that comes to mind: Minimax search is a theoretically optimal algorithm for playing chess, but is too computationally costly to implement in practice. One could therefore argue that all that matters is computationally feasible heuristics, and modeling an ideal chess player as executing a minimax search adds nothing to our knowledge of chess. OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search. There are all kinds of narrow abilities we'd like an AGI to have that depend on the detailed structure of the physical world, but it's not obvious that any such structure, beyond hypotheses about what is feasibly computable, could be usefully exploited to solve the kinds of problem laid out in this sequence. So it may well be that the best approach turns out to involve some sort of bounded search over simpler strategies, plus lots and lots of compute.

OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

I've written previously about this kind of argument -- see here (scroll down to the non-blockquoted text). tl;dr we can often describe the same optimum in multiple ways, with each way giving us a different series that approximates the optimum in the limit. Whether any one series does well or poorly when truncated to N terms can't be explained by saying "it's a truncation of the optimum," since they all are; these truncations properties are facts about the different series, not about the optimum. I illustrate with different series expansions for .

Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search.

You may be right, and there are interesting conversations to be had about when solutions will tend to look like search and when they won't. But this doesn't feel like it really addresses my argument, which is not about "what kind of algorithm should you use" but about the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of the *optimization* procedures you could ever do. There is a logical / definitional weirdness here that can't be resolved by arguments about what sorts of (logically / definitionally unproblematic) algorithms are good or bad in what domains.

...the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of theoptimizationprocedures you could ever do.

My most recent preprint discusses multi-agent Goodhart ( https://arxiv.org/abs/1810.10862 ) and uses the example of poker, along with a different argument somewhat related to the embedded agent problem, to say why the optimization over strategies needs to include optimizing over the larger solution space.

To summarize and try to clarify how I think it relates, strategies for game-playing must at least implicitly include a model of the other player's actions, so that an agent can tell which strategies will work against them. We need uncertainty in that model, because if we do something silly like assume they are rational Bayesian agents, we are likely to act non-optimally against their actual strategy. But the model of the other agent itself needs to account for their model of our strategy, including uncertainty about our search procedure for strategies - otherwise the space is clearly much too large to optimize over.

Does this make sense? (I may need to expand on this and clarify my thinking...)

Abram has made a major update to the post above, adding material on self-reference and the grain of truth problem. The corresponding text on the MIRI Blog version has also been expanded, with some extra material on those topics plus logical uncertainty.

Epistemic Status: Attempting to bridge what I see as a missing inferential link in the post / sequence.

(This is a point which I picked up on because I am familiar with what Abram was thinking about 3 years ago, and I was surprised it didn't get mentioned. Maybe it was assumed to be obvious, maybe it's not as relevant as I assumed, but I think some others will find the point worth a bit more explaining.)

The reason we care about the relative size of the world and the model is that we have a deep reason to think that a model smaller than the world cannot perform optimally - it's the Conant-Ashby Theorem, which states "every good regulator of a system must be a model of that system." For a great explanation of this idea, there is a paper that Abram pointed me to years ago, "Every good key must be a model of the lock it opens (The Conant & Ashby Theorem Revisited)" To quote from there:

"What all of this means, more or less, is that the pursuit of a goal by some dynamic agent (Regulator) in the face of a source of obstacles (System) places at least one particular and unavoidable demand on that agent, which is that the agent's behaviors must be executed in such a reliable and predictable way that they can serve as a representation (Model) of that source of obstacles."

To lay the connection out explicitly, if the agent model of the world is not isomorphic to the world, the actions chosen will be sub-optimal. This is bad if we assume the world is not isomorphic to a simple model (and this sequence is laying out reasons that for reflexive agents, there cannot be such a computational model.)

Some of these issues (obviously) are not limited to AI. Specifically, the problem of how to deal with multi-level models and "composibility" was the subject of an applied research project for military applications by my dissertation chair, Paul Davis, here: https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG101.pdf -

"The appealing imagery of arbitrary plug-and-play is fatally flawed for complex models... The more-complex [lower level] model components have typically been developed for particular purposes and depend on context-sensitive assumptions, some of which are tacit."

This issue has formed the basis of a fair amount of his later work as well, but this work focuses on practical advice, rather than conceptual understanding of the limitations. Still, that type of work may be useful as inspiration.

I can follow most of this, but i'm confused about one part of the premise.

What if the agent created a low-resolution simulation of its behavior, called it Approximate Self, and used that in its predictions? Is the idea that this is doable, but represents a unacceptably large loss of accuracy? Are we in a 'no approximation' context where any loss of accuracy is to be avoided?

My perspective: It seems to me that humans also suffer from the problem of embedded self-reference. I suspect that humans deal with this by thinking about a highly approximate representation of their own behavior. For example, when i try to predict how a future conversation will go, i imagine myself saying things that a 'reasonable person' might say. Could a machine use a analogous form of non-self-referential approximation?

Great piece, thanks for posting.

What if the agent which is a quantum mechanical intelligence CAN temporarily tunnel out of the environment long enough to make certain key observations/measurements. It could be both in the embedded environment AND out at the same time as a hyper wavefunction or in the form of its own pilot wave? Thinking as a human is a quantum mechanical process to a degree. You cannot change a system from within it is a psychological norm, however if the agent is quantum mechanical in nature then it is likely neither particle nor wave but something undeterminable by other agents. The agent might be in quantum flux indefinately n'est pas? Hence incompleteness theorem in both physics and mathematics.

Not sure why the above comment was downvoted to -15. It's a fair question, even if the person asking seems to misinterpret both quantum mechanics and mathematical logic. Quantum mechanics seems to be an accurate description of the "lower levels" of the agent's model of the universe, and mathematical logic is a useful meta-model that helps us construct better quality models of the universe. They are not, as far, as I know, interrelated, and there is no "hence". Additionally, while quantum mechanics is a good description of the microscopic world, it is much less useful at the level of living organisms (though ion channel opening and closing reflects the underlying quantum-mechanical tunneling), so there is no indication that human thinking is inherently quantum mechanical and could not be some day implemented by a classical computer without a huge complexity penalty.