# Embedded World-Models

post by abramdemski, Scott Garrabrant · 2018-11-02T16:07:20.946Z · score: 79 (24 votes) · LW · GW · 11 comments*(A longer text-based version of this post is also available on MIRI's blog* *here, and the bibliography for the whole sequence can be found* *here)*

*(Edit: This post had 15 slides added on Saturday 10th November.)*

*The next post in this sequence, 'Embedded Agency', will come out on Sunday, November 4th.*

*Tomorrow’s AI Alignment Forum sequences post will be 'The easy goal inference problem is still hard' by Paul Christiano, in the sequence 'Value Learning'.*

## 11 comments

Comments sorted by top scores.

This post feels quite similar to things I have written in the past to justify my lack of enthusiasm about idealizations like AIXI and logically-omniscient Bayes. But I would go further: I think that grappling with embeddedness properly will *inevitably* make theories of this *general type* irrelevant or useless, so that "a theory like this, except for embedded agents" is not a thing that we can reasonably want. To specify what I mean, I'll use this paragraph as a jumping-off point:

Embedded agents don’t have the luxury of stepping outside of the universe to think about how to think. What we would like would be a theory of rational belief for situatedagents which provides foundations that are similarly as strong as the foundations Bayesianism provides for dualistic agents.

Most "theories of rational belief" I have encountered -- including Bayesianism in the sense I think is meant here -- are framed at the level of an evaluator outside the universe, and have essentially no content when we try to transfer them to individual embedded agents. This is because these theories tend to be derived in the following way:

- We want a theory of the best possible behavior for agents.
- We have some class of "practically achievable" strategies , which can actually be implemented by agents. We note that an agent's observations provide some information about the quality of different strategies . So if it were possible to follow a rule like "find the best given your observations, and then follow that ," this rule would spit out very good agent behavior.
- Usually we soften this to a performance-weighted average rather than a hard argmax, but the principle is the same: if we could search over all of , the rule that says "do the search and then follow what it says" can be competitive with the very best . (Trivially so, since it has access to the best strategies, along with all the others.)
- But usually . That is, the strategy "search over all practical strategies and follow the best ones" is not a
*practical*strategy. But we argue that this is fine, since we are constructing a theory of*ideal*behavior. It doesn't have to be practically implementable.

For example, in Solomonoff, is defined by computability while is allowed to be uncomputable. In the LIA construction, is defined by polytime complexity while is allowed to run slower than polytime. In logically-omniscient Bayes, finite sets of hypotheses can be manipulated in a finite universe but the full Boolean algebra over hypotheses generally cannot.

I hope the framework I've just introduced helps clarify what I find unpromising about these theories. By construction, any agent you can actually design and run is a *single* element of (a "practical strategy"), so every fact about rationality that can be incorporated into agent design gets "hidden inside" the individual , and the only things you can learn from the "ideal theory" are things which can't fit into a practical strategy.

For example, suppose (reasonably) that model averaging and complexity penalties are broadly good ideas that lead to good results. But all of the model averaging and complexity penalization that can be done *computably* happens inside some Turing machine or other, at the level "below" Solomonoff. Thus Solomonoff *only* tells you about the extra advantage you can get by doing these things *uncomputably*. Any kind of nice Bayesian average over Turing machines that can happen computably is (of course) just another Turing machine.

This also explains why I find it misleading to say that good practical strategies constitute "approximations to" an ideal theory of this type. Of course, since just says to follow the best strategies in , if you are following a very good strategy in your behavior will tend to be close to that of . But this cannot be attributed to *any* of the searching over that does, since you are not doing a search over ; you are executing a *single member* of and ignoring the others. Any searching that can be done practically collapses down to a single practical strategy, and any that doesn't is not practical. Concretely, this talk of approximations is like saying that a very successful chess player "approximates" the rule "consult all possible chess players, then weight their moves by past performance." Yes, the skilled player will *play similarly* to this rule, but they are not *following* it, not even approximately! They are only themselves, not any other player.

Any theory of ideal rationality that wants to be a guide for embedded agents will have to be constrained in the same ways the agents are. But theories of ideal rationality usually get *all of their content* by going to a level above the agents they judge. So this new theory would have to be a very different sort of thing.

Thanks, this is a very clear framework for understanding your objection. Here's the first counterargument that comes to mind: Minimax search is a theoretically optimal algorithm for playing chess, but is too computationally costly to implement in practice. One could therefore argue that all that matters is computationally feasible heuristics, and modeling an ideal chess player as executing a minimax search adds nothing to our knowledge of chess. OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search. There are all kinds of narrow abilities we'd like an AGI to have that depend on the detailed structure of the physical world, but it's not obvious that any such structure, beyond hypotheses about what is feasibly computable, could be usefully exploited to solve the kinds of problem laid out in this sequence. So it may well be that the best approach turns out to involve some sort of bounded search over simpler strategies, plus lots and lots of compute.

OTOH, doing a minimax search of the game tree for some bounded number of moves, then applying a simple board-evaluation heuristic at the leaf nodes, is a pretty decent algorithm in practice.

I've written previously about this kind of argument -- see here (scroll down to the non-blockquoted text). tl;dr we can often describe the same optimum in multiple ways, with each way giving us a different series that approximates the optimum in the limit. Whether any one series does well or poorly when truncated to N terms can't be explained by saying "it's a truncation of the optimum," since they all are; these truncations properties are facts about the different series, not about the optimum. I illustrate with different series expansions for .

Furthermore, it seems like there's a pattern where, the more general the algorithmic problem you want to solve is, the more your solution is compelled to resemble some sort of brute-force search.

You may be right, and there are interesting conversations to be had about when solutions will tend to look like search and when they won't. But this doesn't feel like it really addresses my argument, which is not about "what kind of algorithm should you use" but about the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of the *optimization* procedures you could ever do. There is a logical / definitional weirdness here that can't be resolved by arguments about what sorts of (logically / definitionally unproblematic) algorithms are good or bad in what domains.

...the weirdness of the injunction to optimize over a space containing every procedure you could ever do, including all of theoptimizationprocedures you could ever do.

My most recent preprint discusses multi-agent Goodhart ( https://arxiv.org/abs/1810.10862 ) and uses the example of poker, along with a different argument somewhat related to the embedded agent problem, to say why the optimization over strategies needs to include optimizing over the larger solution space.

To summarize and try to clarify how I think it relates, strategies for game-playing must at least implicitly include a model of the other player's actions, so that an agent can tell which strategies will work against them. We need uncertainty in that model, because if we do something silly like assume they are rational Bayesian agents, we are likely to act non-optimally against their actual strategy. But the model of the other agent itself needs to account for their model of our strategy, including uncertainty about our search procedure for strategies - otherwise the space is clearly much too large to optimize over.

Does this make sense? (I may need to expand on this and clarify my thinking...)

Abram has made a major update to the post above, adding material on self-reference and the grain of truth problem. The corresponding text on the MIRI Blog version has also been expanded, with some extra material on those topics plus logical uncertainty.

Epistemic Status: Attempting to bridge what I see as a missing inferential link in the post / sequence.

(This is a point which I picked up on because I am familiar with what Abram was thinking about 3 years ago, and I was surprised it didn't get mentioned. Maybe it was assumed to be obvious, maybe it's not as relevant as I assumed, but I think some others will find the point worth a bit more explaining.)

The reason we care about the relative size of the world and the model is that we have a deep reason to think that a model smaller than the world cannot perform optimally - it's the Conant-Ashby Theorem, which states "every good regulator of a system must be a model of that system." For a great explanation of this idea, there is a paper that Abram pointed me to years ago, "Every good key must be a model of the lock it opens (The Conant & Ashby Theorem Revisited)" To quote from there:

"What all of this means, more or less, is that the pursuit of a goal by some dynamic agent (Regulator) in the face of a source of obstacles (System) places at least one particular and unavoidable demand on that agent, which is that the agent's behaviors must be executed in such a reliable and predictable way that they can serve as a representation (Model) of that source of obstacles."

To lay the connection out explicitly, if the agent model of the world is not isomorphic to the world, the actions chosen will be sub-optimal. This is bad if we assume the world is not isomorphic to a simple model (and this sequence is laying out reasons that for reflexive agents, there cannot be such a computational model.)

Some of these issues (obviously) are not limited to AI. Specifically, the problem of how to deal with multi-level models and "composibility" was the subject of an applied research project for military applications by my dissertation chair, Paul Davis, here: https://www.rand.org/content/dam/rand/pubs/monographs/2004/RAND_MG101.pdf -

"The appealing imagery of arbitrary plug-and-play is fatally flawed for complex models... The more-complex [lower level] model components have typically been developed for particular purposes and depend on context-sensitive assumptions, some of which are tacit."

This issue has formed the basis of a fair amount of his later work as well, but this work focuses on practical advice, rather than conceptual understanding of the limitations. Still, that type of work may be useful as inspiration.

I think this article is too vague, because for almost almost claims in it I am not sure if I understand the author correctly. Below I am posting my notes. If you want to help me and others clarify understanding of this article, consider answering **questions in bold**, or, if you see a mistake in my notes, correcting it. Also I hope my notes help the author as a piece of feedback. I've only finished 2/3 of the article so far, but posting notes because I might become less interested in this later.

Also it's unfortunate that unlike in https://intelligence.org/2018/11/02/embedded-models/ version of this article we don't have hyperlinks to explanations of various concepts here. Perhaps you could add them under the corresponding images? Or have images themselves be hyperlinks or reference links (like in academic articles) to the bottom of the document where all relevant links would be stored grouped by image number.

The post says an embedded agent can't hold an exact model of the
environment in its head, can't think through the consequences of every
potential course of action, can't hold in its head every possible way the environment
could be. **I think this may not be necessarily true and I am not sure what assumptions the author used here.**

**It seems the whole article assumes countable probability spaces (even before the AIXI part). I wonder why and I wonder how realizability is defined for uncountable probability space.**

--

Regarding relative bounded loss and what this bound is for, my best guess is as follows. Here I use non-conditional probability notation instead of . Suppose some event e is actually true. Let be some "expert" event in the probability space. According to prior, probability of e equals , and its logarithm of probability has a lower bound . Now, according to the expert h, its probability equals just , and its logarithm of probability equals . I conclude that relative bounded loss is the difference between prior logarithm of probability and logarithm of probability of the expert h, which turns out to be at most .

Initially, is your initial trust in expert h, and in each case where it is even a little bit more correct than you, you increase your trust accordingly; the way you do this ensures you assign an expert probability 1 and hence copy it precisely before you lose more than compared to it.

Remember, . It follows
that probability of h increases given evidence e if and only if , i.e. h "is even a little bit more
correct than you". **But I don't understand the bit about copying the expert
h precisely before losing more than **, because losing more
than is logically impossible (assuming ), as was shown above.

Combining this with the previous idea about viewing Bayesian learning as a way of allocating “trust” to “experts” which meets a bounded loss condition, we can see the Solomonoff prior as a kind of ideal machine learning algorithm which can learn to act like any algorithm you might come up with, no matter how clever.

It is assuming all possible algorithms are computable, not that the world is.

I don't understand this. Our probability space is the cartesian product of
the set of all possible UTM programs and the set of all possible UTM
working tape initial configurations. Or, equivalently, the set of outputs
of UTM under these conditions. Hence our whole hypothesis space only
includes computable worlds. **What does "can learn to act like any algorithm"
mean here?** "It's getting bounded loss on its predictive accuracy as
compared with any computable predictor." Huh? **Does predictor here mean
expert h? If yes, what does it mean that h is computable and why? All in
all, is the author claiming it's impossible to have a better computable
predictor than AIXI with Solomonoff prior, even if it has non-computable
worlds in the probability space?**

probabilities may not be calibrated identification of causal structure may not work

**What do these mean?** I only know informally what calibration means
related to forecasting.

So, does AIXI perform well without a realizability assumption?

**How is AIXI even defined without realizability, i.e. when the actual world isn't in the probability space, or it has zero prior probability?**

This is fine if the world “holds still” for us; but because the map is in the world, it may implement some function.

**Is this about the world changing because of the agent just thinking? Or
something else?**

It should be noted, though, that there are additional barriers to getting this property in a game-theoretic setting; so in their common usage cases, "grain of truth" is technically demanding while "realizability" is a technical convenience.

...

In game theory, on the other hand, the assumption itself may be inconsistent. This is because games commonly yield paradoxes of self-reference.

From the former paragraph I don't understand anything except that (the author claims) game theory has more problems with grain of truth / realizability, than AIXI. After the latter paragraph, my best guess is: for any game, if there is no pure strategy equilibrium in it, then we say it has no grain of truth, because for every possible outcome rational agents wouldn't choose it.

If we put weight in both places until a proof rules one out, the beliefs just oscillate forever rather than doing anything useful.

Weights represent possible worlds, therefore they are on the scales right
from the beginning (the prior), we never put new weights on the scales. **My
probably incorrect guess of what the author is saying is**
some agent which acts like AIXI but
instead of updating on pieces of evidence as soon as he receives it, he
stockpiles it, and at some points he (boundedly) searches for proofs that
these pieces of evidence are in favor of some hypothesis and performs
update only when he finds them. **But still, why oscillation?**

Any computable beliefs about logic must have left out something, since the tree will grow larger than any container.

I interpret it as there are infinitely many theorems, hence an agent with finite amount of space or finite amount of computation steps can't process all of them.

Another consequence of the fact that the world is bigger than you is that you need to be able to use high-level world models: models which involve things like tables and chairs.

This is related to the classical symbol grounding problem; but since we want a formal analysis which increases our trust in some system, the kind of model which interests us is somewhat different. This also relates to transparency and informed oversight: world-models should be made out of understandable parts.

**No idea what the second quoted paragraph means.**

All in all, I doubt that high level world models are necessary. And it's very not clear what is meant by "high level" or "things" here. Perhaps embedded agents can (boundedly) reason about the world in other ways, e.g. by modeling only part of the world.

https://intelligence.org/files/OntologicalCrises.pdf explains the ontological crisis idea better. Suppose our AIXI-like agent thinks the world is an elementary outcome of some parameterized probability distribution with the parameter θ. θ is either 1 or 2. We call the set of elementary outcomes with θ=1 the first ontology (e.g. possible worlds running on classical mechanics), and the set of elementary outcomes with θ=2 the second ontology (e.g. possible worlds running on superstrings theory). The programmer has only programmed the agent's utility functiom for θ=1 part, i.e. a u function from ontology 1 to real numbers. The agent keeps count of which value of θ is more probable and chooses actions by considering only current ontology. If at some point he decides that the second ontology is more useful, he switches to it. The agent should extrapolate the utility function to θ=2 part. How can he do it?

I can follow most of this, but i'm confused about one part of the premise.

What if the agent created a low-resolution simulation of its behavior, called it Approximate Self, and used that in its predictions? Is the idea that this is doable, but represents a unacceptably large loss of accuracy? Are we in a 'no approximation' context where any loss of accuracy is to be avoided?

My perspective: It seems to me that humans also suffer from the problem of embedded self-reference. I suspect that humans deal with this by thinking about a highly approximate representation of their own behavior. For example, when i try to predict how a future conversation will go, i imagine myself saying things that a 'reasonable person' might say. Could a machine use a analogous form of non-self-referential approximation?

Great piece, thanks for posting.

What if the agent which is a quantum mechanical intelligence CAN temporarily tunnel out of the environment long enough to make certain key observations/measurements. It could be both in the embedded environment AND out at the same time as a hyper wavefunction or in the form of its own pilot wave? Thinking as a human is a quantum mechanical process to a degree. You cannot change a system from within it is a psychological norm, however if the agent is quantum mechanical in nature then it is likely neither particle nor wave but something undeterminable by other agents. The agent might be in quantum flux indefinately n'est pas? Hence incompleteness theorem in both physics and mathematics.

Not sure why the above comment was downvoted to -15. It's a fair question, even if the person asking seems to misinterpret both quantum mechanics and mathematical logic. Quantum mechanics seems to be an accurate description of the "lower levels" of the agent's model of the universe, and mathematical logic is a useful meta-model that helps us construct better quality models of the universe. They are not, as far, as I know, interrelated, and there is no "hence". Additionally, while quantum mechanics is a good description of the microscopic world, it is much less useful at the level of living organisms (though ion channel opening and closing reflects the underlying quantum-mechanical tunneling), so there is no indication that human thinking is inherently quantum mechanical and could not be some day implemented by a classical computer without a huge complexity penalty.