Humans Are Embedded Agents Too
post by johnswentworth · 2019-12-23T19:21:15.663Z · LW · GW · 21 commentsContents
The Keyboard is Not The Human Modified Humans Off-Equilibrium Drinking Value Drift Akrasia Preferences Over Quantum Fields Unrealized Implications Socially Strategic Self-Modification None 21 comments
Most models of agency (in game theory, decision theory, etc) implicitly assume that the agent is separate from the environment - there is a “Cartesian boundary” between agent and environment. The embedded agency sequence [LW · GW] goes through a long list of theoretical/conceptual problems which arise when an agent is instead embedded in its environment. Some examples:
- No defined/input output channels over which to optimize
- Agent might accidentally self-modify, e.g. drop a rock on its head
- Agent might intentionally self-modify, e.g. change its own source code
- Hard to define hypotheticals which don’t actually happen, e.g. “I will kill the hostages if you don’t pay the ransom”
- Agent may contain subcomponents which optimize for different things
- Agent is made of parts (e.g. atoms) whose behavior can be predicted without thinking of the agent as agenty - e.g. without thinking of the agent as making choices or having beliefs
- Agent is not logically omniscient: it cannot know all the implications of its own beliefs
The embedded agency sequence mostly discusses how these issues create problems for designing reliable AI. Less discussed is how these same issues show up when modelling humans - and, in particular, when trying to define human values (i.e. “what humans want”). Many - arguably most - of the problems alignment researchers run into when trying to create robust pointers to human values [LW · GW] are the same problems we encounter when talking about embedded agents in general.
I’ll run through a bunch of examples below, and tie each to a corresponding problem-class in embedded agency. While reading, bear in mind that directly answering the questions posed is not the point. The point is that each of these problems is a symptom of the underlying issue: humans are embedded agents. Patching over each problem one-by-one will produce a spaghetti tower [LW · GW]; ideally we’d tackle the problem closer to the root.
The Keyboard is Not The Human
Let’s imagine that we have an AI which communicates with its human operator via screen and keyboard. It tries to figure out what the human wants based on what’s typed at the keyboard.
A few possible failure modes in this setup:
- The AI wireheads by seizing control of the keyboard (either intentionally or accidentally)
- A cat walks across the keyboard every now and then, and the AI doesn’t realize that this input isn’t from the human
- After a code patch, the AI filters out cat-input, but also filters out some confusing (but important) input from the human
Embedded agency problem: humans do not have well-defined output channels. We cannot just point to a keyboard and say “any information from that keyboard is direct output from the human”. Of course we can come up with marginally better solutions than a keyboard - e.g. voice recognition - but eventually we’ll run into similar issues. There is nothing in the world we can point to and say “that’s the human’s output channel, the entire output channel, and nothing but the output channel”. Nor does any such output channel exist, so e.g. we won’t solve the problem just by having uncertainty over where exactly the output channel is.
Modified Humans
Because humans are embedded in the physical world, there is no fundamental block to an AI modifying us (either intentionally or unintentionally). Define what a “human” is based on some neural network which recognizes humans in images, and we risk an AI modifying the human by externally-invisible means ranging from drugs to wholesale replacement.
Embedded agency problem: no Cartesian boundary. All the human-parts can be manipulated/modified; the AI is not in a different physical universe from us.
Off-Equilibrium
Human choices can depend on off-equilibrium behavior - what we or someone else would do, in a scenario which never actually happens. Game theory is full of examples, especially threats: we don’t launch our nukes because we expect our enemies would launch their nukes… yet what we actually expect to happen is for nobody to launch any nukes. Our own behavior is determined by “possibilities” which we don’t actually expect to happen, and which may not even be possible. Embedded agency problem: counterfactuals.
Going even further: our values themselves can depend on counterfactuals. My enjoyment of a meal sometimes depends on what the alternatives were, even when the meal is my top pick - I’m happier if I didn’t pass up something nearly-as-good. We’re often unhappy to be forced into a choice, even if it’s a choice we would have made anyway. What does it mean to “have a choice”, in the sense that matters for human values? How do we physically ground that concept? If we want a friendly AI to allow us choices, rather than force us to do what’s best for us, then we need answers to questions like these.
Drinking
Humans have different preferences while drunk than while sober [CITATION NEEDED]. When pointing an AI at “human values”, it’s tempting to simply say “don’t count decisions made while drunk”. But on the other hand, people often drink to intentionally lower their own inhibitions - suggesting that, at a meta-level, they want to self-modify into making low-inhibition decisions (at least temporarily, and within some context, e.g. at a party).
Embedded agency problem: self-modification and robust delegation. When a human intentionally self-modifies, to what extent should their previous values be honored, to what extent their new values, and to what extent their future values?
Value Drift
Humans generally have different values in childhood, middle age, and old age. Heck, humans have different values just from being hangry! Suppose a human makes a precommitment, and then later on, their values drift - the precommitment becomes a nontrivial constraint, pushing them to do something they no longer wish to do. How should a friendly AI handle that precommitment?
Embedded agency problem: tiling & delegation failures. As humans propagate through time, our values are not stable, even in the absence of intentional self-modification. Unlike in the AI case, we can’t just design humans to have more stable values. (Or can we? Would that even be desirable?)
Akrasia
Humans have subsystems. Those subsystems do not always want the same things. Stated preferences and revealed preferences do not generally match. Akrasia exists; many people indulge in clicker games no matter how much some other part of themselves wishes they could be more productive.
Embedded agency problem: subsystem alignment. Human subsystems are not all aligned all the time. Unlike the AI case, we can’t just design humans to have better-aligned subsystems - first we’d need to decide what to align them to, and it’s not obvious that any one particular subsystem contains the human’s “true” values.
Preferences Over Quantum Fields
Humans generally don’t have preferences over quantum fields directly. The things we value are abstract, high-level objects and notions. Embedded agency problem: multi-level world models. How do we take the abstract objects/notions over which human values operate, and tie them back to physical observables?
At the same time, our values ultimately need to be grounded in quantum fields, because that’s what the world is made of. Human values should not seemingly cease to exist just because the world is quantum and we thought it was classical. It all adds up to normality. Embedded agency problem: ontological crises. How do we ensure that a friendly AI can still point to human values even if its model of the world fundamentally shifts?
Unrealized Implications
I have, on at least one occasion, completely switched a political position in about half an hour after hearing an argument I had not previously considered. More generally, we humans tend to update our beliefs, our strategies, and what-we-believe-to-be-our-values as new implications are realized.
Embedded agency problem: logical non-omniscience. We do not understand the full implications of what we know, and sometimes we base our decisions/strategies/what-we-believe-to-be-our-values on flawed logic. How is a friendly AI to recognize and handle such cases?
Socially Strategic Self-Modification
Because humans are all embedded in one physical world, lying is hard. There are side-channels which leak information, and humans have long since evolved to pay attention to those side-channels. One side effect: the easiest way to “deceive” others is to deceive oneself, via self-modification. Embedded agency problem: coordination with visible source code, plus self-modification.
We earnestly adopt both the beliefs and values of those around us. Are those our “true” values? How should a friendly AI treat values adopted due to social pressure? More generally, how should a friendly AI handle human self-modifications driven by social pressure?
Combining this with earlier examples: perhaps we spend an evening drunk because it gives us a socially-viable excuse to do whatever we wanted to do anyway. Then the next day, we bow to social pressure and earnestly regret our actions of the previous night - or at least some of our subsystems do. Other subsystems still had fun while drunk, and we do the same thing the next weekend. What is a friendly AI to make of this? Where, in this mess, are the humans’ “values”?
These are the sorts of shenanigans one needs to deal with when dealing with embedded agents, and I expect that a better understanding of embedded agents in general will lead to substantial insights about the nature of human values.
21 comments
Comments sorted by top scores.
comment by Scott Garrabrant · 2019-12-24T20:38:04.732Z · LW(p) · GW(p)
We actually avoided talking about AI in most of the cartoon, and tried to just imply it by having a picture of a robot.
The first time (I think) I presented the factoring in the embedded agency sequence was at a MIRI CFAR collaboration workshop, so parallels with humans was live in my thinking.
The first time we presented the cartoon in roughly its current form was at MSFP 2018, where we purposely did it on the first night before a CFAR workshop, so people could draw analogies that might help them transfer their curiosity in both directions.
comment by Gordon Seidoh Worley (gworley) · 2019-12-26T20:36:21.085Z · LW(p) · GW(p)
I agree and think this is an unappreciated idea, which is why I liberally link the embedded agency post in things I write. I'm not sure I'm doing a perfect job of not forgetting we are all embedded, but I consider it important and essential to not getting confused about, for example, human values, and think many of the confusions we have (especially the ones we fail to notice) are a result of incorrectly thinking, to put in another way, that the map does not also reside in the territory.
comment by Shmi (shminux) · 2019-12-24T07:24:30.550Z · LW(p) · GW(p)
It makes sense that to address the challenges of the agent being embedded one needs to start at the very foundations. I suspect that there is a fair bit of work before even addressing embedding the agents. For example, in a basic map-territory correspondence the map is a part of the territory. So, a question arises, what does it mean for a part of the territory be a coarse-grained representation of the territory? What restrictions it places on the type of territories that are internally mappable to begin with? For example, it has to admit lossy compression of some kind, yet not be completely fractal. Anyway, my point is that focusing on the agency maybe a wrong place to start, there are more basic questions of embeddings that need to be addressed first. And even figuring out what those questions might be would count as progress.
Replies from: johnswentworth↑ comment by johnswentworth · 2019-12-24T22:55:26.547Z · LW(p) · GW(p)
I strongly agree with this. Those sorts of questions are exactly what I see as the main objective of my own research right now.
comment by justinpombrio · 2019-12-26T18:28:41.455Z · LW(p) · GW(p)
This post points out that many alignment problems can be phrased as embedded agency problems. It seems to me that they can also all be phrased as word-boundary problems. More precisely, for each alignment/embedded-agency problem listed here, there's a question (or a set of questions) of the form "what is X?" such that answering that question would go a long way toward solving the alignment/embedded-agency problem, and vice-versa.
Is this a useful reduction?
The "what is X?" question I see for each problem:
The Keyboard is Not The Human
What does it mean for a person to "say" something (in the abstract sense of the word)?
Modified Humans
What is a "human"? Furthermore, what does it mean to "modify" or "manipulate" a human?
Off-Equilibrium
What are the meanings of counterfactual statements? For example, what does it mean to say "We will launch of nukes if you do."?
Perhaps also, what is a "choice"?
Drinking
What is a "valid profession of one's values"?
Value Drift
What are a person's "values"? Focus being on people changing over time.
Akrasia
What is a "person", and what are a person's "values"? Focus being on people being make of disparate parts.
Preferences Over Quantum Fields
What are the meanings of abstract, high-level statements? Do they change if your low-level model of the world fundamentally shifts?
Unrealized Implications
What are a person's "values"? Focus being on someone knowing A and knowing A->B but not yet knowing B.
Socially Strategic Self-Modification
What are a person's "true values"? Focus being on self-modification.
Replies from: johnswentworth↑ comment by johnswentworth · 2019-12-26T21:00:36.174Z · LW(p) · GW(p)
Yes and no.
I do think you're pointing to the right problems - basically the same problems Shminux was pointing at in his comment, and the same problems which I think are the most promising entry point to progress on embedded agency in general.
That said, I think "word boundaries" is a very misleading label for this class of problems. It suggests that the problem is something like "draw a boundary around points in thing-space which correspond to the word 'tree'", except for concepts like "values" or "person" rather than "tree". Drawing a boundary in thing-space isn't really the objective here; the problem is that we don't know what the right parameterization of thing-space is or whether that's even the right framework for grounding these concepts at all.
Here's how I'd pose it. Over the course of history, humans have figured out how to translate various human intuitions into formal (i.e. mathematical) models. For instance:
- Game theory gave a framework for translating intuitions about "strategic behavior" into math
- Information theory gave a framework for translating intuitions about information into math
- More recently, work on causality gave a framework for translating intuitions about counterfactuals into math
- In the early days, people like Galileo showed how to translate physical intuitions into math
A good heuristic: if a class of intuitive reasoning is useful and effective in practice, then there's probably some framework which would let us translate those intuitions into math. In the case of embedded-agency-related problems, we don't yet have the framework - just the intuitions.
With that in mind, I'd pose the problem as: build a framework for translating intuitions about "values", "people", etc into math. That's what we mean by the question "what is X?".
Replies from: justinpombrio
↑ comment by justinpombrio · 2019-12-27T00:51:53.784Z · LW(p) · GW(p)
Ooh, that is very insightful. The word-boundary problem around "values" feels fuzzy and ill-defined, but that doesn't mean that the thing we care about is actually fuzzy and ill-defined.
comment by Noosphere89 (sharmake-farah) · 2025-01-21T17:06:18.239Z · LW(p) · GW(p)
IMO, the I/O part is not about the lack of such a channel, but rather the lack of a channel that is invulnerable to hacking/modification, such that the channel can be assumed to only come from a certain source.
You could always create such a channel, though it isn't fundamental, but rather that you can't create a channel that isn't able to be modified/hacked, such that the channel can be assumed to only come from a certain source.
I like dxu's comment:
https://www.lesswrong.com/s/Rm6oQRJJmhGCcLvxh/p/zcPLNNw4wgBX5k8kQ#uFdZuNY3XxBBakLv7 [? · GW]
comment by Noosphere89 (sharmake-farah) · 2025-01-20T01:22:35.476Z · LW(p) · GW(p)
Some thoughts on the embedded agents part today, now that I'm inspired to have thoughts on it.
On unrealized implications, I don't think this is exactly an embedded agent problem so much as a problem of limited computational abilities.
More seriously, I suspect it's possible for an infinite agent to be both embedded within the structure of it's universe and also be logically/computationally omniscient, but if we do impose a condition of finiteness, the unrealized implications part comes back.
So in that sense, I think it's not exactly a problem of being in the world, but rather being finite.
But the finiteness condition is fine for now, so I'll talk about other things.
A lot of embedded agency problems, IMO are either created or are significantly enhanced via physical universality, which is semi-plausible for our universe, and in particular, a big thing that physical universality does for embedded agency is you can no longer create a perfect isolator, because the environment can always revitalize an isolated area, and this is why any reversible cellular automaton that allows for perfect walls cannot be physically universal.
This means that there's no ground truth Cartesian boundary available that persists for all time, which breaks the abstraction of an agent separated from it's environment, which means reward corruption and self-modification can happen.
Thus, we have to replace it by a theory that can handle shifts in boundaries. Ideally, the boundary should either be arbitrarily shiftable or not exist at all, but this creates problems since physical universality is way less studied than computational universality, and their interaction is not studied at all.
The I/O part is not about the lack of such a channel, but rather the lack of a channel that is invulnerable to hacking/modification, such that the channel can be assumed to only come from a certain source.
comment by Rohin Shah (rohinmshah) · 2019-12-29T02:27:57.653Z · LW(p) · GW(p)
Planned summary for the Alignment Newsletter:
<@Embedded agency@>(@Embedded Agents@) is not just a problem for AI systems: humans are embedded agents too; many problems in understanding human values stem from this fact. For example, humans don't have a well-defined output channel: we can't say "anything that comes from this keyboard is direct output from the human", because the AI could seize control of the keyboard and wirehead, or a cat could walk over the keyboard, etc. Similarly, humans can "self-modify", e.g. by drinking, which often modifies their "values": what does that imply for value learning? Based on these and other examples, the post concludes that "a better understanding of embedded agents in general will lead to substantial insights about the nature of human values".
Planned opinion:
I certainly agree that many problems with value learning stem from embedded agency issues with humans, and any <@formal account@>(@Why we need a *theory* of human values@) of this will benefit from general progress in understanding embeddedness. Unlike many others, I do not think we need a formal account of human values, and that a "common-sense" understanding will suffice, including for the embeddedness problems detailed in this post.Replies from: johnswentworth
↑ comment by johnswentworth · 2019-12-29T06:24:40.859Z · LW(p) · GW(p)
One (possibly minor?) point: this isn't just about value learning; it's the more general problem of pointing [LW · GW] to values. For instance, a system with a human in the loop may not need to learn values; it could rely on the human to provide value judgements. On the other hand, the human still needs to point to their own values in manner usable/interpretable by the rest of the system (possibly with the human doing the "interpretation", as in e.g. tool AI). Also, the system still needs to point to the human somehow - cats walking on keyboards are still a problem.
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I'd be interested to read that. (Or if someone else has written up views similar to your own, that works too.)
Replies from: rohinmshah↑ comment by Rohin Shah (rohinmshah) · 2019-12-29T20:35:07.761Z · LW(p) · GW(p)
One (possibly minor?) point: this isn't just about value learning; it's the more general problem of pointing [LW · GW] to values.
Makes sense, I changed "value learning" to "figuring out what to optimize".
Also, if you have written up your views on these sorts of problems, and how human-common-sense understanding will solve them, I'd be interested to read that.
Hmm, I was going to say Chapter 3 of the Value Learning sequence [? · GW], but looking at it again it doesn't really talk about this. Maybe the post on Following human norms [? · GW] gives some idea of the flavor of what I mean, but it doesn't explicitly talk about it. Perhaps I should write about this in the future.
Here's a brief version:
We'll build ML systems with common sense, because common sense is necessary for tasks of interest; common sense already deals with most (all?) of the human embeddedness problems. There are still two remaining problems:
- Ensuring the AI uses its common sense when interpreting our goals / instructions. We'll probably figure this out in the future; it seems likely that "give instructions in natural language" automatically works (this is the case with human assistants for example).
- Ensuring the AI is not trying to deceive us. This seems mostly-independent of human embeddedness. You can certainly construct examples where human embeddedness makes it hard to tell whether something is deceptive or not, but I think in practice "is this deceptive" is a common sense natural category that we can try to detect. (You may not be able to prove theorems, since it relies on common sense understanding; but you could be able to detect deception in any case that actually arises.)
↑ comment by johnswentworth · 2019-12-29T21:08:51.311Z · LW(p) · GW(p)
Thanks, that makes sense.
FWIW, my response would be something like: assuming that common-sense reasoning is sufficient, we'll probably still need a better understanding of embeddedness in order to actually build common-sense reasoning into an AI. When we say "common sense can solve these problems", it means humans know how to solve the problems, but that doesn't mean we know how to translate the human understanding into something an AI can use. I do agree that humans already have a good intuition for these problems, but we still don't know how to automate that intuition.
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not "common sense" is a natural category that ML-style methods could figure out. I do think it's a natural category in some sense, but I think we still need a theoretical breakthrough before we'll be able to point a system at it - and I don't think systems will acquire human-compatible common sense by default as an instrumentally convergent tool.
Replies from: rohinmshah↑ comment by Rohin Shah (rohinmshah) · 2019-12-30T08:18:04.097Z · LW(p) · GW(p)
I think our main difference in thinking here is not in whether or not common sense is sufficient, but in whether or not "common sense" is a natural category that ML-style methods could figure out.
To give some flavor of why I think ML could figure it out:
I don't think "common sense" itself is a natural category, but is instead more like a bundle of other things that are natural, e.g. pragmatics. It doesn't seem like "common sense" is innate to humans; we seem to learn "common sense" somehow (toddlers are often too literal). I don't see an obvious reason why an ML algorithm shouldn't be able to do the same thing.
In addition, "common sense" type rules are often very useful for prediction, e.g. if you hear "they gave me a million packets of hot sauce", and then you want to predict how many packets of hot sauce there are in the bad, you're going to do better if you understand common sense. So common sense is instrumentally useful for prediction (and probably any other objective you care to name that we might use to train an AI system).
That said, I don't think it's a crux for me -- even if I believed that current ML systems wouldn't be able to figure "common sense" out, my main update would be that current ML systems wouldn't lead to AGI / transformative AI, since I expect most tasks require common sense. Perhaps the crux is "transformative AI will necessarily have figured out most aspects of 'common sense'".
Replies from: johnswentworth↑ comment by johnswentworth · 2019-12-30T17:10:24.036Z · LW(p) · GW(p)
Ah, ok, I may have been imagining something different by "common sense" than you are - something more focused on the human-specific parts.
Maybe this claim gets more at the crux: the parts of "common sense" which are sufficient for handling embeddedness issues with human values are not instrumentally convergent; the parts of "common sense" which are instrumentally convergent are not sufficient for human values.
The cat on the keyboard seems like a decent example here (though somewhat oversimplified). If the keyboard suddenly starts emitting random symbols, then it seems like common sense to ignore it - after all, those symbols obviously aren't coming from a human. On the other hand, if the AI's objective is explicitly pointing to the keyboard, then that common sense won't do any good - it doesn't have any reason to care about the human's input more than random input a priori, common sense or not. Obviously there are simple ways of handling this particular problem, but it's not something the AI would learn unless it was pointing to the human to begin with.
Replies from: rohinmshah↑ comment by Rohin Shah (rohinmshah) · 2019-12-30T19:38:57.322Z · LW(p) · GW(p)
Hmm, this seems to be less about whether or not you have common sense, and more about whether the AI system is motivated to use its common sense in interpreting instructions / goals.
I think if you have an AI system that is maximizing an explicit objective, e.g. maximize the numbers input from this keyboard; then the AI will have common sense, but (almost tautologically) won't use it to interpret the input correctly. (See also Failed Utopia [LW · GW].)
The hope is to train an AI system that doesn't work like that, in the same way that humans don't work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Replies from: johnswentworth↑ comment by johnswentworth · 2019-12-30T21:15:47.938Z · LW(p) · GW(p)
The hope is to train an AI system that doesn't work like that, in the same way that humans don't work like that. (In fact, I could see that by default AI systems are trained like that; e.g. instruction-following AI systems like CraftAssist seem to be in this vein.)
Let me make sure I understand what you're picturing as an example. Rather than giving an AI an explicit objective, we train it to follow instructions from a human (presumably using something RL-ish?), and the idea is that it will learn something like human common sense in order to better follow instructions. Is that a prototypical case of what you're imagining? If so, what criteria do you imagine using for training? Maximizing a human approval score? Mimicking a human/predicting what a human would do and then doing that? Some kind of training procedure which somehow avoids optimizing anything at all?
Replies from: rohinmshah↑ comment by Rohin Shah (rohinmshah) · 2019-12-31T03:27:56.578Z · LW(p) · GW(p)
Is that a prototypical case of what you're imagining?
Yes.
Maximizing a human approval score?
Sure, that seems reasonable. Note that this does not mean that the agent ends up taking whichever actions maximize the number entered into a keyboard; it instead creates a policy that is consistent with the constraints "when asked to follow <instruction i>, I should choose action <most approved action i>", for instructions and actions it is trained on. It's plausible to me that the most "natural" policy that satisfies these constraints is one which predicts what a real human would think of the chosen action, and then chooses the action that does best according to that prediction.
(In practice you'd want to add other things like e.g. interpretability and adversarial training.)
Replies from: johnswentworth↑ comment by johnswentworth · 2019-12-31T04:50:12.981Z · LW(p) · GW(p)
It's plausible to me that the most "natural" policy that satisfies these constraints is one which predicts what a real human would think of the chosen action...
I'd expect that's going to depend pretty heavily on how we're quantifying "most natural", which brings us right back to the central issue.
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected - and that will produce basically the same problems as an actual human at a keyboard. The final policy won't point to human values any more robustly than the data collection process did - if the data was generated by a human typing at a keyboard, then the most-predictive policy will predict what a human would type at a keyboard, not what a human "actually wants". Garbage in, garbage out, etc.
More pithily: if a problem can't be solved by a human typing something into a keyboard, then it also won't be solved by simulating/predicting what the human would type into the keyboard.
It could be that there's some viable criterion of "natural" other than just maximizing predictive power, but predictive power alone won't circumvent the embeddedness problems.
Replies from: rohinmshah
↑ comment by Rohin Shah (rohinmshah) · 2019-12-31T07:43:07.892Z · LW(p) · GW(p)
Just in terms of pure predictive power, the most accurate policy is going to involve a detailed simulation of a human at a keyboard, reflecting the physical setup in which the data is collected - and that will produce basically the same problems as an actual human at a keyboard. [...] the most-predictive policy will predict what a human would type at a keyboard, not what a human "actually wants".
Agreed. I don't think we will get that policy, because it's very complex. (It's much easier / cheaper to predict what the human wants than to run a detailed simulation of the room.)
I'd expect that's going to depend pretty heavily on how we're quantifying "most natural", which brings us right back to the central issue.
I'm making an empirical prediction; so I'm not quantifying "most natural", reality is.
Tbc, I'm not saying that this is a good on-paper solution to AI safety; it doesn't seem like we could know in advance that this would work. I'm saying that it may turn out that as we train more and more powerful systems, we see evidence that the picture I painted is basically right; in that world it could be enough to do some basic instruction-following.
I'm also not saying that this is robust to scaling up arbitrarily far; as you said, the literal most predictive policy doesn't work.
Replies from: johnswentworth↑ comment by johnswentworth · 2019-12-31T17:24:20.378Z · LW(p) · GW(p)
Cool, I agree with all of that. Thanks for taking the time to talk through this.