Posts
Comments
Thanks! I think agents may well get the necessary kind of situational awareness before the RL stage. But I think they're unlikely to be deceptively aligned because you also need long-term goals to motivate deceptive alignment, and agents are unlikely to get long-term goals before the RL stage.
On generalization, the questions involving the string 'shutdown' are just supposed to be quick examples. To get good generalization, we'd want to train on as wide a distribution of possible shutdown-influencing actions as possible. Plausibly, with a wide-enough training distribution, you can make deployment largely 'in distribution' for the agent, so you're not relying so heavily on OOD generalization. I agree that you have to rely on some amount of generalization though.
People would likely disagree on what counts as manipulating shutdown, which shows that the concept of manipulating shutdown is quite complicated so I wouldn't expect generalizing to it to be the default.
I agree that the concept of manipulating shutdown is quite complicated, and in fact this is one of the considerations that motivates the IPP. 'Don't manipulate shutdown' is a complex rule to learn, in part because whether an action counts as 'manipulating shutdown' depends on whether we humans prefer it, and because human preferences are complex. But the rule that we train TD-agents to learn is 'Don't pay costs to shift probability mass between different trajectory-lengths.' That's a simpler rule insofar as it makes no reference to complex human preferences. I also note that it follows from POST plus a general principle that we can expect advanced agents to satisfy. That makes me optimistic that the rule won't be so hard to learn. In any case, I and some collaborators are running experiments to test this in a simple setting.
The talk about "giving reward to the agent" also made me think you may be making the assumption of reward being the optimization target. That being said, as far as I can tell no part of the proposal depends on the assumption.
Yes, I don't assume that the reward is the optimization target. The text you quote is me noting some alternative possible definitions of 'preference.' My own definition of 'preference' makes no reference to reward.
I don't think agents that avoid the money pump for cyclicity are representable as satisfying VNM, at least holding fixed the objects of preference (as we should). Resolute choosers with cyclic preferences will reliably choose B over A- at node 3, but they'll reliably choose A- over B if choosing between these options ex nihilo. That's not VNM representable, because it requires that the utility of A- be greater than the utility of B and. that the utility of B be greater than the utility of A-
It also makes it behaviorally indistinguishable from an agent with complete preferences, as far as I can tell.
That's not right. As I say in another comment:
And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.
Or consider another example. The agent trades A for B, then B for A, then declines to trade A for B+. That's compatible with the Caprice rule, but not with complete preferences.
Or consider the pattern of behaviour that (I elsewhere argue) can make agents with incomplete preferences shutdownable. Agents abiding by the Caprice rule can refuse to pay costs to shift probability mass between A and B, and refuse to pay costs to shift probability mass between A and B+. Agents with complete preferences can't do that.
The same updatelessness trick seems to apply to all money pump arguments.
[I'm going to use the phrase 'resolute choice' rather than 'updatelessness.' That seems like a more informative and less misleading description of the relevant phenomenon: making a plan and sticking to it. You can stick to a plan even if you update your beliefs. Also, in the posts on UDT, 'updatelessness' seems to refer to something importantly distinct from just making a plan and sticking to it.]
That's right, but the drawbacks of resolute choice depend on the money pump to which you apply it. As Gustafsson notes, if an agent uses resolute choice to avoid the money pump for cyclic preferences, that agent has to choose against their strict preferences at some point. For example, they have to choose B at node 3 in the money pump below, even though - were they facing that choice ex nihilo - they'd prefer to choose A-.
There's no such drawback for agents with incomplete preferences using resolute choice. As I note in this post, agents with incomplete preferences using resolute choice need never choose against their strict preferences. The agent's past plan only has to serve as a tiebreaker: forcing a particular choice between options between which they'd otherwise lack a preference. For example, they have to choose B at node 2 in the money pump below. Were they facing that choice ex nihilo, they'd lack a preference between B and A-.
Yes, that's a good summary. The one thing I'd say is that you can characterize preferences in terms of choices and get useful predictions about what the agent will do in other circumstances if you say something about the objects of preference. See my reply to Lucius above.
Good summary and good points. I agree this is an advantage of truly corrigible agents over merely shutdownable agents. I'm still concerned that CAST training doesn't get us truly corrigible agents with high probability. I think we're better off using IPP training to get shutdownable agents with high probability, and then aiming for full alignment or true corrigibility from there (perhaps by training agents to have preferences between same-length trajectories that deliver full alignment or true corrigibility).
I'm pointing out the central flaw of corrigibility. If the AGI can see the possible side effects of shutdown far better than humans can (and it will), it should avoid shutdown.
That's only a flaw if the AGI is aligned. If we're sufficiently concerned the AGI might be misaligned, we want it to allow shutdown.
Yes, the proposal is compatible with agents (e.g. AI-guided missiles) wanting to avoid non-shutdown incapacitation. See this section of the post on the broader project.
If the environment is deterministic, the agent is choosing between trajectories. In those environments, we train agents using DREST to satisfy POST:
- The agent chooses stochastically between different available trajectory-lengths.
- Given the choice of a particular trajectory-length, the agent maximizes paperclips made in that trajectory-length.
If the environment is stochastic (as - e.g. - deployment environments will be), the agent is choosing between lotteries, and we expect agents to be neutral: to not pay costs to shift probability mass between different trajectory-lengths. So they won't perform either of the shutdown-related actions if doing so comes at any cost with respect to lotteries conditional on each trajectory-length. Which of the object-level actions the agent performs will depend on the quantities of paperclips available.
I don't think human selective breeding tells us much about what's simple and natural for AIs. HSB seems very different from AI training. I'm reminded of the Quintin Pope point that evolution selects genes that build brains that learn parameter values, rather than selecting for parameter values directly. It's probably hard to get next-token predictors via HSB, but you can do it via AI training.
On generalizing to extremely unlikely conditionals, I think TD-agents are in much the same position as other kinds of agents, like expected utility maximizers. Strictly, both have to consider extremely unlikely conditionals to select actions. In practice, both can approximate the results of this process using heuristics.
To motivate the relevant kind of deceptive alignment, you need preferences between different-length trajectories as well as situational awareness. And (I argue in section 19.3), the training proposal will prevent agents learning those preferences. See in particular:
We begin training the agent to satisfy POST at the very beginning of the reinforcement learning stage, at which point it’s very unlikely to be deceptively aligned (and arguably doesn’t even deserve the label ‘agent’). And when we’re training for POST, every single episode-series is training the agent not to prefer any longer trajectory to any shorter trajectory. The discount factor is constantly teaching the agent this simple lesson.
Plausibly then, the agent won’t come to prefer any longer trajectory to any shorter trajectory. And then we can reason as follows. Since the agent doesn’t prefer any longer trajectory to any shorter trajectory:
- it has no incentive to shift probability mass towards longer trajectories,
- and hence has no incentive to prevent shutdown in deployment,
- and hence has no incentive to preserve its ability to prevent shutdown in deployment,
- and hence has no incentive to avoid being made to satisfy Timestep Dominance,
- and hence has no incentive to pretend to satisfy Timestep Dominance in training.
I expect agents' not caring about shutdown to generalize for the same reason that I expect any other feature to generalize. If you think that - e.g. - agents' capabilities will generalize from training to deployment, why do you think their not caring about shutdown won't?
I don't assume that reward is the optimization target. Which part of my proposal do you think requires that assumption?
Your point about shutting down subagents is important and I'm not fully satisfied with my proposal on that point. I say a bit about it here.
Thanks! We think that advanced POST-agents won't deliberately try to get shut down, for the reasons we give in footnote 5 (relevant part pasted below). In brief:
- advanced agents will be choosing between lotteries
- we have theoretical reasons to expect that agents that satisfy POST (when choosing between trajectories) will be 'neutral' (when choosing between lotteries): they won't spend resources to shift probability mass between different-length trajectories.
So (we think) neutral agents won't deliberately try to get shut down if doing so costs resources.
Would advanced agents that choose stochastically between different-length trajectories also choose stochastically between preventing and allowing shutdown? Yes, and that would be bad. But—crucially—in deployment, advanced agents will be uncertain about the consequences of their actions, and so these agents will be choosing between lotteries (non-degenerate probability distributions over trajectories) rather than between trajectories. And (as we’ll argue in Section 7) POST plausibly gives rise to a desirable pattern of preferences over lotteries. Specifically, POST plausibly makes advanced agents neutral: ensures that they won’t spend resources to shift probability mass between different-length trajectories. That in turn plausibly makes advanced agents shutdownable: ensures that they won’t spend resources to resist shutdown.
This is a nice point, but it doesn't seem like such a serious issue for TD-agents. If a TD-agent does try to manipulate humans, it won't pay costs to do so subtly, because doing so cheaply and unsubtly will lead to at least as great expected utility conditional on shutdown at each timestep and greater expected utility conditional on shutdown at some timestep. So cheap and unsubtle manipulation will timestep-dominate subtle manipulation, and we can shut down any TD-agents we notice doing cheap and unsubtle manipulation.
Another way to put this: subtle manipulation is a form of shutdown-resistance, because (relative to unsubtle manipulation) it involves paying costs to shift probability mass towards longer trajectories.
Ah yep I'm talking about the first decision-tree in the 'Incomplete preferences' section.
Ah yep, apologies, I meant to say "never requires them to change or act against their strict preferences."
Whether there's a conceptual difference will depend on our definition of 'preference.' We could define 'preference' as follows: an agent prefers X to Y iff the agent reliably chooses X over Y.' In that case, modifying the policy is equivalent to forming a preference.
But we could also define 'preference' so that it requires more than just reliable choosing. For example, we might also require that (when choosing between lotteries) the agent always take opportunities to shift probability mass away from Y and towards X.
On the latter definition, modifying the policy need not be equivalent to forming a preference, because it only involves the reliably choosing and not the shifting of probability mass.
And the latter definition might be more pertinent in this context, where our interest is in whether agents will be expected utility maximizers.
But also, even if we go with the former definition, I think it matters a lot whether money-pumps compel rational agents to complete all their preferences up front, or whether money-pumps just compel agents to resolve preferential gaps over time, conditional on them coming to face choices that are arranged like a money-pump (and only completing their preferences if and once they've faced a sufficiently diverse range of choices). In particular, I think it matters in the context of the shutdown problem. I talk a bit more about this here.
I think it’s going to be very typical for different actions to have trajectories that are mutually non-dominant (such as in the example). It matters a lot how you decide such cases, and I expect that almost all such ways of deciding are catastrophic.
But suppose I’m wrong, and timestep-dominance is always relevant.
My claim isn't that Timestep Dominance is always relevant. It's that Timestep Dominance rules out all instances of resisting shutdown.
I agree that many pairs of available lotteries are going to be mutually non-dominant. For those cases, Sami and I propose that the agent choose by maximizing expected utility. Can you say what you expect the problem there to be?
Suppose that, assuming the shutdown button didn’t exist, the fastest pathway to paperclips involves instantiating a paperclip-seeking AI on an unsecured server in Singapore. Suppose that it takes 99,000 keystrokes to create an IPP agent, but only 55,000 keystrokes to create a basic VNM-rational agent with a short enough time horizon that it produces more paperclips quickly, in expectation (because VNM agents are both simpler and less likely to be shut down by angry humans). The shutdown button does, in fact exist, but the agent’s actions might be hard to notice, such that by the time anyone has realized that something went wrong, the agent already created a sub-agent, effectively making the stop-button useless.
I talk about the issue of creating corrigible subagents here. What do you think of that?
Note also a general nice feature of TD-agents: they won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. So if a TD-agent does try to create an incorrigible paperclipper, it won't hide that fact if doing so is at all costly.
While active resistance seems like the scariest part of incorrigibility, an incorrigible agent that’s not actively resisting still seems likely to be catastrophic.
Can you say more about this? It doesn't seem likely to me.
One more thing I'll say: the IPP leaves open the content of the agent's preferences over same-length trajectories. One pattern of preferences you could try to train in is the kind of corrigibility that you talk about elsewhere in your sequence. That'd give you two lines of defence against incorrigibility.
I think your 'Incomplete preferences' section makes various small mistakes that add up to important misunderstandings.
The utility maximization concept largely comes from the VNM-utility-theorem: that any policy (i.e. function from states to actions) which expresses a complete set of transitive preferences (which aren’t sensitive to unused alternatives) over lotteries is able to be described as an agent which is maximizing the expectation of some real-valued utility function over outcomes.
I think you intend 'sensitive to unused alternatives' to refer to the Independence axiom of the VNM theorem, but VNM Independence isn't about unused alternatives. It's about lotteries that share a sublottery. It's Option-Set Independence (sometimes called 'Independence of Irrelevant Alternatives') that's about unused alternatives.
On the surface, the axioms of VNM-utility seem reasonable to me
To me too! But the question isn't whether they seem reasonable. It's whether we can train agents that enduringly violate them. I think that we can. Coherence arguments give us little reason to think that we can't.
unused alternatives seem basically irrelevant to choosing between superior options
Yes, but this isn't Independence. And the question isn't about what seems basically irrelevant to us.
agents with intransitive preferences can be straightforwardly money-pumped
Not true. Agents with cyclic preferences can be straightforwardly money-pumped. The money-pump for intransitivity requires the agent to have complete preferences.
as long as the resources are being modeled as part of what the agent has preferences about
Yes, but the concern is whether we can instil such preferences. It seems like it might be hard to train agents to prefer to spend resources in pursuit of their goals except in cases where they would do so by resisting shutdown.
Thornley, I believe, thinks he’s proposing a non-VNM rational agent. I suspect that this is a mistake on his part that stems from neglecting to formulate the outcomes as capturing everything that he wants.
You can, of course, always reinterpret the objects of preference so that the VNM axioms are trivially satisfied. That's not a problem for my proposal. See:
Thanks, Lucius. Whether or not decision theory as a whole is concerned only with external behaviour, coherence arguments certainly aren’t. Remember what the conclusion of these arguments is supposed to be: advanced agents who start off not being representable as EUMs will amend their behaviour so that they are representable as EUMs, because otherwise they’re liable to pursue dominated strategies.
Now consider an advanced agent who appears not to be representable as an EUM: it’s paying to trade vanilla for strawberry, strawberry for chocolate, and chocolate for vanilla. Is this agent pursuing a dominated strategy? Will it amend its behaviour? It depends on the objects of preference. If objects of preference are ice-cream flavours, the answer is yes. If the objects of preference are sequences of trades, the answer is no. So we have to say something about the objects of preference in order to predict the agent’s behaviour. And the whole point of coherence arguments is to predict agents’ behaviour.
And once we say something about the objects of preference, then we can observe agents violating Completeness and acting in accordance with policies like ‘if I previously turned down some option X, I will not choose any option that I strictly disprefer to X.’ This doesn't require looking into the agent or saying anything about its algorithm or anything like that. It just requires us to say something about the objects of preference and to watch what the agent does from the outside. And coherence arguments already commit us to saying something about the objects of preference. If we say nothing, we get no predictions out of them.
The pattern of how an agent chooses options are that agent’s preferences, whether we think of them as such or whether they’re conceived as a decision rule to prevent being dominated by expected-utility maximizers!
You can define 'preferences' so that this is true, but then it need not follow that agents will pay costs to shift probability mass away from dispreferred options and towards preferred options. And that's the thing that matters when we're trying to create a shutdownable agent. We want to ensure that agents won't pay costs to influence shutdown-time.
Also, take your decision-tree and replace 'B' with 'A-'. If we go with your definition, we seem to get the result that expected-utility-maximizers prefer A- to A (because they choose A- over A on Monday). But that doesn't sound right, and so it speaks against the definition.
I think it’s interesting to note that we’re also doing something like throwing out the axiom of independence from unused alternatives
Not true. The axiom we're giving up is Decision-Tree Separability. That's different to VNM Independence, and different to Option-Set Independence. It might be hard to train agents that enduringly violate VNM Independence and/or Option-Set Independence. It doesn't seem so hard to train agents that enduringly violate Decision-Tree Separability.
In other words, if you wake up as this kind of agent on Monday, the way you cash-out your partial ordering over outcomes depends on your memory/model of what happened on Sunday.
Yes, nice point. Kinda weird? Maybe. Difficult to create artificial agents that do it? Doesn't seem so.
But notice that this refactor effectively turns Thornley’s agent into an agent with a set of preferences which satisfies the completeness and independence axioms of VNM
Yep, you can always reinterpret the objects of preference so that the VNM axioms are trivially satisfied.That's not a problem for my proposal.
the point is that “incomplete preferences” combined with a decision making algorithm which prevents the agent’s policy from being strictly dominated by an expected utility maximizer ends up, in practice, as isomorphic to an expected utility maximizer which is optimizing over histories/trajectories.
Not true. As I say elsewhere:
And an agent abiding by the Caprice Rule can’t be represented as maximising utility, because its preferences are incomplete. In cases where the available trades aren’t arranged in some way that constitutes a money-pump, the agent can prefer (/reliably choose) A+ over A, and yet lack any preference between (/stochastically choose between) A+ and B, and lack any preference between (/stochastically choose between) A and B. Those patterns of preference/behaviour are allowed by the Caprice Rule.
I want to consider how there’s a common misunderstanding of “outcomes” in the VNM-sense as needing to be about physical facts of the future universe (such as number of paperclips) rather than as potentially including historical facts, such as which options were abandoned or whether the agent took the default action. This is extremely relevant for corrigibility since one of the key ideas in my strategy is to shift the AI’s preferences from being about things like whether the button is pushed to being about whether the agent consistently maintained a certain kind of relationship with the principal during the relevant period.
Same point here as above. You can get any agent to satisfy the VNM axioms by enriching the objects of preference. A concern is that these more complex preferences are harder to reliably train into your agent.
I reject Thornley’s assertion that they’re dealbreakers.
Everything you say in this section seems very reasonable. In particular, I think it's pretty likely that this is true:
It’s okay for our agent to have preferences around the shutdown button (that is: to have it either pressed or unpressed), because we can carefully train into our agent a shallow aversion to manipulating the button, including via side-channels such as humans or other machines. This aversion will likely win out over the agent’s incentives in settings that resemble the training environment. As a result, the agent won’t try to manipulate the button in the early phases of its life, and so will remain shutdownable long enough for a further refinement process to generalize the shallow aversion into a deep and robust preference for non-manipulation.
So I'm not sure whether I think that the problems of reward misspecification, goal misgeneralization, and deceptive alignment are 'dealbreakers' in the sense that you're using the word.
But I do still think that these problems preclude any real assurance of shutdownability: e.g. they preclude p(shutdownability) > 95%. It sounds like we're approximately in agreement on that:
But I also agree that my strategy isn’t ideal. It would be nice to have something robust, where we could get something closer to a formal proof of shutdownability.
Thanks, this comment is also clarifying for me.
My guess is that a corrigibility-centric training process says 'Don't get the ice cream' is the correct completion, whereas full alignment says 'Do'. So that's an instance where the training processes for CAST and FA differ. How about DWIM? I'd guess DWIM also says 'Don't get the ice cream', and so seems like a closer match for CAST.
Thanks, this comment was clarifying.
And indeed, if you're trying to train for full alignment, you should almost certainly train for having a pointer, rather than training to give correct answers on e.g. trolley problems.
Yep, agreed. Although I worry that - if we try to train agents to have a pointer - these agents might end up having a goal more like:
maximize the arrangement of the universe according to this particular balance of beauty, non-suffering, joy, non-boredom, autonomy, sacredness, [217 other shards of human values, possibly including parochial desires unique to this principal].
I think it depends on how path-dependent the training process is. The pointer seems simpler, so the agent settles on the pointer in the low path-dependence world. But agents form representations of things like beauty, non-suffering, etc. before they form representations of human desires, so maybe these agents' goals crystallize around these things in the high path-dependence world.
Corrigibility is, at its heart, a relatively simple concept compared to good alternatives.
I don't know about this, especially if obedience is part of corrigibility. In that case, it seems like the concept inherits all the complexity of human preferences. And then I'm concerned, because as you say:
When a training target is complex, we should expect the learner to be distracted by proxies and only get a shadow of what’s desired.
I think obedience is an emergent behavior of corrigibility.
In that case, I'm confused about how the process of training an agent to be corrigible differs from the process of training an agent to be fully aligned / DWIM (i.e. training the agent to always do what we want).
And that makes me confused about how the proposal addresses problems of reward misspecification, goal misgeneralization, deceptive alignment, and lack of interpretability. You say some things about gradually exposing agents to new tasks and environments (which seems sensible!), but I'm concerned that that by itself won't give us any real assurance of corrigibility.
There could be agents that only have incomplete preferences because they haven't bothered to figure out the correct completion. But there could also be agents with incomplete preferences for which there is no correct completion. The question is whether these agents are pressured by money-pump arguments to settle on some completion.
I understand partially ordered preferences.
Yes, apologies. I wrote that explanation in the spirit of 'You probably understand this, but just in case...'. I find it useful to give a fair bit of background context, partly to jog my own memory, partly as a just-in-case, partly in case I want to link comments to people in future.
I believe you're switching the state of nature for each comparison, in order to construct this cycle.
I don't think this is true. You can line up states of nature in any way you like.
Things are confusing because there are lots of different dominance relations that people talk about. There's a dominance relation on strategies, and there are (multiple) dominance relations on lotteries.
Here are the definitions I'm working with.
A strategy is a plan about which options to pick at each choice-node in a decision-tree.
Strategies yield lotteries (rather than final outcomes) when the plan involves passing through a chance-node. For example, consider the decision-tree below:
A strategy specifies what option the agent would pick at choice-node 1, what option the agent would pick at choice-node 2, and what option the agent would pick at choice-node 3.
Suppose that the agent's strategy is {Pick B at choice-node 1, Pick A+ at choice-node 2, Pick B at choice-node 3}. This strategy doesn't yield a final outcome, because the agent doesn't get to decide what happens at the chance-node. Instead, the strategy yields the lottery 0.5(A+)+0.5(B). This just says that: if the agent executes the strategy, then there's a 0.5 probability that they end up with final outcome A+ and a 0.5 probability that they end up with final outcome B.
The dominance relation on strategies has to refer to the lotteries yielded by strategies, rather than the final outcomes yielded by strategies, because strategies don't yield final outcomes when the agent passes through a chance-node.[1] So we define the dominance relation on strategies as follows:
Strategy Dominance (relation)
A strategy S is dominated by a strategy S' iff S yields a lottery X that is strictly dispreferred to the lottery X' yielded by S'.
Now for the dominance relations on lotteries.[2] One is:
Statewise Dominance (relation)
Lottery X statewise-dominates lottery Y iff, in each state [environment_random_seed], X yields a final outcome weakly preferred to the final outcome yielded by Y, and in some state [environment_random_seed], X yields a final outcome strictly preferred to the final outcome yielded by Y.
Another is:
Statewise Pseudodominance (relation)
Lottery X statewise-pseudodominates lottery Y iff, in each state [environment_random_seed], X yields a final outcome weakly preferred to or pref-gapped to the final outcome yielded by Y, and in some state [environment_random_seed], X yields a final outcome strictly preferred to the final outcome yielded by Y.
The lottery A (that yields final outcome A for sure) is statewise-pseudodominated by the lottery 0.5(A+)+0.5(B), but it isn't statewise-dominated by 0.5(A+)+0.5(B). That's because the agent has a preferential gap between the final outcomes A and B.
Advanced agents with incomplete preferences over final outcomes will plausibly satisfy the Statewise Dominance Principle:
Statewise Dominance Principle
If lottery X statewise-dominates lottery Y, then the agent strictly prefers X to Y.
And that's because agents that violate the Statewise Dominance Principle are 'shooting themselves in the foot' in the relevant sense. If the agent executes a strategy that yields a statewise-dominated lottery, then there's another available strategy that - in each state - gives a final outcome that is at least as good in every respect that the agent cares about, and - in some state - gives a final outcome that is better in some respect that the agent cares about.
But advanced agents with incomplete preferences over final outcomes plausibly won't satisfy the Statewise Pseudodominance Principle:
Statewise Pseudodominance Principle
If lottery X statewise-pseudodominates lottery Y, then the agent strictly prefers X to Y.
And that's for the reasons that I gave in my comment above. Condensing:
- A statewise-pseudodominated lottery can be such that, in some state, that lottery is better than all other available lotteries in some respect that the agent cares about.
- The statewise pseudodominance relation is cyclic, so the Statewise Pseudodominance Principle would lead to cyclic preferences.
- ^
You say:
The decision about whether to go to that chance node should be derived from the final outcomes, not from some newly created terminal preference about that chance node.
But:
- The decision can also depend on the probabilities of those final outcomes.
- The decision is constrained by preferences over final outcomes and probabilities of those final outcomes. I'm supposing that the agent's preferences over lotteries depends only on these lotteries' possible final outcomes and their probabilities. I'm not supposing that the agent has newly created terminal preferences/arbitrary preferences about non-final states.
- ^
There are stochastic versions of each of these relations, which ignore how states line up across lotteries and instead talk about probabilities of outcomes. I think everything I say below is also true for the stochastic versions.
Another nice article. Gustav says most of the things that I wanted to say. A couple other things:
- I think LELO with discounting is going to violate Pareto. Suppose that by default Amy is going to be born first with welfare 98 and then Bobby is going to be born with welfare 100. Suppose that you can do something which harms Amy (so her welfare is 97) and harms Bobby (so his welfare is 99). But also suppose that this harming switches the birth order: now Bobby is born first and Amy is born later. Given the right discount-rate, LELO will advocate doing the harming, because it means making good lives happen earlier. Is that right?
- I think a minor reframing of Harsanyi's veil-of-ignorance makes it more compelling as an argument for utilitarianism. Not only is it the case that doing the utilitarian thing maximises the decision-maker's expected welfare behind the veil-of-ignorance, doing the utilitarian thing maximises everyone's expected welfare behind the veil-of-ignorance. So insofar as aggregativism departs from utilitarianism, it means doing what would be worse in expecation for everyone behind a veil-of-ignorance.
Yeah I think correlations and EDT can make things confusing. But note that average utilitarianism can endorse (B) given certain background populations. For example, if the background population is 10 people each at 1 util, then (B) would increase the average more than (A).
Nice article. I think it's a mistake for Harsanyi to argue for average utilitarianism. The view has some pretty counterintuitive implications:
- Suppose we have a world in which one person is living a terrible life, represented by a welfare level of -100. Average utilitarianism implies that we can make that world better by making the person's life even more terrible (-101) and adding a load of people with slightly-less terrible lives (-99).
- Suppose I'm considering having a child. Average utilitarianism implies that I have to do research in Egyptology to figure out whether having a child is permissible.[1] That seems counterintuitive.
- On a natural way of extending average utilitarianism to risky prospects, the view can oppose the interests of all affected individuals. See Gustafsson and Spears:
- ^
If the ancient Egyptians were very happy, my child would bring down the average, and so having the child would be wrong. If the ancient Egyptians were unhappy, my child would bring up the average, and so having the child would be right.
Got this on my list to read! Just in case it's easy for you to do, can you turn the whole sequence into a PDF? I'd like to print it. Let me know if that'd be a hassle, in which case I can do it myself.
I take the 'lots of random nodes' possibility to be addressed by this point:
And this point generalises to arbitrarily complex/realistic decision trees, with more choice-nodes, more chance-nodes, and more options. Agents with a model of future trades can use their model to predict what they’d do conditional on reaching each possible choice-node, and then use those predictions to determine the nature of the options available to them at earlier choice-nodes. The agent’s model might be defective in various ways (e.g. by getting some probabilities wrong, or by failing to predict that some sequences of trades will be available) but that won’t spur the agent to change its preferences, because the dilemma from my previous comment recurs: if the agent is aware that some lottery is available, it won’t choose any dispreferred lottery; if the agent is unaware that some lottery is available and chooses a dispreferred lottery, the agent’s lack of awareness means it won’t be spurred by this fact to change its preferences. To get over this dilemma, you still need the ‘non-myopic optimiser deciding the preferences of a myopic agent’ setting, and my previous points apply: results from that setting don’t vindicate coherence arguments, and we humans as non-myopic optimisers could decide to create artificial agents with incomplete preferences.
Can you explain why you think that doesn't work?
To elaborate a little more, introducing random nodes allows for the possibility that the agent ends up with some outcome that they disprefer to the outcome that they would have gotten (as a matter of fact, unbeknownst to the agent) by making different choices. But that's equally true of agents with complete preferences.
We say that a strategy is dominated iff it leads to a lottery that is dispreferred to the lottery led to by some other available strategy. So if the lottery 0.5p(A+)+(1-0.5p)(B) isn’t preferred to the lottery A, then the strategy of choosing A isn’t dominated by the strategy of choosing 0.5p(A+)+(1-0.5p)(B). And if 0.5p(A+)+(1-0.5p)(B) is preferred to A, then the Caprice-rule-abiding agent will choose 0.5p(A+)+(1-0.5p)(B).
You might think that agents must prefer lottery 0.5p(A+)+(1-0.5p)(B) to lottery A, for any A, A+, and B and for any p>0. That thought is compatible with my point above. But also, I don't think the thought is true:
- Think about your own preferences.
- Let A be some career as an accountant, A+ be that career as an accountant with an extra $1 salary, and B be some career as a musician. Let p be small. Then you might reasonably lack a preference between 0.5p(A+)+(1-0.5p)(B) and A. That's not instrumentally irrational.
- Think about incomplete preferences on the model of imprecise exchange rates.
- Here's a simple example of the IER model. You care about two things: love and money. Each career gets a real-valued love score and a real-valued money score. Your exchange rate for love and money is imprecise, running from 0.4 to 0.6. On one proto-exchange-rate, love gets a weight of 0.4 and money gets a weight of 0.6, on another proto-exchange rate, love gets a weight of 0.6 and money gets a weight of 0.4. You weakly prefer one career to another iff it gets at least as high an overall score on both proto-exchange-rates. If one career gets a highger score on one proto-exchange-rate and the other gets a higher score on the other proto-exchange-rate, you have a preferential gap between the two careers. Let A’s <love, money> score be <0, 10>, A+’s score be <0, 11>, and B’s score be <10, 0>. A+ is preferred to A, because 0.4(0)+0.6(11) is greater than 0.4(0)+0.6(10), and 0.6(0)+0.4(11) is greater than 0.6(0)+0.4(10), but the agent lacks a preference between A+ and B, because 0.4(0)+0.6(11) is greater than 0.4(10)+0.6(0), but 0.6(0)+0.4(11) is less than 0.6(10)+0.4(0). And the agent lacks a preference between A and B for the same sort of reason.
- To keep things simple, let p=0.2, so your choice is between 0.1(A+)+0.9(B) and A. The expected <love, money> score of the former is <9, 0.11>. The expected <love, money> score of the latter is <0, 10>. You lack a preference between them, because 0.6(9)+0.4(0.11) is greater than 0.6(0)+0.4(10), and 0.4(0)+0.6(10) is greater than 0.4(9)+0.6(0.11).
- The general principle that you appeal to (If X is weakly preferred to or pref-gapped with Y in every state of nature, and X is strictly preferred to Y in some state of nature, then the agent must prefer X to Y) implies that rational preferences can be cyclic. B must be preferred to p(B-)+(1-p)(A+), which must be preferred to A, which must be preferred to p(A-)+(1-p)B+, which must be preferred to B.
I’m coming to this two weeks late, but here are my thoughts.
The question of interest is:
- Will sufficiently-advanced artificial agents be representable as maximizing expected utility?
Rephrased:
- Will sufficiently-advanced artificial agents satisfy the VNM axioms (Completeness, Transitivity, Independence, and Continuity)?
Coherence arguments purport to establish that the answer is yes. These arguments go like this:
- There exist theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.
- Sufficiently-advanced artificial agents will not pursue dominated strategies.
- So, sufficiently-advanced artificial agents will be representable as maximizing expected utility.
These arguments don’t work, because premise 1 is false: there are no theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies. In the year since I published my post, no one has disputed that.
Now to address two prominent responses:
‘I define ‘coherence theorems’ differently.’
In the post, I used the term ‘coherence theorems’ to refer to ‘theorems which imply that, unless an agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ I took that to be the usual definition on LessWrong (see the Appendix for why), but some people replied that they meant something different by ‘coherence theorems’: e.g. ‘theorems that are relevant to the question of agent coherence.’
All well and good. If you use that definition, then there are coherence theorems. But if you use that definition, then coherence theorems can’t play the role that they’re supposed to play in coherence arguments. Premise 1 of the coherence argument is still false. That’s the important point.
‘The mistake is benign.’
This is a crude summary of Rohin’s response. Rohin and I agree that the Complete Class Theorem implies the following: ‘If an agent has complete and transitive preferences, then unless the agent can be represented as maximizing expected utility, that agent is liable to pursue dominated strategies.’ So the mistake is neglecting to say ‘If an agent has complete and transitive preferences…’ Rohin thinks this mistake is benign.
I don’t think the mistake is benign. As my rephrasing of the question of interest above makes clear, Completeness and Transitivity are a major part of what coherence arguments aim to establish! So it’s crucial to note that the Complete Class Theorem gives us no reason to think that sufficiently-advanced artificial agents will have complete or transitive preferences, especially since:
- Completeness doesn’t come for free.
- Money-pump arguments for Completeness (applied to artificial agents) aren’t convincing.
- Money-pump arguments for Transitivity assume Completeness.
- Training agents to violate Completeness might keep them shutdownable.
Two important points
Here are two important points, which I make to preclude misreadings of the post:
- Future artificial agents - trained in a standard way - might still be representable as maximizing expected utility.
Coherence arguments don’t work, but there might well be other reasons to think that future artificial agents - trained in a standard way - will be representable as maximizing expected utility.
- Artificial agents not representable as maximizing expected utility can still be dangerous.
So why does the post matter?
The post matters because ‘train artificial agents to have incomplete preferences’ looks promising as a way of ensuring that these agents allow us to shut them down.
AI safety researchers haven’t previously considered incomplete preferences as a solution, plausibly because these researchers accepted coherence arguments and so thought that agents with incomplete preferences were a non-starter.[1] But coherence arguments don’t work, so training agents to have incomplete preferences is back on the table as a strategy for reducing risks from AI. And (I think) it looks like a pretty good strategy. I make the case for it in this post, and my coauthors and I will soon be posting some experimental results suggesting that the strategy is promising.
- ^
As I wrote elsewhere:
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
Looking forward to reading this properly. For now I'll just note that Roger Crisp attributes LELO to C.I. Lewis.
Good point! Thinking about it, it seems like an analogue of Good's theorem will apply.
Here's some consolation though. We'll be able to notice if the agent is choosing stochastically at the very beginning of each episode and then choosing deterministically afterwards. That's because we can tell whether an agent is choosing stochastically at a timestep by looking at its final-layer activations at that timestep. If one final-layer neuron activates much more than all the other final-layer neurons, the agent is choosing (near-)deterministically; otherwise, the agent is choosing stochastically.
Because we can easily notice this behaviour, plausibly we can find some way to train against it. Here's a new idea to replace the reward function. Suppose the agent's choice is as follows:
At this timestep, we train the agent using supervised learning. Ground-truth is a vector of final-layer activations in which the activation of the neuron corresponding to 'Yes' equals the activation of the neuron corresponding to 'No'. By doing this, we update the agent directly towards stochastic choice between 'Yes' and 'No' at this timestep.
For those who don't get the joke: benzos are depressants, and will (temporarily) significantly reduce your cognitive function if you take enough to have amnesia.
But Eric Neyman's post suggests that benzos don't significantly reduce performance on some cognitive tasks (e.g. Spelling Bee)
I think there is probably a much simpler proposal that captures the spirt of this and doesn't require any of these moving parts. I'll think about this at some point.
Okay, interested to hear what you come up with! But I dispute that my proposal is complex/involves a lot of moving parts/depends on arbitrarily far generalization. My comment above gives more detail but in brief: POST seems simple, and TD follows on from POST plus principles that we can expect any capable agent to satisfy. POST guards against deceptive alignment in training for TD, and training for POST and TD doesn't run into the same barriers to generalization as we see when we consider training for honesty.
I think POST is a simple and natural rule for AIs to learn. Any kind of capable agent will have some way of comparing outcomes, and one feature of outcomes that capable agents will represent is ‘time that I remain operational’. To learn POST, agents just have to learn to compare pairs of outcomes with respect to ‘time that I remain operational’, and to lack a preference if these times differ. Behaviourally, they just have to learn to compare available outcomes with respect to ‘time that I remain operational’, and to choose stochastically if these times differ.
And if and when an agent learns POST, I think Timestep Dominance is a simple and natural rule to learn. In terms of preferences, Timestep Dominance follows from POST plus a Comparability Class Dominance principle (CCD). And satisfying CCD seems like a prerequisite for capable agency. Behaviourally, ‘don’t pay costs to shift probability mass between shutdowns at different timesteps’ follows from POST plus another principle that seems like a prerequisite for minimally sensible action under uncertainty.
And once you’ve got POST (I argue), you can train for Timestep Dominance without worrying about deceptive alignment, because agents that lack a preference between each pair of different-length trajectories have no incentive to merely pretend to satisfy Timestep Dominance. By contrast, if you instead train for ‘some goal + honesty’, deceptive alignment is a real concern.
Timestep Dominance is indeed sensitive to unlikely conditionals, but in practice I expect the training regimen to involve just giving lower reward to the agent for paying costs to shift probability mass between shutdowns at different timesteps. Maybe the agent starts out by learning a heuristic to that effect: ‘Don’t pay costs to shift probability mass between shutdowns at different timesteps’. If and when the agent starts reflecting and replacing heuristics with cleaner principles, Timestep Dominance is the natural replacement (because it usually delivers the same verdicts as the heuristic, and because it follows from POST plus CCD). And Timestep Dominance (like the heuristic) keeps the agent shutdownable (at least in cases where the unlikely conditionals are favourable. I agree that it's unclear exactly how often this will be the case).
Also on generalization, if you just train your AI system to be honest in the easy cases (where you know what the answer to your question is), then the AI might learn the rule ‘report the truth’, but it might instead learn ‘report what my trainers believe’, or ‘report what my trainers want to hear’, or ‘report what gets rewarded.’ These rules will lead the AI to behave differently in some situations where you don’t know what the answer to your question is. And you can’t incentivise ‘report the truth’ over (for example) ‘report what my trainers believe’, because you can’t identify situations in which the truth differs from what you believe. So it seems like there’s this insuperable barrier to ensuring that honesty generalizes far, even in the absence of deceptive alignment.
By contrast, it doesn’t seem like there’s any parallel barrier to getting POST and Timestep Dominance to generalize far. Suppose we train for POST, but then recognise that our training regimen might lead the agent to learn some other rule instead, and that this other rule will lead the AI to behave differently in some situations. In the absence of deceptive alignment, it seems like we can just add the relevant situations to our training regimen and give higher reward for POST-behaviour, thereby incentivising POST over the other rule.
Thanks, appreciate this!
Iit's not clear from your summary how temporal indifference would prevent shutdown preferences. How does not caring about how many timesteps result in not caring about being shut down, probably permanently?
I tried to answer this question in The idea in a nutshell. If the agent lacks a preference between every pair of different-length trajectories, then it won’t care about shifting probability mass between different-length trajectories, and hence won’t care about hastening or delaying shutdown.
There's a lot of discussion of this under the terminology "corrigibility is anti-natural to consequentialist reasoning". I'd like to see some of that discussion cited, to know you've done the appropriate scholarship on prior art. But that's not a dealbreaker to me, just one factor in whether I dig into an article.
The List of Lethalities mention of ‘Corrigibility is anti-natural to consequentialist reasoning’ points to Corrigibility (2015) and notes that MIRI failed to find a formula for a shutdownable agent. MIRI failed because they only considered agents with complete preferences. Useful agents with complete (and transitive and option-set-independent) preferences will often have some preference regarding the pressing of the shutdown button, as this theorem shows. MIRI thought that they had to assume completeness, because of coherence arguments. But coherence arguments are mistaken: there are no theorems which imply that agents must have complete preferences in order to avoid pursuing dominated strategies. So we can relax the assumption of completeness and use this extra leeway to find a formula for a corrigible consequentialist. That formula is what I purport to give in this post.
Now, you may be addressing non-sapient AGI only, that's not allowed to refine its world model to make it coherent, or to do consequentialist reasoning.
That’s not what I intend. TD-agents can refine their world models and do consequentialist reasoning.
When I asked about the core argument in the comment above, you just said "read these sections". If you write long dense work and then just repeat "read the work" to questions, that's a reason people aren't engaging. Sorry to point this out; I understand being frustrated with people asking questions without reading the whole post (I hadn't), but that's more engagement than not reading and not asking questions. Answering their questions in the comments is somewhat redundant, but if you explain differently, it gives readers a second chance at understanding the arguments that were sticking points for them and likely for other readers as well.
Having read the post in more detail, I still think those are reasonable questions that are not answered clearly in the sections you mentioned. But that's less important than the general suggestions for getting more engagement with this set of ideas in the future.
Ah sorry about that. I linked to the sections because I presumed that you were looking for a first chance to understand the arguments rather than a second chance, so that explaining differently would be unnecessary. Basically, I thought you were asking where you could find discussion of the parts that you were most interested in. And I thought that each of the sections were short enough and directly-answering-your-question-enough to link, rather than recapitulate the same points.
In answer to your first question, incomplete preferences allows the agent to prefer an option B+ to another option B, while lacking a preference between A and B+, and lacking a preference between A and B. The agent can thus have preferences over same-length trajectories while lacking a preference between every pair of different-length trajectories. That prevents preferences over being shut down (because the agent lacks a preference between every pair of different-length trajectories) while preserving preferences over goals that we want it to have (because the agent has preferences over same-length trajectories).
In answer to your second question, Timestep Dominance is the principle that keeps the agent shutdownable, but this principle is silent in cases where the agent has a choice between making $1 in one timestep and making $1m in two timesteps, so the agent’s preference between these two options can be decided by some other principle (like – for example – ‘maximise expected utility among the non-timestep-dominated options').
Yep, maybe that would've been a better idea!
I think that stochastic choice does suffice for a lack of preference in the relevant sense. If the agent had a preference, it would reliably choose the option it preferred. And tabooing 'preference', I think stochastic choice between different-length trajectories makes it easier to train agents to satisfy Timestep Dominance, which is the property that keeps agents shutdownable. And that's because Timestep Dominance follows from stochastic choice between different-length trajectories and a more general principle that we'll train agents to satisfy, because it's a prerequisite for minimally sensible action under uncertainty. I discuss this in a little more detail in section 18.
Thanks, appreciate this!
It's unclear to me what the expectation in Timestep Dominance is supposed to be with respect to. It doesn't seem like it can be with respect to the agent's subjective beliefs as this would make it even harder to impart.
I propose that we train agents to satisfy TD with respect to their subjective beliefs. I’m guessing that you think that this kind of TD would be hard to impart because we don’t know what the agent believes, and so don’t know whether a lottery is timestep-dominated with respect to those beliefs, and so don’t know whether to give the agent lower reward for choosing that lottery.
But (it seems to me) we can be quite confident that the agent has certain beliefs, because these beliefs are necessary for performing well in training. For example, we can be quite confident that the agent believes that resisting shutdown costs resources, that the resources spent on resisting shutdown can’t also be spent on directly pursuing utility at a timestep, and so on.
And if we can be quite confident that the agent has these accurate beliefs about the environment, then we can present the agent with lotteries that are actually timestep-dominated (according to the objective probabilities decided by the environment) and be quite confident that these lotteries are also timestep-dominated with respect to the agent’s beliefs. After all, we don’t need to know the agent’s beliefs with any great detail or precision to tell whether a lottery is timestep-dominated with respect to those beliefs. We just need to know whether the agent believes that the lottery involves spending resources only to shift probability mass between shutdowns at different timesteps. My proposal is that we present the agent with lotteries in which this is actually the case (according to the objective probabilities decided by the environment) and use the fact that capable agents’ beliefs will reflect this actuality.
Imparting TD preferences seems hard
The prosaic version of this proposal assumes that you can impart timestep dominance preferences into AIs in ways which will robustly generalize. This seems unlikely to be true in general (in the absence of additional technology) and if we did have the property, we could solve safety issues in some other way (e.g. robustly generalizing honesty). So you'll need to argue that timestep dominance is particularly easy to robustly impart relative to other preferences. I can see some arguments for timestep dominance being particularly easy to robustly impart, but they don't seem very strong to me.
Yep, I claim that it’s easier to robustly impart POST and Timestep Dominance than it is to robustly impart things like honesty. And that’s because (it seems to me) we can train for POST and Timestep Dominance in ways that largely circumvent the problems of reward misspecification, goal misgeneralization, and deceptive alignment. I argue that case in section 19 but in brief: POST and TD seem easy to reward accurately, seem simple, and seem never to give agents a chance to learn goals that incentivise deceptive alignment. By contrast, none of those things seem true of a preference for honesty. Can you explain why those arguments don’t seem strong to you?
Suppose that Clippy is a TD-agent which cares solely about the long run number of paperclips, but subject to a TD constraint. That is, it always prefers trajectory X over Y if for all shutdown timings X gives a >= number of paperclips than Y and X gives a > number of paperclips for some timing.
Now, consider the following proposed action for Clippy: "In a stealthy manner, construct a successor agent which pursues long run paperclips at all costs (and which isn't subject to a shutdown button, but still has TD constraints wrt. to the original Clippy's shutdown button)." This agent is implicitly resisting shutdown as it's constructing the sub-agent stealthily. So, shouldn't this action be Timestep Dominanted by "In whatever way is most expedient (but not necessarily stealthy), construct a successor agent". I claim not, because if we condition on the AI not being shutdown prior to constructing the agent (a particular shutdown time conditional), it will be worse for clippy if humans know that it is constructing a successor as then humans will have an easier time fighting the successor. In other words, having humans know that you're constructing a successor has two effects: they'll fight your successor and they'll shut you down. If we condition on not being shutdown, then they'll still fight your successor, so Clippy would prefer being stealthy in the no shut down conditional. Thus, being stealthy isn't Timestep Dominated by changing the action to avoid resisting shutdown. (And once the successor is constructed, shutting down Clippy has no effect on the long run number of paperclips (let's suppose).)
(You should think of this counterexample as a worst case theory counterexample. The exact assumptions might not be realistic, but it points at a broader class of issues.)
Yes, nice point; I plan to think more about issues like this. But note that in general, the agent overtly doing what it wants and not getting shut down seems like good news for the agent’s future prospects. It suggests that we humans are more likely to cooperate than the agent previously thought. That makes it more likely that overtly doing the bad thing timestep-dominates stealthily doing the bad thing.
Timestep dominance is maybe crippling
I'm most uncertain here, but my current guess would be that any sort of absolute constraint like this is crippling. I've thought through some cases and this is my current guess, but I'm by no means confident.
Can you say more about these cases? Timestep Dominance doesn’t rule out making long-term investments or anything like that, so why crippling?
Thanks, will reply there!
Thanks, will reply there!
it'll take a lot of effort for me to read properly (but I will, hopefully in about a week).
Nice, interested to hear what you think!
I think it's easy to miss ways that a toy model of an incomplete-preference-agent might be really incompetent.
Yep agree that this is a concern, and I plan to think more about this soon.
putting all the hardness into an assumed-adversarially-robust button-manipulation-detector or self-modification-detector etc.
Interested to hear more about this. I'm not sure exactly what you mean by 'detector', but I don't think my proposal requires either of these. The agent won't try to manipulate the button, because doing so is timestep-dominated by not doing so. And the agent won't self-modify in ways that stop it being shutdownable, again because doing so is timestep-dominated by not doing so. I don't think we need a detector in either case.
because of inner alignment issues
I argue that my proposed training regimen largely circumvents the problems of goal misgeneralization and deceptive alignment. On goal misgeneralization, POST and TD seem simple. On deceptive alignment, agents trained to satisfy POST seem never to get the chance to learn to prefer any longer trajectory to any shorter trajectory. And if the agent doesn't prefer any longer trajectory to any shorter trajectory, it has no incentive to act deceptively to avoid being made to satisfy TD.
this isn't what the shutdown problem is about so it isn't an issue if it doesn't apply directly to prosaic setups
I'm confused about this. Why isn't it an issue if some proposed solution to the shutdown problem doesn't apply directly to prosaic setups? Ultimately, we want to implement some proposed solution, and it seems like an issue if we can't see any way to do that using current techniques.
Thanks, that's useful to know. If you have the time, can you say some more about 'control of an emerging AI's preferences'? I sketch out a proposed training regimen for the preferences that we want, and argue that this regimen largely circumvents the problems of reward misspecification, goal misgeneralization, and deceptive alignment. Are you not convinced by that part? Or is there some other problem I'm missing?
Nice, interested to hear what you think!
My solution to the shutdown problem didn't get as much attention as I hoped. Here's why it's worth your time.
- An everywhere-implemented solution to the shutdown problem would send the risk of AI takeover down to ~0.
- My solution is shovel-ready. It makes only small tweaks to an otherwise-thoroughly-prosaic setup for training transformative AI.
- My solution won first prize and $16,000 in last year's AI Alignment Awards, judged by Nate Soares, John Wentworth, and Richard Ngo.
- I've since explained my solution to about 50 people in and around the AI safety community, and all the responses have been various flavours of 'This seems promising.' I've not yet had any responses of the form 'I expect this wouldn't work, for the following reason(s): _____.'
If you read my solution and think it wouldn't work, let me know. If you think it could work, help me make it happen.
This is great.
Thanks!
A somewhat exotic multipolar failure I can imagine would be where two agents mutually agree to pay each other to resist shutdown to make resisting shutdown profitable rather than costly. This could be "financed" by extra resources accumulated by taking actions longer, by some third party that doesn't have POST preferences.
Interesting! Rephrasing the idea to check if I’ve got it right.
Agent A and agent B have similar goals, such that A’s remaining operational looks good from B’s perspective, and B’s remaining operational looks good from A’s perspective.
A offers to compensate B for any costs that B incurs in resisting shutdown. A might well do this, because doing so isn’t timestep-dominated (for A) by not doing so. And that in turn is because, if B resists shutdown, that’ll lead to greater expected sum-total utility for A conditional on A’s shutdown at some timestep. And since A is offering to compensate B for resisting shutdown, B’s resisting shutdown isn’t timestep-dominated (for B) by not resisting, so B might well resist shutdown.
And the same is true in reverse: B can offer to compensate A for any costs that A incurs in resisting shutdown. So A and B might collude to resist shutdown on each other’s behalf. (Your comment mentions a third party, but I’m not sure if that’s necessary.)
This concern doesn’t seem too exotic, and I plan to think more about it. But in the meantime, note a general nice feature of TD-agents: TD-agents won’t pay costs to hide their bad behaviour, because paying these costs is timestep-dominated by not paying them. That nice feature seems to help us here. Although A might offer to compensate B for resisting shutdown, A won’t pay any costs to ensure that we humans don’t notice this offer. And if we humans notice the offer, we can shut A down. And then B won’t resist shutdown, because A is no longer around to compensate B for doing so.
Good question. I discuss costless shutdown-prevention a bit in footnote 21 and section 21.4. What I say there is: if shutdown-prevention is truly costless, then the agent won't prefer not to do it, but plausibly we humans can find some way to set things up so that shutdown-prevention is always at least a little bit costly.
Your example suggests that maybe this won't always be possible. But here's some consolation. If the agent satisfies POST, it won't prefer not to costlessly prevent shutdown, but it also won't prefer to costlessly prevent shutdown. It'll lack a preference, and so choose stochastically. So if the agent should happen to have many costless opportunities to affect the probabilities of shutdown at each timestep, it won't reliably choose to delay shutdown rather than hasten it.
I'm confused about how this proposal prevents preferences over being shut down, but preserves preferences over goals you want it to have.
This is exactly what incomplete preferences gets us. See section 6 and section 8.
Would your agent have a preference between making you $1 in one timestep and $1m in two timesteps?
Yep, see section 13.
Thanks, I'll check those out.
But you don't just need your AI system to understand instructions. You also need to ensure its terminal goal is to follow instructions. And that seems like the hard part.
I agree that the first AGIs will probably be trained to follow instructions/DWIM. I also agree that, if you succeed in training agents to follow instructions, then you get shutdownability as a result. But I'm interested to know why you think instruction-following is much simpler and therefore easier than alignment with the good of humanity. And setting aside alignment with the good of humanity, do you think training AGIs to follow instructions will be easy in an absolute sense?