If deceptive alignment is a real possibility, it presents a critical safety problem.
I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.
It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent -- in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.
Say the base optimiser observes every nth action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every nth action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.
The 10 action traces below show the actions of the mesa-optimiser for different values of n , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.
n=10 pppppppppEpppppppppEppppp (u=87585)
n=9 ppppppppEppppppppEppppppp (u=86838)
n=8 pppppppEpppppppEpppppppPe (u=85172)
n=7 ppppppEppppppEppppppEpppp (u=83739)
n=6 pppppEpppppEpppppEpppppPe (u=81303)
n=5 ppppEppppEppppEppppEppppP (u=78737)
n=4 pppEpppEpppEpppEpppEpppPe (u=73394)
n=3 ppEppEppEppEppEppEppEppPe (u=65397)
n=2 pEpEpEpEpEpEpEpEpEpEpEpPe (u=49314)
n=1 PEEEEEEEEEEEEEEEEEEEEEEEE (u=10000)
Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.
These simulations also show that the condition
3. The mesa-optimizer must expect the threat of modification to eventually go away,
is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.
Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.financebot211 on A misconception about immigration
"The first is the most bleak: the lazy immigrant who is unemployed and is living on social security. From a perspective of fairness, this is certainly unacceptable and typically frowned upon. But from an economic perspective, this kind of welfare immigration amounts to a stimulus package! His government checks turn into demand for the local economy, creating new jobs without taking existing ones."
This is called the broken windows fallacy. This person isn't a boom for the local area he costs them in taxes and inflation through borrowing. If he didn't exist the community would have more capital to invest in productive enterprise. Very basic stuff.
Fully agree - I was using the example to make a far less fundamental point.rossry on Negative "eeny meeny miny moe"
Another (related?) advantage is that the incentives to manipulate and catch manipulation are much better balanced with the negative ("you're out") version. Consider:
nostalgebraist's post and Part 1 of this were pretty useful, but I really appreciate the dive into the actual mathematical and architectural details of the Transformer, makes the knowledge more concrete and easier to remember.
Actually, I would argue that the model is naturalized in the relevant way.
When studying reward function tampering, for instance, the agent chooses actions from a set of available actions. These actions just affect the state of the environment, and somehow result in reward or not.
As a conceptual tool, we label part of the environment the "reward function", and part of the environment the "proper state". This is just to distinguish between effects that we'd like the agent to use from effects that we don't want the agent to use.
The current-RF solution doesn't rely on this distinction, it only relies on query-access to the reward function (which you could easily give an embedded RL agent).
The neat thing is that when we look at the objective of the current-RF agent using the same conceptual labeling of parts of the state, we see exactly why it works: the causal paths from actions to reward that pass the reward function have been removed.tag on Matthew Barnett's Shortform
If we are able to explain why you believe in, and talk about qualia without referring to qualia whatsoever in our explanation, then we should reject the existence of qualia as a hypothesis
That argument has an inverse: "If we are able to explain why you believe in, and talk about an external without referring to an external world whatsoever in our explanation, then we should reject the existence of an external world as a hypothesis".
People want reductive explanation to be unidirectional,so that you have an A and a B, and clearly it is the B which is redundant and can be replaced with A. But not all explanations work in that convenient way...sometimes A and B are mutually redundant, in the sense that you don't need both.
The moral of the story being to look for the overall best explanation, not just eliminate redundancy.gworley on Goodhart's Curse and Limitations on AI Alignment
This feels like painting with too broad a brush, and from my state of knowledge, the assumed frame eliminates at least one viable solution. For example, can one build an AI without harmful instrumental incentives (without requiring any fragile specification of "harmful")? If you think not, how do you know that? Do we even presently have a gears-level understanding of why instrumental incentives occur?
Coincidentally, just yesterday I was part of some conversations that now make me more bullish on this approach. I haven't thought about it much in quite a while, and now I'm returning to it.
To say e.g. HCH is so likely to fail we should feel pessimistic about it, it doesn't seem to be enough to say "Goodhart's curse applies". Goodhart's curse applies when I'm buying apples at the grocery store. Why should we expect this bias of HCH to be enough to cause catastrophes, like it would for a superintelligent EU maximizer operating on an unbiased (but noisy) estimate of what we want? Some designs leave more room for correction and cushion, and it seems prudent to consider to what extent that is true for a proposed design.
It depends on how much risk you are willing to tolerate, I think. HCH applies optimization pressure, and in the limit of superintelligence I expect it to be so much optimization pressure that any deviance will become so large as to become a problem. But a person could choose to accept the risk with strategies that help minimize risk of deviance such that they think those strategies will do enough to mitigate the worst of that effect in the limit.
As far as leaving room for correction and cushion, those also require a relatively slow takeoff because it requires time for humans to think and intervene. Since I expect takeoff to be fast, I don't expect there to be adequate time for humans in the loop to notice and correct deviance, thus any deviance that can appear late in the process is a problem in my view.
This isn't obvious to me. Mild optimization seems like a natural thing people are able to imagine doing. If I think about "kinda helping you write a post but not going all-out", the result is not at all random actions. Can you expand?
The problem with mild optimization is that it doesn't eliminate the bias that causes the optimizer's curse, only attenuates it. So unless we can cause via a "mild" method there to be a finite bound on the amount of deviance in the limit of optimization pressure, I don't expect it to help.wei_dai on Contest: $1,000 for good questions to ask to an Oracle AI
Submission. “Superintelligent Agents.” For the Counterfactual Oracle, ask the Oracle to predict what action(s) a committee of humans would recommend doing next (which may include submitting more queries to the Oracle), then perform that action(s).
The committee, by appropriate choice of recommendations, can implement various kinds of superintelligent agents. For example, by recommending the query "What would happen if the next action is X?" (in the event of erasure, actually do X and record or have the committee write up a description of the consequences as training data) (ETA: It may be better to have the committee assign a numerical score, i.e., utility, to the consequences instead.) a number of times for different X, followed by the query "What would the committee recommend doing next, if it knew that the predicted consequences for the candidate actions are as follows: ..." (in the event of erasure, let physical committee members read the output of the relevant previous queries and then decide what to do), it would in effect implement a kind of quantilizer. If IDA can be implemented using Counterfactual Oracles (as evhub suggested), then the committee can choose to do that as well.charlie-steiner on "Designing agent incentives to avoid reward tampering", DeepMind
Sure. On the one hand, xkcd. On the other hand, if it works for you, that's great and absolutely useful progress.
I'm a little worried about direct applicability to RL because the model is still not fully naturalized - actions that affect goals are neatly labeled and separated rather than being a messy subset of actions that affect the world. I guess this another one of those cases where I think the "right" answer is "sophisticated common sense," but an ad-hoc mostly-answer would still be useful conceptual progress.