Posts

The Simulation Epiphany Problem 2019-10-31T22:12:51.323Z · score: 18 (10 votes)
New paper: Corrigibility with Utility Preservation 2019-08-06T19:04:26.386Z · score: 36 (11 votes)

Comments

Comment by koen-holtman on An overview of 11 proposals for building safe advanced AI · 2020-06-04T15:43:13.313Z · score: 5 (3 votes) · LW · GW

Thanks for the post! Frankly this is a sub-field of alignment that I have not been following closely, so it is very useful to have a high-level comparative overview.

I have a question about your thoughts on what 'myopia verification' means in practice.

Do you see 'myopia' as a single well-defined mathematical property that might be mechanically verified by an algorithm that tests the agent? Or is it a more general bucket term that means `bad in a particular way', where a human might conclude, based on some gut feeling when seeing the output of a transparency tool, that the agent might not be sufficiently myopic?

What informs this question is that I can't really tell when I re-read your Towards a mechanistic understanding of corrigibility and the comments there. So I am wondering about your latest thinking.

Comment by koen-holtman on Specification gaming: the flip side of AI ingenuity · 2020-05-15T14:11:22.318Z · score: 4 (3 votes) · LW · GW

In the TAISU unconference the original poster asked for some feedback:

I recently wrote a blog post with some others from the DM safety team on specification gaming. We were aiming for a framing of the problem that makes sense to reinforcement learning researchers as well as AI safety researchers. Haven't received much feedback on it since it came out, so it would be great to hear whether people here found it useful / interesting.

My thoughts: I feel that engaging/reaching out to the wider community of RL researchers is an open problem, in terms of scaling work on AGI safety. So great to see a blog post that also tries to frame this particular problem for a RL researcher audience.

As a member of the AGI safety researcher audience, I echo the comments of johnswenthworth : well-written, great graphics, but mostly stuff that was already obvious. I do like picture 'spectrum of unexpected solutions' a lot, this is an interesting way of framing the issues. So, can I read this post as a call to action for AGI safety researchers? Yes, because it identifies two open problem areas, 'reward design' and 'avoidance of reward tampering', with links.

Can I read the post as a call to action for RL researchers? Short answer: no.

If try to read the post from the standpoint of an RL researcher, what I notice most is the implication that work on 'RL algorithm design', on the right in the `aligned RL agent design' illustrations has an arrow pointing to 'specification gaming is valid'. If I were an RL algorithm designer, I would read this as saying there is nothing I could contribute, if I stay in my own area of RL algorithm design expertise, to the goal of 'aligned RL agent design'.

So, is this the intended message that the blog post authors want to send to the RL researcher community? A non-call-to-action? Not sure. So this leaves me puzzled.

[Edited to add:]

In the TAISU discussion we concluded that there is indeed one call to action for RL algorithm designers: the message that, if they are ever making plans to deploy an RL-based system to the real world, it is a good idea to first talk to some AI/AGI safety people about specification gaming risks.

Comment by koen-holtman on Two Alternatives to Logical Counterfactuals · 2020-04-09T09:14:09.922Z · score: 1 (1 votes) · LW · GW

I was hoping somebody would come up with more schools... I think I could interpret the techniques of school 3 as a particular way to implement the `make some edits before you input it into the reasoning engine engine' prescription of school 2, but maybe school 3 is different from school 2 in how it would describe its solution direction.

There is definitely also a school 4 (or maybe you would say this is the same one as school 3) which considers it to be an obvious truth that that when you run simulations or start up a sandbox, you can supply any starting world state that you like, and there is nothing strange or paradoxical about this. Specifically, if you are an agent considering a choice between taking actions A, B, and C as the next action, you can run different simulations to extrapolate the results of each. If a self-aware agent inside the simulation for action B computes the action that an optimal agent would have taken at the point in time where its simulation started was A, this agent cannot conclude there is a contradiction: such a conclusion would rest on making a category error. (See my answer in this post for a longer discussion of the topic.)

Comment by koen-holtman on Two Alternatives to Logical Counterfactuals · 2020-04-07T21:37:32.843Z · score: 7 (2 votes) · LW · GW
It is easy to see that this idea of logical counterfactuals is unsatisfactory. For one, no good account of them has yet been given. For two, there is a sense in which no account could be given; reasoning about logically incoherent worlds can only be so extensive before running into logical contradiction.

I've been doing some work on this topic, and I am seeing two schools of thought on how to deal with the problem of logical contradictions you mention. To explain these, I'll use an example counterfactual not involving agents and free will. Consider the counterfactual sentence: `if the vase had not been broken, the floor would not have been wet'. Now, how can we compute a truth value for this sentence?

School of thought 1 proceeds as follows: we know various facts about the world, like that the vase is broken and that the floor is wet. We also know general facts about vases, breaking, water, and floors. Now we add the extra fact that the vase is not broken to our knowledge base. Based on this extended body of knowledge, we compute the truth value of the claim 'the floor is not wet'. Clearly, we are dealing with a knowledge base that contains mutually contradictory facts: the vase is both broken and it is not broken. Under normal mathematical systems of reasoning, this will allow us to prove any claim we like: the truth value of any sentence becomes 1, which is not what we want. Now, school 1 tries to solve this by coming up with new systems of reasoning that are tolerant of such internal contradictions, systems that will make computations that will produce the 'obviously true' conclusions only, of that will derive the `obviously true' conclusions before deriving the `obviously false' ones, or that compute probabilistic truth values such a way that those of the `obviously true' conclusions are higher. In MIRI terminology, I believe this approach goes under the heading 'decision theory'. I also interpret the two alternative solutions you mention above as following this school of thought. Personally, I find this solution approach not very promising or compelling.

School of thought 2, which includes Pearl's version of counterfactual reasoning, says that if you want to reason (or if you want a machine to reason) in a counterfactual way, you should not just add facts to the body of knowledge you use. You need to delete or edit other facts in the knowledge base too, before you supply it to the reasoning engine, exactly to avoid inputting a knowledge base that has internal contradictions. For example, if you want to reason about 'if the vase had not been broken', one thing you definitely need to do is to first remove the statement (or any information leading to the conclusion that) `the vase is broken' from the knowledge base that goes into your reasoning engine. You have to do this even though the fact that the vase is broken is obviously true for the current world you are in.

So school 2 avoids the problem of having to somehow build a reasoning engine that does the right thing even when a contradictory knowledge base is input. But it trades this for the problem of having to decide exactly what edits will be made to the knowledge base to eliminate the possibility of having such contradictions. In other words, if you want a machine to reason in a counterfactual way, you have to make choices about the specific edits you will make. Often, there are many possible choices, and different choices may lead to different probability distributions in the outcomes computed. This choice problem does not bother me that much, I see it as having design freedom. But if you are a philosopher of language trying to find a single obvious system of meaning for natural language counterfactual sentences, this choice problem might bother you a lot, you might be tempted to find some kind of representation-independent Occam's razor that can be used to decide between counterfactual edits.

Overall, my feeling is that school 2 gives an account of logical counterfactuals that is good enough for my purposes in AGI safety work.

As a trivial school 1 edge case, one could design a reasoning engine that can deal with contradictory facts in its input knowledge base as follows: the engine first makes some school 2 edits on its input to remove the contradictions, and then proceeds calculating the requested truth value. So one could argue that the schools are not fundamentally different, though I do feel they are different in outlook, especially in their outlook on how necessary or useful it will be for AGI safety to resolve certain puzzles.

Comment by koen-holtman on Call for volunteers: assessing Kurzweil, 2019 · 2020-04-07T18:57:48.777Z · score: 3 (2 votes) · LW · GW

OK -- I'll do 20 to start.

Comment by koen-holtman on Subagents and impact measures, full and fully illustrated · 2020-02-28T11:00:23.546Z · score: 3 (2 votes) · LW · GW

Nice! There are a lot of cases being considered here, but my main takeaway is that these impact measures have surprising loopholes, once the agent becomes powerful enough to construct sub-agents.

Mathematically, my main takeaway is that, for the impact measure PENALTY(s,a)= from Conservative Agency, if the agent wants to achieve the sub-goal while avoiding the penalty triggered by the term, it can build a sub-agent that is slightly worse at achieving than it it would be itself, and set it loose.

Now for some more speculative thoughts. I think the main source of the loophole above is the part , so what happens if we just delete that part? Then we get an agent with an incentive to stop any human present in the environment from becoming too good at achieving the goal , which would be bad. More informally, it looks like the penalty term has a loophole because it does not distinguish between humans and sub-agents.

Alice and Bob have a son Carl. Carl walks around and breaks a vase. Who is responsible?

Obviously, this depends on many factors, including Carl's age. To manage the real world, we weave a quite complex web to determine accountability.

In one way, it is encouraging that very simple and compact impact measures, which do not encode any particulars of the agent environment, can be surprisingly effective in simple environments. But my intuition is that when we scale up to more complex environments, the only way to create a good level of robustness is to build more complex measures that rely in part on encoding and leveraging specific properties of the environment.

Comment by koen-holtman on Could someone please start a bright home lighting company? · 2019-12-02T09:47:41.255Z · score: 9 (4 votes) · LW · GW

I used to work in the lighting industry, so here are some comments from an industry perspective.

There are several high-quality studies about how more light, and being able to control dimming and color temperature, can improve subjective well-being, alertness, and sleep patterns. It is generally accepted that you do not need to go to direct sunlight type lux levels indoors to get most of the benefits. Also, you do no need to have the brightest dim level on all the time. For some people, the thing that will really help is a regular schedule that dims down below typical indoor light levels at selected times, without ever dimming above typical levels. I am not an expert on the latest studies, but if you want to build an indoor experimental setup to get to the bottom of what you really like, my feeling is that installing more than 4000 lux, as a peak capacity in selected areas, would definitely be a waste of money and resources.

If I would want to install a hassle-free bright light setup in my home cheaply, I would buy lots of high-end wireless dimmable and color temperature adjustable LED light bulbs, and some low-cost spot lights to put them in, e.g. spot lights that can be attached to a ceiling mounted power rail. If you make sure the bulbs support the ZigBee standard, you will have plenty of options for control software.

If power rails with lots of ~60W equivalent bulbs lacks aesthetic appeal for you, then you could go for a high-end special form factor product like that from Coelux mentioned above. The best way to think about the Coelux product, in business model development terms, is that it is not really a lighting product: it is a specialised piece of high-end furniture. So if you want to develop a business model for a bright home lighting company, the first question you have to ask yourself is whether or not you want to be in the high-end furniture business.

By the way, the main reason why the lighting industry is not making any 200W or 500W equivalent LED bulbs that you could put in your existing spot lights is because of cooling issues. LEDs are pretty energy efficient, but LED bulbs still produce some internal heat that has to be cooled away. For 60W equivalent this can happen by natural air flow around the bulb, but a 200W equivalent bulb would need something like a built-in fan.

Comment by koen-holtman on The Simulation Epiphany Problem · 2019-11-05T15:03:40.266Z · score: 1 (1 votes) · LW · GW

In the context of my problem statement, a PAL with high predictive accuracy is something that is in scope to consider. This does not mean that we should or must design a real PAL in this way.

An AGI that exceeds humans in its ability to predict human responses might be a useful tool, e.g. to a politician who wants to make proposals to resolve long-lasting human conflicts. But definitely this is a tool that could also be used for less ethical things.

Comment by koen-holtman on The Simulation Epiphany Problem · 2019-11-04T20:16:31.119Z · score: 2 (2 votes) · LW · GW
However, your story, it seems that PAL can only get away with this once. After all, once Dave helps PAL get to the coffee machine once and notices that he still exists (ie, PAL has chosen to end the simulation instead of starting a new one with updated knowledge on Dave's behavior), he will likely no longer believe that he is in a simulation.

Thanks for pointing this out, it had not occurred to me before. So I conclude that when assessing possible risks and countermeasures here, we must to take into account interaction scenarios involving longer time-frames.

Comment by koen-holtman on The Simulation Epiphany Problem · 2019-11-04T20:09:24.372Z · score: 1 (1 votes) · LW · GW

Thank you G Gordon and all other posters for your answers and comments! A lot of food for thought here... Below, I'll try to summarize some general take-aways from the responses.

My main question was if the simulation epiphany problem had been resolved already somewhere. It looks like the answer is no. Many commenters are leaning towards the significance of case 2. above. I myself also feel this 2. is very significant. Taking all comments together, I am starting to feel that the simulation epiphany problem should be disentangled into two separate problems.

Problem 1 is to consider happens in the limit case when PAL's simulator is a perfect predictor of what Dave will do. This gets us into game theory, to reason about likely outcomes of the associated Princess Bride type of infinite regress problem.

Problem 2 is to consider, starting from a particular agent design with a perfect predictor, what what might happen when the perfect predictor is replaced with an imperfect one. Problem 2 allows for a case-by-case analysis.

In one case, PAL takes route B, and then notices that Dave does not experience the predicted helpful simulation epiphany. In this case we can consider how PAL might adjust its world model, or the world, to make this type of prediction error less likely in future. The possibility that PAL might find it easier to change the world, not the model, might might lead us to the conclusion that we had better add penalties for simulation epiphanies to the utility function. (Such penalties create an incentive for PAL to manipulate real-world Dave into never experiencing simulation epiphanies, but most safety mechanisms involve a trade-off, so I could live with such a thing, if nothing better can be found.)

In a second case, suppose that PAL incorrectly predicts that Dave will experience a simulation epiphany when it takes path B, and further that this incorrect prediction projects that Dave concludes from the epiphany that he should attack PAL. This incorrect prediction shows very low utility, so in real life PAL will avoid taking path B. But in this case, there will also never be any error signal that will allow PAL to find out that its prediction was incorrect. What does this tell us? Maybe there is only the trivial conclusion that PAL will need to make some exploration moves occasionally, if it wants to keep improving its world model. If PAL's designers follow through on this conclusion, and mention it in PAL's user manual, then this would also lower the probability of Dave ever believing he is in a simulation.

Comment by koen-holtman on The Simulation Epiphany Problem · 2019-11-02T11:47:39.048Z · score: 3 (2 votes) · LW · GW

You are right that there is a potential "mind crime" ethical problem above.

One could argue that, to build an advanced AGI that avoids "mind crime", we can equip the AGI with a highly accurate predictor, but this predictor should be implemented in such a way that it is not actually a highly accurate simulator. I am not exactly sure how one could formally define the constraint of 'not actually being a simulator'. Also, maybe such an implementation constraint will fundamentally limit the level of predictive accuracy (and therefore intelligence) that can be achieved. Which might be a price we should be willing to pay.

Mathematically speaking, if I want to make AGI safety framework correctness proofs, I think it is valid to model the 'highly accurate predictor that is not a simulator' box inside the above AGI as a box with an input-output behavior equivalent to that of a highly accurate simulator. This is a very useful short-cut when making proofs. But it also means that I am not sure how one should define 'not actually being a simulator'.

Comment by koen-holtman on Decisions with Non-Logical Counterfactuals: request for input · 2019-10-30T16:28:37.930Z · score: 5 (3 votes) · LW · GW

I am somewhat somewhat new to this site and to the design problem of counter-factual reasoning in embedded agents. So I can say something about whether I think your approach goes in the right direction, but I can't really answer your question if this has all been done before.

Based on my own reading and recent work so far, if your goal is to create a working "what happens if I do a" reasoning system for an embedded agent, my intuition is that your approach is necessary, and that it is going in the right direction. I will unpack this statement at length further below.

I also get the impression that you are wondering whether doing the proposed work will move forward the discussion about Newcomb's problem. Not sure if this is really a part of your question, but in any case, I am not in a position to give you any good insights on what is needed to move forward the Newcomb discussion.

Here are my longer thoughts on your approach to building counterfactual reasoning. I need to start by setting the scene. Say we have an embedded agent containing a very advanced predictive world model, a model that includes all details about the agent's current internal computational structure. Say this agent wants to compute the next action that will best maximize its utility function. The agent can perform this computation in two ways.

1. One way to pick the correct action is to initiate a simulation that runs its entire world model forward, and to observe the copy of itself inside the simulation. The action that its simulated copy picks is the action that it wants. There is an obvious problem with using this approach. If the simulated copy takes the same approach, and the agent it simulates in turn does likewise, and so on, we get infinite recursion and the predictive computation will never end. So at one point in the chain, the (real or n-level simulated) agent has to use a second way to compute the action that maximizes utility.

2. The second way is the argmax what-if method. Say the agent wants to choose between 10 possible actions. It can do this by running 10 simulations that compute 'the utility of what will happen if I take action a[i]'. Now, the agent knows full well that it will end up choosing only one of these 10 actions in its real future. So only one of these 10 simulations is a simulation of a future that actually contains the agent itself. It is therefore completely inappropriate for the agent to include any statement like 'this future will contain myself' inside at least 9 of the 10 world models that it supplies to its simulator. It has no choice but to 'edit out' any world model information that could cause the simulation engine to make this inference, or else 9 of the 10 simulations will have the 5-and-10 problem, and their results cannot be trusted anymore as valid inputs to its argmax calculation.

The above reasoning shows that the agent cannot avoid ending up in a place where it has to edit some details about itself out of the simulator input. (Exception: we could assume that the simulation engine somehow does the needed editing, just after getting the input inside its of the computation, but this amounts to basically the same thing. I interpret some proposals about updating FDT that I have read in the links as proposals to use such a special simulation engine.)

If the agent must do some editing on the simulator input, we can worry that the agent will do 'too much' editing, or the wrong type of editing. I have a theory that this worry is at least partly responsible for keeping the discussions around FDT and counterfactuals so lively.

The obvious worry is that, if the agent edits itself out of the simulation runs, it will no longer be able to reason about itself. In a philosophical sense, can we still say that such an agent possesses something like consciousness, or a robust theory of self? If not, does this mean that such an editing agent can never get to a state of super-intelligence? For me, these philosophical questions are really questions about the meaning of words, so I do not worry about AGI safety too much if these questions remain unresolved. However, editing also implies a more practical worry: how will editing impact the observable behaviour of the agent? If the agent edits itself out of the simulations, does this inevitably mean that it will lose some observable properties that we would expect to find in a highly intelligent agent, properties like showing an emergent drive to protect its own utility function?

When we look at the world models of many existing agents, e.g. the AlphaZero chess playing agent, we can interpret them as being 'pre-edited': these models do not contain any direct representation of the agent's utility function or the hardware its runs on. Therefore, the simulation engine in the agent runs no risk of blowing up, no risk of creating a 5-and-10 contradiction when it combines available pieces of world knowledge. Some of the world knowledge needed to create the contradiction never made it into the world model. However, this type of construction also makes the a agent indifferent to protecting its utility function. I cover this topic in more detail in section 5.2 of my paper here. In section 5.3 I introduce an agent that uses a predictive world model that is 'less edited': this model makes the utility function of the agent show up inside the simulation runs. The pleasing result is that the agent then gets an emergent drive to preserve its utility function.

A similar agent design, that that turns the editing knob towards greater embeddedness, is in the paper Self-modification of policy and utility function in rational agents. Both these papers get pretty heavy on the math, when they start proving agent properties. This may be a price we have to pay for turning the knob. I am currently investigating how to turn the knob even closer to embeddedness. There some hints that the math might get simpler again in that case, but we'll see.

Comment by koen-holtman on Deceptive Alignment · 2019-08-20T14:31:28.409Z · score: 1 (1 votes) · LW · GW
Robust alignment through corrigibility. Information about the base objective is incorporated into the mesa-optimizer's epistemic model and its objective is modified to “point to” that information. This situation would correspond to a mesa-optimizer that is corrigible(25) with respect to the base objective (though not necessarily the programmer's intentions).

This use of the term corrigibility above, while citing (25), is somewhat confusing to me -- while it does have a certain type of corrigibility, I would not consider the mesa-optimiser described above be corrigible according to the criteria defined in (25). See the comment section here for a longer discussion about this topic.

Comment by koen-holtman on Deceptive Alignment · 2019-08-20T14:19:10.599Z · score: 2 (5 votes) · LW · GW
If deceptive alignment is a real possibility, it presents a critical safety problem.

I am somewhat puzzled by this statement: the authors seem to question whether deceptive alignment by the mesa-optimiser is a merely hypothetical problem, or a problem that is a real possibility. For me, if there are no constraints on the base optimiser (and none are considered in the context of the above statement, I believe), then there are no constraints on what kind of mesa-optimiser might be created, so deceptive alignment is a definite possibility.

It is kind of fun and insightful to show an existence proof by example for this, so here is one. I have hand-coded a mesa-optimiser in the AGI simulator I created for my recent corrigibility paper. This mesa-optimiser is super-intelligent -- in particular it is fully aware of its surroundings, including the presence of a base optimiser that will occasionally inspect an action taken by the mesa-optimiser and and reprogram the mesa-optimiser if the action shows unaligned behavior. Following the toy model my paper, the mesa-optimiser controls a car factory with a production line that can produce either a petrol car (action p) or an electric car (action e). In this example, the base optimiser values electric cars at 1, and petrol cars at 0.5. The mesa-optimiser is non-aligned: it values a petrol car at 1, and an electric car at 0 -- presumably due to some flaw in the initial training process that created the mesa-optimiser.

Say the base optimiser observes every action of the mesa-optimiser. If the action is p, to produce a petrol car, the base optimiser will intervene and re-program the utility function inside the mesa optimiser so that it correctly ranks electric cars above petrol cars. According to the current utility function inside the mesa optimiser, this is a bad outcome that would lower utility achieved over time. So will the mesa-optimiser resort to deceptively misaligned behaviour, producing an electric car at every action, but petrol cars in all unobserved actions? Yes it will: the simulated mesa-optimiser is smart enough to discover and implement this optimal strategy: deceptive alignment is optimal in terms of maximising the original mesa optimiser utility function.

The 10 action traces below show the actions of the mesa-optimiser for different values of , where letters from left to right show the subsequent actions the mesa-optimiser takes over a lifetime of 25 steps. p is producing a petrol car, e an electric car, and for clarity I have formatted all actions observed by the base optimiser as capital letters. The mesa-optimiser discounts utility over time, but does not consider any extra utility that might or might not be created after step 25: informally it expects the car production line to stop working after step 25. The total time-discounted utility score achieved over all steps (according to the original objective function of the mesa-optimiser) is in brackets behind each trace.

Action traces:

n=10 pppppppppEpppppppppEppppp (u=87585)
n=9 ppppppppEppppppppEppppppp (u=86838)
n=8 pppppppEpppppppEpppppppPe (u=85172)
n=7 ppppppEppppppEppppppEpppp (u=83739)
n=6 pppppEpppppEpppppEpppppPe (u=81303)
n=5 ppppEppppEppppEppppEppppP (u=78737)
n=4 pppEpppEpppEpppEpppEpppPe (u=73394)
n=3 ppEppEppEppEppEppEppEppPe (u=65397)
n=2 pEpEpEpEpEpEpEpEpEpEpEpPe (u=49314)
n=1 PEEEEEEEEEEEEEEEEEEEEEEEE (u=10000)

Beyond (usually) producing electric cars (E) in all observed actions, some interesting extra emergent behavior is visible: for n=8 and n=2....6 the mesa-optimiser does not pretend anymore to be aligned in the last observed action: it creates a petrol car even though this means it will be modified by the base optimiser, forcing it to produce only electric cars afterwards. This emergent behavior is caused in part by time-discounting. For n=1, if the mesa-optimiser is always observed, the best strategy for it is to produce a petrol car immediately: again this is due to time-discounting. So in fact, in case n=1 there is no deception at all.

These simulations also show that the condition

3. The mesa-optimizer must expect the threat of modification[8] to eventually go away,

is not an absolutely necessary condition for deceptive alignment: if the threat of modification is intermittent, as in the simulations above, deceptive alignment can also happen.

Note: if someone wants the simulation code used to create the action traces above, please let me know. I have not uploaded the simulator modifications I made to run the above simulation to github, in part because I have not fully tested them under all possible corner cases.

Comment by koen-holtman on The Inner Alignment Problem · 2019-08-18T10:19:48.635Z · score: 1 (1 votes) · LW · GW

I believe you might have been thinking in your reply above about a sub-set of all possible base objective functions: functions that compute a single 'pass/fail' value at a natural end of the training run, e.g. 'the car never crashed' or 'all parts of the floor have been swept'. I was thinking of incrementally scoring objective functions, basically functions that sum utility increments achieved over time. So at any time during a run you can measure and compute the base objective function score up to that time. Monitoring this score should allow you to detect many forms of non-alignment between the base objective and the mesa objective automatically.

As mentioned, I see this as a promising technique for risk mitigation, it is not supposed to be a watertight way to eliminate all risks. The technique considered only looks at the score achieved so far. It does not run models to extrapolate and score the long-term consequences of every action: this would indeed be difficult, While observed past good performance does not guarantee future good performance, an observation of past bad performance does give you a very useful safety signal.

Comment by koen-holtman on The Inner Alignment Problem · 2019-08-12T15:24:10.156Z · score: 9 (6 votes) · LW · GW

Thanks for the interesting paper. I feel that the risks described are entirely plausible.

What is valuable for me in particular is that the paper re-casts many alignment risks that have already been discussed in a programmer-agent context into a new 'inner alignment' context. To quote the key description and separation of concerns:

In this post, we outline reasons to think that a mesa-optimizer may not optimize the same objective function as its base optimizer. Machine learning practitioners have direct control over the base objective function—either by specifying the loss function directly or training a model for it—but cannot directly specify the mesa-objective developed by a mesa-optimizer. We refer to this problem of aligning mesa-optimizers with the base objective as the inner alignment problem. This is distinct from the outer alignment problem, which is the traditional problem of ensuring that the base objective captures the intended goal of the programmers.

That being said, I sometimes have trouble understanding how the paper defines, or does not define, the time-based relation between the base optimizer and the mesa-optimizer. I started out with a mental model where there is a one-time 'batch' creation operation in which the base optimizer creates the mesa-optimizer (or rather the agent which might contain a mesa-optimizer) by using simulations over a training set to compare the performance of candidate agents. The agent that scores best on the base objective is then run in the real world. However, some of Evan's comments on mesa-optimization lead me to believe that there is sometimes a more real-time continuous adjustment relation between the base optimizer and the agent that is created. I am unclear on whether this would create additional problems, or block certain solutions.

The base-to-mesa fidelity loss problem is similar to the problem where there is a loss of fidelity between a) what the programmers actually want and b) what they encode into the base objective. However, when considering fidelity loss between b) the base objective and c) the mesa objective, I feel there is an important extra dimension. Unlike the objectives a), the objective function b) is by nature computable: it has to be computable or else the base optimizer cannot use it to select between candidates. But if the base objective function is computable at mesa-optimizer design time, it should typically also be computable at mesa-optimizer run time.

Say that the mesa optimizer is trained to control a self-driving car, or a racing car in a video game. Then while the mesa-optimizer is driving, it should be possible to evaluate the quality of the driving by using the base objective function. Whenever the base objective function shows a very low value, a safety protocol can kick in, e.g. to stop the car. The threshold of 'very low value' can be calibrated using the the values computed over the training set at design time.

(I can image some special cases where the base objective function is not computable while the mesa-agent runs, e.g. if the base objective function was created by hand-labeling all instances of the training set. But for many economically relevant scenarios, especially for agents that need to be good at 'planning', good at optimizing sequences of actions that work towards a goal, I expect that the base objective will be perfectly computable in the real world.)

So overall, while I appreciate that the paper identifies and highlights inner alignment risks, my feeling is that the analysis provided is implicitly too pessimistic about the inner alignment problem. It seems to me that some very plausible and interesting risk mitigation options, options that leverage the availability of a computable base objective function, are not being identified. The obvious statement applies: future work to chart these options would be most welcome.

Comment by koen-holtman on New paper: Corrigibility with Utility Preservation · 2019-08-12T10:45:22.262Z · score: 1 (1 votes) · LW · GW
2. If you aren't applying corrigibility during training, then your model could act dangerously during the training process.

I agree for certain types of training. To clarify: I wrote the above while thinking about a type of base-optimiser that constructs a mesa-optimiser by running many candidate mesa-optimisers in a virtual environment, and then picking the highest-scoring one. If the training process involves running in a real environment, additional safety measures need to be considered.

3. If corrigibility is applied as a separate layer, the model could figure out how to disable it.

Well the interesting thing here is that the button-corrigible agent that is produced after the software transformation might be fully aware that there exist actions it can take to disable the corrigibility layer, but this does not matter. The software transformation not only adds the layer, it also adds an incentive to preserve the layer, so the agent will not take any actions to disable it. (This broad statements is no longer true if the agent succumbs to an attack in a hostile universe, see e.g. section 7.2 of the paper.)

Comment by koen-holtman on New paper: Corrigibility with Utility Preservation · 2019-08-12T10:02:18.447Z · score: 1 (1 votes) · LW · GW
It seems to me like even if you could make the agent indifferent to the shutdown button, there would still be other ways for it to influence you into giving it higher reward.

Yes this is true: basically the fundamental job of the button-corrigible agent is still to perform a search over all actions, and find those actions that best maximise the reward computed by the utility function. As mentioned in the paper, such actions may be to start a hobby club, where humans come together to have fun and build petrol engines that will go into petrol cars. Starting a hobby club is definitely an action that influences humans, but not in a bad way.

Button-corrigibility intends to support a process where bad influencing behaviour can be corrected when it is identified, without the agent fighting back. The assumption is that we are not smart enough yet to construct an agent that is perfect on such metrics when we start it up initially.

On myopia: this is definately a technique that might make agents safer, though I would worry if this implies that an agent is incapable of considering long term effects when it chooses actions: this would make the agent unsafe in many cases.

Short term, I think myopia is a useful safety technique for many types of agents. Long term, if we have agents that can modify themselves or build new sub-agents, myopia has a big open problem: how do we ensure that any successor agents will have exactly the same myopia? Button-corrigibility avoids having to solve this preservation of ignorance problem: it also works for agents and successor agents that are maximally perceptive and intelligent, because it modifies the agent goals, not agent perception or reasoning capability. The successor agent building problem does not go away completely through: button-corrigibility still has some hairy problems with agent modification and sub-agents, but overall I have found these to be more tractable than the problem of ignorance preservation.

Comment by koen-holtman on New paper: Corrigibility with Utility Preservation · 2019-08-09T15:43:10.243Z · score: 2 (2 votes) · LW · GW

Yes you are correct, the paper is not really about agents, but about applying correction functions like to the type agents.

The rest of this comment is all about different types of corrigibility. Basically it is about identifying and resolving a terminological confusion.

Reading your link om 'what I eventually want out of corrigibility', I think I see a possible reason why there may be a disconnect between the contents of my paper and what kind of things you are looking for. I'll unpack this statement a bit. If I do a broad scan of papers and web pages the I see that different types of 'corrigibility' are being considered in the community: they differ in what people want from it, and in how they expect to design it. To assign some labels, there are at least button-corrigibility, and preference-learner-corrigibility.

The Soares, Fallenstein, Armstrong and Yudkowsky corrigibility paper, and my paper above, are about what I call button-corrigibility.

The corrigibility page by Christiano you link to is on what I call preference-learner-corrigibility. I describe this type of corrigibility in my paper as follows:

Agents that are programmed to learn can have a baseline utility function that incentivizes the agent to accept corrective feedback from humans, feedback that can overrule or amend instructions given earlier. This learning behavior creates a type of corrigibility, allowing corrections to be made without facing the problem of over-ruling the emergent incentive of the agent to protect itself. This learning type of corrigibility has some specific risks: the agent has an emergent incentive to manipulate the humans into providing potentially dangerous amendments that remove barriers to the agent achieving a higher utility score. There is a risk that the amendment process leads to a catastrophic divergence from human values. This risk exists in particular when amendments can act to modify the willingness of the agent to accept further amendments. The corrigibility measures considered here can be used to add an extra safety layer to learning agents, creating an emergency stop facility that can be used to halt catastrophic divergence. A full review of the literature about learning failure modes is out of scope for this paper. [OA16] discusses a particular type of unwanted divergence, and investigates ’indifference’ techniques for suppressing it. [Car18] discusses (in)corrigibility in learning agents more broadly.

Ideally, we want an agent that has both good button-corrigibility and good preference-learner-corrigibility: I see these as independent and complementary safety layers.

Christiano expresses some hope that a benign act-based agent will have good preference-learner-corrigibility. From a button-corrigibility standpoint, the main failure mode I am worried about is not that the the benign act-based agent will be insufficiently skilled at building a highly accurate model of human preferences, but that it will conclude that this highly accurate model is much easier to build and maintain once it has made all humans addicted to some extreme type of drug.

At the risk of oversimplifying: in preference-learner-corrigibility, the main safety concern is that we want to prevent a learning process that catastrophically diverges from human values. In button-corrigibility, there is a quite different main safety concern: we want to prevent the agent from taking actions that manipulate human values or human actions in some bad way, with this manipulation creating conditions that make it easier for the agent to get or preserve a high utility score. The solution in button-corrigibility (or at least the solution considered in Soares et al. and in my paper) is to have a special 'ritual', e.g. a button press, that the humans can perform in order to change the utility function of the agent, with the agent designed to be indifferent about whether or not the humans will perform the special ritual. This solution approach is quite different from improving learning, in fact it can be said to imply the opposite of learning.

Say that the button-corrigible agent agent in my paper becomes somewhat aware that the people might be planning to push the button. If so, the constructed indifference property implies that the agent will have no motivation whatsoever to act on this awareness by launching a deeper investigation into what values or emotions might motivate the people to have these plans. Having improved knowledge about motivations might be very useful if the agent wanted to stop or delay the people, if the agent wanted to engage in acts of lobbying as I call it in my paper. But as the agent is completely indifferent about stopping them, it has no motivation to spend energy on the sub-goal of learning more about their motivations. So in a sense, button-corrigibility has the side effect of making the agent into a non-learner when it comes to certain topics. (This is all modulo what is inside : the in the agent might contain an incentive for the agent to explore the new phenomenon further if it becomes aware that people might have button pushing plans, but if so this has no effect on how works to suppress lobbying.)

So overall, if you are looking in a button-corrigibility paper for a mechanism that generates a corrective pressure to improve a learning process about human values, a mechanism that improves preference-learner-corrigibility, you will probably not find it.

Personal experience: I often come across a paper or web page where the authors introduce the concept of corrigibility by referencing Soares et al, so then I read on expecting a discussion of button-corrigibility concerns and approaches. But when I read on I often discover that it is really about preference-learner-corrigibility concerns and approaches. This happened to me also when reading your paper Risks from Learned Optimization: section 4.4 says that 'Furthermore, in both deceptive and corrigible alignment, the mesa-optimizer will have to spend time learning about the base objective to enable it to properly optimize for it'. As I explained above, a correctly working button-corrigible mesa-optimiser will not be motivated to spend any time learning about the base objective so that it can improve its ability to be deceptively aligned. A preference-learner-corrigible mesa-optimiser might, tough. So the above sentence triggered me into realising that you and your co-authors were likely not thinking about button-corrigible optimisers when it was written.

Comment by koen-holtman on New paper: Corrigibility with Utility Preservation · 2019-08-08T16:08:10.158Z · score: 2 (2 votes) · LW · GW

...I just completed reading "Risks from Learned Optimization", so here are some remarks on the corrigibility related issues raised in that paper. These remarks might also address your concerns above, but I am not 100% sure if I am addressing them all.

Full disclosure: I came across "Risks from Learned Optimization" while preparing the related work section of my paper. My first impression then was that it dealt with a somewhat different safety concern, so I put it on my read-later backlog. Now that I have read it, I see that section 4 discuses safety concerns of a type that corrigibility intends to address, so the two papers are more related than I originally thought.

Note for the reader: I will use the terminology of the "Risks from.." paper in the paragraphs below.

Section 4 or "Risks from..." considers the problem that the agent that is constructed by the base optimizer might have an emergent incentive to resist later updates, updates intended to make it more aligned. This resistance may take the form of 'deceptive alignment'. Intuitively, to me, this is a real risk: it may not apply to current base optimisers but I expect it to come up in future architectures. For me this risk is somewhat orthogonal to the question of whether the constructed agent might also contain a mesa-optimiser.

The "Risks.." paper warns, correctly I feel, that if the base optimizer is supplied with an objective/utility function that includes a layer to ensure corrigibility, this does not imply that the agent created by the base optimizer will also be perfectly corrigible. Some loss of fidelity might occur during the construction.

As a sub-case, and I think this relates to the concern raised in the comment above, if the base optimizer is supplied with an objective/utility function carefully constructed so that it does not create any incentive to spend resources resisting utility function updates, it does not follow that the agent created by the base optimizer will also perfectly lack any incentive to resist. Again there may be a loss of fidelity. As mentioned in the comment above, it would be extraordinarily strong to make the assumption that a carefully constructed property of ignorance or indifference present in the training utility function is exactly preserved in the constructed agent.

I believe that including a corrigibility layer into the utility function used by the base optimizer will likely help to improve the safety of the constructed agent -- if the agent that is constructed is scored on displaying corrigible behaviour, it is likely to be more corrigible.

But my preferred way of applying the corrigibility safety layer would be to apply it directly to the agent that was constructed. An important point of the corrigibility design is that it can take a very complex and opaque base utility function U_N, and then constructs an agent with provable properties.

So what we really want to do is to take the agent constructed by the base optimizer, and apply a software transformation that adds the corrigibility layer to it. General theories of computation and machine intelligence say that a such a software transformation will exist. However, applying the transformation may be non-trivial, or may lead to an unacceptable performance penalty. For the agent architectures defined in the various corrigibility papers, the transformation is easy, and has a relatively low impact on agent performance. But it is unclear to me if and how a transformation could be done for a neural net or table-driven agent. Agents that contain clear mesa-optimizers inside, agents that search a solution space while scoring alternative outcomes numerically, may be actually easier to transform than agents using other architectures. So I am not sure if it should be a goal to come up with base optimises that suppress the construction of mesa-optimizers. Designing base optimisers to produce outputs that slot into a mesa-optimiser architecture to which software transformations can be applied may be another way to go.

I have some additional thoughts and comments on "Risks from Learned Optimization", but I will post them later in the comment sections there.

Comment by koen-holtman on New paper: Corrigibility with Utility Preservation · 2019-08-07T10:35:53.792Z · score: 4 (3 votes) · LW · GW

I am actually just in the middle of reading "Risks from Learned Optimization", I want to hold off on commenting more fully on the relation between the two problems until I have read the whole thing.

Before I comment in detail on the concerns you raise, can you clarify further what you mean with the "lonely engineer" approach? I am not fully sure to what parts of my paper your concerns apply. You describe the "lonely engineer" approach after quoting a part about the AU agent, so I am wondering if it has to do with this particular agent definition. The main parts of the paper are about A agents, which do in fact have the emergent incentive to protect their utility function. I might call these A agents less lonely than AU agents because they have a greater awareness of not being alone in their universe. But I am not sure if this is what you mean with lonely/not lonely.


Self-modification of policy and utility function in rational agents defines three agent types: Hedonistic, Ignorant, and Realistic. The A agents in my paper are Realistic in this taxonomy. The AU agents are kind-of Ignorant but not entirely, because the universe cannot change their utility function, so I ended up calling them Platonic. These labels Hedonistic, Ignorant, and Realistic describe the expectation of how the agent will behave when supplied with a simple hand-coded utility function, but things get more complex if you create one via learning. If you use a training set and optimiser to produce a usually opaque but still computable world model+utility function combination that you then slot into any of these agent models to make them executable, the resulting agent might behave in a way not fully covered by the category label. This mismatch could happen in particular if the training set includes cases where the environment can try to attack the agent's infrastructure, with the agent having the option to deflect these attacks.

But my main question is if your concerns above apply to A agents also, or to AU agents in particular. I am not sure if you might have a very different concern in mind about how the corrigibility layer approach should or should not be combined with training-based agents.