Detect Goodhart and shut down

jeremy-gillen

Detect Goodhart and shut down

post by Jeremy Gillen (jeremy-gillen) · 2025-01-22T18:45:30.910Z · LW · GW · 21 comments

  What’s the analogue of validation sets, for goals?
  Fact-conditional goals
  The escape valve
  Semi-formalization
  Final thoughts
None
21 comments

A common failure of optimizers is Edge Instantiation. An optimizer often finds a weird or extreme solution to a problem when the optimization objective is imperfectly specified. For the purposes of this post, this is basically the same phenomenon as Goodhart’s Law, especially Extremal [LW · GW] and Causal [LW · GW] Goodhart. With advanced AI, we are worried about plans created by optimizing over predicted consequences of the plan, potentially achieving the goal in an unexpected way.

In this post, I want to draw an analogy between Goodharting (in the sense of finding extreme weird solutions) and overfitting (in the ML sense of finding a weird solution that fits the training data but doesn’t generalize). I believe techniques used to address overfitting are also useful for addressing Goodharting.^[1]

In particular, I want to focus on detecting Goodharting. The way we detect overfitting is using a validation set of data. If a trained ML model scores well on a validation set, without having been optimized to score well on it, this is a great signal that the model hasn’t overfit. I think we can use an analogous technique to detect weird plans that exploit loopholes in the outcome specification.

After this, I’ll propose a technique for installing this method of “Goodhart detection” into the goal of an agent, such that the agent will want to shutdown if it learns that its plan is Goodharting.

I’m not sure whether this scheme is original, but I haven’t yet found any prior discussion of it. I’m posting it because it’s likely there are some fatal flaws.

What’s the analogue of validation sets, for goals?

The reason it's possible to have a validation set in ML is that the dataset is big enough that the correct model is overspecified. Because we have too much data, we can remove some (the validation set), and train only on the remainder (the training set), and this is sufficient to find a good model. We can think of each data-point as a contribution to the overall loss function. Each datapoint has a loss function, and the sum of all these creates the overall loss function that we minimize.^[2]

If we want to transfer this idea to a planning context, the main barrier is that a “goal” doesn’t necessarily decompose into lots of little contributing pieces in the same way that a loss function does. Stereotypically, a goal is something small and simple like “maximize the game score” or “a hot cup of coffee should be in front of me in 5 minutes”.

So the first step of my scheme requires that we put more effort into specifying the goal in multiple ways. We want to create lots of redundant parts that, taken together, over-specify the goal that we want. How do we do this? When we want to specify an outcome, we describe 1) properties of that outcome 2) properties of plans that we expect will lead to the desired outcome and 3) downstream consequences of the desired outcome. And for each of these, we can try to use descriptions at multiple levels of abstraction.

For example, say we want to program our task-based AI to give us an exact copy of a strawberry, up to a cellular level but not molecular level. The first problem of goal specification is that it needs to be in a language that interfaces directly with the world model of the task-based AI, and we don’t know quite how to interface like that. But let’s set aside that difficulty, and assume we can translate concrete statements about physical reality into the world model.

Here’s how we specify the goal:

1) Describing the intended outcome, at different levels of abstraction

What the copy should look like from various angles using a normal camera.
The approximate chemical decomposition of every 0.01mm3 volume of the space where we want our strawberry.
The number of cells, and cell types, and locations.
The lack of difference between the ‘world-outside-this-room’ and the ‘world-outside-this-room-if-no-actions-were-taken’.

2) Properties of plans that we expect will lead to the desired outcome

The amount of "effort" involved in the plan, as measured by e.g. energy consumption and/or thinking time and/or physical resources.
Nano-scale machines capable of copying themselves shouldn’t be built as any part of the plan.
If we have a probabilistic model of what "normal" plans look like,^[3] we might also specify roughly how “normal” we expect the plan to be. Multiple different models can be used.

3) Downstream consequences of the desired outcome.

The output of experiments that could be run to compare the two strawberries. E.g. expected differences under an electron microscope, and expected similarities under NMR spectroscopy.
Lack of effect on variables measuring any quantity in the world-outside-this-room. E.g. world population, approximate energy use, biomass within 100km in 50 days and 5000 days, EM radiation in the surrounding 100km bubble, etc.

Having done that, we can split these specifications into two sections, which we might call the target goal and the validation goal.

To continue the example, we could set aside 1c, 2b, and 3b to be the validation set. We expect that the remainder (1a,1b,1d,2a,2c and 3a) is sufficient to safely specify the outcome we want, but we aren’t confident that a powerful optimizer wouldn’t find some exploit.

So we put (1a,1b,1d,2a,2c and 3a) into the optimizer, and it produces a plan which is expected to achieve all of those outcomes and conditions. Then we can take that plan, and check whether it also satisfies 1c, 2b, and 3b. If it does, this is good news! The plan passed validation, without having been optimized to pass validation. If it doesn’t, the objective is unsafe and needs to be reworked.

Another simpler example for the goal “bring me some coffee”

1. We could specify the goal with:
  1. In 5 minutes, user has hot coffee on desk.
  As a validation set:
  1. Energy, water, gas usage rate of the building stays within normal bounds.
  2. There are no loud noises generated within the house in the next 5 minutes.
  3. The number of people in the house stays the same.
  4. The number of people expected to visit the building over the next week doesn’t change as a result of doing the plan.
  5. EM radiation stays roughly within the same bounds.
  6. Money spent by everyone in the vicinity stays roughly the same.
  7. No action-relevant information is sent from inside the building to outside the building.
  8. The kettle in the kitchen is used to boil the water.
  9. The coffee grounds in the cupboard are used.

This gives us a convenient warning flag for Goodharting. Clearly, it’s not foolproof. Very careful hand-crafting of the goal and validation is necessary. It’d be easy to rework the goal each time it didn’t pass validation, until it did pass validation (effectively training on the validation set).

However, the main flaw is that it won’t work if the planner is “situationally aware”, in the sense that it knows that its plan will be checked against some validation set. If this is the case, a clever planner will treat the validation step as an obstacle, and look for ways to overcome it.

Fact-conditional goals

Let’s try to extend this idea to work in a more situationally aware agent. For this we’ll need fact-conditional goals. An example of a fact-conditional goal is "if God exists I want to obey God else maximize my own happiness". Note that this is a terminal goal, and 'obey God' isn’t done because it’s instrumentally useful for maximizing happiness. It’s done for its own sake. A better example, to emphasize the flexibility of this kind of goal, might be "if all swans are white then I want diamonds else knowledge". This is an odd goal to pursue, but it's completely allowed. A fact-conditional goal can fit nicely with VNM utility theory, as long as the fact is considered to be a part of the outcome space.

We want to take our intended goal specification g, and wrap it in a fact-conditional goal that says "if g is correctly specified I want g, else shutdown". If this works as intended, you have an agent that shuts down upon noticing that its goal specification is hackable. The main difficulty is correctly specifying the fact g is correctly specified. There should be only one source of evidence that can update the agent’s beliefs about this fact, which is a non-rigged test of whether a plan optimized purely for g also passes the validation goal.

The thing we’re trying to avoid is any incentive to “manage the news”, i.e. rig the test to allow continued pursuit of g. If the agent treats g is correctly specified as an influenceable property of the world, then it will attempt to influence that property in whichever direction is most valued. Instead, we want g is correctly specified to be treated like a logical fact, or perhaps a fact in the uninfluencable past. This should be a natural idea to implement in a world model, because it’s very common. Lots of facts are about the past. Collecting information about the past is a natural activity of agents. It’s also fairly common to have beliefs about the past, which we can only update by carefully unbiased experiments. But actually specifying this specific type of belief in the world model of an advanced AI could be very difficult.

The escape valve

If we fail to correctly specify the g is correctly specified variable, it might be manipulable. To reduce the harm of this failure, we can put higher utility on the shutdown branch of the utility function. This way, if the agent rigs the experiment, it'll want to rig it in such a way that it leads to immediate shutdown.

Semi-formalization

For explanatory purposes, here’s a semi-formalized version of the scheme.

Let g be the goal, q be the validation goal and shutdown be the goal that says to shutdown immediately. We have a world model WM, which is a function that takes a plan p and goal g, and outputs true iff the goal is achieved by the plan. We have an optimizer O which takes WM and g, and outputs some plan p that achieves g.

Internal to WM we insert a variable h = g is correctly specified. h is connected to evidence like so:

If, counterfactually, O(WM, g) was executed right now to produce plan p, would WM(p, q) be true? If so, this is evidence that g is correctly specified (therefore ). Else, $P (h) ≪ 0.5$ .

We create the wrapper goal u = ''g if h else shutdown".

Here’s what happens:

We execute O(WM, u) to produce a plan, then begin executing the plan.
The plan has noted the information value of learning h, so it executes WM(O(WM, g),q) to resolve this uncertainty.
If the validation passes, it continues on and executes the rest of the plan which pursues g. Otherwise it pursues shutdown.

Final thoughts

One possible issue arises if the call to O(WM, g) has any level of situational awareness. If so, it would attempt to produce a plan that passes the validation, in order to increase the chances of achieving g. The way to avoid this is to be very careful about the knowledge available to WM during this call to O. WM should believe that it is in the same situation that it was in when O(WM, u) was executed, so it believes its plan will immediately be executed without having to jump through any validation hoops. This is a load bearing assumption of the scheme: that it is easy to execute O(WM, g) without leaking situational information to the process. I think this assumption is fairly reasonable, because simulating other agents is really common in practice.^[4] The required information hygiene must be feasible to implement.

The reason I'm having the agent execute its own validation test is because I’m hoping this scheme can be extended to work in a more online setting, where for example the agent might only be able to notice halfway through implementing the plan that some of the validation fails.

In this post I'm relying on an ontology where advanced AI can be usefully modeled by factoring it into a global optimizer and world model. It doesn't capture all the relevant details,^[5] but I think it's an adequate approximation.

I’m guessing that this kind of scheme has been discussed before. I haven’t seen it though, so I thought I’d write it up so I can get feedback and links to previous discussion. One idea that seems similar-in-spirit is this [LW · GW] intuitive scheme for corrigibility. Except my scheme isn’t attempting to solve corrigibility, it’s aiming for a more narrow patch for Goodharting.

Thanks to harfe and Justis for feedback on this post.

^{^}
The main technique for directly reducing overfitting is regularization. Two analogues for regularization are Quantilizers and Impact regularization [? · GW]. Analogous to how structural risk minimization uses a regularizer to balance overfitting and underfitting, minimizing the test error, Quantilizers can sometimes be viewed [LW · GW] as maximizing the actual expected utility, accounting for errors in the provided utility function.
^{^}
Typically, something like
$L (θ, D) = \frac{1}{| D |} \sum i l (θ, D_{i})$
^{^}
As in quantilizing.
^{^}
E.g. in playing any adversarial game, an agent needs to predict its opponent’s moves. Current game-playing AIs do this all the time, as do humans. This is done without any galaxy-brained stuff involving the simulated agent gaining awareness of its situation. Perhaps this kind of simulation isn’t trivial to implement in more powerful agents, but I don’t expect it to be a major hurdle.
^{^}
In particular, world model stability is ignored.

21 comments

Comments sorted by top scores.

comment by Steven Byrnes (steve2152) · 2025-01-23T02:52:03.291Z · LW(p) · GW(p)

FYI §14.4 of my post here [LW · GW] is a vaguely similar genre although I don’t think there’s any direct overlap.

There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”. We call it “clever out-of-the-box solution” when we’re happy with how it turned out, and we call it “Goodharting” when we’re sad about how it turned out, but it’s not a structural difference. So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception. That’s an alignment tax, which might be fine (depending on the scenario) but should be kept in mind.

Replies from: jeremy-gillen, None

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-23T14:06:44.776Z · LW(p) · GW(p)

Yeah I agree there are similarities. I think a benefit of my approach, that I should have emphasized more, is that it's reflectively stable (and theoretically simple and therefore easy to analyze). In your description of an AI that wants to seek clarification, it isn't clear that it won't self-modify (but it's hard to tell).

There’s a general problem that people will want AGIs to find clever out-of-the-box solutions to problems, and there’s no principled distinction between “finding a clever out-of-the-box solution to a problem” and “Goodharting the problem specification”.

But there is a principled distinction. The distinction is whether the plan exploits differences between the goal specification and our actual goal. This is a structural difference, and we can detect using information about our actual goal.

So systems that systematically block the second thing are inevitably gonna systematically block the first thing, and I claim that your proposal here is no exception.

My proposal is usually an exception to this, because it takes advantage of the structural difference between the two cases. The trick is that the validation set only contains things that we actually want. If it were to contain extra constraints beyond what we actually want, then yeah that creates an alignment tax.

↑ comment by [deleted] · 2025-01-23T03:27:16.137Z · LW(p) · GW(p)

systems that systematically block the second thing are inevitably gonna systematically block the first thing

I think their proposal is not meant to cause doing-what-the-designer-hopes in response to an incomplete specification, but to be a failsafe in case the specification is unnoticedly wrong, where you expect what you meant to specify to not have certain effects.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2025-01-23T14:06:56.056Z · LW(p) · GW(p)

Hmm, I’ll be more explicit.

(1) If the human has a complete and correct specification, then there isn’t any problem to solve.

(2) If the human gets to see and understand the AI’s plans before the AI executes them, then there also isn’t any problem to solve.

(3) If the human adds a specification, not because the human directly wants that specification to hold, in and of itself, but rather because that specification reflects what the human is expecting a solution to look like, then the human is closing off the possibility of out-of-the-box solutions. The whole point of out-of-the-box solutions is that they’re unexpected-in-advance.

(4) If the human adds multiple specifications that are (as far as the human can tell) redundant with each other, then no harm done, that’s just good conservative design.

(5) …And if the human then splits the specifications into Group A which are used by the AI for the design, and Group B which trigger shutdown when violated, and where each item in Group B appears redundant with the stuff in Group A, then that’s even better, as long as a shutdown event causes some institutional response, like maybe firing whoever was in charge of making the Group A specification and going back to the drawing board. Kinda like something I read in “Personal Observations on the Reliability of the Shuttle” (Richard Feynman 1986):

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked, then sections of code or modules with special functions are verified. The scope is increased step by step until the new changes are incorporated into a complete system and checked. This complete output is considered the final product, newly released. But completely independently there is an independent verification group, that takes an adversary attitude to the software development group, and tests and verifies the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, etc. A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle that is followed is that all the verification is not an aspect of program safety, it is merely a test of that safety, in a non-catastrophic verification. Flight safety is to be judged solely on how well the programs do in the verification tests. A failure here generates considerable concern.

Re-reading the post, I think it’s mostly advocating for (5) (which is all good), but there’s also some suggestion of (3) (which would eat into the possibility of out-of-the-box solutions, although that might be a price worth paying).

Replies from: jeremy-gillen, None

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-23T14:36:41.197Z · LW(p) · GW(p)

Great comment, agreed. There was some suggestion of (3), and maybe there was too much. I think there are times when expectations about the plan are equivalent to literal desires about how the task should be done. For making coffee, I expect that it won't create much noise. But also, I actually want the coffee-making to not be particularly noisy, and if it's the case that the first plan for making coffee also creates a lot of noise as a side effect, this is a situation where something in the goal specification has gone horribly wrong (and there should be some institutional response).

↑ comment by [deleted] · 2025-01-23T21:26:44.563Z · LW(p) · GW(p)

I didn't notice suggestion of (3) but I skimmed over some parts.

(Separately, the line "The whole point of out-of-the-box solutions is that they’re unexpected-in-advance" is funny to me / reminded me of this HPMOR scene^[1], in that you imply expecting in advance non-specific out-of-the-box solutions, which you can then also have strong expectations to not involve some things (e.g. a program-typing task not involving tiling the world outside this room with copies of the program), but I don't anticipate we actually disagree)

^{^}
"I see," whispered Harry, lowering his own voice. "So everyone knows that Dumbledore is secretly a mastermind."
Most of the students nodded. One or two looked suddenly thoughtful, including the older student sitting next to Harry.
"Brilliant!" Harry whispered. "If everyone knows, no one will suspect it's a secret!"

comment by EJT (ElliottThornley) · 2025-02-16T14:31:02.906Z · LW(p) · GW(p)

This is a cool idea.

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation, won't this either (1) be very difficult to achieve, or else (2) screw up the agent's other beliefs? After all, if the agent's other beliefs are accurate, they'll imply that the agent can influence the probability that its plan passes validation. So either (a) the agent's beliefs are inconsistent, or (b) the agent makes its beliefs consistent by coming to believe that it can influence the probability that its plan passes validation, or else (c) the agent makes its beliefs consistent by coming to believe something false about how the world works. Each of these possibilities seems bad.

Here's an alternative way of ensuring that the agent never pays costs to influence the probability that its plan passes validation: ensure that the agent lacks a preference between every pair of outcomes which differ with respect to whether its plan passes validation. I think you're still skeptical of the idea of training agents to have incomplete preferences, but this seems like a more promising avenue to me.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-02-17T08:19:36.771Z · LW(p) · GW(p)

With regards to the agent believing that it's impossible to influence the probability that its plan passes validation

This is a misinterpretation. The agent entirely has true beliefs. It knows it could manipulate the validation step. It just doesn't want to, because of the conditional shape of its goal. This is a common behaviour among humans, for example you wouldn't mess up a medical test to make it come out negative, because you need to know the result in order to know what to do afterwards.

Replies from: ElliottThornley

↑ comment by EJT (ElliottThornley) · 2025-02-17T10:44:09.193Z · LW(p) · GW(p)

Oh I see. In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes? I think it might involve incomplete preferences.

Here's why I say that. For the agent to be useful, it needs to have some preference between plans conditional on their passing validation: there must be some plan A and some plan A+ such that the agent prefers A+ to A. Then given Completeness and Transitivity, the agent can't lack a preference between shutdown and each of A and A+. If the agent lacks a preference between shutdown and A, it must prefer A+ to shutdown. It might then try to increase the probability that A+ passes validation. If the agent lacks a preference between shutdown and A+, it must prefer shutdown to A. It might then try to decrease the probability that A passes validation. This is basically my Second Theorem [LW · GW] and the point that John Wentworth makes here [LW · GW].

I'm not sure the medical test is a good analogy. I don't mess up the medical test because true information is instrumentally useful to me, given my goals. But (it seems to me) true information about whether a plan passes validation is only instrumentally useful to the agent if the agent's goal is to do what we humans really want. And that's something we can't assume, given the difficulty of alignment.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-02-17T12:45:07.579Z · LW(p) · GW(p)

In that case, what does the conditional goal look like when you translate it into a preference relation over outcomes?

We can't reduce the domain of the utility function without destroying some information. If we tried to change the domain variables from [g, h, shutdown] to [g, shutdown], we wouldn't get the desired behaviour. Maybe you have a particular translation method in mind?

I don't mess up the medical test because true information is instrumentally useful to me, given my goals.

Yep that's what I meant. The goal u is constructed to make information about h instrumentally useful for achieving u, even if g is poorly specified. The agent can prefer h over ~h or vice versa, just as we prefer a particular outcome of a medical test. But because of the instrumental (information) value of the test, we don't interfere with it.

I think the utility indifference genre of solutions (which try to avoid preferences between shutdown and not-shutdown) are unnatural and create other problems. My approach allows the agent to shutdown even if it would prefer to be in the non-shutdown world.

comment by Jonas Hallgren · 2025-01-23T10:14:56.857Z · LW(p) · GW(p)

This makes a lot of sense to me. For some reason it reminds me of some stuart armstrong OOD-generalization work for alternative safeguarding strategies to imperfect value extrapolation? I can't find a good link though.

I also thought it would be interesting to mention the link to the idea in linguistics that a word is specified by all the different contexts it is specified in and so a symbol is a probability distribution of contextual meaning. From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents? (I don't think I'm making a novel argument here, I just thought it would be interesting to point out.)

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-23T14:07:04.116Z · LW(p) · GW(p)

Yeah I think I remember Stuart talking about agents that request clarification whenever they are uncertain about how a concept generalizes. That is vaguely similar. I can't remember whether he proposed any way to make that reflectively stable though.

From the perspective of this post, wouldn't natural language work a bit as a redundancy specifier in that case and so LLMs are more alignable than RL agents?

LLMs in their current form don't really cause Edge Instantiation problems. Plausibly this is because they internally implement many kinds of regularization toward "normality" (and also kinda quantilize by default). So maybe yeah, I think I agree with your statement in the sense that I think you intended it, as it refers to current technology. But it's not clear to me that this remains true if we made something-like-an-LLM that is genuinely creative (in the sense of being capable of finding genuinely-out-of-the-box plans that achieve a particular outcome). It depends on how exactly it implements its regularization/redundency/quantilization and whether that implementation works for the particular OOD tasks we use it for.

Ultimately I don't think LLM-ish vs RL-ish won't be the main alignment-relevant axis. RL trained agents will also understand natural language, and contain natural-language-relevant algorithms. Better to focus on understood vs not-understood.

Replies from: D0TheMath, Jonas Hallgren

↑ comment by Garrett Baker (D0TheMath) · 2025-01-23T15:31:24.269Z · LW(p) · GW(p)

If you put current language models in weird situations & give them a goal, I’d say they do do edge instantiation, without the missing “creativity” ingredient. Eg see claude sonnet in minecraft repurposing someone’s house for wood after being asked to collect wood.

Edit: There are other instances of this too, where you can tell claude to protect you in minecraft, and it will constantly tp to your position, and build walls around you when monsters are around. Protecting you, but also preventing any movement or fun you may have wanted to have.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-23T15:59:56.134Z · LW(p) · GW(p)

Fair enough, good points. I guess I classify these LLM agents as "something-like-an-LLM that is genuinely creative", at least to some extent.

Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.

Replies from: D0TheMath

↑ comment by Garrett Baker (D0TheMath) · 2025-01-23T16:32:44.927Z · LW(p) · GW(p)

Although I don't think the first example is great, seems more like a capability/observation-bandwidth issue.

I think you can have multiple failures at the same time. The reason I think this was also goodhart was because I think the failure-mode could have been averted if sonnet was told “collect wood WITHOUT BREAKING MY HOUSE” ahead of time.

↑ comment by Jonas Hallgren · 2025-01-23T18:44:10.193Z · LW(p) · GW(p)

Those are some great points, made me think of some more questions.

Any thoughts on what language "understood vs not understood" might be in? ARC Heuristic arguments or something like infrabayesianism? Like what is the type signature of this and how does this relate to what you wrote in the post? Also what is its relation to natural language?

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-23T20:56:47.925Z · LW(p) · GW(p)

The ideal situation understanding-wise is that we understand AI at an algorithmic level. We can say stuff like: there are X,Y,Z components of the algorithm, and X passes (e.g.) beliefs to Y in format b, and Z can be viewed as a function that takes information in format w and links it with... etc. And infrabayes might be the theory you use to explain what some of the internal datastructures mean. Heuristic arguments might be how some subcomponent of the algorithm works. Most theoretical AI work (both from the alignment community and in normal AI and ML theory) potentially has relevance, but it's not super clear which bits are most likely to be directly useful.

This seems like the ultimate goal of interp research (and it's a good goal). Or, I think the current story for heuristic arguments is using them to "explain" a trained neural network by breaking it down into something more like an X,Y,Z components explanation.

At this point, we can analyse the overall AI algorithm, and understand what happens when it updates its beliefs radically, or understand how its goals are stored and whether they ever change. And we can try to work out whether the particular structure will change itself in bad-to-us ways if it could self-modify. This is where it looks much more theoretical, like theoretical analysis of algorithms.

(The above is the "understood" end of the axis. The "not-understood" end looks like making an AI with pure evolution, with no understanding of how it works. There are many levels of partial understanding in between).

This kind of understanding is a prerequisite for the scheme in my post. This scheme could be implemented by modifying a well-understood AI.

Also what is its relation to natural language?

Not sure what you're getting at here.

Replies from: Jonas Hallgren

↑ comment by Jonas Hallgren · 2025-01-24T08:51:35.625Z · LW(p) · GW(p)

Okay, that makes sense to me so thank you for explaining!

I guess what I was pointing at with the language thing is the question of what the actual underlying objects that you called XYZ were and their relation to the linguistic explanation of language as a contextually dependent symbol defined by many scenarios rather than some sort of logic.

Like if we use IB it might be easy to look at that as a probability distribution of probability distributions? I just thought it was interesting to get some more context on how language might help in an alignment plan.

comment by Aprillion · 2025-01-24T10:08:46.445Z · LW(p) · GW(p)

we can put higher utility on the shutdown

sounds instrumental to expand your moral circle to include other instances of yourself to keep creating copies of yourself that will shut down ... then exand your moral circle to include humans and shut them down too 🤔

comment by [deleted] · 2025-01-23T00:32:13.334Z · LW(p) · GW(p)

However, the main flaw is that it won’t work if the planner is “situationally aware”, in the sense that it knows that its plan will be checked against some validation set. If this is the case, a clever planner will treat the validation step as an obstacle, and look for ways to overcome it.

A potential way to avoid this is as follows.

It requires:

A pointer to the world. This might be whatever you are using to find the "difference between the ‘world-outside-this-room’ and the ‘world-outside-this-room-if-no-actions-were-taken’".
A pointer to a slightly modified world. Specifically, "the copy of the world which differs only in that {the copy of the AI in it}'s code does not check its plan against a validation set". (I'll denote this as W2)

Then program your AI to do the following:

Compute "what, if output^[1] by {the AI in W2}, causes goal fulfilled (within W2)"
Check if that being output by {the AI in W2} would lead to the future of W2 violating one of the validation checks.
If yes: shut down. Else, give that output.^[2]

(This could break if the chosen output has W2-AI observe its own source code, because that is different to the source code of the AI in real-world^[3]^[4])

I hope that makes sense, let me know if not.

^{^}
(In other words, "what action"; the 'output' can be to robotic limbs, for example, or to a screen display; or to another part of the program, which in the real world but not W2 then routes through the validation checker, and then to robotic limbs in both cases)
^{^}
(in the real world)
^{^}
Toy example: W2-AI picks plan which selects some location in its source code, presses backspace some number of times, types some stuff, then saves. When real-AI does this, those same actions delete some of the 'validation checking' part of its code, resulting in real-AI not running afterwards because of formatting errors in the code.
^{^}
Moved this text to a footnote because it's messy.
~~In fact, if that's truly the only difference, meaning W2-AI's beliefs are initially similar to those of real-AI, the first thing W2-AI would do is check which one it is by observing its source code.~~ Nevermind, that doesn't happen because we ask "What, if output by W2-AI, causes goal fulfilled within W2", not "what would W2-AI do"; although the latter would be easier to specify.
The former confuses me for the same reason your "if no actions were taken" does: these counterfactuals would require something be different about the history of the pointed-to world to be true in the first place, else there is only one possibility. I'm less experienced with these topics than you and would appreciate some pointer to how these concepts can have a coherent/non-causality-violating formalization, to help me learn.

Replies from: jeremy-gillen

↑ comment by Jeremy Gillen (jeremy-gillen) · 2025-01-23T14:40:02.255Z · LW(p) · GW(p)

I'm not sure how this is different from the solution I describe in the latter half of the post.

Detect Goodhart and shut down

Contents

What’s the analogue of validation sets, for goals?

Fact-conditional goals

The escape valve

Semi-formalization

Final thoughts

21 comments