Towards a solution to the alignment problem via objective detection and evaluation

paul-colognese

Towards a solution to the alignment problem via objective detection and evaluation

post by Paul Colognese (paul-colognese) · 2023-04-12T15:39:31.662Z · LW · GW · 7 comments

  Introduction
    High-level summary of this post
  An alignment problem for “optimizing systems” and a theoretical solution
    Optimizing systems and objectives
    The alignment problem for optimizing systems
      Statement of the alignment problem
    Objective detection and evaluation as a theoretical solution to the alignment problem
  Additional thoughts
    The role of interpretability
    Thoughts on evaluating objectives
    What if we can only evaluate certain objectives?
  Empirical work/next steps 
  Related work
    Retargeting the search
    Acceptability oversight and transparency tech tree
    Isolating objectives by leveraging their differential tendency to update in certain contexts
  Conclusion
None
7 comments

Thanks to Arun Jose, Joseph Bloom, and Evan Hubinger for discussions/feedback.

This work was primarily carried out during SERI MATS under Evan Hubinger’s mentorship.

Introduction

We explore whether the ability to detect and evaluate the objectives of advanced AI systems in training and in deployment is sufficient to solve the alignment problem. We mostly ignore practical/implementation considerations in order to examine this approach in an idealized setting before we try to make concrete progress in this direction. This allows us to highlight weaknesses with this approach and further considerations that any practical implementation must contend with. We note that taking this idealized framing hides some messy issues that we intend to discuss in future posts.

We note that this approach is not novel (see this post [LW · GW] or the related work section of this post) but provides the motivation for our future research and some additional considerations.

Our basic analysis indicates that we can solve the alignment problem in this idealized setting if we can:

Detect all objectives/intentions of any system produced during the training process as well as any objectives that may develop in deployment.
Evaluate whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversibly bad.
Shut down a system if we detect that the set of objectives it’s pursuing will lead to irreversibly bad outcomes before it can accomplish those outcomes.

High-level summary of this post

Automating certain tasks will require us to build/train AI systems that pursue objectives (optimizing systems).^[1]
We are concerned with unintentionally training optimizing systems that pursue (catastrophically) bad objectives during training or in deployment.
Assumption: Optimizing systems accomplish some outcomes by having an “internalized objective” that is causally upstream of that outcome, i.e., in a counterfactual setting where this internalized objective didn’t exist within the internals of the system, the outcome wouldn’t be realized in the environment.
Assumption: We assume that outcomes such as successful deception require the optimizing system to have a corresponding objective.
We claim that if an overseer can accurately detect and evaluate all of the objectives of optimizing systems produced during the training process and during deployment, before the outcomes associated to the objectives are realized, the overseer can prevent bad outcomes caused by optimizing systems pursuing bad objectives. This includes deceptive behavior and the optimizing system preventing itself from being turned off.
We argue that robustly detecting an optimizing system’s objectives requires strong interpretability tools.
We discuss the problem of evaluating objectives and some of the difficulties involved.

We will avoid trying to define what we mean by an objective [LW · GW] for now and instead work with an intuitive notion. In future work, we will attempt to clarify this intuition into a definition.

An alignment problem for “optimizing systems” and a theoretical solution

Optimizing systems and objectives

We begin by presenting the key intuitions and assumptions that inspired this approach. We doubt that these intuitions/assumptions hold as stated but leave a more thorough analysis for future work.

The first intuition is that for a cognitive system to be able to achieve certain “low entropy” outcomes in an environment (building a factory, successfully running a business, producing novel scientific discoveries, etc.), the system will likely need to be capable of using strategy/planning/problem-solving [? · GW] to select action sequences that lead to that outcome. We call such systems optimizing systems.

The second intuition states that for an optimizing system to produce such a desired outcome, it must first have an “internal objective” that corresponds to the desired outcome. It must then apply computation/cognition (possibly in the form of planning or general problem-solving) that “uses” this objective to achieve this outcome. So there exists a causal relationship between the objective/intention of the system and the outcome that the objective references. In particular, there exists a time gap between the system first developing the objective and the outcome produced by the system.

We also make the assumption that outcomes that involve the overseer no longer being in control of the optimizing system (due to successful deception, etc.) requires the optimizing system to have an objective that corresponds to that outcome.

These intuitions suggest that it is theoretically possible for an overseer to make certain observations of the internals of an optimizing system (in particular, observations of the system’s objective) that allow it to deduce that the system is “trying” to produce a particular outcome in the environment. If the outcome that the objective corresponds to is deemed bad by the overseer, the overseer could intervene and stop the system before the outcome is realized.

Again, for the purposes of this post, we don’t attempt to define exactly what we mean by an objective and rely on an intuitive notion. We hope to clarify this term in an upcoming post. Other terms we could have potentially worked with include goals, values, intentional systems, consequentialist systems, and mesa-optimizers.

Note that we don’t assume that an optimizing system has fixed objectives. Its objectives may change during runtime/deployment (we discuss this briefly in the next section).^[2]

The alignment problem for optimizing systems

Building/training optimizing systems is potentially dangerous if, in the process of building/deploying a powerful optimizing system, the training process succeeds in installing powerful capabilities in the system, but the optimizing system develops and successfully pursues bad objectives.

We define the overseer to be the system that builds/trains the optimizing system. We say that an objective of an optimizing system is good/bad (with respect to the overseer) if the outcome that corresponds to that objective is considered good/bad from the overseer’s perspective. We call an objective “irreversibly bad” if the resulting outcome is an existential catastrophe, the removal of the overseer’s capacity to shut down the optimizing system, deceptive behavior, etc. Broadly speaking, we’re referring to objectives that result in the overseer “losing control of the optimizing system.”

At this point, it seems prudent to note that in practice, an overseer may not be able to accurately evaluate whether an objective is good or (irreversibly) bad. For now, we assume that the overseer can accurately evaluate outcomes and hence evaluate the corresponding objectives that point to those outcomes. We will touch upon this issue later.

The big problem: We currently don’t know how to install good objective(s)/avoid installing (irreversibly) bad objectives in optimizing systems via prosaic training methods, and it seems likely that we can/will train powerful optimizing systems via prosaic training methods. There are conceptual arguments [? · GW] that suggest that installing the intended objectives is difficult and empirical results that demonstrate the failure of prosaic training methods to install the intended objectives in existing toy optimizing systems.

To make matters worse, we may build and deploy optimizing systems whose objectives can change during deployment [LW · GW], intentionally or otherwise. Such optimizing systems may be necessary for good performance during deployment in complex environments with diverse tasks. This could be because the optimizing system may need to learn new objectives that depend on the deployment environment (for example, it may need to learn instrumental objectives). It could also be the case that we might want the objective(s) that the optimizing system is pursuing to be updated in the presence of new data via feedback. It could also be the case that objectives haven’t been solidified during training and so may change as the system interacts with the environment. Hence, we may need methods to ensure that optimizing systems continually have good objectives, even if they may change during deployment.

Statement of the alignment problem

We need a method of training/building optimizing systems such that:

(a) The optimizing systems produced by the training process never successfully pursue an irreversibly bad objective during training or deployment.
(b) The optimizing systems produced by the training process pursue good objectives and “rarely” pursue bad objectives.

If the above two properties hold, we say that the optimizing system is aligned with the overseer.

We briefly give some justification for choosing this framing. We’d like our optimizing systems to pursue good objectives and do useful things during deployment. However, we allow some slack regarding the pursuit of bad objectives because we may need to collect observations of optimizing systems pursuing bad objectives so that we can apply the relevant selection pressure, or because it may be sufficiently difficult to get guarantees that our system will never pursue bad objectives during deployment. However, we need our optimizing systems to never successfully pursue irreversibly bad objectives such as removing the overseer’s capacity to shut the system down, etc. If an optimizing system never successfully pursues an irreversibly bad objective, then the overseer is always in the position to shut down and retrain the optimizing system.

We note that solving the alignment problem, as stated, does not necessarily prevent bad outcomes from AI; it ensures that we don’t produce systems that produce bad outcomes (from the perspective of the overseer) that result from the system pursuing objectives that correspond to those outcomes.

Objective detection and evaluation as a theoretical solution to the alignment problem

We’ll spell out the solution which follows naturally from our framing of the problem.

Claim: If the overseer has the following capabilities, part (a) of the alignment problem is solved:

(Detection): The overseer can detect all of the objectives of any optimizing system produced by the training process (including in deployment if the objectives change) before the outcome that the objectives corresponds to is realized in the world.
(Evaluation): Given a set of objectives that the overseer has detected in an optimizing system, the overseer can evaluate whether the optimizing system pursuing this set of objectives will lead to good/bad/irreversibly bad outcomes.
(Control): The overseer can shut down any optimizing system produced by the training process, even in deployment. In particular, if the overseer detects and evaluates an objective as being irreversibly bad, then the overseer can shut down the optimizing system before it can successfully pursue that irreversibly bad objective.

With these capabilities, the overseer can shut down and modify an optimizing system that develops an irreversibly bad objective and so part (a) of the alignment problem (as stated) is solved. In particular, because we’ve defined an irreversibly bad objective to include removing the overseer’s control capability, the overseer will retain this ability unless it loses this capability for other reasons (such as losing the off-switch).

To solve part (b) of the alignment problem the overseer also needs to train/select/build optimizing systems that pursue good objectives and “rarely” pursue bad objectives during deployment.^[3]

This can be achieved with either of the two following capabilities (based on the training/building method being used):

(Objective Writing): The overseer can directly write objectives into optimizing systems and delete any objectives it detects.
(Objective Selection Pressure): The overseer can apply selection pressure to select for systems that pursue good objectives and don’t pursue (irreversibly) bad objectives during training.

Objective Writing [LW · GW] allows the overseer to ensure that optimizing systems are always pursuing good objectives and furthermore, it can delete (irreversibly) bad objectives whenever they’re detected because part (a) holds.

Objective Selection Pressure holding ensures that the overseer can (theoretically) select for optimizing systems that pursue good objectives and never pursue (irreversibly) bad objectives on the training distribution. Furthermore, part (a) ensures that at any point in deployment, the optimizing system can be shut down and retrained so that the new system has had additional selection pressure to pursue good objectives on the distribution sampled from deployment.

The above capabilities allow us to solve the alignment problem under the constraint that the overseer must constantly be monitoring the optimizing systems it deploys, and it must retrain them as necessary.

The following additional capability could allow the overseer to eventually take it easy and would allow us to drop the Control capability from the overseer:

(Future Objective Guarantees): The overseer can verify whether or not a given optimizing system will pursue (irreversibly) bad objectives during deployment/will continue to pursue its current good terminal objectives.

Note that eventually, we will need Future Objective Guarantees if we are to confidently build a Sovereign AI that can independently optimize for aligned objectives without human oversight.

Additional thoughts

In this section, we consider some aspects of implementing this approach in practice, including the role of interpretability in this approach and difficulties with evaluating objectives.

The role of interpretability

To go from the above theoretical solution to a practical implementation, we need to develop tools that give an overseer the aforementioned capabilities.

We claim that for us to have confidence that our solution holds, we'll likely need interpretability tools and can’t rely on using external observations of systems produced during the training processes.

When training systems via prosaic training methods, we encounter the problem where we select systems that look good to the overseer but may not actually be good. This is because prosaic training methods use selection pressures based on the overseer’s observations of the system during training and the overseer’s evaluations of those observations. Therefore, we need the overseer to be able to make observations that can truly distinguish between good systems and bad-but-good-looking systems.

We claim that relying on observing the external behavior of systems may not be sufficient to make such distinctions. This is because under a fixed set of inputs to two optimizing systems with different objectives, it’s possible that the two systems may produce the same external behavior [? · GW].

To overcome this, we might try to select the input set in an adversarial [LW · GW] manner [LW · GW] in order to produce inputs that cause an optimizing system with one objective to behave differently from an optimizing system with a different objective.

We should do this; however, in the worst case, we can never be truly confident that we’ve selected a good system due to the possibility of deception [? · GW]. A sufficiently capable deceptive systems can produce outputs that look good to the overseer during training due its situational awareness. In the worst-case scenario, we should think of the information coming from the system’s external behavior as corrupted due to the possibility of deception.

Hence, when implementing this solution under prosaic training setups, we can’t confidently rely on detecting and evaluating an optimizing system’s objectives based on observations of its external behavior during training.

Certain observations of an optimizing system’s internals might allow an overseer to truly differentiate between good and good-looking-but-bad optimizing systems. We earlier alluded to the idea that objectives might have to be internally represented in the system before the outcome of that objective can occur. If this is the case then, theoretically, interpretability tools could be developed that allow the overseer to directly detect (and possibly help evaluate) the internally represented objectives.

One might object that a sufficiently capable deceptive system could make the internal representations of its objectives opaque to interpretability tools. This could indeed be the case, which is why our solution requires us to be able to use interpretability tools to detect deceptive objectives and shut down the system before the obscuration can be achieved.

It is currently unclear how detectable or interpretable objectives within the optimizing systems we care about will be.

Thoughts on evaluating objectives

Detecting all objectives in an optimizing system is already a big ask. Even if the overseer could detect all of the objectives in an optimizing system, it might be difficult to accurately predict whether a powerful optimizing system pursuing those objectives would result in good outcomes or not.

This might be difficult for a few different reasons. One reason is that the objectives might be represented in an alien way such that we can’t simply “read off” the corresponding outcome, or set of outcomes. Another difficulty [? · GW] is that the information about the outcome contained in the objective may not be enough for the overseer to properly evaluate all the relevant aspects of that outcome.

That being said, with enough of an understanding of the optimizing system’s internals, we might be able to directly translate from the internal representation of the objective to a description of the relevant parts of the corresponding outcome. An alternative approach involves the overseer using predictive models that have been trained to map from the objective to representations of the corresponding outcome. These predictive models could use the optimizing system’s internals as inputs, or the overseer might be able to use a predictive model that is used by the optimizing system internally.

A classic problem touched upon above is that we might obtain predicted outcomes that look good but may not actually be good. In the case of human overseers, this could be due to our lack of forecasting abilities or our lack of ability to discern what is good. If the overseer uses a predictive model, the model might give some observations about what the outcome will look like (e.g. in the form of a video feed showing happy humans) but this information may be incomplete and hence might not inform us that the outcome is actually bad (e.g. the outcome features humans who have been forced to look happy). A solution to ELK [? · GW] would help address this problem.

What if we can only evaluate certain objectives?

We could weaken the assumption that the overseer needs to be able to evaluate all objectives that the optimizing systems produced by the training process might pursue. Suppose that we can still detect all objectives of an optimizing system. Then we could ensure that our optimizing systems are pursuing objectives that we can evaluate as good and shut down the optimizing system whenever it begins pursuing an objective that we can’t evaluate. This might still be useful if we can productively use optimizing systems that pursue simple objectives that we can identify and evaluate.

Empirical work/next steps

Making progress on this research agenda involves investigating to what extent we can develop tools that allow us to detect and evaluate objectives in optimizing systems that might show up in future training regimes.

Early work will involve deconfusion work that attempts to help clarify what we mean by an objective and generate hypotheses regarding how objectives might be represented in future optimizing systems. From there, we hope to test these hypotheses on existing and future optimizing systems (toy maze-solving systems or LLM-based systems with objectives).

A benchmark for this agenda involves developing tools that can successfully pass auditing games [LW · GW] for detecting and evaluating objectives in optimizing systems. In particular, if some group (the Blue Team) claims that they have developed tools that can detect or evaluate the objectives of a set of optimizing systems, a Red Team will adversarially propose experiments to test/break the Blue Team’s tools. The Blue Team’s tools have passed the auditing game if their tools can successfully stand up to the Red Team’s attack.

Related work

In this section, we look at existing work related to the approach outlined in this post.

Retargeting the search

John Wentworth’s Retargeting the search [LW · GW] can be thought of as an implementation of this approach under the assumption that optimizing systems have a particular internal structure. This assumption states that optimizing systems have a general-purpose search (or optimization) module that takes in the current objective as an input. The ability to detect the “objective slot” of this module and monitor the objectives being passed into this module would allow us to implement the solution proposed in this post.

Acceptability oversight and transparency tech tree

We can identify this approach as “Worst-case robust-to-training transparency for non-deceptive models” from Evan Hubinger’s post A transparency and interpretability tech tree [LW · GW], where the interpretability targets are “intentions/objectives.” Monitoring objectives that develop during the training process could serve as a form of acceptability predicate [LW · GW] that allows us to detect precursors to deception (namely the objective to be deceptive).

Isolating objectives by leveraging their differential tendency to update in certain contexts

Arun’s work [LW · GW] demonstrates an attempt at isolated objectives in existing toy models that might be thought of as having objectives. Hopefully future conceptual work will lead to more hypotheses regarding how objectives might be represented in optimizing systems.

Conclusion

We have introduced a theoretical solution to the alignment problem (as stated) and explored a number of considerations regarding how this might be implemented in practice. This approach is based on intuitions that will need to be critiqued and developed in further work.

This solution seems difficult to implement in practice. There are obvious difficulties such as the fact that developing such interpretability tools seems like a big task and we face time constraints. Furthermore, it’s unclear whether the tools we might develop and test on optimizing systems will generalize to future systems.

Pessimism is warranted, but it seems possible that pursuing this direction could lead to useful insights. Therefore we suggest further conceptual and empirical investigation.

^{^}
Probably should have stuck with the term "mesa-optimizer [? · GW]".
^{^}
Powerful agents will likely be able to preserve their objectives during deployment [LW · GW] but it’s not clear whether the first dangerous optimizing systems will have this capability.
^{^}
Note that we assume that training methods will eventually have the ability to train optimizing systems that have the capability of pursuing objectives we care about. Otherwise, we don't have to worry about bad outcomes from optimizing systems pursuing bad objectives.

7 comments

Comments sorted by top scores.

comment by Roman Leventov · 2023-04-12T16:54:11.772Z · LW(p) · GW(p)

I think that objective is a (future) world state, or probability distribution over world states (or a subset of factors of a world state), wrt. the world model of an intelligent system.

If the world model and cost model and planning model are not disentangled with the AI architecture but are in fact a spaghetti ball within a giant LLM, wrapped as AutoGPT agent, then of course the objectives will hardly be tractable (requires "full" interpretability of LLM' to be complete, cf. Othello-GPT) and interpretable.

Solution: use AI architectures with explicitly disentangled world models, such as LeCun's H-JEPA, and then use a separate model for converting the world state (i.e., embedding) into textual description for human interpretability.

comment by Evan R. Murphy · 2023-04-20T02:24:47.840Z · LW(p) · GW(p)

Post summary (experimental)

I just found your post. I want to read it but didn't have time to dive into it thoroughly yet, so I put it into a summarizer script I've been working on that uses gpt-3.5-turbo and gpt-4 to summarize texts that exceed the context window length.

Here's the summary it came up with, let me know if anyone see problems with it. If you're in a rush you can use agree/disagree voting to signal whether you think this is overall a good summary or not:

The article examines a theoretical solution to the AI alignment problem, focusing on detecting and evaluating objectives in optimizing systems to prevent negative or irreversible outcomes. The author proposes that an overseer should possess three capabilities: detecting, evaluating, and controlling optimizing systems to align with their intended objectives.
Emphasizing the significance of interpretability, the article delves into the challenges of assessing objectives. As a practical solution, the author suggests developing tools to detect and evaluate objectives within optimizing systems and testing these methods through auditing games. Although implementing such tools may be difficult, the author advocates for further exploration in this direction to potentially uncover valuable insights. Acknowledging the theoretical nature of the solution, the author recognizes the potential hurdles that may arise during practical implementation.

Update: I see now that your post includes a High-level summary of this post [LW · GW] (thanks for doing that!), which I'm going through and comparing with this auto-generated one.

Replies from: paul-colognese

↑ comment by Paul Colognese (paul-colognese) · 2023-04-20T06:10:05.383Z · LW(p) · GW(p)

Interesting! Quick thought: I feel as though it over-compressed the post, compared to the summary I used. Perhaps you can tweak things to generate multiple summaries in varying degrees of length.

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2023-04-20T21:06:48.296Z · LW(p) · GW(p)

Great idea, I will experiment with that - thanks!

Replies from: Evan R. Murphy

↑ comment by Evan R. Murphy · 2023-04-24T15:39:43.556Z · LW(p) · GW(p)

Less compressed summary

Here's a longer summary of your article generated by the latest version of my summarizer script:

In this article, Paul Colognese explores whether detecting and evaluating the objectives of advanced AI systems during training and deployment is sufficient to solve the alignment problem. The idealized approach presented in the article involves detecting all objectives/intentions of any system produced during the training process, evaluating whether the outcomes produced by a system pursuing a set of objectives will be good/bad/irreversibly bad, and shutting down a system if its objectives lead to irreversibly bad outcomes.
The alignment problem for optimizing systems is defined as needing a method of training/building optimizing systems such that they never successfully pursue an irreversibly bad objective during training or deployment and pursue good objectives while rarely pursuing bad objectives. The article claims that if an overseer can accurately detect and evaluate all of the objectives of optimizing systems produced during the training process and during deployment, the overseer can prevent bad outcomes caused by optimizing systems pursuing bad objectives.
Robustly detecting an optimizing system’s objectives requires strong interpretability tools. The article discusses the problem of evaluating objectives and some of the difficulties involved. The role of interpretability is crucial in this approach, as it allows the overseer to make observations that can truly distinguish between good systems and bad-but-good-looking systems.
Detecting all objectives in an optimizing system is a challenging task, and even if the overseer could detect all of the objectives, it might be difficult to accurately predict whether a powerful optimizing system pursuing those objectives would result in good outcomes or not. The article suggests that with enough understanding of the optimizing system’s internals, it might be possible to directly translate from the internal representation of the objective to a description of the relevant parts of the corresponding outcome.
The article concludes by acknowledging that the proposed solution seems difficult to implement in practice, but pursuing this direction could lead to useful insights. Further conceptual and empirical investigation is suggested to better understand the feasibility of this approach in solving the alignment problem.

comment by Charlie Steiner · 2023-04-16T05:29:11.295Z · LW(p) · GW(p)

I felt like this post was building a castle in the air. You spend a lot of time assuming various useful abilities, which I often didn't find interesting.

I'd be more interested in taking those assumed abilities, and breaking them down into much smaller pieces that we understand better individually. It's also important to be able to learn that you were assuming the wrong thing (maybe the thing you're assuming is too hard to do, or didn't even make sense) and assume something else instead - if we couldn't do this we'd lose one of our big advantages over philosophers :D

Replies from: paul-colognese

↑ comment by Paul Colognese (paul-colognese) · 2023-04-16T06:06:16.208Z · LW(p) · GW(p)

Thanks for the feedback! I guess the intention of this post was to lay down the broad framing/motivation for upcoming work that will involve looking at the more concrete details.

I do resonate with the feeling that the post as a whole feels a bit empty as it stands and the effort could have been better spent elsewhere.

Towards a solution to the alignment problem via objective detection and evaluation

Contents

Introduction

High-level summary of this post

An alignment problem for “optimizing systems” and a theoretical solution

Optimizing systems and objectives

The alignment problem for optimizing systems

Statement of the alignment problem

Objective detection and evaluation as a theoretical solution to the alignment problem

Additional thoughts

The role of interpretability

Thoughts on evaluating objectives

What if we can only evaluate certain objectives?

Empirical work/next steps

Related work

Retargeting the search

Acceptability oversight and transparency tech tree

Isolating objectives by leveraging their differential tendency to update in certain contexts

Conclusion

7 comments