Posts

[Interim research report] Evaluating the Goal-Directedness of Language Models 2024-07-18T18:19:04.260Z

Comments

Comment by Elizabeth Donoway (elizabeth-donoway) on [Interim research report] Evaluating the Goal-Directedness of Language Models · 2024-07-30T22:52:52.945Z · LW · GW

I really appreciate the thoughtful replies and feedback—I also didn't read any of your comments as rude or mean! I'd like to clarify a few points about our approach and its relevance:

We expect that good CoT reasoning helps current models better pursue and achieve goals. In the future, all of the reasoning needed to do this effectively might happen in the forward pass. However, we think it's likely that AGI could be achieved with models similar to current SOTA models, where sophisticated reasoning for goal pursuit will still need to happen in CoT. Even if this bet is wrong and models can do this reasoning entirely in the forward pass, our evals should still reveal goal-directed behavior because we assess actions the model takes that are instrumental to achieving the goal (not just the sensibility of the reasoning traces).

Regarding AI control scenarios, we understand that with good oversight methods, deceptive goal-pursuit would need to happen either in the forward pass or through obfuscated reasoning in CoT.  While our scoring of reasoning traces may become less relevant as models improve at internal reasoning, our other metrics focus on how well the model carries out steps and decides on instrumental actions that lead to goal achievement. We believe this serves as a good proxy for goal-directedness, even if we can't directly observe how intrinsic goals arise or manifest from the training process.

A potential risk scenario we're considering is one where a highly capable model is developed that doesn't show obvious signs of misalignment in standard tests (e.g. typical use or general reasoning benchmarks) but has strong goal-pursuit capabilities. Without evaluations that specifically probe goal-directedness, such a model could potentially develop deceptive goals while becoming adept at hiding them. Our evals aim to surface whether models have the ability to perform the necessary reasoning to competently pursue goals, by measuring their capability to carry out the steps needed to achieve those goals and the circumstances under which this behavior is impacted. 

We remain agnostic to what sort of training results in these capabilities and think it should be possible to get a good idea of goal-directedness by assessing aspects of how well models can carry out goals (I would imagine that a capable, goal-directed model with intrinsic goals would also be able to be goal-directed towards goals inscribed through prompting, since we train models to follow instructions. If this expectation is wrong, this would be a major limitation of our evals).

Our approach differs from typical capability evaluations in that we try to determine the utility of models’ actions towards goal fulfillment, situations or conditions under which models fail, the time horizons over which they can act and plan, and their corrigibility under adverse circumstances or when presented with conflicting goals. This allows us to assess aspects of goal-directedness that are distinct from general reasoning capabilities. I know it’s not touched on in the post, since we hadn’t started implementing these tasks when we wrote this, but we're currently developing long-horizon tasks to better assess these aspects of goal-directedness, and expect to have preliminary results in a few weeks!