2025 Alignment Predictions

anaguma

2025 Alignment Predictions

post by anaguma · 2025-01-02T05:37:36.912Z · LW · GW · No comments

This is a question post.

  Answers
    5 Nathan Helm-Burger
None
No comments

I’m curious how alignment researchers would answer these two questions:

What alignment progress do you expect to see in 2025?
What results in 2025 would you need to see for you to believe that we are on track to successfully align AGI?

Answers

answer by Nathan Helm-Burger · 2025-01-02T06:47:28.302Z · LW(p) · GW(p)

I think I'm somewhat unusual in both approaches I'm exploring and results I'd like to see (not necessarily from my own research). So, not sure if others will agree with this.

What I really want to see is:

more multi-modal evals where models direct humans interacting with the real world
More agent-choice evals where there is an AI agent (perhaps primarily an LLM with scaffolding and some much smaller specialist modules) acting in a variety of simulators where actions have consequences (at least consequences within a given 'run' of a game). Like Machiavelli benchmark and Balrog benchmark, but expanded and with the LLMs fine-tuned to the task.
More evals of LLMs making choices about realistic political and legal scenarios with the evals measuring both deontological ethics and consequentialist outcomes. This feels like a natural extension of LLMs making predictions about the world.
Exploration of the goal-stickiness and cohesiveness and reliability of Corrigibility-first models (see Max Harms' Corrigibility as Singular Target series). My hypothesis is that in high-stakes complicated ambiguous decisions with high capability models, Corrigibility as the primary goal, and constitutional alignment added on only afterwards (mainly as in-context learning) would show desirable characteristics.

What would make me feel like we are on target?

One of my biggest worries is that AI will get harder to control as it gets more capable. Seeing existing results replicated on new more powerful models (before those models are deployed!) would be great.

Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.

↑ comment by anaguma · 2025-01-02T07:11:55.648Z · LW(p) · GW(p)

Also, I think under-elicitation is a current problem causing erroneously low results (false negatives) on dangerous capabilities evals. Seeing more robust elicitation (including fine-tuning!!) would make me more confident about the results of evals.

I’m confused about how to think about this. Are there any evals where fine-tuning on a sufficient amount of data wouldn’t saturate the eval? E.g. if there’s an eval measuring knowledge of virology, then I would predict that fine-tuning on 1B tokens of the relevant virology papers would lead to a large increase in performance. This might be true even if the 1B tokens were already in the pretraining dataset, because in some sense it’s the most recent data that the model has seen.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-02T18:12:01.464Z · LW(p) · GW(p)

But you see, that's exactly the point!

What is the eval trying to measure?

if you are measuring how safe a model is to deploy as-is behind a closed API, then fine. No fine-tuning evals needed (as long as the API doesn't offer fine-tuning, in which case you need fine-tuning evals that take the API protections into account.)
if you are measuring how dangerous the model weights would be if they were stolen and completely under the control of bad actors... Obviously you need fine-tuning evals! Why wouldn't you expect the bad actors to fine-tune on the task they wanted the model to do well on?!
if you are measuring how dangerous the model weights will be for a model you intend to publish the weights openly for, same deal. If you wait to do this testing until after you've made the weights public, or you make arguments like "we trust the open source community will let us know if they discover anything hazardous". That's an argument that makes some sense with an open source code library. Users notice a flaw, they report it, the maintainers patch the flaw, users update to latest version, bug is gone. That is not a good model for how problematic capabilities discovered in open-weight models could be handled.

No comments

Comments sorted by top scores.

2025 Alignment Predictions

Contents

Answers

No comments