Some thoughts on automating alignment research

post by Lukas Finnveden (Lanrian) · 2023-05-26T01:50:20.099Z · LW · GW · 4 comments



As AI systems get more capable, they may at some point be able to help us with alignment research. This increases the chance that things turn out ok.[1] Right now, we don’t have any particularly scalable or competitive alignment solutions. But the methods we do have might let us use AI to vastly increase the amount of labor spent on the problem before AI has the capability and motivation to take over. In particular, if we’re only trying to make progress on alignment, the outer alignment problem is reduced to (i) recognising progress on sub-problems of alignment (potentially by imitating productive human researchers), and (ii) recognising dangerous actions like e.g. attempts at hacking the servers.[2]

But worlds in which we’re hoping for significant automated progress on alignment are fundamentally scary. For one, we don’t know in what order we’ll get capabilities that help with alignment vs. dangerous capabilities.[3] But even putting that aside, AIs will probably become helpful for alignment research around the same time as AIs become better at capabilities research. Once AI systems can significantly contribute to alignment (say, speed up research by >3x), superintelligence will be years or months away.[4] (Though intentional, collective slow-downs could potentially buy us more time. Increasing the probability that such slow downs happen at key moments seems hugely important.)

In such a situation, we should be very uncertain about how things will go.

To illustrate one end of the spectrum: It’s possible that automated alignment research could save the day even in an extremely tense situation, where multiple actors (whether companies or nations) were racing towards superintelligence. I analyze this in some detail here. To briefly summarize:

In a bit more detail:

If this worked out as described, this could lead to much progress in alignment. And plausibly also progress in other areas — e.g. demos of risk that could help persuade other companies or governments that a slow-down is needed.

To point out some significant issues with that particular analysis:

So reiterating: significant automation of alignment is a real possibility, which could save an otherwise hopeless situation. But worlds in which this needs to be done on a short deadline are very scary worlds —  because it’s plausible that sufficient boosts won’t be possible to achieve without danger, or that the danger will be misjudged, directly leading to AI takeover. If we could collectively slow down, that would be much safer.[12]

I’ll end this post by explicitly mentioning a few other strategic implications from the possibility of automating alignment research:

Acknowledgements: Thanks to Tom Davidson and Carl Shulman for comments and ideas. I work at Open Philanthropy but the views here are my own.


  1. Though it’s closely related to the scary insight that AI might accelerate AI research, which suggests shorter timelines and faster takeoff speeds. If you had previously reflected on “AI accelerating capabilities”, but not on “AI accelerating alignment”, then the latter should come as a positive update. But if you were to simultaneously update on both of them, I don’t know what the net effect on p(doom) is. ↩︎

  2. This wouldn’t solve inner alignment, by itself. But inner misalignment would probably be easier to notice in less capable models. And for models that aren’t yet superintelligent, you could reduce risk by reducing their opportunities to cause catastrophes and/or to escape the lab, including by training many AIs on recognising and sounding the alarm on such attempts. ↩︎

  3. There are different kinds of dangerous capabilities. In particular, there’s a spectrum between (i) capabilities that are only dangerous if they’re developed and deployed by sufficiently incautious actors to (ii) capabilities that would make the models dangerous even if cautious actors combined them with the best available alignment techniques. The former end of the spectrum is important for the question “How long do we have until incautious actors cause AI takeover?” (absent successful efforts at regulating them or otherwise making them cautious). The latter is important for the question “What sort of models could a maximally cautious actor use without risking AI takeover?”, which is important for what speed-up to alignment research we could expect. ↩︎

  4. See Tom Davidson’s work [LW · GW] on takeoff speeds. ↩︎

  5. Which highlights the strong importance of info security. Even with good info security, leaking 0 information might be unrealistic, if the leader needs to widely deploy their technology to finance training. ↩︎

  6. More precisely: As long as they move as fast as their competitors will move when they get to the cautious coalitions’ level of AI technology. This might differ from their competitors’ current pace if novel AI systems can be used to accelerate capabilities research. So if nearby competitors will use AI to significantly accelerate their capabilities research, then this condition would require the cautious coalition to be the first to do so. At some point, that will probably become very dangerous. At that point, the cautious coalition would need to slow down or pause. ↩︎

  7. According to this, there’s ~4e21 FLOP/s of computing capacity on GPUs and TPUs as of Q1 2023. (Most of this is on chips that would struggle to run SOTA models, so the numbers are more relevant for what you could get if chip manufacturers switched over to producing many more AI chips.) A quarter of this for a year would be 1e21*(365243600)*50% ~= 3e28. With 50% hardware utilization, this would correspond to a Chinchilla-trained model with 11 trillion parameters. ↩︎

  8. Serial researcher months per month = (0.1 serial-researcher-second/serial-token) * (50 serial-tokens/second) * (36002430 seconds/month) / (3600830 serial-researcher-seconds/serial-researcher-months) = 15 serial researcher months / month.

    In the above denominator, I use “8” instead of 24 hours per day, because humans only work ~⅓ of their life, while GPUs can run non-stop.

    In total, there are (0.01 researcher-seconds / token) * (2.3e7 tokens/second) ~= 6e11 tokens per second. So given a serial speed-up of 15x, that corresponds to 6e11 / 15 ~= 15000 parallel researchers. ↩︎

  9. A model with 1000x the compute (1.5e31) would need enough compute to process sqrt(1000)~=32x more data in parallel. And if it was trained in 12/4=3x shorter time, it would need 3x more compute in parallel. 32 * 3 ~= 100. ↩︎

  10. Imagine that the AI researchers invent new programming languages and write huge, new code-bases in them. And then imagine a human trying to arbitrate a debate about whether a line in the code introduces a vulnerability. ↩︎

  11. Among simplistic models you can use, it’s plausible to me that a better model would be to assume that tasks will be automated one at a time, and that once you’ve automated a task, it’s cheap to do it as much as you want (c.f. Tom Davidson’s takeoff speeds model [LW · GW] and Richard Ngo’s framework [LW · GW]). I’d be interested in more research on what this would suggest about alignment automation. ↩︎

  12. This would be compatible with using AI assistance to make progress on key problems during the slow-down. If you had many eyes on that process and there was ample room to pause if something looked dangerous, then that would make AI assistance much safer than in a unilateral race-situation. ↩︎


Comments sorted by top scores.

comment by Tao Lin (tao-lin) · 2023-06-13T19:02:24.752Z · LW(p) · GW(p)

If automating alignment research worked, but took 1000+ tokens per researcher-second, it would be much more difficult to develop, because you'd need to run the system for 50k-1M tokens between each "reward signal or such". Once it's 10 or less tokens per researcher second, it'll be easy to develop and improve quickly.

comment by RogerDearnaley (roger-d-1) · 2023-05-28T09:26:28.575Z · LW(p) · GW(p)

As long as all the agentic AGIs people are building are value learners (i.e. their utility function is hard-coded to something like "figure out what utility function humans in aggregate would want you to use if they understood the problem better, and use that"), then improving their understanding of the human values becomes a convergent instrumental strategy for them: obviously the better they understand the human-desired utility function, the better job they can do of optimizing it. In particular, if AGI's capabilities are large, and as a result many of the things it can do are outside the region of validity of its initial model of human values, and also it understands the concept of the region of validity of a model (a rather basic, obviously required capability for an AGI that can do research, so this seems like a reasonable assumption), then it can't use most of its capabilities safely, so solving that problem obviously becomes top priority. This is painfully obvious to us, so it should also be painfully obvious to an AGI capable of doing research.

In that situation, a fast takeoff should just cause you to get an awful lot of AGI intelligence focused on the problem of solving alignment. So, as the author mentions, perhaps we should be thinking about how we would maintain human supervision in that eventuality? That strikes me as a particular problem that I'd feel more comfortable to have solved by a human alignment researcher than an AGI one.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-28T13:48:47.096Z · LW(p) · GW(p)

If we solve the alignment problem than we solve alignment problem.

I agree with this true statement.

Replies from: roger-d-1
comment by RogerDearnaley (roger-d-1) · 2023-05-29T04:48:11.642Z · LW(p) · GW(p)

If we can solve enough of the alignment problem, the rest gets solved for us.

If we can get a half-assed approximate solution to the alignment problem, sufficient to semi-align a STEM-capable AGI value learner of about smart-human level well enough to not kill everyone, then it will be strongly motivated to solve the rest of the alignment problem for us, just as the 'sharp left turn' is happening, especially if it's also going Foom. So with value learning, there is is a region of convergence around alignment.

Or to reuse one of Eliezer's metaphors, then if we can point the rocket on approximately the right trajectory, it will automatically lock on and course-correct from there.