Should AutoGPT update us towards researching IDA?

post by Michaël Trazzi (mtrazzi) · 2023-04-12T16:41:13.735Z · LW · GW · No comments

This is a question post.

Contents

  Answers
    5 David Reber
    4 Charlie Steiner
None
No comments

Given the rate of progress in AutoGPT-like approaches, should we reconsider Paul Christiano's Iterated Distillation and Amplitication (IDA) agenda as potentially central to the alignment of transformative ML systems?

For contex on IDA and AutoGPT:

Answers

answer by David Reber · 2023-04-12T18:54:55.544Z · LW(p) · GW(p)

My understanding of Auto-GPT is that it strings together many GPT-4 requests, while notably also giving it access to memory and the internet. Empirically, this allocation of resources and looping seems promising for solving complex tasks, such as debugging the code of Auto-GPT itself. (For those interested, this paper discusses how to use looped transformers can serve as general-purpose computers).

But to my ears, that just sounds like an update of the form “GPT can do many tasks well”, not in the form of “Aligned oversight is tractable”. Put another way, Auto-GPT sounds like evidence for capabilities, not evidence for the ease of scalable oversight. The question of whether human values can be propagated up through increasingly amplified models seems separate from the ability to improve self-recursively, in the same way that capabilities-progress is distinct from alignment-progress.

comment by David Reber (derber) · 2023-04-12T19:04:52.316Z · LW(p) · GW(p)

To clarify, here I'm not taking a stance on whether IDA should be central to alignment or not, simply claiming that unless you have a crux of "whether or not recursive improvement is easy to do" as the limiting factor for IDA being a good alignment strategy, your assessment of IDA should probably stay largely unchanged.

comment by David Reber (derber) · 2023-04-12T19:17:04.250Z · LW(p) · GW(p)

Tho as a counterpoint, maybe Auto-GPT presents some opportunities to empirically test the IDA proposal? To have a decent experiment, you would need a good metric for alignment (does that exist?) and demonstrate that as you implement IDA using Auto-GPT, your metric is at least maintained, even as capabilities improve on the newer models. 

I'm overall skeptical of my particular proposal however, because 1. I'm not aware of any well-rounded "alignment" metrics, and 2. you'd need to be confident that you can scale it up without losing control (because if the experiment fails, then by definition you've developed a more powerful AI which is less aligned).

But it's plausible to me that someone could find some good use for Auto-GPT for alignment research, now that it has been developed. It's just not clear to me how you would do so in a net-positive way.

Replies from: mtrazzi
comment by Michaël Trazzi (mtrazzi) · 2023-04-12T23:22:20.339Z · LW(p) · GW(p)

The evidence I'm interested goes something like:

  • we have more empirical ways to test IDA
  • it seems like future systems will decompose / delegates tasks to some sub-agents, so if we think either 1) it will be an important part of the final model that successfully recursively self-improves 2) there are non-trivial chances that this leads us to AGI before we can try other things, maybe it's high EV to focus more on IDA-like approaches?
answer by Charlie Steiner · 2023-04-15T16:46:22.170Z · LW(p) · GW(p)

Maybe.[1]

  1. ^

    Even though language models are impressive, and it definitely is something to be aware of that you could try to do amplification with language models and something like chain of thought prompting or AutoGPT's task breakdown prompts, I still think that the typical IDA architecture is too prone to essentially training the model to hack itself. Heck, I'm worried that if you arranged humans in an IDA architecture, the humans would effectively "hack themselves."

    But given the suitability of language models for things even sorta like IDA, I agree you're right to bring this up, and maybe there's something clever nearby that we should be searching for.

No comments

Comments sorted by top scores.