Quick thoughts on "scalable oversight" / "super-human feedback" research

post by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-01-25T12:55:31.334Z · LW · GW · 9 comments

Contents

9 comments

The current default view seems to roughly be:

A common example used for motivating scalable oversight is the "AI CEO".

My views are:

9 comments

Comments sorted by top scores.

comment by paulfchristiano · 2023-01-25T17:39:48.694Z · LW(p) · GW(p)

If you don't like AI systems doing tasks that humans can't evaluate, I think you should be concerned about the fact that people keep building larger models and fine-tuning them in ways that elicit intelligent behavior.

Indeed, I think current scaling up of language models is likely net negative (given our current level of preparedness) and will become more clearly net negative over time as risks grow. I'm very excited about efforts to monitor and build consensus about these risks, or to convince or pressure AI labs to slow down development as further scaling becomes more risky.

But I think something has gone quite wrong if our collective strategy ends up being "We keep training smarter systems, but fortunately we are only able to fine-tune them to do tasks with external feedback signals from the world." You're going to get increasingly smart systems that can predict and manipulate observable features of the world; this is just asking for a catastrophic failure from reward hacking or deceptive alignment.

Even the world where you keep building larger and larger language models, but avoid ever training them to act in the world, seems like a recipe for trouble. It creates an unstable situation where anyone who fine-tunes a model can cause a catastrophe, or where a tiny behavioral quirk of a model could lead it to take over the world.

If we want to avoid AI systems taking over the world then I think we should strongly prefer to do it by stopping the creation of systems smart enough to do so. That seems like a much better place for norms.

I think this entire picture becomes even more compelling if you are mostly worried about deceptive alignment. I think training bigger models seems like the overwhelmingly dominant risk factor, and making models of a fixed size more useful seems like an extremely plausible intervention for reducing the risk of deceptive alignment.

Perhaps the big difference is that I think it matters a lot whether AI systems are smart enough to be an AI CEO; and not so much whether people are actually trying to employ their AI as a CEO. The risk comes from a deceptive aligned AI (or reward-hacking AI) having that level of competence. If you are in that unfortunate world, then you probably do unfortunately want a bunch of aligned AIs doing complex stuff to help you survive.

If there is an attempt to build consensus in the ML community around "Hey you can train language models just try not to have them do complicated stuff" then I wouldn't push back on it. But I think saying "Don't try to align AI systems that do complex tasks because it interferes with norms against using AIs  to do complex tasks" is actively counterproductive. And I think this is a less reasonable ask than "just stop building such big language models," and so I'll argue against it being the main ask the safety community focuses on.

Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-01-25T23:25:16.142Z · LW(p) · GW(p)

I understand your point of view and think it is reasonable.

However, I don't think "don't build bigger models" and "don't train models to do complicated things" need to be at odds with each other.  I see the argument you are making, but I think success on these asks are likely highly correlated via the underlying causal factor of humanity being concerned enough about AI x-risk and coordinated enough to ensure responsible AI development.

I also think the training procedure matters a lot (and you seem to be suggesting otherwise?), since if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time (if ever).

 

Replies from: ediz-ucar
comment by TheFigMaster (ediz-ucar) · 2024-02-29T13:07:54.375Z · LW(p) · GW(p)

if you don't do RL or other training schemes that seem designed to induce agentyness and you don't do tasks that use an agentic supervision signal, then you probably don't get agents for a long time

 

Is this really the case? If you imagine a perfect Oracle AI [? · GW], which is certainly not agenty, it seems to me that with some simple scaffolding, one could construct a highly agentic system. It would go something along the lines of 

  1. Setup API access to 'things' which can interact with the real world. 
  2. Ask the oracle 'What would be the optimal action if you want to do <insert-goal> via <insert-api-functions>?'
  3. Do the actions that are outputted.
  4. Some kind of looping mechanism to gain feedback from the world and account for it.

This is my line of reasoning why AIS matters for language models in general.

Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-03-06T21:08:26.047Z · LW(p) · GW(p)

I meant "other training schemes" to encompass things like scaffolding that deliberately engineers agents using LLMs as components, although I acknowledge they are not literally "training" and more like "engineering".

Replies from: ediz-ucar
comment by TheFigMaster (ediz-ucar) · 2024-03-07T17:59:22.579Z · LW(p) · GW(p)

The thing that we care about is how long it takes to get to agents. If we put lots of effort making powerful Oracle systems or other non-agentic systems, we must assume that agentic systems will follow shortly. Someone will make them, even if you do not. 

Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-03-15T12:07:53.173Z · LW(p) · GW(p)

I don't disagree... in this case you don't get agents for a long time; someone else does though.

comment by Dan H (dan-hendrycks) · 2023-01-30T05:22:36.443Z · LW(p) · GW(p)

When ML models get more competent, ML capabilities researchers will have strong incentives to build superhuman models. Finding superhuman training techniques would be the main thing they'd work on. Consequently, when the problem is more tractable, I don't see why it'd be neglected by the capabilities community--it'd be unreasonable for profit maximizers not to have it as a top priority when it becomes tractable. I don't see why alignment researchers have to work in this area with high externalities now and ignore other safe alignment research areas (in practice, the alignment teams with compute are mostly just working on this area). I'd be in favor of figuring out how to get superhuman supervision for specific things related to normative factors/human values (e.g., superhuman wellbeing supervision), but researching superhuman supervision simpliciter will be the aim of the capabilities community.

Don't worry, the capabilities community will relentlessly maximize vanilla accuracy, and we don't need to help them.

comment by Charlie Steiner · 2023-01-25T17:43:00.636Z · LW(p) · GW(p)

I think I disagree with lots of things in this post, sometimes in ways that partly cancel each other out.

  • Parts of generalizing correctly involve outer alignment. I.e. building objective functions that have "something to say" about how humans want the AI to generalize.
  • Relatedly, outer alignment research is not done, and RLHF/P is not the be-all-end-all.
  • I think we should be aiming to build AI CEOs (or more generally, working on safety technology with an eye towards how it could be used in AGI that skillfully navigates the real world). Yes, the reality of the game we're playing with gung-ho orgs is more complicated, but sometimes, if you don't, someone else really will.
  • Getting AI systems to perform simpler behaviors safely also looks like capabilities research. When you say "this will likely require improving sample efficiency," a bright light should flash. This isn't a fatal problem - some amount of advancing capabilities is just a cost of doing business. There exists safety research that doesn't advance capabilities, but that subset has a lot of restrictions on it (little connection to ML being the big one). Rather than avoiding ever advancing AI capabilities, we should acknowledge that fact in advance and try to make plans that account for it.
Replies from: capybaralet
comment by David Scott Krueger (formerly: capybaralet) (capybaralet) · 2023-01-25T23:18:57.006Z · LW(p) · GW(p)

(A very quick response):


Agree with (1) and (2).  
I am ambivalent RE (3) and the replaceability arguments.
RE (4): I largely agree, but I think the norm should be "let's try to do less ambitious stuff properly" rather than "let's try to do the most ambitious stuff we can, and then try and figure out how to do it as safely as possible as a secondary objective".