Language Model Tools for Alignment Research

post by Logan Riggs (elriggs) · 2022-04-08T17:32:33.230Z · LW · GW · 0 comments

Contents

  How does this reduce x-Risk?
  Differential Impact
  Counterfactual Impact
  Current Work
  Open Problems/Future Work
  Who to Contact
None
No comments

I do not speak for the rest of the people working on this project

[I think it's valuable to have clear, short intros on different research agendas & projects]

How does this reduce x-Risk?

AI will continue to become increasingly more powerful; we should leverage this to accelerate alignment research. Language model tasks will also follow this trend (if transformers don't lead to AGI, whatever's next will still be capable of language tasks so this argument doesn't rely on transformers scaling). If you believe that certain research agendas reduce x-risk, then clearly giving them better tools to do their work faster also reduces x-risk.

Differential Impact

Tools that can accelerate alignment research can probably be repurposed to accelerate capabilities research, so wouldn't developing these tools be net negative?

Yes. Especially if you gave them out to everybody or if you were a for-profit company with incentives to do so.

Only giving them out to alignment researchers is good. There's also lots of alignment-researcher-specific work to do, such as:

Though we could still increase capabilities by being the first to a capability and releasing that we succeeded. For example, OpenAI released their "inserting text" without telling people how they did it, but the people I work with, based off that information, figured out a way to do it too. The moral is that even just releasing that you succeeded is bits of information that those in the know can work backwards from.

Counterfactual Impact

Let's say a project like this never gets started, there are still huge economic incentives to make similar products and sell to large amounts of people. Elicit, Cohere, Jasper (previously Jarvis), OpenAI, DeepMind, and more in the years to come will create these products, so why shouldn't we just use their products since they're likely to beat us to it and do it better?

Good point. Beyond the points made in "differential impact," having infrastructure/people to quickly integrate the latest advances into alignment researcher's workflows is useful even in this scenario. This includes engineers, pre-existing code for interfaces, & a data-labeling pipeline.

Current Work

We've scraped LessWrong (including Alignment Forum), some blogs, relevant arxiv papers and their cited papers, and books. We are currently fine-tuning and trying to make it do useful tasks for us.

We've also released a survey and talked to several people about what would be most useful for them. However, the most important feedback may be actually demo-ing for different users.

Open Problems/Future Work

1. Cleaning up Data

2. Collecting and listing more sources of Alignment Data

3. Creating a pipeline of data-labelers for specific tasks

4. Trying to get our models to do specific tasks

5. Getting better feedback on the most useful tools for alignment researchers.

Who to Contact

DM me. Also join us at the #accelerating-alignment channel in the EleutherAI discord server: https://discord.gg/67AYcKK6. Or dm me for a link if it expired.

0 comments

Comments sorted by top scores.