Truthful LMs as a warm-up for aligned AGI

post by Jacob_Hilton · 2022-01-17T16:49:24.270Z · LW · GW · 14 comments

Contents

  Warm-ups for aligned AGI
  Truthful LMs as a good warm-up
  Why focus on negligent falsehoods?
  Medium-term vision for truthful LMs
  Comparison with related proposals
    Method-driven projects
    Aligning language models in general
  Common objections
    Lack of focus
    AI unboxing
    Similarity to capabilities research
  Conclusion
  Request for feedback
None
14 comments

This post is heavily informed by prior work, most notably that of Owain Evans, Owen Cotton-Barratt and others (Truthful AI [AF · GW]), Beth Barnes (Risks from AI persuasion [AF · GW]), Paul Christiano (unpublished) and Dario Amodei (unpublished), but was written by me and is not necessarily endorsed by those people. I am also very grateful to Paul Christiano, Leo Gao, Beth Barnes, William Saunders, Owain Evans, Owen Cotton-Barratt, Holly Mandel and Daniel Ziegler for invaluable feedback.

In this post I propose to work on building competitive, truthful language models or truthful LMs for short. These are AI systems that are:

Such systems will likely be fine-tuned from large language models such as GPT-3, hence the name.

WebGPT is an early attempt in this direction. The purpose of this post is to explain some of the motivation for building WebGPT, and to seek feedback on this direction.

Truthful LMs are intended as a warm-up for aligned AGI. This term is used in a specific way in this post to refer to an empirical ML research direction with the following properties:

  1. Practical. The goal of the direction is plausibly achievable over the timescale of a few years.
  2. Valuable. The direction naturally leads to research projects that look helpful for AGI alignment.
  3. Mirrors aligned AGI. The goal is structurally similar to aligned AGI on a wide variety of axes.

The remainder of the post discusses:

Warm-ups for aligned AGI

There are currently a number of different empirical ML research projects aimed at helping with AGI alignment. A common strategy for selecting such projects is to select a research goal that naturally leads to helpful progress, such as summarizing books or rarely describing injury in fiction [AF · GW]. Often, work on the project is output-driven, taking a no-holds-barred approach to achieving the selected goal, which has a number of advantages that aren't discussed here. On the other hand, goal selection is usually method-driven, tailored to test a particular method, such as recursive decomposition or adversarial training.

The idea of a warm-up for aligned AGI, as defined above, is to take the output-driven approach one step further. Instead of selecting projects individually, we attempt to choose a more ambitious research goal that naturally leads to helpful projects. Because it is harder to predict the course of research over multiple projects, we also try to make the goal structurally similar to aligned AGI, to make it more likely that unforeseen and auxiliary projects will also be valuable.

Whether this output-driven approach to project selection is preferable to the method-driven approach depends on more specific details that will be discussed later. But it is worth discussing first the advantages of each approach in broad strokes:

Truthful LMs as a good warm-up

In this section I will argue that truthful LMs serve as a particularly good warm-up for aligned AGI, in the sense defined above.

To begin with, truthful LMs are structurally similar to aligned AGI on a wide variety of axes:

Because of these similarities, working on truthful LMs offers numerous benefits. Perhaps most importantly, it naturally leads in several directions that are also attractive from a method-driven perspective:

In addition, there are a number of broader benefits to working on truthful LMs:

Overall, working on truthful LMs seems practical, valuable, and mirrors aligned AGI in enough ways to make it seem highly promising as an empirical ML research direction.

Why focus on negligent falsehoods?

Most of the arguments in favor of working on truthful LMs apply equally well to working on aligning language models in general. However, the definition of truthful LMs specifically singles out negligent falsehoods [AF · GW]: statements that are unacceptably likely to be false, and where it should have been feasible for an AI system to understand this. This is done for several reasons:

The most obvious drawback of focusing on negligent falsehoods is that they are more ambiguous than falsehoods in general. In practice, I think it will be fine to focus on falsehoods that are plausibly negligent: it will be OK if some effort goes into capabilities that improve truthfulness, as long as they do not become the main focus. Such capabilities may also enable new alignment strategies: for example, the use of retrieval in WebGPT opened the door to improved evaluation of factual accuracy via the use of references. For the purposes of evaluating progress, it will be fine to make reasonable judgments about how likely a falsehood is to be negligent.

Medium-term vision for truthful LMs

Truthful LMs are a target that could be pursued with various different mindsets. At one extreme, one could take a very method-driven approach to selecting projects, and simply incorporate a preference for goals that can be framed in terms of truthfulness. At the other extreme, one could mostly try to make language models more useful, but try to adhere to relatively high standards of truthfulness along the way. Where to land on this spectrum depends on how one trades off the advantages of output-driven and method-driven approaches, as discussed above.

My tentative inclination is towards a middle ground, remaining slightly method-driven while having a clear medium-term vision for the next few years. In this spirit, here is a first attempt at such a vision.

The system is a truthful pure-text assistant:

I think that achieving such a system would be a lot of work, but would not require any fundamental insights, and could be achieved with pre-trained models of the future using the methods of WebGPT together with some form of debate and adversarial training.

There are a number of similar approaches that have recently been proposed or are currently being pursued. I am generally a fan of these approaches, but it is worth discussing how they compare.

Method-driven projects

Some alternative proposals also focus on improving the behavior of contemporary models, but are more method-driven:

As discussed above, there are trade-offs between being method-driven and being output-driven when selecting projects. Overall, it seems plausible to me that method-driven projects are currently the most valuable empirical ML projects on the margin, since they are the most carefully targeted. On the other hand, being output-driven is a longer-term play, and may be able to make better use of people who thrive on practical problems in particular. Hence I would argue in favor of a portfolio approach.

Aligning language models in general

Another category of proposals is very similar to working on truthful LMs, but focused on a more general notion of alignment than truthfulness:

I do think it makes sense to incorporate criteria other than truthfulness when aligning language models, and so these projects may end up being very similar to working on truthful LMs in practice. However, I would argue in favor of such projects placing particular emphasis on negligent falsehoods, for the reasons discussed above.

Common objections

Working on truthful LMs has a number of possible objections in common with Aligning narrowly superhuman models [AF · GW]. In addition to these, there are some more specific objections.

Lack of focus

One concern with working on truthful LMs is that it will be insufficiently focused on the core parts of alignment, as a result of being too output-driven. I think this concern is pretty reasonable, and can largely be mitigated by not being completely output-driven, but instead retaining some of the method-driven mindset, as discussed above.

It is a difficult question to determine exactly where to fall on this spectrum. I think that there are a couple of potential cruxes that lead people to have different intuitions on this question:

AI unboxing

Another concern is that working on truthful LMs may lead to AI being "let out of the box" by encouraging research in which models interact with the external world agentically, in the manner of WebGPT.

I think this concern is worth taking seriously, but that the case for it is weak:

There is still an argument that there will be a period during which AI is capable enough to cause serious damage, but not capable enough to escape from sandboxed environments, and that setting precedents could worsen the risks posed during this interval. I don't currently find this argument persuasive, but would be interested to hear if there is a more persuasive version of it. That said, one bright line that stands out is training models to perform tasks that actually require real-world side effects, and I think it makes sense to think carefully before crossing that line.

Similarity to capabilities research

The output-driven approach has its advantages, but also makes the research more similar to capabilities research, which exacerbates some other potential concerns. In each case, I think that the response given in Aligning narrowly superhuman models [AF · GW] remains valid, but is worth commenting on:

Conclusion

I think that working on truthful LMs has a comparative advantage in worlds where:

These all seem like plausible assumptions to me, which probably goes some way towards explaining why I find truthful LMs compelling. I'm of course also keen on other work that is more valuable under different assumptions.

On the whole, working on truthful LMs seems highly promising to me as part of a portfolio of approaches aimed at AGI alignment, especially for people who are drawn to practical agendas.

Request for feedback

By default, this is the research direction I'll continue to pursue at OpenAI. It's therefore very valuable for me to know if it's horribly mistaken, or even if it's just clearly less valuable than alternative directions on the margin. Equally, if you're very excited by this research direction, then we should coordinate. In addition to leaving comments, please feel free to reach out to me at jhilton@openai.com if your feedback would be more convenient to give privately or via a different medium.

14 comments

Comments sorted by top scores.

comment by Charlie Steiner · 2022-01-19T08:26:51.325Z · LW(p) · GW(p)

Here's my worry.

If we adopt a little bit of deltonian pessimism [LW · GW] (though not the whole hog), and model present-day language models as doing something vaguely like nearest-neighbor interpolation in a slightly sophisticated latent space (while still being very impressive), then we might predict that there are going to be some ways of getting honest answers an impressive percentage of the time while staying entirely within the interpolation regime.

And then if you look at the extrapolation regime, it's basically the entire alignment problem squeezed into a smaller space! So I worry that people are going to do the obvious things, get good answers on 90%+ of human questions, and then feel some kind of pressure to write off the remainder as not that important ("we've got honest answers 98% of the time, so the alignment problem is like 98% solved"). When what I want is for people to use language models as a laboratory to keep being ambitions, and do theory-informed experiments that try to push the envelope in terms of extrapolating human preferences in a human-approved way.

Replies from: Jacob_Hilton
comment by Jacob_Hilton · 2022-01-19T17:05:22.371Z · LW(p) · GW(p)

I can think of a few different interpretations of your concern (and am interested to hear if these don't cover it):

  • There will be insufficient attention paid to robustness.
  • There will be insufficient attention paid to going beyond naive human supervision.
  • The results of the research will be misinterpreted as representing more progress than is warranted.

I agree that all of these are possibilities, and that the value of the endeavor could well depend on whether the people conducting (and communicating) the research are able to avoid pitfalls such as these.

There's certainly more object-level discussion to be had about how much emphasis should be placed on avoiding these particular pitfalls, and I'm happy to dig in to them further if you're able to clarify which if any of them capture your main concern.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2022-01-19T20:26:27.467Z · LW(p) · GW(p)

I think there are different kinds of robustness, and people focused on present-day applications (including tests that are easy to do in the present) are going to focus on the kinds of robustness that help with present-day problems. Being robust to malicious input from human teenagers will only marginally help make you robust to input from a future with lots of extra technological progress. They might have very different-looking solutions, because of factors like interpolation vs. extrapolation.

Framing it this way suggests one concrete thing I might hope for you to do, which is to create artificial problems for the language model that you think will exercise kinds of robustness and generalization not represented by the problem of fine-tuning GPT (or a BERT-based classifier) to be robust to the teenager distribution.

Replies from: Jacob_Hilton
comment by Jacob_Hilton · 2022-01-19T21:18:59.554Z · LW(p) · GW(p)

one concrete thing I might hope for you to do...

I think this is included in what I intended by "adversarial training": we'd try to find tasks that cause the model to produce negligent falsehoods, train the model to perform better at those tasks, and aim for a model that is robust to someone searching for such tasks.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2022-01-19T22:56:12.391Z · LW(p) · GW(p)

Sure - another way of phrasing what I'm saying is that I'm not super interested (as alignment research, at least) in adversarial training that involves looking at difficult subsets of the training distribution, or adversarial training where the proposed solution is to give the AI more labeled examples that effectively extend the training distribution to include the difficult cases.

It would be bad if we build an AI that wasn't robust on the training distribution, of course, but I think of this as a problem already being addressed by the field of ML without any need for looking ahead to AGI.

comment by Jon Garcia · 2022-01-17T21:41:55.187Z · LW(p) · GW(p)

I like this approach to alignment research. Getting AIs to be robustly truthful (producing language output that is consistent with their best models of reality, modulo uncertainty) seems like it falls in the same space as getting them to keep their goals consistent with their best estimates of human goals and values.

As for avoiding negligent falsehoods, I think it will be crucial for the AI to have explicit estimates of its uncertainty for anything it might try to say. To a first approximation, assuming the system can project statements to a consistent conceptual space, it could predict the variance in the distribution of opinions in its training data around any particular issue. Then it could use this estimate of uncertainty to decide whether to state something confidently, to add caveats to what it says, or to turn it into a question for the interlocutor.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-01-19T21:56:42.965Z · LW(p) · GW(p)

Thanks for this!

I think that working on truthful LMs has a comparative advantage in worlds where:
--We have around 10-40 years until transformative AI
--Transformative AI is built using techniques that resemble modern deep learning
--There is a slow takeoff
--Alignment does not require vastly more theoretical insight (but may require some)
--Our current picture of the risks posed by transformative AI is incomplete

Can you elaborate on what you mean by slow takeoff here?

Also, what do you mean by the current picture of the risks being incomplete? What would it even mean for our picture to be complete?

Replies from: Jacob_Hilton
comment by Jacob_Hilton · 2022-01-19T23:11:54.129Z · LW(p) · GW(p)

Thanks for these questions, these phrases were ambiguous or poorly chosen:

  • By "slow takeoff", I had in mind the "Paul slow takeoff [AF · GW]" definition, although I think the (related) "Continuous takeoff [AF · GW]" definition is more relevant to this post. The point is that trying for alignment to continually keep pace with capabilities, and to catch misalignment early, seems less valuable if there is going to be a sudden jump in capabilities. (I could be wrong about this, as I don't think I understand the fast takeoff viewpoint well.)
  • By "our current picture of the risks is incomplete", I meant something like: a significant portion of the total existential risk from AI comes from scenarios that have not yet been clearly articulated. More specifically, I had in mind power-seeking misalignment [EA · GW] as the most clearly articulated risk, so I think it would have been better to say: a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment. Examples of potential sources of such risk include AI persuasion [AF · GW], social upheaval, deliberate misuse, authoritarianism and unforseen risks.
Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-01-20T01:12:12.923Z · LW(p) · GW(p)

Thanks, these clarifications are very helpful.

FWIW I think paul slow takeoff is pretty unlikely for reasons to be found in this thread [LW(p) · GW(p)] and this post. [LW · GW] On the other hand, as someone who thinks fast takeoff (in various senses) is more likely than not, I don't yet see why that makes Truthful LM work significantly less useful. (By contrast I totally see why Truthful LM work is significantly less useful if AGI/TAI/etc. comes from stuff that doesn't resemble modern deep learning.)

"Catch misalignment early..." This makes it sound like misalignment is something that AIs don't have yet but might one day have, so we need to be vigilant and notice it when it appears. But instead isn't misalignment something that all AIs have by default?

My current view is that power-seeking misalignment will probably cause existential catastrophe, that persuasion tools happen first and have a >20% chance of destroying our ability to solve that problem, and that there are various philosophical and societal problems that could (>20%) get us even if we solve power-seeking misalignment. Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

Replies from: MondSemmel, Jacob_Hilton
comment by MondSemmel · 2022-01-20T15:32:08.246Z · LW(p) · GW(p)

FYI, the "this thread" link in your comment doesn't work. Apparently it's possible for a link to be simultaneously green and unclickable.

Replies from: Zack_M_Davis, daniel-kokotajlo
comment by Zack_M_Davis · 2022-01-20T15:44:11.205Z · LW(p) · GW(p)

(The underlying HTML is <a href="http://I think that working on truthful LMs has a comparative advantage in worlds where: We have around 10-40 years until transformative AI Transformative AI is built using techniques that resemble modern deep learning There is a slow takeoff Alignment does not require vastly more theoretical insight (but may require some) Our current picture of the risks posed by transformative AI is incomplete">this thread</a>. I am also surprised that this results in clicking being a no-op, rather than a "functional" link that leads to your browser's could-not-resolve-host page.)

comment by Jacob_Hilton · 2022-01-20T08:07:42.205Z · LW(p) · GW(p)

"Catch misalignment early..." - This should have been "scary misalignment", e.g. power-seeking misalignment, deliberate deception in order to achieve human approval, etc., which I don't think we've seen clear signs of in current LMs. My thinking was that in fast takeoff scenarios, we're less likely to spot this until it's too late, and more generally that truthful LM work is less likely to "scale gracefully" to AGI. It's interesting that you don't share these intuitions.

Does this mean I agree or disagree with "our current picture of the risks is incomplete?"

As mentioned, this phrase should probably be replaced by "a significant portion of the total existential risk from AI comes from risks other than power-seeking misalignment". There isn't supposed to be a binary cutoff for "significant portion"; the claim is that the greater the risks other than power-seeking misalignment, the greater the comparative advantage of truthful LM work. This is because truthful LM work seems more useful for addressing risks from social problems such AI persuasion (as well as other potential risks that haven't been as clearly articulated yet, I think). Sorry that my original phrasing was so unclear.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2022-01-20T17:40:58.313Z · LW(p) · GW(p)

Nothing to apologize for, it was reasonably clear, I'm just trying to learn more about what you believe and why. This has been helpful, thanks!

I totally agree that in fast takeoff scenarios we are less likely to spot those things until it's too late. I guess I agree that truthful LM work is less likely to scale gracefully to AGI in fast takeoff scenarios... so I guess I agree with your overall point... I just notice I feel a bit confused and muddle about it, is all. I can imagine plausible slow-takeoff scenarios in which truthful LM work doesn't scale gracefully, and plausible fast-takeoff scenarios in which it does. At least, I think I can. The former scenario would be something like: It turns out the techniques we develop for making dumb AIs truthful stop working once the AIs get smart, for similar reasons that techniques we use to make small children be honest (or to put it more vividly, believe in santa) stop working once they grow up. The latter scenario would be something like: Actually that's not the case, the techniques work all the way up past human level intelligence, and "fast takeoff" in practice means "throttled takeoff" where the leading AI project knows they have a few month lead over everyone else and is using those months to do some sort of iterated distillation and amplification, in which it's crucial that the early stages be truthful and that the techniques scale to stage N overseeing stage N+1.