Desiderata for AI alignment

post by paulfchristiano · 2018-11-01T21:47:41.562Z · LW · GW · 0 comments

Contents

    1. Secure
    2. Competitive
    3. Scalable
    Aside: feasibility
  Conclusion
None
No comments

I’m most interested in algorithms that are secure, competitive, and scalable, and I think that most research programs are very unlikely to deliver these desiderata (this is why the lists above are so short).

Since these desiderata are doing a lot of work in narrowing down the space of possible research directions, it seems worthwhile to be thoughtful and clear about them. It would be easy to gloss over any of them as obviously unobjectionable, but I would be more interested in people pushing back on the strong forms than implicitly accepting a milder form.

1. Secure

Many pieces of software work “well enough” most of the time; we often learn this not by a deep analysis but by just trying it and seeing what happens. “Works well enough” often breaks down when an adversary enters the prediction.

Whether or not that’s a good way to build AI, I think it’s a bad way to do alignment research right now.

Instead, we should try to come up with alignment solutions that work in the least convenient world, when nature itself is behaving adversarially. Accomplishing this requires argument and analysis, and cannot be exclusively or based on empirical observation.

AI systems obviously won’t work well in the worst case (there is no such thing as a free lunch) but it’s reasonable to hope that our AI systems will never respond to a bad input by actively trying to hurt us —at least as long as we remain in physical control of the computing hardware, and the training process, etc.

Why does security seem important?

(Previous posts: security and AI alignment)

2. Competitive

It’s easy to avoid building an unsafe AI system (for example: build a spreadsheet instead). The only question is how much you have to sacrifice to do it.

Ideally we’ll be able to build benign AI systems that are just as efficient and capable as the best AI that we could build by any means. That means: we don’t have to additional domain-specific engineering work to align our systems, benign AI doesn’t require too much more data or computation, and our alignment techniques don’t force us to use particular techniques or restrict our choices in other ways.

(More precisely, I would consider an alignment strategy a success if the additional costs are sublinear: if the fraction of resources that need to be spent on alignment research and run-time overhead decreases as the AI systems become more powerful, converging towards 0.)

Why is competitiveness important?

A. It’s easy to tell when a solution is plausibly competitive, but very hard to tell exactly how uncompetitive an uncompetitive solution will be. For example, if a purported alignment strategy requires an AI not to use technique or development strategy X, it’s easy to tell that this proposal isn’t competitive in general, but very hard to know exactly how uncompetitive it is.

As in the security case, it seems very easy to look into the fog of the future and say “well this seems like it will probably be OK” and then to turn out to be too optimistic. If we hold ourselves to the higher standard of competitiveness, it is much easier to stay honest.

Relatedly, we want alignment solutions that work across an extremely large range of techniques not just because we are uncertain about which techniques will be important, but because generalizing across all of the situations we can foresee is a good predictor of working for situations we can’t foresee.

B. You can’t unilaterally use uncompetitive alignment techniques; we would need global coordination to avoid trouble. If we don’t know how to build competitive benign AI, then users/designers of AI systems have to compromise efficiency in order to maintain reliable control over those systems. The most efficient systems will by default be built by whoever is willing to accept the largest risk of catastrophe (or perhaps by actors who consider unaligned AI a desirable outcome).

It may be possible to avert this kind of race to the bottom by effective coordination by e.g. enforcing regulations which mandate adequate investments in alignment or restrict what kinds of AI are deployed. Enforcing such controls domestically is already a huge headache. But internationally things are even worse: a country that handicapped its AI industry in order to proceed cautiously would face the risk of being overtaken by a less prudent competitor, and avoiding that race would require effective international coordination.

Ultimately society will be able and willing to pay some efficiency cost to reliably align AI with human interests. But the higher that cost, the harder the coordination problem that we will need to solve. I think the research community should be trying to make that coordination problem as easy as possible.

(Previous posts: prosaic AI alignment, a possible stance for AI control, efficient and safely scalable)

3. Scalable

Over time, we are acquiring more data, more powerful computers, richer model classes, better optimization algorithms, better exploration strategies, and so on. If we extrapolate these trends, we end up with very powerful models and policies.

Many approaches to alignment break down at some point in this extrapolation. For example, if we train an RL agent with a reward function which imperfectly approximates what we want, it is likely to fail once the agent becomes sufficiently sophisticated — unless the reward function itself becomes more sophisticated in parallel.

In contrast, let’s say that a technique is “scalable” if it continues to work just as well even when the underlying learning becomes much more powerful. (See also: Eliezer’s more colorful “omnipotence test.”)

This is another extremely demanding requirement. It rules out many possible approaches to alignment. For example, it probably rules out any approach that involves hand-engineering reward functions. More subtly, I expect it will rule out any approach that requires hand-engineering an informative prior over human values (though some day we will hopefully find a scalable approach to IRL).

Why is scalability important?

(Previous posts: scalable AI control)

Aside: feasibility

One might reject these desiderata because they seem too demanding: it would be great if we had a secure, competitive and scalable approach to alignment, but that might not be possible.

I am interested in trying to satisfy these desiderata despite the fact that they are quite demanding, for two reasons:

Conclusion

I think there is a lot of research to be done on AI alignment; we are limited by a lack of time and labor rather than by a lack of ideas about how to make progress.

Research relevant to alignment is already underway; researchers and funders interested in alignment can get a lot of mileage by supporting and fleshing out existing research programs in relevant directions. I don’t think it is correct to assume that if anyone is working on a problem then it is going to get solved — even amongst things that aren’t literally at the “no one else is doing it” level, there are varying degrees of neglect.

At the same time, the goals of alignment are sufficiently unusual that we shouldn’t be surprised or concerned to find ourselves doing unusual research. I think that area #3 on deliberation and amplification is almost completely empty, and will probably remain pretty empty until we have clearer statements of the problem or convincing demonstrations of work in that area.

I think the distinguishing feature of research motivated by AI alignment should be an emphasis on secure, competitive, and scalable solutions. I think these are very demanding requirements that significantly narrow down the space of possible approaches and which are rarely explicitly considered in the current AI community.

It may turn out that these requirements are infeasible; if so, one key output of alignment research will be a better understanding of the key obstacles. This understanding can help guide less ambitious alignment research, and can help us prepare for a future in which we won’t have a completely satisfying solution to AI alignment.

This post has mostly focused on research that would translate directly into concrete systems. I think there is also a need for theoretical research building better abstractions for reasoning about optimization, security, selection, consequentialism, and so on. It is plausible to me that we will produce acceptable systems with our current conceptual machinery, but if we want to convincingly analyze those systems then I think we will need significant conceptual progress (and better concepts may lead us to different approaches). I think that practical and theoretical research will be attractive to different researchers, and I don’t have strong views about their relative value.

0 comments

Comments sorted by top scores.