Discovering alignment windfalls reduces AI risk

post by goodgravy, stuhlmueller · 2024-02-28T21:23:27.876Z · LW · GW · 1 comments

This is a link post for https://blog.elicit.com/alignment-windfalls/

Contents

    Alignment taxes
    Alignment windfalls
  Companies as optimisers
    Shaping the landscape
    Companies greedily optimise within the known landscape
      Shaping knowledge of the landscape
  Factored cognition: an example of an alignment windfall
      Factored cognition as a windfall
      How we’ve been exploring factored cognition at Elicit
  Conclusion
    Want to help us?
None
1 comment

Some approaches to AI alignment incur upfront costs to the creator (an “alignment tax”). In this post, I discuss “alignment windfalls” which are strategies that tend towards the long-term public good at the same time as reaping short-term benefits for a company.

My argument, in short:

  1. Just as there are alignment taxes, there are alignment windfalls.
  2. AI companies optimise within their known landscape of alignment taxes & windfalls.
  3. We can change what AI companies do by:
    1. Shaping the landscape of taxes and windfalls
    2. Shaping their knowledge of that landscape
  4. By discovering and advocating for alignment windfalls, we reduce AI risk overall because it becomes easier for companies to adopt more alignable approaches.

Alignment taxes

An “alignment tax [? · GW]” refers to the reduced performance, increased expense, or elongated timeline required to develop and deploy an aligned system compared to a merely useful one.

More specifically, let’s say an alignment tax is an investment that a company expects to help with alignment of transformative AI that has a net negative impact on the company's bottom line over the next 3-12 months.

A few examples, from most to least concrete:

All of these require more investment than the less aligned baseline comparison, and companies will face hard decisions about which to pursue.

Alignment windfalls

On the other hand, there are some ideas and businesses where progress on AI safety is intrinsically linked to value creation. 

More specifically, let’s say an alignment windfall is an investment that a company expects to help with alignment of transformative AI that also has a net positive impact on the company's bottom line over the next 3-12 months.

For example:

In practice, almost all ideas will have some costs and some benefits: finding ways to shape the economic environment so that they look more like windfalls is key to getting them implemented.

Companies as optimisers

Startup companies are among the best machines we've invented to create economic value through technological efficiency.

Two drivers behind why startups create such an outsized economic impact are:

  1. Lots of shots on goal. The vast majority of startups fail: perhaps 90% die completely and only 1.5% get to a solid outcome. As a sector, startups take a scattergun approach: each individual company is likely doomed, but the outsized upside for the lucky few means that many optimists are still willing to give it a go.
  2. Risk-taking behaviour. Startups thrive in legal and normative grey areas, where larger companies are constrained by their brand reputation, partnerships, or lack of appetite for regulatory risk.

In this way, the startup sector is effectively searching over our legal and ethical landscape for unexploited money-making ideas. They're looking for the most efficient way to create value, and they're willing to take risks to do so.

This optimisation pressure will be especially strong for artificial intelligence, because the upside for organisations leading the AGI race is gigantic. For startups and incumbents alike, there is incredible countervailing pressure on anything standing in the way: alignment taxes start to look a lot like inefficiencies to be eliminated, and alignment windfalls become very appealing.

Shaping the landscape

The technical approaches which lead to taxes and windfalls lie on a landscape that can be shaped in a few ways:

Regulation can levy taxes on unaligned approaches:

Public awareness can cause windfalls for aligned approaches:

Recruiting top talent is easier for safety-oriented companies:

In practice, there is often interplay between these. For example, regulation can help the public appreciate the dangers of products, and a positive public company profile is a boon for recruiting.

Companies greedily optimise within the known landscape

An important nuance with the above model is that companies don’t optimise within the true landscape: they optimise within the landscape they can access. Founders don’t know the truth about where all the windfalls lie: they have to choose what to build based on the set of possible approaches within reach.

Here are a couple of reasons why the full landscape tends to be poorly known to AI startup founders:

In contrast, researchers in academia have much more latitude to explore completely unproven ideas lacking any clear path to practical application—and the expectation is that those results will be published publicly too. Unfortunately, because of the data and compute requirements associated with modern machine learning, it has become hard for aspiring researchers to do many forms of groundbreaking work outside of an industrial setting: they’re embedded in these myopic and secretive organisations.

Shaping knowledge of the landscape

What does it look like to shape the broader knowledge of this landscape?

Factored cognition: an example of an alignment windfall

Let’s consider a more detailed example. Elicit has been exploring one part of the landscape of taxes & windfalls, with a focus on factored cognition.[5]

Factored cognition as a windfall

Since the deep learning revolution, most progress on AI capability has been due to some combination of:

Normally, we do all three at the same time. We have basically thrown more and more raw material at the models, then poked them with RLHF until it seems sufficiently difficult to get them to be obviously dangerous. This is an inherently fragile scheme, and there are strong incentives to cut corners on the “now make it safe” phase.

Factored cognition is an alternative paradigm which offers a different path. Instead of solving harder problems with bigger and bigger models, we decompose the problem into a set of smaller, more tractable problems. Each of these smaller problems is solved independently and their solutions combined to produce a final result. In cases where factored cognition isn't great for generating a result, we can factor a verification process instead. Either way, we aim to keep the component models small and the tasks self-contained and supervisable.

How we’ve been exploring factored cognition at Elicit

Elicit, our AI research assistant, is built using factored cognition: we decompose common research tasks into a sequence of steps, using gold standard processes like systematic reviews as a guide.

For our users, accuracy is absolutely crucial. They are pinning their reputation on the claims that they make, and therefore something which merely sounds plausible is nowhere near good enough. We need to earn their trust through solid epistemic foundations.

For Elicit, creating a valuable product is the same thing as building a truthful, transparent system. We don't have some people building an AI tool and also some people figuring out how to make it reliable. Trustworthiness is our value proposition.

Conclusion

Let's find and promote alignment windfalls!

Some of the top AI labs seem to be careful and deliberative actors, but as more competitors enter the race for AGI the pressure to avoid hindrance will increase. Future competitors may well be less cautious and even explicitly reject safety-related slow-downs.

If this is the world we're heading towards, AI safety measures which impose significant alignment taxes are at risk of being avoided. To improve outcomes in that world, we should discover and promote alignment windfalls, by which I mean mechanisms that harness the awesome efficiency of markets to create aligned AI systems.

I'm a proponent of other approaches—such as regulation—to guide us towards safe AI, but in high stakes situations like this my mind turns to the Swiss cheese model used to reduce clinical accidents. We shouldn't hope for a panacea, which probably doesn't exist in any case. We need many, independent layers of defence each with their strengths and (hopefully non-overlapping) weaknesses.

In my view, Elicit is the best example of an alignment windfall that we have today. To have maximum impact, we need to show that factored cognition is a powerful approach for building high-stakes ML systems. Elicit will be a compelling existence proof: an example which we hope other people will copy out of their own self-interest and—as such—make AI safer for everyone.

Want to help us?

We are building the best possible team to make this happen—you can see our open roles here!

Many thanks to @jungofthewon [EA · GW], Étienne Fortier-Dubois, @Owain_Evans [EA · GW], and @brachbach [EA · GW] for comments on an early draft.


  1. ^
  2. ^

    Jan Leike estimated RLHF accounted for 5–20% of the total cost of GPT-3 Instruct, factoring in things like engineering work, hiring labelers, acquiring compute, and researcher effort.

  3. ^

    Of course, it’s far from a complete solution to alignment and might be more akin to putting lipstick on a shoggoth than deeply rewiring the model’s concepts and capabilities.

  4. ^

    But also: Some have noted [AF · GW] that interpretability could be harmful if our increased understanding of the internals of neural networks leads to capability gains. Mechanistic interpretability would give us a tight feedback loop with which to better direct our search for algorithmic and training setup improvements.

  5. ^

    We expect to contribute to reducing AI risk by improving epistemics, but highlighting alignment windfalls from factored cognition is important to us as well.

1 comments

Comments sorted by top scores.

comment by stuhlmueller · 2024-02-28T23:35:48.405Z · LW(p) · GW(p)

Another potential windfall I just thought of: the kind of AI scientist system discussed by Bengio in this talk (older writeup). The idea is to build a non-agentic system that uses foundation models and amortized Bayesian inference to create and do inference on compositional and interpretable world models. One way this would be used is for high-quality estimates of p(harm|action) in the context of online monitoring of AI systems, but if it could work it would likely have other profitable use cases as well.