What does success look like?
post by Raymond D · 2025-01-23T17:48:35.618Z · LW · GW · 0 commentsContents
Win/lose asymmetry Noticing the gaps Robust, agnostic work So what does success look like? None No comments
The general movement around AI safety is currently pursuing many different agendas. These are individually very easy to motivate with some specific story of how things could naturally go wrong. For example:
- AI control: we lose control of our AIs
- Regulation on model deployment: someone deploys a dangerous model
- AI interpretability: we fail to understand how our AIs are actually reasoning
- Cybersecurity: someone steals model weights, or an AI exfiltrates itself
I would further claim that this is how most people tend to think about each agenda most of the time. If you’re bought in on x-risk, it’s much easier to describe very specific failures than it is to describe very specific success stories.
But in the long term, I think trying to avoid all the losing conditions is a bad strategy. So I believe it’s pretty useful for anyone working on existential risk to at least consider what the success story is.
Win/lose asymmetry
There are many ways to lose and many ways to win, but crucially, we need to avoid all of the paths to failure, whereas we only need to achieve one of the paths to success.
Of course, as a matter of strategy, it’s probably smart to spread your bets especially when things are so uncertain, and it is important that we consistently avoid failure.
Still, I think it’s easy for individuals to slip into working based on more easily-motivated stories about failure, and I think this:
- On net probably leads to a group-level misallocation of energy
- Contributes to an overall doomy/pessimistic attitude
It’s also actually pretty hard to describe and discuss good paths to success.
- They’re much more controversial
- Talk about melting GPUs and unilateral pivotal acts has probably caused a reasonable amount of alienation
- Attempts to characterise utopia often involve implicit claims about values, and “what’s good for people”
- By extension, they’re much easier to criticise
- In some ways it’s easier to describe a process rather than an end goal (e.g. getting AIs to align future AIs and deliberate on values) but then you need to also be sure the process will work
Nonetheless, I think that if humanity does eventually succeed, it's likely to be because at some point someone had an actual plan for how to succeed which some people actually followed, rather than just continually dodging mistakes and putting out fires. We can only defer these questions for so long.
Noticing the gaps
I think really engaging with the question of what success looks like is pretty connected with actually noticing the gaps in our current approach.
My impression is that there are a few pretty huge unanswered questions in alignment, including things like:
- Supposing we can align an AI to arbitrary values, what do we even align it to? Is there even a coherent set of extrapolated values we could point an AGI to? (see e.g. here [LW · GW] for some discussion of how this could be hard)
- Supposing we can fully interpret the actual reasoning AGIs are performing, what does a world full of super-competent artificial agents look like, and is it safe for humans?
- Supposing we solve the theoretical and practical problems of control for pre-AGI agents with only a moderate alignment tax [? · GW], what sort of structures need to be in place for that to lead to people actually using AGIs safely and responsibly?
There’s a natural pull towards working ‘under the streetlight’ on the problems that seem easier to solve. I think it’s easy to not notice you’re doing this, and my impression is that the most reliable way to get out of that trap is to have really thought about what it would mean to solve the whole problem.
Robust, agnostic work
It’s also possible to do work that is just generally useful, with that as your goal. For example:
- Capacity-building: It’s helpful to have more smart people doing technical research
- AI interpretability: It seems broadly useful to understand models better
- Upskilling/building a network: It’s good to know influential, experienced people, and to be generally more capable yourself
I am in favour of this. But even here, I think it’s useful to have thought pretty hard at some point about what this is all building up to. Otherwise, there’s a risk that you:
- Stray into still being motivated just by avoiding failure
- Do superficially useful work which ultimately doesn’t end up being helpful
I also think it’s really important to notice the skulls — I believe there are pretty compelling cases that a lot of work on each of the ‘agnostic’ approaches I’ve described above has ended up causing more harm than good: growth that damages the community, dual-use research, and social pressure causing value drift, for example.
So what does success look like?
The goal of this piece is mainly to spur people towards asking this question for themselves, about the work that they’re doing. Nonetheless, I’ll try to give some examples that currently seem salient to me:
- Scalable oversight: We build a way to make recursively improving AIs which reliably inherit alignment/control, eventually giving us extremely powerful safe AIs
- Differential AI progress: We try to differentially advance the extent to which AI enables things like coordination and forecasting, such that humanity overall is well-equipped to handle any emergent threats (see e.g. here [LW · GW] and here [LW · GW])
- A permanent pause: We find a way to simply halt all future AI progress (I think this would be a pretty meagre success, but it’s conceivably the least worst outcome)
I think all of these proposals have serious challenges and need a lot more work. And I would really like for that work to happen.
0 comments
Comments sorted by top scores.