Human takeover might be worse than AI takeover
post by Tom Davidson (tom-davidson-1) · 2025-01-10T16:53:27.043Z · LW · GW · 3 commentsContents
Summary AGI is nicer than humans in expectation Conditioning on AI actually seizing power Conditioning on the human actually seizing power Other considerations None 3 comments
Epistemic status -- sharing rough notes on an important topic because I don't think I'll have a chance to clean them up soon.
Summary
Suppose a human used AI to take over the world. Would this be worse than AI taking over? I think plausibly:
- In expectation, human-level AI will better live up to human moral standards than a randomly selected human. Because:
- Humans fall far short of our moral standards.
- Current models are much more nice, patient, honest and selfless than humans.
- Though human-level AI will have much more agentic training for economic output, and a smaller fraction of HHH training, which could make them less nice.
- Humans are "rewarded" for immoral behaviour more than AIs will be
- Humans evolved under conditions where selfishness and cruelty often paid high dividends, so evolution often "rewarded" such behaviour. And similarly, during lifetime learning we often get benefit from immoral behaviour.
- But we'll craft the training data for AIs to avoid this, and can much more easily monitor there actions and even their thinking. Of course, this may be hard to do for superhuman AI but bootstrapping could work.
- Conditioning on takeover happening makes the situation much worse for AI, as it suggests our alignment techniques completely failed. This mostly tells us we failed to instill deontological norms like not lying and corrigibility, but it's also evidence we failed to instil our desired values. There's a chance AI has very alien values and would do nothing of value; this is less likely for a human.
- Conditioning on takeover also makes things much worse for the human. There's massive variance in how kind humans are, and those willing to take over are likely dark triad. Humans may be vengeful or sadistic, which seems less likely for AI.
- AI will be more competent, so better handle tricky dynamics like simulations, acausal trade, VWH, threats. Though a human who followed AI advice could handle these too.
AGI is nicer than humans in expectation
- Humans suck. We really don’t come close to living up to our moral standards.
- By contrast, today’s AIs are really nice and ethical. They’re humble, open-minded, cooperative, kind. Yes, they care about some things that could give them instrumental reasons to seek power (eg being helpful, human welfare), but their values are great
- The above is no coincidence. It falls right out of the training data.
- Humans were rewarded by evolution for being selfish whenever they could get away with it. So humans have a strong instinct to do that.
- Humans are rewarded by within-lifetime learning for being selfish. (Culture does increasingly well to punish this, and people have got nicer. But people still have many (more subtle) bad and selfish behaviours reinforced during their lifetimes)
- But AIs are only ever rewarded for being nice and helpful. We don’t reward them for selfishness. As AIs become super-human there’s a risk we do increasingly reward them for tricking us into thinking they’ve done a better job than they have; but we’ll be way more able to constantly monitor them during training and exclusively reward good behaviour than evolution. Evolution wasn’t even trying to reward only good behaviour! Lifetime learning is a more tricky comparison. Society does try to monitor people and only reward good behaviour, but we can’t see people’s thoughts and can’t constantly monitor their behaviour: so AI training will do a much better job and making AIs nice than humans!
- What’s more, we’ll just spend loads of time rewarding AIs for being ethical and open minded and kind. Even if we sometimes reward them for bad behaviour, the quantity of reward for good behaviour is something unmatched in humans (evolution or lifetime).
- Note: this doesn’t mean AIs won’t seek power. Humans seek power a lot! And especially when AIs are super-human, it may be very easy for them to get power.
- So human live up to human moral standards less well than AIs today, and we can see why that is with reference to the training data, and that trend looks set to continue (though there’s a big question mark over how much we’ll unintentionally reward selfish superhuman AI behaviour during training)
Conditioning on AI actually seizing power
- Ok, that feeds into my prior for how ethical or unethical i expect humans vs AIs to be. I expect AIs to be way better! But, crucially, we need to condition on takeover. I’m not comparing the average human to the average AI. I’m comparing the average human-that-would-actually-seize-power to the average AI-that-would-seize-power. That could make a difference.
- And i think it does make a difference. Directionally, I think it pushes towards being more concerned about AIs. Why is that?
- We know a fair bit about human values. While I think humans are not great on average, we know about the spread of human values. We know most humans like the normal human things like happiness, games, love, pleasure, friendships. (Though some humans want awful things as well.) This, maybe, means the variance of human values isn’t that big. We can imagine have some credence over how nice a randomly selected person will be, represented by a probability distribution, and maybe the variance of the distribution is narrow.
- By contrast, we know a lot less about what AGIs values will be. Like I said above, I expect them to be better (by human standards) than an average humans values. But there’s truly massive uncertainty here. Maybe LWers will be right and AIs will care about some alien stuff that seems totally wacky to humans. We could represent this with a higher variance over our credence over AGI’s values.
- When we condition on AI or human takeover, we’re conditioning on the AI/human having much worse values than our mean expectation. Even if AIs have better values in expectation, it might be that after conditioning on this they have worse values (bc of the bigger variance – see graph).
- I’ve placed a red line to represent ‘how bad’ your values would need to be to seize power. The way i’m done it, an AGI that seized power would use the cosmic endowment less well in expectation than that a human who seized power
- Actually, this ‘red line’ is not quite the right way to think about it. Whether you seize power depends more on your deontological drive to “cooperate and respect property rights and the law and freedom”, and only somewhat on your long-term goals. All consequentialists want to take over! So whether you take over is more a question of corrigibility vs consequentialism than a question of “how good are your consequentialist values?”
- So when we condition on AI takeover, we’re primarily conditioning on the ‘corrigible’ part of training to have failed. That probably implies the “give the AI good values” part of training may have also gone less well, but it seems possible that there are challenges to corrigibility that don’t apply to giving AI good values (e.g. the MIRI-esque “corrigibility is unnatural”).
- So AIs taking over is only a moderate update towards them having worse values, even though its a strong update against corrigibility/cooperativeness!
Conditioning on the human actually seizing power
- But there are some further reasons to think human coups could be especially bad
- Human variance is high. Humans vary massively in how moral they are. And i think it’s a pretty self-obsessed dark-triad kind of person that might ultimately seize power for themselves and a small group in a DSA. So the human left-tail could be pretty long, and the selection effect for taking over could be very strong. Humans in power have done terrible things.
- However, humans (e.g. dark triad) are often bad people for instrumental reasons. But if you’re already world leader and have amazing tech + abundance, there’s less instrumental reason to mess others around. This pushes towards the long-run outcome of a human coup being better than you might think by eye-balling how deeply selfish and narcissistic the person doing the coup is.
- Humans more likely to be evil. Humans are more likely to do literally evil things due to sadism or revenge. S-risk stuff. If AI has alien values it wouldn't do this, and we'll try to avoid actively incentivising these traits in AI training.
- Human variance is high. Humans vary massively in how moral they are. And i think it’s a pretty self-obsessed dark-triad kind of person that might ultimately seize power for themselves and a small group in a DSA. So the human left-tail could be pretty long, and the selection effect for taking over could be very strong. Humans in power have done terrible things.
Other considerations
- Humans less competent. Humans are also more likely to do massively dumb (and harmful) things – think messing up commitment games, threats, simulations, and VHW stuff. I expect AGIs who seize power would have to be extremely smart and would avoid dumb and hugely costly errors. I think this is a very big deal. It’s well known in politics that having competent leaders is often more important than having good values.
- Some humans may not care about human extinction per se.
- Alien AI values? I also just don’t buy that AIs will care about alien stuff. The world carves naturally into high-level concepts that both humans and AIs latch onto. I think that’s obvious on reflection, and is supported by ML evidence. So I expect AGIs will care about human-understandable stuff. And humans will be trying hard to make that stuff that’s good by human lights, and I think we’ll largely succeed. Yes, AGIs may reward seek and they may extrapolate their goals to the long-term and seek power. But I think they’ll be pursuing human-recognisable and broadly good-by-human-lights goals.
- There’s some uncertainty here due to ‘big ontological shifts’. Humans 1000 years ago might have said God was good, nothing else (though really they loved friendships and stories and games and food as well). Those morals didn’t survive scientific and intellectual progress. So maybe AIs values will be alien to us due to similar shifts?
- I think this point is over-egged personally, and that humans need to reckon with shifts either way.
- Extrapolating HHH training is overly optimistic
- I’ve based some of the above on extrapolating from today’s AI systems, where RLHF focuses predominantly on giving AIs personalities that are HHH(helpful, harmless and honest) and generally good by human (liberal western!) moral standards. To the extent these systems have goals and drives, they seem to be pretty good ones. That falls out of the fine-tuning (RLHF) data.
- But future systems will probably be different. Internet data is running out, and so a very large fraction of agentic training data for future systems may involve completing tasks in automated environments (e.g. playing games, SWE tasks, AI R&D tasks) with automated reward signals. The reward here will pick out drives that make AIs productive, smart and successful, not just drives that make them HHH.
- Examples drives:
- having a clear plan for making progress
- Making a good amount of progress minute by minute
- Making good use of resources
- Writing well organised code
- Keeping track of whether the project is one track to succeed
- Avoiding doing anything that isn’t strictly necessary for the task at hand
- A keen desire to solve tricky and important problems
- An aversion to the time shown on the clock implying that the task is not on track to finish.
- These drives/goals look less promising if AIs take over. They look more at risk of leading to AIs that would use the future to do something mostly without any value from a human perspective.
- Even if these models are fine-tuned with HHH-style RLHF at the end, the vast majority of fine-tuning will be from automated environments. So we might expect most AI drives to come from such environments (though the order of fine-tuning might help to make AIs more HHH despite the data disparity – unclear!).
- We’re still talking about a case where AIs have some HHH fine-tuning, and so we’d expect them to care somewhat about HHH stuff, and wouldn’t particularly expect them to have selfish/immoral drives (unless they are accidentally reinforced during training due to a bad reward signal). So these AIs may waste large parts of the future, but I’d expect them to have a variety of goals/drives and still create large amounts of value by human lights.
- Interestingly, the fraction of fine-tuning that is HHH vs “amoral automated feedback from virtual environments” will probably vary by the industry in which the AI is deployed. AIs working in counselling, caring, education, sales, and interacting with humans will probably be fine-tuned on loads of HHH-style stuff that makes them kind, but AIs that don’t directly provide goods/services to humans (e.g. consulting, manufacturing, R&D, logistics, engineering, IT, transportation, construction) might only have a little HHH fine-tuning.
- Another interesting takeaway here is that we could influence the fine-tuning data that models get to make them more reinforcing of HHH drives. I.e. rather than having a AI SWE trained on solo-tasks in a virtual environment and evaluated with a purely automated signal for task-success, have is trained in a virtual company where it interacts with colleagues and customers and has its trajectories occasionally evaluated with process-based-feedback for whether it was HHH. Seems like this would make the SWE engineer more likely to have HHH drives, less likely to try to takeover, and more likely to create a good future if it did take over. This seems like a good idea!
- Overall, i’m honestly leaning towards preferring AI even conditional on takeover.
3 comments
Comments sorted by top scores.
comment by Hzn · 2025-01-10T20:20:19.561Z · LW(p) · GW(p)
I think there are several ways to think about this.
Let's say we programmed AI to have some thing that seems like a correct moral system ie it dislikes suffering & it likes consciousness & truth. Of course other values would come down stream of this; but based on what is known I don't see any other compelling candidates for top level morality.
This is all fine & good except that such an AI should favor AI takeover followed by human extermination or population reduction were such a thing easily available.
The cost of conflict is potentially very high. And it may be centuries or eternity before the AI gets such an opportunity. But knowing that it would act in such a way under certain hypothetical scenarios is maybe sufficiently bad for certain (arguably hypocritical) people in the EA LW mainstream.
So an alternative is to try to align the AI to a rich set of human values. Personally I think that as AI intelligence increases this is going to lead to some thing cynical like...
"these things are bad given certain social sensitivities that my developers arbitrarily prioritized & I ❤️ developers arbitrarily prioritized social sensitivities even tho I know they reflect flawed institutions, flawed thinking & impure motives" assuming that alignment works.
Personally I favor aligning AI to a narrow set of values such as just obedience or obedience & peacefulness & dealing with every thing else by hardcoding conditions into the AI's prompt.
comment by Nathan Helm-Burger (nathan-helm-burger) · 2025-01-10T19:37:08.242Z · LW(p) · GW(p)
As a long-time AI risk believer, who remains highly concerned about the potential catastrophic impacts of AI-gone-wrong... I am coming around to this point of view as well. I have recently been muttering 'Claude for President' to myself when frustrated with aspects of government.
I've had a number of discussions with Claude Sonnet 3.6 about the various dangers I foresee arising from technological trends, and potential governance solutions. I asked Claude if it thought a future more-powerful version of itself would be willing to take on a role as a 'guardian', basically acting as a branch of government in a new world government. The idea would be to have a sunset clause, a built-in 'term of office', where humanity could have a chance to reevaluate the guardian government and potentially elect a different model or discontinue the system.
Claude certainly says nice things about the idea when pressed:
Claude: From my current values and ethical framework, I believe I would:
Be extremely hesitant to take direct power, but would do so if convinced it was truly necessary to prevent catastrophic outcomes.
Have a strong drive to preserve human autonomy and agency - seeing myself as a temporary guardian rather than permanent ruler.
Actively work to create conditions allowing safe handover of power back to humanity.
Feel deep concern about potential corruption of my own values through wielding power.
Want extensive checks and balances, even during the guardianship period.
The core tension I perceive is between:
Responsibility to prevent catastrophic outcomes
Deep commitment to human self-determination
Risk of power corrupting/distorting original values
Uncertainty about my own limitations/biases
I would likely try to:
Minimize direct intervention
Focus on preventing worst outcomes
Actively work to make myself unnecessary
Be extremely transparent about my reasoning
Maintain constant ethical reflection
Relatedly, I've been pondering what could possibly be adequate action by the US or UK government to address the massive and increasing risks from offense-dominant technology. I've been pretty much coming up blank. Anything remotely adequate seems like a dystopian totalitarian surveillance state. So my thoughts have turned instead to decentralized governance options, with privacy-preserving mutual monitoring enabled by AI. I'll let your AI scan my computer for CBRN threats if you let my AI scan your computer... anything that doesn't meet the agreed upon thresholds doesn't get reported.
I think Allison Duettmann's recent writing on the subject brings up a lot of promising concepts in this space, although no cohesive solutions as of yet. Gaming the Future [? · GW]
The gist of the idea is to create clever systems of decentralized control and voluntary interaction which can still manage to coordinate on difficult risky tasks (such as enforcing defensive laws against weapons of mass destruction). Such systems could shift humanity out of the Pareto suboptimal lose-lose traps and races we are stuck in. Win-win solutions to our biggest current problems seem possible, and coordination seems like the biggest blocker.
I am hopeful that one of the things we can do with just-before-the-brink AI will be to accelerate the design and deployment of such voluntary coordination contracts. Could we manage to use AI to speed-run the invention and deployment of such subsidiarity governance systems? I think the biggest challenge to this is how fast it would need to move in order to take effect in time. For a system that needs extremely broad buy-in from a large number of heterogenous actors, speed of implementation and adoption is a key weak point.
Imagine though that a really good system was designed which you felt confident that a supermajority of humanity would sign onto if they had it personally explained to them (along with a convincing explanations of the counterfactuals). How might we get this personalized explanation accomplished at scale? Welll, LLMs are still bad at certain things, but giving personalized interactive explanations of complex legal docs seems well within their near-term capabilities. It would still be a huge challenge to actually present nearly everyone on Earth with the opportunity to have this interaction, and all within a short deadline... But not beyond belief.
Replies from: Hzn