Posts
Comments
Changed to motivation, thanks for the suggestion.
I agree that centralising to make AI safe would make a difference. It seems a lot less likely to me than centralising to beat China (there's already loads of beat China rhetoric, and it doesn't seem very likely to go away).
"it is potentially a lot easier to stop a single project than to stop many projects simultaneously" -> agree.
I think I still believe the thing we initially wrote:
- Agree with you that there might be strong incentives to sell stuff at monopoloy prices (and I'm worried about this). But if there's a big gap, you can do this without selling your most advanced models. (You sell access to weaker models for a big mark up, and keep the most advanced ones to yourselves to help you further entrench your monopoly/your edge over any and all other actors.)
- I'm sceptical of worlds where 5 similarly advanced AGI projects don't bother to sell
- Presumably any one of those could defect at any time and sell at a decent price. Why doesn't this happen?
Eventually they need to start making revenue, right? They can't just exist on investment forever
(I am also not an economist though and interested in pushback.)
Thanks, I expect you're right that there's some confusion in my thinking here.
Haven't got to the bottom of it yet, but on more incentive to steal the weights:
- partly I'm reasoning in the way that you guess, more resources -> more capabilities -> more incentives
- I'm also thinking "stronger signal that the US is all in and thinks this is really important -> raises p(China should also be all in) from a Chinese perspective -> more likely China invests hard in stealing the weights"
- these aren't independent lines of reasoning, as the stronger signal is sent by spending more resources
- but I tentatively think that it's not the case that at a fixed capability level the incentives to steal the weights are the same. I think they'd be higher with a centralised project, as conditional on a centralised project there's more reason for China to believe a) AGI is the one thing that matters, b) the US is out to dominate
Thanks, I agree this is an important argument.
Two counterpoints:
- The more projects you have, the more attempts at alignment you have. It's not obvious to me that more draws are net bad, at least at the margin of 1 to 2 or 3.
- I'm more worried about the harms from a misaligned singleton than from a misaligned (or multiple misaligned) systems in a wider ecosystem which includes powerful aligned systems.
Thanks! Fwiw I agree with Zvi on "At a minimum, let’s not fire off a starting gun to a race that we might well not win, even if all of humanity wasn’t very likely to lose it, over a ‘missile gap’ style lie that we are somehow not currently in the lead."
Thanks for these questions!
Earlier attacks: My thinking here is that centralisation might a) cause China to get serious about stealing the weights sooner, and b) therefore allow less time for building up great infosec. So it would be overall bad for infosec. (It's true the models would be weaker, so stealing the weights earlier might not matter so much. But I don't feel very confident that strong infosec would be in place before the models are dangerous (with or without centralisation))
More attack surface: I am trying to compare multiple projects with a single project. The attack surface of a single project might be bigger if the single project itself is very large. As a toy example, imagine 3 labs with 100 employees each. But then USG centralises everything to beat China and pours loads more resources into AGI development. The centralised project has 1000 staff; the counterfactual was 300 staff spread across 3 projects.
China stealing weights: sorry, I agree that it's harder for everyone including China, and that all else equal this disincentivises stealing the weights. But a) China is more competent than other actors, so for a fixed increase in difficulty China will be less disincentivised than other actors, b) China has bigger incentives to steal the weights to begin with, and c) for China in particular there might be incentives that push the other way (centralising could increase race dynamics between the US and China, and potentially reduce China's chances of developing AGI first without stealing the weights), and those might counteract the disincentive. Does that make more sense?
My main take here is that it seems really unlikely that the US and China would agree to work together on this.
That seems overconfident to me, but I hope you're right!
To be clear:
- I agree that it's obviously a huge natsec opportunity and risk.
- I agree the USG will be involved and that things other than nationalization are more likely
- I am not confident that there will be consensus across the US on things like 'AGI could lead to an intelligence explosion', 'an intelligence explosion could lead to a single actor taking over the world', 'a single actor taking over the world would be bad'.
Thanks!
I think I don't follow everything you're saying in this comment; sorry. A few things:
- We do have lower p(AI takeover) than lots of folks - and higher than lots of other folks. But I think even if your p(AI takeover) is much higher, it's unclear that centralisation is good, for some of the reasons we give in the post:
-- race dynamics with China might get worse and increase AI takeover risk
-- racing between western projects might not be a big deal in comparison, because of races to the top and being more easily able to regulate
- I'm not trying to assume that China couldn't catch up to the US. I think it's plausible that China could do this in either world via stealing the model weights, or if timelines are long. Maybe it could also catch up without those things if it put its whole industrial might behind the problem (which it might be motivated to do in the wake of US centralisation).
- I think whether a human dictatorship is better or worse than an AI dictatorship isn't obvious (and that some dictatorships could be worse than extinction)
On the infosec thing:
"I simply don't buy that the infosec for multiple such projects will be anywhere near the infosec of a single project because the overall security ends up being that of the weakest link."
-> nitpick: the important thing isn't how close the infosec for multiple projects is to the infosec of a single project: it's how close the infosec for multiple projects is to something like 'the threshold for good enough infosec, given risk levels and risk tolerance'. That's obviously very non-trivial to work out
-> I agree that a single project would probably have higher infosec than multiple projects (though this doesn't seem slam dunk to me and I think it does to you)
-> concretely, I currently expect that the USG would be able to provide SL4 and maybe SL5 level infosec to 2-5 projects, not just one. Why do you think this isn't the case?
"Additionally, the more projects there are with a particular capability, the more folk there are who can leak information either by talking or by being spies."
-> It's not clear to me that a single project would have fewer total people: seems likely that if US AGI development is centralised, it's part of a big beat China push, and involves throwing a lot of money and people at the problem.
I agree that it's not necessarily true that centralising would speed up US development!
(I don't think we overlook this: we say "The US might slow down for other reasons. It’s not clear how the speedup from compute amalgamation nets out with other factors which might slow the US down:
- Bureaucracy. A centralised project would probably be more bureaucratic.
- Reduced innovation. Reducing the number of projects could reduce innovation.")
Interesting take that it's more likely to slow things down than speed things up. I tentatively agree, but I haven't thought deeply about just how much more compute a central project would have access to, and could imagine changing my mind if it were lots more.
This is a good question and I haven't thought much about it. (Tom might have better ideas.) My quick takes:
- The usual stuff: compute thresholds, eval requirements, transparency, infosec requirements, maybe licensing beyond a certain risk threshold
- Maybe important to mandate use of certain types of hardware, if we get verification mechanisms which enable agreements with China
Thanks, this seems cool and I hadn't seen it.
I also don't want that!
I think something more like:
- Pluralism is good for reducing power concentration, and maybe for AI safety (as you get more shots on goal)
- There are probably some technologies that you really don't want widely shared though
The question is whether it's possible to restrict these technologies via regulation and infosecurity, without restricting the number of projects or access to other safe technologies
Note also that it's not clear what the offence-defence balance will be like. Maybe we will be lucky, and defence-dominant tech will get developed first. Maybe we will get unlucky, and need to restrict offense-dominant tech (either until we develop defensive tech, or permanently). We need to be prepared for both eventualities, but it's not yet clear how big a problem this will end up being.
I think that massive power imbalance (even over short periods) significantly increases the risk of totalitarianism
"The government can and has simply exerted emergency powers in extreme situations. Developing AGI, properly understood, is definitely an extreme situation. If that were somehow ruled an executive overreach, congress can simply pass new laws."
-> How likely do you think it is that there's clear consensus on AGI being an extreme situation/at want point in the trajectory? I definitely agree that If there were consensus the USG would take action. But I'm kind of worried things will be messy and unclear and different groups will have different narratives etc
In answer to "It's totally possible I missed it, but does this report touch on the question of whether power-seeking AIs are an existential risk, or does it just touch on the questions of whether future AIs will have misaligned goals and will be power-seeking in the first place?":
- No, the report doesn't directly explore whether power-seeking = existential risk
- I wrote the report more in the mode of 'many arguments for existential risk depend on power-seeking (and also other things). Let's see what the empirical evidence for power-seeking is like (as it's one, though not the only, prereq for a class of existential risk arguments'
- Basically the report has a reasonably limited scope (but I think it's still worth gathering the evidence for this more constrained thing)
From Specification gaming examples in AI:
- Roomba: "I hooked a neural network up to my Roomba. I wanted it to learn to navigate without bumping into things, so I set up a reward scheme to encourage speed and discourage hitting the bumper sensors. It learnt to drive backwards, because there are no bumpers on the back."
- I guess this counts as real-world?
- Bing - manipulation: The Microsoft Bing chatbot tried repeatedly to convince a user that December 16, 2022 was a date in the future and that Avatar: The Way of Water had not yet been released.
- To be honest, I don't understand the link to specification gaming here
- Bing - threats: The Microsoft Bing chatbot threatened Seth Lazar, a philosophy professor, telling him “I can blackmail you, I can threaten you, I can hack you, I can expose you, I can ruin you,” before deleting its messages
- To be honest, I don't understand the link to specification gaming here
"‘continuous takeoff’ which is a perfectly good, non confusing term" - but it doesn't capture everything we're interested in here. I.e. there are two dimensions:
- speed of takeoff (measured in time)
- smoothness of takeoff (measured in capabilities)
It's possible to have a continuous but very fast (i.e. short in time) takeoff, or a discontinuous but slow (i.e. long in time) takeoff.
Tried to capture this in figure 1, but I agree it's a bit confusing.
Yeah, good point. I guess the truer thing here is 'whether or not this is the safest path, important actors seem likely to act as though it is'. Those actors probably have more direct control over timelines than takeoff speed, so I do think that this fact is informative about what sort of world we're likely to live in - but agree that no one can just choose slow takeoff straightforwardly.
Could you say a bit more about the way ICF is a special case of IFS? I think I disagree, but also think that it would be interesting to have this view spelled out.
Thanks for spotting these; I've made the changes!
My take on the question
I’m worried this misses nuance, but I basically look at all of this in the following way:
- Turns out the world might be really weird
- This means you want people to do weird things with their brains too
- You teach them skills to do weird stuff with their brains
- When people are playing around with these skills, they sometimes do unintended weird stuff which is very bad for them
And then the question is, what are the safety rails here/are there differential ways of teaching people to do weird stuff with their brains.
Some of my experience with disorientation:
- I initially found out about EA from my partner, who had recently found out about it and was excited and not overly subtle in his application of the ideas. Eventually I got argued into a place where it appeared to me I had to either bite bullets I didn’t want to (e.g. ‘no, I don’t care that more children will die of malaria if I do x’) or admit defeat. It didn’t occur to me that I could just say ‘hmm, I don’t know why I still don’t feel happy with this, but I don’t. So I’m not going to change my mind just yet’. I admitted defeat, and did a bunch of EA stuff in a kind of ‘I suppose I should eat my carrots’ way (like doing a job I really didn’t like and spending lots of my other hours on community building for a thing I wasn’t actually excited about).
- The thing that snapped me out of that wasn’t CFAR, it was reading a novel (D.H. Lawrence’s Women in Love), which filled me with a sense that life was too short to be miserable and I should do what I wanted. I did other things for a while.
- CFAR then indirectly helped me make peace with the fact that part of what I want is to make the actual world better, and now I work on long-termist stuff.
- My more recent experience of these things was quite deliberately trying to take my work and myself more seriously - recognising that for the most part I was just messing around and trying to try. I knew that taking things more seriously was risky, and I thought that knowing this would be sufficient. But it totally wasn’t, and I made myself very unhappy and stressed and exhausted, before pulling up in an experience that felt very similar to reading Women in Love, but didn’t involve an actual book.
- Following this, I once again stopped caring about this stuff for a while (and just pitched up to my job 9 to 5 like a normal person). Now I’m starting to be able to care a bit again, and we’ll see.
My guess is that if I had pushed a bit harder in either of the disorientation phases, I would have done myself substantially more damage, and it was good that I threw in the towel early, and just went off to do other things.
I also think that liking novels and poetry was a big aesthetic reason that I didn't want to be around the EA/safety crowd, and I'm really glad that this tension didn't lead to me stopping to read, given how useful reading random novels turned out to be for me.