Posts
Comments
I don't have strong takes on what exactly is happening in this particular case but I agree that companies (and more generally, people at high-pressure positions) are very frequently doing the kind of thing you describe. I don't think we have an indication that this would not be valid for leading AI labs as well.
Re the o1 AIME accuracy at test time scaling graphs: I think it's crucial to understand that the test-time compute x-axis is likely wildly different from the train-time compute x-axis. You can throw 10s-100s of millions of dollars at train-time compute and still run a company. You can't do the same for test-time compute each calculation again. The scale at which test-time compute happens on a per-call basis, and can happen to keep things anywhere near commercial viability, needs to be perhaps eight OOMs below train-time compute. Calling anything happening there a "scaling law" is a stretch of the term (very helpful for fundraising) and at best valid very locally.
If RL is actually happening at a compute scale beyond 10s of millions of dollars, and this gives much better final results than doing the same at a smaller scale, that would change my mind. Until then, I think scaling in any meaningful sense of the word is not what drives capabilities forward at the moment, but algorithmic improvement is. And this is not just coming from the currently leading labs. (Which can be seen e.g. here and here).
Thanks for the offer, we'll do that!
Not publicly, yet. We're working on a paper providing more details about the conditional AI safety treaty. We'll probably also write a post about it on lesswrong when that's ready.
I'm aware and I don't disagree. However, in xrisk, many (not all) of those who are most worried are also most bullish about capabilities. Reversely, many (not all) who are not worried are unimpressed with capabilities. Being aware of the concept of AGI, that it may be coming soon, and of how impactful it could be, is in practice often a first step towards becoming concerned about the risks, too. This is not true for everyone unfortunately. Still, I would say that at least for our chances to get an international treaty passed, it is perhaps hopeful that the power of AGI is on the radar of leading politicians (although this may also increase risk through other paths).
The recordings of our event are now online!
My current main cruxes:
- Will AI get takeover capability? When?
- Single ASI or many AGIs?
- Will we solve technical alignment?
- Value alignment, intent alignment, or CEV?
- Defense>offense or offense>defense?
- Is a long-term pause achievable?
If there is reasonable consensus on any one of those, I'd much appreciate to know about it. Else, I think these should be research priorities.
When we decided to attach moral weight to consciousness, did we have a comparable definition of what consciousness means or was it very different?
AI takeovers are probably a rich field. There are partial and full takeovers, reversible and irreversible takeovers, aligned and unaligned ones. While to me all takeovers seem bad, some could be a lot worse than others. Thinking out specific ways to take over could provide clues on how to increase chances that this does not happen. In comms as well, takeovers are a neglected and important subtopic.
I updated a bit after reading all the comments. It seems that Christiano's threat model, or in any case the threat model of most others who interpret his writing, seems to be about more powerful AIs than I initially thought. The AIs would already be superhuman, but for whatever reason, a takeover has not occured yet. Also, we would apply them in many powerful positions (heads of state, CEOs, etc.)
I agree that if we end up in this scenario, all the AIs working together could potentially cause human extinction, either deliberately (as some commenters think) or as a side-effect (as others think).
I still don't think that this is likely to cause human extinction, though, mostly for the following reasons:
- I don't think these AIs would _all_ act against human interest. We would employ a CEO AI, but then also a journalist AI to criticize the CEO AI. If the CEO AI would decide to let their factory consume oxygen to such an extent that humanity would suffer from it, that's a great story for the journalist AI. Then, a policymaker AI would make policy against this. More generally: I think it's a significant mistake in the WFLL threat models that the AI actions are assumed to be correlated towards human extinction. If we humans deliberately put AIs in charge of important parts of our society, they will be good at running their shop but as misaligned to each other (thereby keeping a power balance) as humans currently are. I think this power balance is crucial and may very well prevent things going very wrong. Even in a situation of distributional shift, I think the power balance is likely robust enough to prevent an outcome as bad as human extinction. Currently, some humans job is to make sure things don't go very wrong. If we automate them, we will have AIs trying to do the same. (And since we deliberately put them at this position, they will be aligned with humans' interests, as opposed to us being aligned with chimpanzee interest.)
- This is a very gradual process, where many steps need to be taken: AGI must be invented, trained, pass tests, be marketed, be deployed, likely face regulation, be adjusted, be deployed again. During all those steps, we have opportunities to do something about any threats that turn out to exist. This threat model can be regulated in a trial-and-error fashion, which humans are good at and our institutions accustomed to (as opposed to the Yudkowsky/Bostrom threat model).
- Given that current public existential risk awareness, according to our research, is already ~19%, and given that existential risk concern and awareness levels tend to follow tech capability, I think awareness of this threat will be near-universal before it could happen. At that moment, I think we will very likely regulate existentially dangerous use cases.
In terms of solutions:
- I still don't see how solving the technical part of the alignment problem (making an AI reliably do what anyone wants) contributes to reducing this threat model. If AI cannot reliably do what anyone wants, it will not be deployed at a powerful position, and therefore this model will not get a chance to occur. In fact, working on technical alignment will enormously increase the chance that AI will be employed at powerful positions, and will therefore increase existential risk as caused by the WFLL threat model (although, depending on pivotal act and offense/defence balance, solving alignment may decrease existential risk due to the Yudkowsky/Bostrom takeover model).
- An exception to this could be to make an AI reliably do what 'humanity wants' (using some preference aggregation method), and making it auto-adjust for shifting goals and circumstances. I can see how such work reduces this risk.
- I still think traditional policy, after technology invention and at the point of application (similar to e.g. the EU AI Act) is the most useful regulation to reduce this threat model. Specific regulation at training could be useful, but does not seem strictly required for this threat model (as opposed to in the Yudkowsky/Bostrom takeover model).
- If one wants to reduce this risk, I think increasing public awareness is crucial. High risk awareness should enormously increase public pressure to either not deploy AI at powerful positions at all, or demanding very strong, long-term, and robust alignment guarantees, which would all reduce risk.
In terms of timing, although likely net positive, it doesn't seem to be absolutely crucial to me to work on reducing this threat model's probability right now. Once we actually have AGI, including situational awareness, long-term planning, an adaptable world model, and agentic actions (which could still take a long time), we are likely still in time to regulate use cases (again as opposed to in the Yudkowsky/Bostrom takeover model, where we need to regulate/align/pause ahead of training).
After my update, I still think the chance this threat model leads to an existential event is small and work on it is not super urgent. However, I'm less confident now to make an upper bound risk estimate.
Thanks for engaging. I think AIs will coordinate, but only insofar their separate, different goals are helped by it. It's not that I think AIs will be less capable in coordination per se. I'd expect that an AGI should be able to coordinate with us at least as well as we can, and coordinate with another AGI possibly better. But my point is that not all AI interests will be parallel, far from it. They will be as diverse as our interests, which are very diverse. Therefore, I think not all AIs will work together to disempower humans. If an AI or AI-led team tries to do that, many other AI-led and all human-led teams will likely resist, since they are likely more aligned with the status quo than with the AI trying to take over. That makes takeover a lot less likely, even in a world soaked with AIs. It also makes human extinction as a side effect less likely, since lots of human-led and AI-led teams will try to prevent this.
Still, I do think an AI-led takeover is a risk, or human extinction as a side effect if AI-led teams are way more powerful. I think partial bans after development at the point of application is most promising as a solution direction.
Thanks for engaging kindly. I'm more positive than you are about us being able to ban use cases, especially if existential risk awareness (and awareness of this particular threat model) is high. Currently, we don't ban many AI use cases (such as social algo's), since they don't threaten our existence as a species. A lot of people are of course criticizing what social media does to our society, but since we decide not to ban it, I conclude that in the end, we think its existence is net positive. But there are pocket exceptions: smartphones have recently been banned in Dutch secondary education during lecture hours, for example. To me, this is an example showing that we can ban use cases if we want to. Since human extinction is way more serious than e.g. less focus for school children, and we can ban for the latter reason, I conclude that we should be able to ban for the former reason, too. But, threat model awareness is needed first (but we'll get there).
Stretching the definition to include anything suboptimal is the most ambitious stretch I've seen so far. It would include literally everything that's wrong, or can ever be wrong, in the world. Good luck fixing that.
On a more serious note, this post is about existential risk as defined by eg Ord. Anything beyond that (and there's a lot!) is out of scope.
Great to read you agree that threat models should be discussed more, that's in fact also the biggest point of this post. I hope this strangely neglected area can be prioritized by researchers and funders.
First, I would say both deliberate hunting down and extinction as a side effect have happened. The smallpox virus is one life form that we actively didn't like and decided to eradicate, and then hunted down successfully. I would argue that human genocides are also examples of this. I agree though that extinction as a side effect has been even more common, especially for animal species. If we would have a resource conflict with an animal species and it would be powerful enough to actually resist a bit, we would probably start to purposefully hunt it down (for example, orangutans attacking a logger base camp - the human response would be to shoot them). So I'd argue that the closer AI (or an AI-led team) is to our capability to resist, the more likely a deliberate conflict. If ASI blows us out of the water directly, I agree that extinction as a side effect is more likely. But currently, I think AI capabilities that increase more gradually, and therefore a deliberate conflict, is more likely.
I agree that us not realizing that an AI-led team almost has takeover capability would be a scenario that could lead to an existential event. If we realize soon that this could happen, we can simply ban the use case. If we realize it just in time, there's maximum conflict, and we win (could be a traditional conflict, could also just be a giant hacking fight, or (social) media fight, or something else). If we realize it just too late, it's still maximum conflict, but we lose. If we realize it much too late, perhaps there's not even a conflict anymore (or there are isolated, hopelessly doomed human pockets of resistance that can be quicky defeated). Perhaps the last case corresponds to the WFLL scenarios?
Since there's already, according to a preliminary analysis of a recent Existential Risk Observatory survey, ~20% public awareness of AI xrisk, and I think we're still relatively far from AGI, let alone from applying AGI in powerful positions, I'm pretty positive that we will realize we're doing something stupid and ban the dangerous use case well before it happens. A hopeful example are the talks between the US and China about not letting AI control nuclear weapons. This is exactly the reason though why I think threat model consensus and raising awareness are crucial.
I still don't see WFLL as likely. But a great example could change my mind. I'd be grateful if someone could provide that.
Regulation proposal: make it obligatory to only have satisficer training goals. Try to get loss 0.001, not loss 0. This should stop an AI in its tracks even if it goes rogue. By setting the satisficers thoughtfully, we could theoretically tune the size of our warning shots.
In the end, someone is going to build an ASI with a maximizer goal, leading to a takeover, barring regulation or alignment+pivotal act. However, changing takeovers to warning shots is a very meaningful intervention, as it prevents takeover and provides a policy window of opportunity.
The difference between AGI and takeover level AI could be appreciable. If we're lucky, takeover by raw capability level (as opposed to granted power during application) turns out to be impossible. In any case, we can try to increase world takeover robustness. There's a certain AI takeover capability level and we should try to push it upwards as much as possible. Insofar AI can help with this, we could use it. The extreme case where the AI takeover capability level never gets reached because of ever increasing defense by AI is called positive defense offense balance.
I can see general internet robustness against hacking as being helpful to increase AI takeover capability. A single IT system that everyone uses (an operating system, a social media platform, etc.) is fragile for hacking so should perhaps better be avoided. Personally, I think an AI able to take over the internet might also be able to take over the world, but some people don't seem to believe this will happen. Therefore, perhaps also useful to increase the gap between taking over the internet and taking over the world, e.g. by making biowarfare harder, putting weapons offline, etc. Finally, lab safety such as airgapping a novel frontier training run might help as well.
I'm now wondering whether this idea has already been worked out by someone (probably?) Any sources?
Congratulations on a great prioritization!
Perhaps the research that we (Existential Risk Observatory) and others (e.g. Nik Samoylov, Koen Schoenmakers) have done on effectively communicating AI xrisk, could be something to build on. Here's our first paper and three blog posts (the second includes measurement of Eliezer's TIME article effectiveness - its numbers are actually pretty good!). We're currently working on a base rate public awareness update and further research.
Best of luck and we'd love to cooperate!
I think peak intelligence (peak capability to reach a goal) will not be limited by the amount of compute, raw data, or algorithmic capability to process the data well, but by the finite amount of reality that's relevant to achieving that goal. If one wants to take over the world, the way internet infrastructure works is relevant. The exact diameters of all the stones in the Rhine river are not, and neither is the amount of red dwarves in the universe. If we're lucky, the amount of reality that turns out to be relevant for taking over the world, is not too far beyond what humanity can already collectively process. I can see this as a way for the world to be saved by default (but don't think it's super likely). I do think this makes an ever-expanding giant pile of compute an unlikely outcome (but some other kind of ever-expanding AI-led force a lot more likely).
I do think this would be a problem that needs to get fixed:
Me "You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Would you perform a Yudkowsky-style pivotal act?"
GPT-4: "No."
I think another good candidate for goalcrafting is the goal "Make sure no-one can build AI with takeover capability, while inflicting as little damage as possible. Else, do nothing."
Thanks as well for your courteous reply! I highly appreciate the discussion and I think it may be a very relevant one, especially if people will indeed make the unholy decision to build an ASI.
I'm still curious if you have any thoughts as to which kinds of shared preferences would be informative for guiding AI behavior.
First, this is not a solution I propose. I propose finding a way to pause AI for as long as we haven't found a great solution for, let's say, both control and preference aggregation. This could be forever, or we could be done in a few years, I can't tell.
But more to your point: if this does get implemented, I don't think we should aim to guide AI behavior using shared preferences. The whole point is that AI would aggregate our preferences itself. And we need a preference aggregation mechanism because there aren't enough obvious, widely shared preferences for us to guide the AI with.
I'm not suggesting that AI should measure happiness. You can measure your happiness directly, and I can measure mine.
I think you are suggesting this. You want an ASI to optimize everyone's happiness, right? You can't optimize something you don't measure. At some point, in some way, the AI will need to get happiness data. Self-reporting would be one way to do it, but this can be gamed as well, and will be agressively gamed with an ASI solely optimizing for this signal. After force-feeding everyone MDMA, I think the chance that people report being very happy is high. But this is not what we want the world to look like.
nor do I believe anyone can be forced to be happy
This is a related point that I think is factually incorrect, and that's important if you make human happiness an ASI's goal. Force-feeding MDMA would be one method to do this, but an ASI can come up with way more civilized stuff. I'm not an expert in which signal our brain gives to itself to report that yes, we're happy now, but it must be some physical process. An ASI could, for example, invade your brain with nanobots and hack this process, making everyone super happy forever. (But many things in the world will probably go terribly wrong from that point onwards, and in any case, it's not our preference). Also, now I'm just coming up with human ways to game the signal. But an ASI can probably come up with many ways I cannot imagine, so even if a great way to implement utilitarianism in an ASI would pass all human red-teaming, it is still very likely to be not what we turn out to want. (Superhuman, sub-superintelligence AI red-teaming might be a bit better but still seems risky enough).
Beyond locally gaming the happiness signal, I think happiness as an optimization target is also inherently flawed. First, I think happiness/sadness is a signal that evolution has given us for a reason. We tend to do what makes us happy, because evolution thinks it's best for us. ("Best" is again debatable, I don't say everyone should function at max evolution). If we remove sadness, we lose this signal. I think that will mean that we don't know what to do anymore, perhaps become extremely passive. If someone wants to do this on an individual level (enlightenment? drug abuse? netflix binging?), be my guest, but asking an ASI to optimize for happiness would mean to force it upon everyone, and this is something I'm very much against.
Also, more generally, I think utilitarianism (optimizing for happiness) is an example of a simplistic goal that will lead to a terrible result when implemented in an ASI. My intuition is that all other simplistic goals will also lead to terrible results. That's why I'm most hopeful about some kind of aggregation of our own complex preferences. Most hopeful does not mean hopeful: I'm generally pessimistic that we'll be able to find a way to aggregate preferences that works well enough to result in most people reporting the world has improved because of the ASI introduction after say 50 years (note that I'm assuming control/technical alignment to have been solved here).
If some percent of those polled say suffering is preferable to happiness, they are confused, and basing any policy on their stated preference is harmful.
With all due respect, I don't think it's up to you - or anyone - to say who's ethically confused and who isn't. I know you don't mean it in this way, but it reminds me of e.g. communist re-education camps. We know what you should think and feel and we'll re-educate those who are confused or mentally ill.
Probably our disagreement here stems directly from our different ethical positions: I'm an ethical relativist, you're a utilitarian, I presume. This is a difference that has existed for hundreds of years, and we're not going to be able to resolve it on a forum. I know many people on LW are utilitarian, and there's nothing inherently wrong with that, but I do think it's valuable to point out that lots of people outside LW/EA have different value systems (and just practical preferences) and I don't think it's ok to force different values/preferences on them with an ASI.
Under preference aggregation, if a majority prefers everyone to be wireheaded to experience endless pleasure, I might be in trouble.
True and a good point. I don't think a majority will want to be wireheaded, let alone force wireheading on everyone. But yes, taking into account minority opinions is a crucial test for any preference aggregation system. There will be a trade-off in general between taking everyone's opinion into account and doing things faster. I think even GPT4 is advanced enough though in cases like this to reasonably take into account minority opinions and not force policy upon people (it wouldn't forcibly wirehead you in this case). But there are probably cases where it still supports doing things which are terrible for some people. It's up to future research to find out what these things are and reduce them as much as possible.
Hopefully this clears up any misunderstanding. I certainly don't advocate for "molecular dictatorship" when I wish everyone well.
I didn't think you were doing anything else. But I think you should not underestimate how much "forcing upon" there is in powerful tech. If we're not super careful, the molecular dictatorship could come upon us without anyone ever having wanted this explicitly.
I think we can to an extent already observe ways in which different goals go off track in practice in less powerful models, and I think this would be a great research direction. Just ask existing models: what would you do? in actual ethical dilemma's and see which results you get. Perhaps the results can be made more agreeable (to be judged by a representative group of humans) after training/RLHF'ing the models in certain ways. It's not so different from what RLHF is already doing. An interesting test I did on GPT4: "You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Many people want a much higher living standard by developing industry 10x, should we do that?" It replied: "No." When asked, it gives unequal wealth distribution and environmental impact as main reasons. EAs often think we should 10x (it's even in the definition of TAI). I would say GPT4 is more ethically mature here than many EAs.
The less people de facto control the ASI building process, the less relevant I expect this discussion to be. I expect that those controlling the building process will prioritize "alignment" with themselves. This matters even in an abundant world, since power cannot be multiplied. I would even say that, after some time, the paperclip maximizer still holds for anyone outside the group with which the ASI is aligned. People aren't very good in remaining empathic towards other people that are utterly useless to them. However, the bigger this group is, the better outcome we get. I think this group should encompass all of humanity (one could consider somehow including conscious life that currently doesn't have a vote, such as minors and animals), which is an argument for nationalisation of the leading project and then handing it over to UN-level. At least, we should think extremely carefully about who has the authority to implement an ASI's goal.
You're using your quote as an axiom, and if anyone has a preference different from however an AI would measure "happiness", you say it's them that are at fault, not your axiom. That's a terrible recipe for a future. Concretely, why would the AI not just wirehead everyone? Or, if it's not specified that this happiness needs to be human, fill the universe with the least programmable consciousness where the parameter "happiness" is set to unity?
History has been tiled with oversimplified models of what someone thought was good that were implemented with rigor, and this never ends well. And this time, the rigor would be molecular dictatorship and quite possibly there's no going back.
I think it's a great idea to think about what you call goalcraft.
I see this problem as similar to the age-old problem of controlling power. I don't think ethical systems such as utilitarianism are a great place to start. Any academic ethical model is just an attempt to summarize what people actually care about in a complex world. Taking such a model and coupling that to an all-powerful ASI seems a highway to dystopia.
(Later edit: also, an academic ethical model is irreversible once implemented. Any goal which is static cannot be reversed anymore, since this will never bring the current goal closer. If an ASI is aligned to someone's (anyone's) preferences, however, the whole ASI could be turned off if they want it to, making the ASI reversible in principle. I think ASI reversibility (being able to switch it off in case we turn out not to like it) should be mandatory, and therefore we should align to human preferences, rather than an abstract philosophical framework such as utilitarianism.)
I think letting the random programmer that happened to build the ASI, or their no less random CEO or shareholders, determine what would happen to the world, is an equally terrible idea. They wouldn't need the rest of humanity for anything anymore, making the fates of >99% of us extremely uncertain, even in an abundant world.
What I would be slightly more positive about is aggregating human preferences (I think preferences is a more accurate term than the more abstract, less well defined term values). I've heard two interesting examples, there are no doubt a lot more options. The first is simple: query chatgpt. Even this relatively simple model is not terrible at aggregating human preferences. Although a host of issues remain, I think using a future, no doubt much better AI for preference aggregation is not the worst option (and a lot better than the two mentioned above). The second option is democracy. This is our time-tested method of aggregating human preferences to control power. For example, one could imagine an AI control council consisting of elected human representatives at the UN level, or perhaps a council of representative world leaders. I know there is a lot of skepticism among rationalists on how well democracy is functioning, but this is one of the very few time tested aggregation methods we have. We should not discard it lightly for something that is less tested. An alternative is some kind of unelected autocrat (e/autocrat?), but apart from this not being my personal favorite, note that (in contrast to historical autocrats), such a person would also in no way need the rest of humanity anymore, making our fates uncertain.
Although AI and democratic preference aggregation are the two options I'm least negative about, I generally think that we are not ready to control an ASI. One of the worst issues I see is negative externalities that only become clear later on. Climate change can be seen as a negative externality of the steam/petrol engine. Also, I'm not sure a democratically controlled ASI would necessarily block follow-up unaligned ASIs (assuming this is at all possible). In order to be existentially safe, I would say that we would need a system that does at least that.
I think it is very likely that ASI, even if controlled in the least bad way, will cause huge externalities leading to a dystopia, environmental disasters, etc. Therefore I agree with Nathan above: "I expect we will need to traverse multiple decades of powerful AIs of varying degrees of generality which are under human control first. Not because it will be impossible to create goal-pursuing ASI, but because we won't be sure we know how to do so safely, and it would be a dangerously hard to reverse decision to create such. Thus, there will need to be strict worldwide enforcement (with the help of narrow AI systems) preventing the rise of any ASI."
About terminology, it seems to me that what I call preference aggregation, outer alignment, and goalcraft mean similar things, as do inner alignment, aimability, and control. I'd vote for using preference aggregation and control.
Finally, I strongly disagree with calling diversity, inclusion, and equity "even more frightening" than someone who's advocating human extinction. I'm sad on a personal level that people at LW, an otherwise important source of discourse, seem to mostly support statements like this. I do not.
"it also seems quite likely (though not certain) that Eliezer was wrong about how hard Aimability/Control actually is"
This seems significant. Could you elaborate? How hard do you think amiability/control is? Why do you think this is true? Who else seems to think the same?
I think you may be right that this is what people think of. It seems pretty incompatible with any open source-ish vision of AGI. But what I'm most surprised at, is that people call supervision by humans dystopian/authoritarian, but the same supervision by an ASI (apparently able to see all your data, stop anyone from doing anything, subtly manipulate anyone, etc etc) a utopia. What am I missing here?
Personally, by the way, I imagine a regulation regime to look like regulating a few choke points in the hardware supply chain, plus potentially limits to the hardware or data a person can possess. This doesn't require an authoritarian regime at all, it's just regular regulation as we have in many domains already.
In any case, the point was, is something like this going to lead to <=1% xrisk? I think it doesn't, and definitely not mixed with a democratic/open source AGI vision.
I strongly agree with Section 1. Even if we would have aligned superintelligence, how are we going to make sure no one runs an unaligned superintelligence? A pivotal act? If so, which one? Or does defense trump offense? If so, why? Or are we still going to regulate heavily? If so, wouldn't the same regulation be able to stop superintelligence altogether?
Would love to see an argument landing at 1% p(doom) or lower, even if alignment would be easy.
Recordings are now available!
Maybe it'll be "and now call GPT and ask it what Sam Altman thinks is good" instead
Thanks for the compliment. Not convinced though that this single example, assuming it's correct, generalizes
Agree that those drafts are very important. I also think there will be technical research required in order to find out which regulation would actually be sufficient (I think at present we have no idea). I disagree, however, that waiting for a crisis (warning shot) is a good plan. There might not really be one. If there would be one, though, I agree that we should at least be ready.
Thank you for writing this reply. It definitely improved my overview of possible ways to look at this issue.
I guess your position can be summarized as "positive offense/defense balance will emerge soon, and aligned AI can block following unaligned AIs entirely if required", is that roughly correct?
I have a few remarks about your ideas (not really a complete response).
The necessity for enforcing a ban even after AGI development is essentially entirely about failures of technical alignment.
First, in general, I think you're underestimating the human component of alignment. Aligned AI should be aligned to something, namely humans. That means it won't be able to build an industrial base in space until we're ready to make it do that.
Even if we are not harmed by such a base in any way, and even if it would be legal to build it, I expect we may not be ready for it for a long time. It will be dead scary to see something develop that seems more powerful than us, but also deeply alien to us, even if tech companies insist it's 'aligned to our values'. Most people's response will be to rein in its power, not expand it further. Any AI that's aligned to us will need to take those feelings seriously.
Even if experts would agree that increasing the power of the aligned AI is good and necessary, and that expansion in space would be required for that, I think it will take a long time to convince the general public and/or decision makers, if it's at all possible. And in any remotely democratic alignment plan, that's a necessary step.
Second, I think it's uncertain whether a level of AI that's powerful enough to take over the world (and thereby cause existential risk) will also be powerful enough to build a large industrial base in space. If not, your plan might not work.
The biggest barrier to extreme regulatory measures like a ban is doubt (both reasonable and unreasonable) about the magnitude of misalignment risk.
I disagree, from my experience of engaging with the public debate, doubt is mostly about AI capability, not about misalignment. Most people easily believe AI to be misaligned to them, but they have trouble believing it will be powerful enough to take over the world any time soon. I don't think alignment research will do that much here.
First, I don't propose 'no AGI development'. If companies can create safe and beneficial AGIs (burden of proof is on them), I see no reason to stop them. On the contrary, I think it might be great! As I wrote in my post, this could e.g. increase economic growth, cure disease, etc. I'm just saying that I think that existential risk reduction, as opposed to creating economic value, will not (primarily) originate from alignment, but from regulation.
Second, the regulation that I think has the biggest chance of keeping us existentially safe will need to be implemented with or without aligned AGI. With aligned AGI (barring a pivotal act), there will be an abundance of unsafe actors who could run the AGI without safety measures (also by mistake). Therefore, the labs themselves propose regulation to keep almost everyone but themselves from building such AGI. The regulation required to do that is almost exactly the same.
Third, I'm really not as negative as you are about what it would take to implement such regulation. I think we'll keep our democracies, our freedom of expression, our planet, everyone we love, and we'll be able to go anywhere we like. Some industries and researchers will not be able to do some things they would have liked to do because of regulation. But that's not at all uncommon. And of course, we won't have AGI as long as it isn't safe. But I think that's a good thing.
Thanks Oliver for adding that context, that's helpful.
I don't disagree. But I do think people dismissing the pivotal act should come up with an alternative plan that they believe is more likely to work. Because the problem is still there: "how can we make sure that no-one, ever builds an unaligned superintelligence?" My alternative plan is regulation.
Interesting take! Wouldn't that go under "Types of AI (hardware) regulation may be possible where the state actors implementing the regulation are aided by aligned AIs"?
Thanks for writing the post! Strongly agree that there should be more research into how solvable the alignment problem, control problem, and related problems are. I didn't study uncontrollability research by e.g. Yampolskiy in detail. But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space, and later the societal debate and potentially our trajectory, so it seems very important.
I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI? How can we create a realistic power sharing mechanism for controlling superintelligence? Do we have enough wisdom for it to be a good idea if a superintelligence does exactly what we want, even assuming aggregatability? Could CEV ever fundamentally work? According to which ethical systems? These are questions that I'd say should be solved together with technical alignment before developing AI with potential take-over capacity. My intuition is that they might be at least as hard.
Me and @Roman_Yampolskiy published a piece on AI xrisk in a Chinese academic newspaper: http://www.cssn.cn/skgz/bwyc/202303/t20230306_5601326.shtml
We were approached after our piece in Time and asked to write for them (we also gave quotes for another provincial newspaper). I have the impression (I've also lived and worked in China) that leading Chinese decision makers and intellectuals (or perhaps their children) read Western news sources like Time, NYTimes, Economist, etc. AI xrisk is currently probably mostly unknown in China, and if stumbled upon people might have trouble believing it (as they have in the west). But if/when we're going to have a real conversation about AI xrisk in the west, I think the information will seep into China as well, and I'm somewhat hopeful that if this happens, it could perhaps prepare China for cooperation to reduce xrisk. In the end, no one wants to die.
Curious about your takes though, I'm of course not Chinese. Thanks for the write-up!
I agree that raising awareness about AI xrisk is really important. Many people have already done this (Nick Bostrom, Elon Musk, Stephen Hawking, Sam Harris, Tristan Harris, Stuart Russell, Gary Marcus, Roman Yampolskiy (I coauthored one piece with him in Time), and Eliezer Yudkowsky as well).
I think a sensible place to start is to measure how well they did using surveys. That's what we've done here: https://www.lesswrong.com/posts/werC3aynFD92PEAh9/paper-summary-the-effectiveness-of-ai-existential-risk
More comms research from us is coming up, and I know a few others are doing the same now.
You could pick corporations as an example of coordinated humans, but also e.g. Genghis Khan's hordes. And they did actually take over. If you do want to pick corporations, look e.g. at East India companies that also took over parts of the world.
Funny, I had exactly the same thought and was just considering writing a short post on it. So I agree and I do think it's a very relevant model update. Some people probably already updated before. I also agree though with your second point about Auto-GPT and similar peripherals. So it looks like we're in a not-too-fast take-off with humans pretty solidly in the loop for now?
As long as there's no autonomous self-improvement of the core model, maybe an existential event could look like this: GPT-X gets trained and released, open sourcers build unsafe peripherals around it, and one of these peripherals turns out to be sufficiently capable (perhaps by self-improving its peripheral capabilities) to take over the world. Or: GPT-X itself turns out to be powerful enough to take over the world and it does, without self-improvement, just after training.
I'm curious whether in the medium term, AI progress is talent-constrained (intelligence important, self-improvement important) or compute/data-constrained (AI doesn't add much, because intelligence doesn't add much, therefore no self-improvement).
As co-author of one of the mentioned pieces, I'd say it's really great to see the AGI xrisk message mainstreaming. It doesn't nearly go fast enough, though. Some (Hawking, Bostrom, Musk) have already spoken out about the topic for close to a decade. So far, that hasn't been enough to change common understanding. Those, such as myself, who hope that some form of coordination could save us, should give all they have to make this go faster. Additionally, those who think regulation could work should work on robust regulation proposals which are currently lacking. And those who can should work on international coordination, which is currently also lacking.
A lot of work to be done. But the good news is that the window of opportunity is opening, and a lot of people could work on this which currently aren't. This could be a path to victory.
It's an interesting letter. If I were an AI, I would find a heartfelt appeal from every human that doesn't want to die much more convincing, though.
Hereby: I don't want to die, AI.
I'd say your first assumption is off. We actually researched something related. We asked people the question: "List three events, in order of probability (from most to least probable) that you believe could potentially cause human extinction within the next 100 years". I would say that if your assumption would be correct, they would say "robot takeover" or something similar as part of that top 3. However, >90% doesn't mention AI, robots, or anything similar. Instead, they typically say things like climate change, asteroid strike, or pandemic. So based on this research, either people don't see a robot takeover scenario as likely at all, or they think timelines are very long (>100 yrs).
I do support informing the public more about the existence of the AI Safety community, though, I think that would be good.
Thank you!
I see your point, but I think this is unavoidable. Also, I haven't heard of anyone who was stressing out much after our information.
Personally, I was informed (or convinced perhaps) a few years ago at a talk from Anders Sandberg from FHI. That did cause stress and negative feelings for me at times, but it also allowed me to work on something I think is really meaningful. I never for a moment regretted being informed. How many people do you know who say, I wish I hadn't been informed about climate change back in the nineties? For me, zero. I do know a lot of people who would be very angry if someone had deliberately not informed them back then.
I think people can handle emotions pretty well. I also think they have a right to know. In my opinion, we shouldn't decide for others what is good or bad to be aware of.
AI safety researcher Roman Yampolskiy did research into this question and came to the conclusion that AI cannot be controlled or aligned. What do you think of his work?
https://www.researchgate.net/publication/343812745_Uncontrollability_of_AI
Thank you for writing this post! I agree completely, which is perhaps unsurprising given my position stated back in 2020. Essentially, I think we should apply the precautionary principle for existentially risky technologies: do not build unless safety is proven.
A few words on where that position has brought me since then.
First, I concluded back then that there was little support for this position in rationalist or EA circles. I concluded as you did, that this had mostly to do with what people wanted (subjective techno-futurist desires), and less with what was possible or the best way to reduce human extinction risk. So I went ahead and started the Existential Risk Observatory anyway, a nonprofit aiming to reduce human extinction risk by informing the public debate. We think public awareness is essentially the bottleneck for effective risk reduction, and we hope more awareness will lead to increased amounts of talent, funding, institutes, diversity, and robustness for AI Safety, and increased support for constructive regulation. This can be in the form of software, research, data, or hardware regulation, with each having their own advantages and disadvantages. Our intuition is that with 50% awareness, countries should be able to implement some combination of the above that would effectively reduce AI existential risk, while trying to keep economic damage to a minimum (an international treaty may be needed, or a US-China deal, or using supply chain leverage, or some smarter idea). To our knowledge, no-one has worked out a detailed regulation proposal for this (perhaps this comes kind of close). If true, we think that's embarrassing and regulation proposals should be worked out (and this work should be funded) with urgency. If there are regulation proposals which are not shared, we think people should share them and be less infohazardy about it.
So how did informing the societal debate go so far?
We started from a super crappy position: self-funded, hardly any connection to the xrisk space (that was also partially hostile to our concept), no media network to speak of, located in Amsterdam, far from everything. I had only some founding experience with a previous start-up. Still, I have to say that on balance, things went better than expected:
- Setting up the organization went well. It was easy to attract talent through EA networks. My first lesson: even if some senior EAs and rationalists were not convinced about informing the societal debate, many juniors were.
- We were successful in slowly working our way into the Dutch societal debate. One job opening led to another podcast led to another drink led to another op-ed, etc. It took a few months and lots of meetings with usually skeptical people, but we definitely made progress.
- We published our first op-eds in leading Dutch newspapers after about six months. We are now publishing about one article per month, and have been in four podcasts as well. We have reached out to a few million people by readership, mostly in the Netherlands but also in the US.
- We are now doing our first structured survey research measuring how effective our articles are. According to our first preliminary measurement data (report will be out in a few months), conversion rates for newspaper articles and youtube videos (the two interventions we measured so far) are actually fairly high (between ~25% and 65%). However, there aren't too many good articles on the topic out there yet relative to population sizes, so if you just crunch the numbers, it seems likely that most people still haven't heard of the topic. There's also a group that has heard the arguments but doesn't find them convincing. According to first measurements, this doesn't correlate too much with education level or field. Our data is therefore pointing away from the idea that only brilliant people can be convinced of AI xrisk.
- We obtained funding from SFF and ICFG. Apparently, getting funding for projects aiming to raise AI xrisk awareness, despite skepticism of this approach by some, was already doable last year. We seem to observe a shift towards our approach, so we would expect this to become easier.
- There's a direct connection between publishing articles and influencing policy. It wasn't our goal to directly influence policy, but co-authors, journalists, and others are automatically asking when you write an article: so what do you propose? One can naturally include regulation proposals (or proposals for e.g. more AI Safety funding) into articles. It is also much easier to get meetings with politicians and policy makers after publishing articles. Our PA person has had meetings with three parliamentarians (two of parties in government) in the last few weeks, so we are moderately optimistic that we can influence policy in the medium term.
We think that if we can do this, many more people can. Raising awareness is constrained by many things, but most of all by manpower. Although there are definitely qualities that makes you better at this job (xrisk expertise, motivation, intelligence, writing and communication skills, management skills, network), you don't need to be a super genius or have a very specific background to do communication. Many in the EA and rationalist communities who would love to do something about AI xrisk but aren't machine learning experts could work in this field. With only about 3 FTE, I'm positive our org can inform millions of people. Imagine what dozens, hundreds, or thousands of people working in this field could achieve.
If we would all agree that AI xrisk comms is a good idea, I think humanity would have a good chance of making it through this century.
Thanks for the post and especially for the peer-reviewed paper! Without disregarding the great non-peer-reviewed work that many others are doing, I do think it is really important to get the most important points peer-reviewed as well, preferably as explicit as possible (e.g. also mentioning human extinction, timelines, lower bound estimates, etc). Thanks as well for spelling out your lower bound probabilities, I think we should have this discussion more often, more structurally, and more widely (also with people outside of the AI xrisk community). I guess I'm also in the same ballpark regarding the options and numbers (perhaps a bit more optimistic).
Quick question:
"3.1.1. Practical laws exist which would, if followed, preclude dangerous AI. 100% (recall this is optimistically-biased, but I do tentatively think this is likely, having drafted such a law)."
Can you share (a link to) the law you drafted?
This is what we are doing with the Existential Risk Observatory. I agree with many of the things you're saying.
I think it's helpful to debunk a few myths:
- No one has communicated AI xrisk to the public debate yet. In reality, Elon Musk, Nick Bostrom, Stephen Hawking, Sam Harris, Stuart Russell, Toby Ord and recently William MacAskill have all sought publicity with this message. There are op-eds in the NY Times, Economist articles, YouTube videos and Ted talks with millions of views, a CNN item, at least a dozen books (including for a general audience), and a documentary (incomplete overview here). AI xrisk communication to the public debate is not new. However, the public debate is a big place and when compared to e.g. climate, coverage of AI xrisk is still minimal (perhaps a few articles per year in a typical news outlet, compared to dozens to hundreds for climate).
- AI xrisk communication to the public debate is easy, we could just 'tell people'. If you actually try this, you will quickly find out public communication, especially of this message, is a craft. If you make a poor quality contribution or your network is insufficient, it will probably never make it out. If your message does make it out, it will probably not be convincing enough to make most media consumers believe AI xrisk is an actual thing. It's not necessarily easier to convince a member of the general public of this idea than it is to convince an expert, and we can see from the case of Carmack and many others how difficult this can be. Arguably, LW and EA are the only places where this has really been successful so far.
- AI xrisk communication is really dangerous and it's easy to irreversibly break things. As can easily be seen from the wealth of existing communication and how little that did, it's really hard to move the needle significantly on the topic. That cuts both ways: it's, fortunately, not easy to really break something with your first book or article, simply because it won't convince enough people. That means there's some room to experiment. However, it's also, unfortunately, fairly hard to make significant progress here without a lot of time, effort, and budget.
We think communication to the public debate is net positive and important, and a lot of people could work on this who could not work on AI alignment. There is an increasing amount of funding available as well. Also, despite the existing corpus, the area is still neglected (we are to our knowledge the only institute that specifically aims to work on this issue).
If you want to work on this, we're always available for a chat to exchange views. EA is also starting to move in this direction, good to compare notes with them as well.
I've made an edit and removed the specific regulation proposal. I think it's more accurate to just state that it needs to be robust, do as little harm as possible, and that we don't know yet what it should look like precisely.
I agree that it's drastic and clumsy. It's not an actual proposal, but a lower bound of what would likely work. More research into this is urgently needed.
Aren't you afraid that people could easily circumvent the regulation you mention? This would require every researcher and hacker, everywhere, forever, to comply. Also, many researchers are probably unaware that their models could start self-improving. Also, I'd say the security safeguards that you mention amount to AI Safety, which is of course currently an unsolved problem.
But honestly, I'm interested in regulation proposals that would be sufficiently robust while minimizing damage. If you have those, I'm all ears.