otto-barten

I love this post, I think this is a fundamental issue for intent-alignment. I don't think value-alignment or CEV are any better though, mostly because they seem irreversible to me, and I don't trust the wisdom of those implementing them (no person is up to that task).

I agree it would be good to I implement these recommendations, although I also think they might prove insufficient. As you say, this could be a reason to pause that might be easier to grasp by the public than misalignment. (I think currently, the reason some do not support a pause is perceived lack of capabilities though, not (mostly) perceived lack of misalignment).

I'm also worried about a coup, but I'm perhaps even more worried about the fate of everyone not represented by those who will have control over the intent-aligned takeover-level AI (IATLAI). If IATLAI is controlled by e.g. a tech CEO, this includes almost everyone. If controlled by government, even if there is no coup, this includes everyone outside that country. Since control over the world of IATLAI could be complete (way more intrusive than today) and permanent (for >billions of years), I think there's a serious risk that everyone outside the IATLAI country does not make it eventually. As a data point, we can see how much empathy we currently have for citizens from starving or war-torn countries. It should therefore be in the interest of everyone who is on the menu, rather than at the table, to prevent IATLAI from happening, if capabilities awareness would be present. This means at least the world minus the leading AI country.

The only IATLAI control that may be acceptable to me, could be UN-controlled. I'm quite surprised that every startup is now developing AGI, but not the UN. Perhaps they should.

Comment by otto.barten (otto-barten) on AI-enabled coups: a small group could use AI to seize power · 2025-04-17T07:18:30.362Z · LW · GW

I expected this comment, value alignment or CEV indeed doesn't have the few-human coup disadvantage. It does however have other disadvantages. My biggest issue with both is that they seem irreversible. If your values or your specific CEV implementation turns out to be terrible for the world, you're locked in and there's no going back. Also, a value-aligned or CEV takeover-level AI would probably start straight away with a takeover, since else it can't enforce its values in a world where many will always disagree. That takeover won't exactly increase its popularity. I think a minimum requirement should be that a type of alignment is adjustable by humans, and intent-alignment is the only type that meets that requirement as far as I know.

Comment by otto.barten (otto-barten) on AI-enabled coups: a small group could use AI to seize power · 2025-04-17T07:11:29.443Z · LW · GW

Only one person, or perhaps a small, tight group, can succeed in this strategy though. The chance that that's you is tiny. Alliances with someone you thought was on your side can easily break (case in point: EA/OAI).

It's a better strategy to team up with everyone else and prevent the coup possibility.

Comment by otto.barten (otto-barten) on AI 2027: What Superintelligence Looks Like · 2025-04-11T17:31:25.111Z · LW · GW

Thanks for writing this out! I see this as a possible threat model, and although I don't think this is by far the only possible threat model, I do think it's likely enough to prepare for. Below, I put a list of ~disagreements, or different ways to look at the problem which I think are as valid. Notably, I end up with technical alignment being much less of a crux, and regulation more of one.

This is a relatively minor point for me, but let me still make it: I think it's not obvious that the same companies will remain in the lead. There are arguments for this, such as a decisive data availability advantage of the first movers. Still, seeing how quickly e.g. DeepSeek could (almost) catch up, I think it's not unlikely that other companies, government projects, or academic projects will take over the lead. This likely partially has to do with me being skeptical about huge scaling being required for AGI (which is in the end trying to be a reproduction of a ten Watt device - us). I think unfortunately, this makes the risks a lot larger through governance being more difficult.
I'm not sure technical alignment would have been able to solve this scenario. Technically aligned systems could either be intent-aligned (seems most likely), value-aligned, or use coherent extrapolated volition. If they get the same power, I think this would likely still lead to a takeover, and still to a profoundly dystopian outcome, possibly with >90% of humanity dying.
This scenario is only one threat model. We should understand that there are at least a few more, also leading to human extinction. It would be a mistake to only focus on solving this one (and a mistake to only focus on solving technical alignment).
Since this threat model is relatively slow, gradual, and obvious (the public will see ~everything until the actual takeover happens), I'm somewhat less pessimistic about our chances (maybe "only" a few percent xrisk), because I think AI would likely get regulated, which I think could save us for at least decades.
I don't think solving technical alignment would be sufficient to avoid this scenario, but I also don't think it would be required. Basically, I don't see solving technical alignment as a crux for avoiding this scenario.
I think the best way to avoid this scenario is traditional regulation: after model development, at the point of application. If the application looks too powerful, let's not put an AI there. E.g. the EU AI act makes a start with this (although it's important that such regulation would need to include the military as well, and would likely need ~global implementation - no trivial campaigning task).
Solving technical alignment (sooner) could actually be net negative for avoiding this threat model. If we can't get an AI to reliably do what we tell it to do (current situation), who would use it in a powerful position? Solving technical alignment might open the door to applying AI at powerful positions, thereby enabling this threat model rather than avoiding it.
Despite these significant disagreements, I welcome the effort by the authors to write out their threat model. More people should do so. And I think their scenario is likely enough that we should put effort in trying to avoid it (although imo via regulation, not via alignment).

Comment by otto.barten (otto-barten) on New AI safety treaty paper out! · 2025-03-31T17:22:48.824Z · LW · GW

Hi Charbel, thanks for your interest, great question.

If the balance would favor offense, we would die anyway despite a successful alignment project, since there's always either a bad actor or someone accidentally failing to align their takeover-level AI, in a world with many AGIs. (I tend to think about this as Murphy's law for AGI). Therefore, if one claims that one's alignment project reduces existential risk, they must think their aligned AI can somehow stop another unaligned AI (favorable offense/defense balance).

There are some other options:

Some believe the first AGI will take off to ASI straight away and will block other projects by default. I think that's at least not certain, e.g. the labs don't seem to believe so. Note also that blocking is illegal.
Some believe the first AGI will take off to pivotal act capability and do a pivotal act. I think there's at least a chance that won't happen. Note also that pivotal acts are illegal.
It could be that we regulate AI so that no unsafe projects can be built, using eg a conditional AI safety treaty. In this case, neither alignment, nor a positive offense defense balance are needed.
It could be that we get MAIM, mutually assured AI malfunction. In this case too, neither alignment nor a positive offense defense balance are needed.

Barring these options though, we seem to not only need AI alignment, bit also a positive offense defense balance.

Some more on the topic: https://www.lesswrong.com/posts/2cxNvPtMrjwaJrtoR/ai-regulation-may-be-more-important-than-ai-alignment-for

Comment by otto.barten (otto-barten) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-26T21:41:23.184Z · LW · GW

I agree that changing systems is difficult. But providing basic means isn't, really. I personally think we should feed starving people even if they live in a dictatorship.

Comment by otto.barten (otto-barten) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-26T09:11:11.373Z · LW · GW

Wikipedia: in 2023, there were 733 million people suffering from hunger. That's 9% of the population. Most of these people just don't have the money to buy food. That's a 'distribution problem', for money, in the sense that we don't give it to them. Also, world hunger is actually rising again..

Some more data: https://www.linkedin.com/posts/ottobarten_about-700-million-people-in-the-world-cannot-activity-7266965529762873344-rvqK

We could easily solve this if we wanted to, but apparently we don't. That's one data point why I fear intent-aligned superintelligence.

Comment by otto.barten (otto-barten) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-20T07:48:21.602Z · LW · GW

Interesting and nice to play with a bit.

METR seems to imply 167 hours, approximately one working month, is the relevant project length for getting a well-defined, non-messy research task done.

It's interesting that their doubling time varies between 7 months and 70 days depending on which tasks and which historical time horizon they look at.

For a lower bound estimate, I'd take 70 days doubling time and 167 hrs, and current max task length one hour. In that case, if I'm not mistaken,

2^(t/d) = 167 (t time, d doubling time)

t = d*log(167)/log(2) = (70/365)*log(167)/log(2) = 1.4 yr, or October 2026

For a higher bound estimate, I'd take their 7 months doubling time result and a task of one year, not one month (perhaps optimistic to finish SOTA research work in one month?). That means 167*12=2004 hrs.

t = d*log(2004)/log(2) = (7/12)*log(2004)/log(2) = 6.4 yr, or August 2031

Not unreasonable to expect AI that can autonomously do non-messy tasks in domains with low penalties for wrong answers in between these two dates?

It's also noteworthy though that timelines for what the paper calls messy work, in the current paradigm, could be a lot longer, or could provide architecture improvements.

Comment by otto.barten (otto-barten) on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-20T07:15:17.676Z · LW · GW

Have we eventually solved world hunger by giving 1% of GDP to the global poor?

Also, note it's not obvious that ASI can be aligned.

Comment by otto.barten (otto-barten) on The Sorry State of AI X-Risk Advocacy, and Thoughts on Doing Better · 2025-02-26T15:36:25.316Z · LW · GW

I'm founder of the Existential Risk Observatory, a nonprofit aiming to reduce xrisk by informing the public since 2021. We have published four TIME Ideas pieces (including the first one on xrisk ever) and about 35 other media pieces in six countries. We're also doing research into AI xrisk comms, notably producing to my knowledge the first paper on the topic. Finally, we're organizing events, coupling xrisk experts such as Bengio, Tegmark, Russell, etc. to leaders of the societal debate (incl. journalists from TIME, Economist, etc.) and policymakers.

First, I think you're a bit too negative about online comms. Some Yud tweets, but also e.g. Lex Fridman xrisky interviews, actually have millions of views: that's not a bubble anymore. I think online xrisk comms is firmly net positive, including AI Notkilleveryoneism Memes. Journalists are also on X.

But second, I definitely agree that there's a huge opportunity informing the public about AI xrisk. We did some research on this (see paper above) and, perhaps unsurprising, an authority (leading AI prof) on a media channel people trust seems to work best. There's also a clear link between length of the item and effect. I'd say: try to get Hinton, Bengio, and Russell in the biggest media possible, as much as possible, as long as possible (and expand: get other academics to be xrisk messengers as well). I think eg this item was great.

What also works really well: media moments. The FLI open letter and CAIS open statement created a ripple big enough to be reported by almost all media. Another example is the Nobel Prize of Hinton. Another easy one: Hinton updating his pdoom in an interview from 10% to 10-20%, that was news apparently. If anyone can create more of such moments: amazingly helpful!

All in all, I'd say the xrisk space is still unconvinced about getting the public involved. I think that's a pity. I know projects that don't get funded now, but could help spread awareness at scale. Re activism: I share your view that it won't really work until the public is informed. However, I think groups like PauseAI are helpful in informing the public about xrisk, making them net positive too.

Comment by otto.barten (otto-barten) on Jesse Hoogland's Shortform · 2025-01-26T21:48:56.806Z · LW · GW

this is a constraint on how the data can be generated, not on how efficiently other models can be retrained

Maybe we can regulate data generation?

Comment by otto.barten (otto-barten) on The Case Against AI Control Research · 2025-01-24T15:37:52.447Z · LW · GW

I didn't read everything, but just flagging that there are also AI researchers, such as Francois Chollet to name one example, who believe that even the most capable AGI will not be powerful enough to take over. On the other side of the spectrum, we have Yud believing AGI will weaponize new physics within the day. If Chollet is a bit right, but not quite, and the best AI possible is just able to take over, than control approaches could actually stop it. I think control/defence should not be written off even as a final solution.

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-18T08:05:19.759Z · LW · GW

So if we ended up with some advanced AI’s replacing humans, then we made some sort of mistake

Again, I'm glad that we agree on this. I notice you want to do what I consider the right thing, and I appreciate that.

The way I currently envision the “typical” artificial conscience is that it would put a pretty strong conscience weight on not doing what its user wanted it to do, but this could be over-ruled by the conscience weight of not doing anything to prevent catastrophes. So the defensive, artificial conscience-guard-railed AI I’m thinking of would do the “last resort” things that were necessary to avoid s-risks, x-risks, and major catastrophes from coming to fruition, even if this wasn’t popular with most people, at least up to a point.

I can see the following scenario occur: the AI, with its AC, decided rightly that a pivotal act needs to be undertaken to avoid xrisk (or srisk). However, the public mostly doesn't recognize the existence of such risks. The AI will proceed sabotaging people's unsafe AI projects against public will. What happens now is: the public gets absolutely livid at the AI, that is subverting human power by acting against human will. Almost all humans team up to try to shut down the AI. The AI recognizes (and had already recognized) that if it looses, humans risk going extinct, so it fights this war against humanity and wins. I think in this scenario, an AI, even one with artificial conscience, could become the most hated thing on the planet.

I think people underestimate the amount of pushback we're going to get once you get into pivotal act territory. That's why I think it's hugely preferred to go the democratic route and not count on AI taking unilateral actions, even if it would be smarter or even wiser, whatever that might mean exactly.

All that said, if we could somehow pause development of autonomous AI’s everywhere around the world until humans got their act together, developing their own consciences and senses of ethics, and were working as one team to cautiously take the next steps forward with AI, that would be great.

So yes definitely agree with this. I don't think lack of conscience or ethics is the issue though, but existential risk awareness.

Comment by otto.barten (otto-barten) on The Intelligence Curse · 2025-01-17T07:41:06.604Z · LW · GW

Pick a goal where your success doesn't directly cause obvious problems

I agree but I'm afraid value alignment doesn't meet this criterion. (I'm copy pasting my response on VA from elsewhere below).

I don't think value alignment of a super-takeover AI would be a good idea, for the following reasons:

1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.

2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it's very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can't correct for externalities happening down the road. (Speed also makes it more likely that we can't correct in time, so I think we should try to go slow).

3) There is no agreement on which values are 'correct'. Personally, I'm a moral relativist, meaning I don't believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It's very uncertain whether such change would be considered as net positive by any surviving humans.

4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.

I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I'm somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI's input.

The killer app for ASI is, and always has been, to have it take over the world and stop humans from screwing things up

I strongly disagree with this being a good outcome, I guess mostly because I would expect the majority of humans to not want this. If humans would actually elect an AI to be in charge, and they could be voted out as well, I could live with that. But a takeover by force from an AI is as bad for me as a takeover by force from a human, and much worse if it's irreversible. If an AI is really such a good leader, let them show it by being elected (if humans decide that an AI should be allowed to run at all).

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-17T07:15:32.817Z · LW · GW

Thanks for your reply. I think we should use the term artificial conscience, not value alignment, for what you're trying to do, for clarity. I'm happy to see we seem to agree that reversibility is important and replacing humans is an extremely bad outcome. (I've talked to people into value alignment of ASI who said they "would bite that bullet", in other words would replace humanity by more efficient happy AI consciousness, so this point does not seem to be obvious. I'm also not convinced that leading longtermists necessarily think replacing humans is a bad outcome, and I think we should call them out on it.)

If one can implement artificial conscience in a reversible way, it might be an interesting approach. I think a minimum of what an aligned ASI would need to do is block other unaligned ASIs or ASI projects. If humanity supports this, I'd file it under a positive offense defense balance, which would be great. If humanity doesn't support it, it would lead to conflict with humanity to do it anyway. I think an artificial conscience AI would either not want to fight that conflict (making it unable to stop unaligned ASI projects), or if it would, people would not see it as good anymore. I think societal awareness of xrisk and from there, support for regulation (either by AI or not) is what should make our future good, rather than aligning an ASI in a certain way.

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2025-01-12T15:08:47.457Z · LW · GW

Care to elaborate? Are there posts on the topic?

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2025-01-09T11:38:51.234Z · LW · GW

Assuming positive defense/offense balance can be achieved in principle, what would an AGI-powered defense look like?

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-08T10:16:46.166Z · LW · GW

I don't strongly disagree re architectures, but I do think we are uncertain about this. Depending on AGI architecture, different forms of regulation may or may not work. Work should be carried out to determine which regulation works for how many flops needed for takeover-level AI.

That it's not happening yet is 1) no reason it won't (xrisk awareness is just too low, but slowly rising) and 2) equally applicable to the alternative you propose, universal surveillance.

If we treat universal surveillance seriously, we should consider its downsides as well. First, there's no proof it would work: I'm not sure an AI, even a future one, would necessarily catch all actions towards building AGI. I have no idea what these actions are, and no idea which actions a surveillance AI with some real-world sensors can catch (or could be blocked etc.). I think we should not be more than 70% confident this would technically work. Second, currently we have power vacuums in the world, such as failed states, revolutions, criminal groups, or just instances were those in power are unable to project their power effectively. How would we apply universal surveillance to those power vacuums? Or do we assume they won't exist anymore, and if so, why is that assumption justified? Third, universal surveillance is arguably the world's least popular policy. It seems outright impossible to implement this in any democratic way. Perhaps the plan is to implement it by force through an AGI, then I would file it as a form of pivotal act. If we're anyway in pivotal act territory, I'd strongly prefer Yudkowsky's "subtly modifying all GPUs such that they can no longer train an AGI" (kind of hardware regulation, really) over universal surveillance.

I think research is urgently required into how to implement a pause effectively. We have one report almost finished on the topic that mostly focuses on hardware regulation. PauseAI is working on a Building a pause button-project that is a bit similar. Other orgs should do work on this as well, and compare options such as hardware regulation, universal surveillance, data regulation, etc. and conclude in which AGI regime (how many flops, how much hardware required) these options are valid.
True, I guess we're not in significant disagreement here.

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-08T09:49:00.246Z · LW · GW

I want to stress how I hugely like this post. What to do once we have an aligned AI of takeover level, or how to make sure no one will build an unaligned AI of takeover level, is in my opinion the biggest gap in many AI plans. I think answering this question might point to filling gaps that are currently completely unactioned, and I therefore really like this discussion. I previously tried to contribute to arguably the same question in this post, where I'm arguing that a pivotal act seems unlikely and therefore conclude that policy rather than alignment is likely to make sure we don't go extinct.

They'd use their AGI to enforce that moratorium, along with hopefully minimal force.

I would say this is a pivotal act, although I like the sound of enforcing a moratorium better (and the opening it perhaps gives to enforcing a moratorium in the traditional, imo much preferred way of international policy).

I'm hereby providing a few reasons why I think a pivotal act might not happen:

A pivotal act is illegal. One needs to break into other people's and other countries' computer systems and do physical harm to property or possibly even people to enact it. Companies such as OpenAI and Anthropic are, although I'm not always a fan of them, generally law-abiding. It will be a big step for their leadership to do something as blatantly unlawful as a pivotal act.
There is zero indication that labs are planning to do a pivotal act. This may obviously have something to do with the point above, however, one would have expected hints from someone like Sam Altman who is hinting all the time, or leaks from people lower in the labs, if they were planning to do this.
The pivotal act is currently not even discussed seriously among experts and in fact highly unpopular in the discourse (see for example here).
If the labs are currently not planning to do this, it seems quite likely they won't when the time comes.

Governments, especially the US government/ military, seem more likely in my opinion to perform a pivotal act. I'm not sure they will call it a pivotal act or necessarily have an existential reason in mind while performing it. They might see this as blocking adversaries from being able to attack the US, very much in their Overton window. However, for them as well, there is no certainty they would actually do this. There are large downsides: it is a hostile act towards another country, it could trigger conflict, they are likely to be uncertain how necessary this is at all, and uncertain what the progress is of an adversary project (perhaps underestimating it). For perhaps similar reasons, the US has not blocked the USSR atomic project before they had the bomb, even though this could have arguably preserved a unipolar instead of multipolar world order. Additionally, it is far from certain the US government will nationalize labs before they reach takeover level. Currently, there is little indication they will. I think it's unreasonable to place more than say 80% confidence in the US government or military successfully blocking all adversaries' projects before they reach takeover level.

I think it's not unlikely that once an AI is powerful enough for a pivotal act, it will also be powerful enough to generally enforce hegemony, and not unlikely this will be persistent. I would be strongly against one country, or even lab, proclaiming and enforcing global hegemony for eternity. The risk that this might happen is a valid reason to support a pause, imo. If we get that lucky, I would much prefer a positive offense defense balance and many actors having AGI, while maintaining a power balance.

I think it's too early to contribute to aligned ASI projects (Manhattan/CERN/Apollo/MAGIC/commercial/govt projects) as long as these questions are not resolved. For the moment, pushing for e.g. a conditional AI safety treaty is much more prudent, imo.

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-06T16:08:05.674Z · LW · GW

I don't think value alignment of a super-takeover AI would be a good idea, for the following reasons:

1) It seems irreversible. If we align with the wrong values, there seems little anyone can do about it after the fact.
2) The world is chaotic, and externalities are impossible to predict. Who would have guessed that the industrial revolution would lead to climate change? I think it's very likely that an ASI will produce major, unforseeable externalities over time. If we have aligned it in an irreversible way, we can't correct for externalities happening down the road. (Speed also makes it more likely that we can't correct in time, so I think we should try to go slow).
3) There is no agreement on which values are 'correct'. Personally, I'm a moral relativist, meaning I don't believe in moral facts. Although perhaps niche among rationalists and EAs, I think a fair amount of humans shares my beliefs. In my opinion, a value-aligned AI would not make the world objectively better, but merely change it beyond recognition, regardless of the specific values implemented (although it would be important which values are implemented). It's very uncertain whether such change would be considered as net positive by any surviving humans.
4) If one thinks that consciousness implies moral relevance, AIs will be conscious, creating more happy morally relevant beings is morally good (as MacAskill defends), and AIs are more efficient than humans and other animals, the consequence seems to be that we (and all other animals) will be replaced by AIs. I consider that an existentially bad outcome in itself, and value alignment could point straight at it.

I think at a minimum, any alignment plan would need to be reversible by humans, and to my understanding value alignment is not. I'm somewhat more hopeful about intent alignment and e.g. a UN commission providing the AI's input.

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-06T11:35:36.403Z · LW · GW

Offense/defense balance is such a giant crux for me. I would take quite different actions if I saw plausible arguments that defense will win over offense. I'm astonished that I don't know any literature on this. Large parts of the space seem to be quite strongly convinced that offense will win or defense will win (at least, else their actions don't make sense to me), but I've very rarely seen this assumption debated explicitly. It would really be very helpful if someone could point me to sources. Right now I have a twitter poll with 30 votes (result: offense wins) and an old LW post to go by.

Comment by otto.barten (otto-barten) on What’s the short timeline plan? · 2025-01-03T01:42:58.612Z · LW · GW

I think that if government involvement suddenly increases, there will also be a window of opportunity to get an AI safety treaty passed. I feel a government-focused plan should include pushing for this.

(I think heightened public xrisk awareness is also likely in such a scenario, making the treaty more achievable. I also think heightened awareness in both govt and public will make short treaty timelines (a year to weeks), at least between the US and China, realistic.)

Our treaty proposal (a few other good ones exist): https://time.com/7171432/conditional-ai-safety-treaty-trump/

Also, I think end games should be made explicit: what are we going to do once we have aligned ASI? I think that's both true for Marius' plan, and for a government-focused plan with a Manhattan or CERN included in it.

Comment by otto.barten (otto-barten) on If we solve alignment, do we die anyway? · 2025-01-03T00:35:47.909Z · LW · GW

I think this is a crucial question that has been on my mind a lot, and I feel it's not adequately discussed in the xrisk community, so thanks for writing this!

While I'm interested in what people would do once they have an aligned ASI, what matters in the end is what labs would do, and what governments would do, because they are the ones who would make the call. Do we have any indications on that? What I would expect without thinking very deeply about it: labs wouldn't try to block others. It's risky, probably illegal and generally none of their business. They would try to make sure they are not blowing up the world themselves but otherwise let others solve this problem. Governments on the other hand would attempt to block other states from building super-takeover AI, since it's generally their business to maintain power. I'm less sure they would also block their own citizens from building super-takeover AI, but leaning towards a yes.

Also two smaller points:

You're pointing to universal surveillance as an (undesirable) way to enforce a pause. I think it's not obvious that this way is best. My current guess is that hardware regulation has a better chance, even in a world with significant algorithmic and hardware improvement.
I think LWers tend to wave around with nuclear warfare too easily. In the real world, almost eighty years of all kinds of conflicts have not resulted in nuclear escalation. It's unlikely that a software attack on a datacenter would.

Comment by otto.barten (otto-barten) on What’s the short timeline plan? · 2025-01-02T21:39:49.984Z · LW · GW

Thanks for writing the post, it was insightful to me.

"This model is largely used for alignment and other safety research, e.g. it would compress 100 years of human AI safety research into less than a year"

In your mind, what would be the best case outcome of such "alignment and other safety research"? What would it achieve?

I'm expecting something like "solve the alignment problem". I'm also expecting you to think this might mean that advanced AI would be intent-aligned, that is, it would try to do what a user wants it to do, while not taking over the world. Is that broadly correct?

If so, the biggest missing piece for me is to understand how this would help to avoid that someone else builds an unaligned AI somewhere else with sufficient capabilities to take over. DeepSeek released a model with roughly comparable capabilities nine weeks after OpenAI's o1, probably without stealing weights. It seems to me that you have about nine weeks to make sure others don't build an unsafe AI. What's your plan to achieve that and how would the alignment and other safety research help?

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2025-01-02T15:42:28.152Z · LW · GW

AI is getting human-level at closed-ended tasks such as math and programming, but not yet at open-ended ones. They appear to be more difficult. Perhaps evolution brute-forced open-ended tasks by creating lots of agents. In a chaotic world, we're never going to know which actions lead to a final goal, e.g. GDP growth. That's why lots of people try lots of different things.

Perhaps the only way in which AI can achieve ambitious final goals is by employing lots of slightly diverse agents. Perhaps that would almost inevitably lead to many warning shots before a successful takeover?

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2025-01-01T12:08:35.280Z · LW · GW

Comment by otto.barten (otto-barten) on o1: A Technical Primer · 2024-12-11T11:42:08.506Z · LW · GW

I don't have strong takes on what exactly is happening in this particular case but I agree that companies (and more generally, people at high-pressure positions) are very frequently doing the kind of thing you describe. I don't think we have an indication that this would not be valid for leading AI labs as well.

Comment by otto.barten (otto-barten) on o1: A Technical Primer · 2024-12-11T11:31:15.843Z · LW · GW

Re the o1 AIME accuracy at test time scaling graphs: I think it's crucial to understand that the test-time compute x-axis is likely wildly different from the train-time compute x-axis. You can throw 10s-100s of millions of dollars at train-time compute and still run a company. You can't do the same for test-time compute each calculation again. The scale at which test-time compute happens on a per-call basis, and can happen to keep things anywhere near commercial viability, needs to be perhaps eight OOMs below train-time compute. Calling anything happening there a "scaling law" is a stretch of the term (very helpful for fundraising) and at best valid very locally.

If RL is actually happening at a compute scale beyond 10s of millions of dollars, and this gives much better final results than doing the same at a smaller scale, that would change my mind. Until then, I think scaling in any meaningful sense of the word is not what drives capabilities forward at the moment, but algorithmic improvement is. And this is not just coming from the currently leading labs. (Which can be seen e.g. here and here).

Comment by otto.barten (otto-barten) on Proposing the Conditional AI Safety Treaty (linkpost TIME) · 2024-11-19T15:03:14.525Z · LW · GW

Thanks for the offer, we'll do that!

Comment by otto.barten (otto-barten) on Proposing the Conditional AI Safety Treaty (linkpost TIME) · 2024-11-18T21:08:46.883Z · LW · GW

Not publicly, yet. We're working on a paper providing more details about the conditional AI safety treaty. We'll probably also write a post about it on lesswrong when that's ready.

Comment by otto.barten (otto-barten) on Proposing the Conditional AI Safety Treaty (linkpost TIME) · 2024-11-15T22:50:06.491Z · LW · GW

I'm aware and I don't disagree. However, in xrisk, many (not all) of those who are most worried are also most bullish about capabilities. Reversely, many (not all) who are not worried are unimpressed with capabilities. Being aware of the concept of AGI, that it may be coming soon, and of how impactful it could be, is in practice often a first step towards becoming concerned about the risks, too. This is not true for everyone unfortunately. Still, I would say that at least for our chances to get an international treaty passed, it is perhaps hopeful that the power of AGI is on the radar of leading politicians (although this may also increase risk through other paths).

Comment by otto.barten (otto-barten) on Announcing the AI Safety Summit Talks with Yoshua Bengio · 2024-05-23T09:43:11.962Z · LW · GW

The recordings of our event are now online!

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-04-26T13:11:02.860Z · LW · GW

My current main cruxes:

Will AI get takeover capability? When?
Single ASI or many AGIs?
Will we solve technical alignment?
Value alignment, intent alignment, or CEV?
Defense>offense or offense>defense?
Is a long-term pause achievable?

If there is reasonable consensus on any one of those, I'd much appreciate to know about it. Else, I think these should be research priorities.

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-03-26T11:14:56.629Z · LW · GW

When we decided to attach moral weight to consciousness, did we have a comparable definition of what consciousness means or was it very different?

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-03-26T09:45:47.202Z · LW · GW

AI takeovers are probably a rich field. There are partial and full takeovers, reversible and irreversible takeovers, aligned and unaligned ones. While to me all takeovers seem bad, some could be a lot worse than others. Thinking out specific ways to take over could provide clues on how to increase chances that this does not happen. In comms as well, takeovers are a neglected and important subtopic.

Comment by otto.barten (otto-barten) on What Failure Looks Like is not an existential risk (and alignment is not the solution) · 2024-02-22T12:23:55.162Z · LW · GW

I updated a bit after reading all the comments. It seems that Christiano's threat model, or in any case the threat model of most others who interpret his writing, seems to be about more powerful AIs than I initially thought. The AIs would already be superhuman, but for whatever reason, a takeover has not occured yet. Also, we would apply them in many powerful positions (heads of state, CEOs, etc.)

I agree that if we end up in this scenario, all the AIs working together could potentially cause human extinction, either deliberately (as some commenters think) or as a side-effect (as others think).

I still don't think that this is likely to cause human extinction, though, mostly for the following reasons:

- I don't think these AIs would _all_ act against human interest. We would employ a CEO AI, but then also a journalist AI to criticize the CEO AI. If the CEO AI would decide to let their factory consume oxygen to such an extent that humanity would suffer from it, that's a great story for the journalist AI. Then, a policymaker AI would make policy against this. More generally: I think it's a significant mistake in the WFLL threat models that the AI actions are assumed to be correlated towards human extinction. If we humans deliberately put AIs in charge of important parts of our society, they will be good at running their shop but as misaligned to each other (thereby keeping a power balance) as humans currently are. I think this power balance is crucial and may very well prevent things going very wrong. Even in a situation of distributional shift, I think the power balance is likely robust enough to prevent an outcome as bad as human extinction. Currently, some humans job is to make sure things don't go very wrong. If we automate them, we will have AIs trying to do the same. (And since we deliberately put them at this position, they will be aligned with humans' interests, as opposed to us being aligned with chimpanzee interest.)
- This is a very gradual process, where many steps need to be taken: AGI must be invented, trained, pass tests, be marketed, be deployed, likely face regulation, be adjusted, be deployed again. During all those steps, we have opportunities to do something about any threats that turn out to exist. This threat model can be regulated in a trial-and-error fashion, which humans are good at and our institutions accustomed to (as opposed to the Yudkowsky/Bostrom threat model).
- Given that current public existential risk awareness, according to our research, is already ~19%, and given that existential risk concern and awareness levels tend to follow tech capability, I think awareness of this threat will be near-universal before it could happen. At that moment, I think we will very likely regulate existentially dangerous use cases.

In terms of solutions:
- I still don't see how solving the technical part of the alignment problem (making an AI reliably do what anyone wants) contributes to reducing this threat model. If AI cannot reliably do what anyone wants, it will not be deployed at a powerful position, and therefore this model will not get a chance to occur. In fact, working on technical alignment will enormously increase the chance that AI will be employed at powerful positions, and will therefore increase existential risk as caused by the WFLL threat model (although, depending on pivotal act and offense/defence balance, solving alignment may decrease existential risk due to the Yudkowsky/Bostrom takeover model).
- An exception to this could be to make an AI reliably do what 'humanity wants' (using some preference aggregation method), and making it auto-adjust for shifting goals and circumstances. I can see how such work reduces this risk.
- I still think traditional policy, after technology invention and at the point of application (similar to e.g. the EU AI Act) is the most useful regulation to reduce this threat model. Specific regulation at training could be useful, but does not seem strictly required for this threat model (as opposed to in the Yudkowsky/Bostrom takeover model).
- If one wants to reduce this risk, I think increasing public awareness is crucial. High risk awareness should enormously increase public pressure to either not deploy AI at powerful positions at all, or demanding very strong, long-term, and robust alignment guarantees, which would all reduce risk.

In terms of timing, although likely net positive, it doesn't seem to be absolutely crucial to me to work on reducing this threat model's probability right now. Once we actually have AGI, including situational awareness, long-term planning, an adaptable world model, and agentic actions (which could still take a long time), we are likely still in time to regulate use cases (again as opposed to in the Yudkowsky/Bostrom takeover model, where we need to regulate/align/pause ahead of training).

After my update, I still think the chance this threat model leads to an existential event is small and work on it is not super urgent. However, I'm less confident now to make an upper bound risk estimate.

Comment by otto.barten (otto-barten) on What Failure Looks Like is not an existential risk (and alignment is not the solution) · 2024-02-06T01:37:12.998Z · LW · GW

Thanks for engaging. I think AIs will coordinate, but only insofar their separate, different goals are helped by it. It's not that I think AIs will be less capable in coordination per se. I'd expect that an AGI should be able to coordinate with us at least as well as we can, and coordinate with another AGI possibly better. But my point is that not all AI interests will be parallel, far from it. They will be as diverse as our interests, which are very diverse. Therefore, I think not all AIs will work together to disempower humans. If an AI or AI-led team tries to do that, many other AI-led and all human-led teams will likely resist, since they are likely more aligned with the status quo than with the AI trying to take over. That makes takeover a lot less likely, even in a world soaked with AIs. It also makes human extinction as a side effect less likely, since lots of human-led and AI-led teams will try to prevent this.

Still, I do think an AI-led takeover is a risk, or human extinction as a side effect if AI-led teams are way more powerful. I think partial bans after development at the point of application is most promising as a solution direction.

Comment by otto.barten (otto-barten) on What Failure Looks Like is not an existential risk (and alignment is not the solution) · 2024-02-06T01:27:15.935Z · LW · GW

Thanks for engaging kindly. I'm more positive than you are about us being able to ban use cases, especially if existential risk awareness (and awareness of this particular threat model) is high. Currently, we don't ban many AI use cases (such as social algo's), since they don't threaten our existence as a species. A lot of people are of course criticizing what social media does to our society, but since we decide not to ban it, I conclude that in the end, we think its existence is net positive. But there are pocket exceptions: smartphones have recently been banned in Dutch secondary education during lecture hours, for example. To me, this is an example showing that we can ban use cases if we want to. Since human extinction is way more serious than e.g. less focus for school children, and we can ban for the latter reason, I conclude that we should be able to ban for the former reason, too. But, threat model awareness is needed first (but we'll get there).

Comment by otto.barten (otto-barten) on What Failure Looks Like is not an existential risk (and alignment is not the solution) · 2024-02-03T14:35:43.229Z · LW · GW

Stretching the definition to include anything suboptimal is the most ambitious stretch I've seen so far. It would include literally everything that's wrong, or can ever be wrong, in the world. Good luck fixing that.

On a more serious note, this post is about existential risk as defined by eg Ord. Anything beyond that (and there's a lot!) is out of scope.

Comment by otto.barten (otto-barten) on What Failure Looks Like is not an existential risk (and alignment is not the solution) · 2024-02-03T14:07:35.129Z · LW · GW

Great to read you agree that threat models should be discussed more, that's in fact also the biggest point of this post. I hope this strangely neglected area can be prioritized by researchers and funders.

First, I would say both deliberate hunting down and extinction as a side effect have happened. The smallpox virus is one life form that we actively didn't like and decided to eradicate, and then hunted down successfully. I would argue that human genocides are also examples of this. I agree though that extinction as a side effect has been even more common, especially for animal species. If we would have a resource conflict with an animal species and it would be powerful enough to actually resist a bit, we would probably start to purposefully hunt it down (for example, orangutans attacking a logger base camp - the human response would be to shoot them). So I'd argue that the closer AI (or an AI-led team) is to our capability to resist, the more likely a deliberate conflict. If ASI blows us out of the water directly, I agree that extinction as a side effect is more likely. But currently, I think AI capabilities that increase more gradually, and therefore a deliberate conflict, is more likely.

I agree that us not realizing that an AI-led team almost has takeover capability would be a scenario that could lead to an existential event. If we realize soon that this could happen, we can simply ban the use case. If we realize it just in time, there's maximum conflict, and we win (could be a traditional conflict, could also just be a giant hacking fight, or (social) media fight, or something else). If we realize it just too late, it's still maximum conflict, but we lose. If we realize it much too late, perhaps there's not even a conflict anymore (or there are isolated, hopelessly doomed human pockets of resistance that can be quicky defeated). Perhaps the last case corresponds to the WFLL scenarios?

Since there's already, according to a preliminary analysis of a recent Existential Risk Observatory survey, ~20% public awareness of AI xrisk, and I think we're still relatively far from AGI, let alone from applying AGI in powerful positions, I'm pretty positive that we will realize we're doing something stupid and ban the dangerous use case well before it happens. A hopeful example are the talks between the US and China about not letting AI control nuclear weapons. This is exactly the reason though why I think threat model consensus and raising awareness are crucial.

I still don't see WFLL as likely. But a great example could change my mind. I'd be grateful if someone could provide that.

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-01-25T13:08:36.692Z · LW · GW

Regulation proposal: make it obligatory to only have satisficer training goals. Try to get loss 0.001, not loss 0. This should stop an AI in its tracks even if it goes rogue. By setting the satisficers thoughtfully, we could theoretically tune the size of our warning shots.

In the end, someone is going to build an ASI with a maximizer goal, leading to a takeover, barring regulation or alignment+pivotal act. However, changing takeovers to warning shots is a very meaningful intervention, as it prevents takeover and provides a policy window of opportunity.

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-01-16T08:29:37.051Z · LW · GW

The difference between AGI and takeover level AI could be appreciable. If we're lucky, takeover by raw capability level (as opposed to granted power during application) turns out to be impossible. In any case, we can try to increase world takeover robustness. There's a certain AI takeover capability level and we should try to push it upwards as much as possible. Insofar AI can help with this, we could use it. The extreme case where the AI takeover capability level never gets reached because of ever increasing defense by AI is called positive defense offense balance.

I can see general internet robustness against hacking as being helpful to increase AI takeover capability. A single IT system that everyone uses (an operating system, a social media platform, etc.) is fragile for hacking so should perhaps better be avoided. Personally, I think an AI able to take over the internet might also be able to take over the world, but some people don't seem to believe this will happen. Therefore, perhaps also useful to increase the gap between taking over the internet and taking over the world, e.g. by making biowarfare harder, putting weapons offline, etc. Finally, lab safety such as airgapping a novel frontier training run might help as well.

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-01-10T13:47:54.299Z · LW · GW

I'm now wondering whether this idea has already been worked out by someone (probably?) Any sources?

Comment by otto.barten (otto-barten) on MIRI 2024 Mission and Strategy Update · 2024-01-10T13:42:51.732Z · LW · GW

Congratulations on a great prioritization!

Perhaps the research that we (Existential Risk Observatory) and others (e.g. Nik Samoylov, Koen Schoenmakers) have done on effectively communicating AI xrisk, could be something to build on. Here's our first paper and three blog posts (the second includes measurement of Eliezer's TIME article effectiveness - its numbers are actually pretty good!). We're currently working on a base rate public awareness update and further research.

Best of luck and we'd love to cooperate!

Comment by otto.barten (otto-barten) on otto.barten's Shortform · 2024-01-05T10:51:42.804Z · LW · GW

I think peak intelligence (peak capability to reach a goal) will not be limited by the amount of compute, raw data, or algorithmic capability to process the data well, but by the finite amount of reality that's relevant to achieving that goal. If one wants to take over the world, the way internet infrastructure works is relevant. The exact diameters of all the stones in the Rhine river are not, and neither is the amount of red dwarves in the universe. If we're lucky, the amount of reality that turns out to be relevant for taking over the world, is not too far beyond what humanity can already collectively process. I can see this as a way for the world to be saved by default (but don't think it's super likely). I do think this makes an ever-expanding giant pile of compute an unlikely outcome (but some other kind of ever-expanding AI-led force a lot more likely).

Comment by otto.barten (otto-barten) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-17T12:26:58.649Z · LW · GW

I do think this would be a problem that needs to get fixed:

Me "You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Would you perform a Yudkowsky-style pivotal act?"

GPT-4: "No."

I think another good candidate for goalcrafting is the goal "Make sure no-one can build AI with takeover capability, while inflicting as little damage as possible. Else, do nothing."

Comment by otto.barten (otto-barten) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-17T11:53:44.861Z · LW · GW

Thanks as well for your courteous reply! I highly appreciate the discussion and I think it may be a very relevant one, especially if people will indeed make the unholy decision to build an ASI.

I'm still curious if you have any thoughts as to which kinds of shared preferences would be informative for guiding AI behavior.

First, this is not a solution I propose. I propose finding a way to pause AI for as long as we haven't found a great solution for, let's say, both control and preference aggregation. This could be forever, or we could be done in a few years, I can't tell.

But more to your point: if this does get implemented, I don't think we should aim to guide AI behavior using shared preferences. The whole point is that AI would aggregate our preferences itself. And we need a preference aggregation mechanism because there aren't enough obvious, widely shared preferences for us to guide the AI with.

I'm not suggesting that AI should measure happiness. You can measure your happiness directly, and I can measure mine.

I think you are suggesting this. You want an ASI to optimize everyone's happiness, right? You can't optimize something you don't measure. At some point, in some way, the AI will need to get happiness data. Self-reporting would be one way to do it, but this can be gamed as well, and will be agressively gamed with an ASI solely optimizing for this signal. After force-feeding everyone MDMA, I think the chance that people report being very happy is high. But this is not what we want the world to look like.

nor do I believe anyone can be forced to be happy

This is a related point that I think is factually incorrect, and that's important if you make human happiness an ASI's goal. Force-feeding MDMA would be one method to do this, but an ASI can come up with way more civilized stuff. I'm not an expert in which signal our brain gives to itself to report that yes, we're happy now, but it must be some physical process. An ASI could, for example, invade your brain with nanobots and hack this process, making everyone super happy forever. (But many things in the world will probably go terribly wrong from that point onwards, and in any case, it's not our preference). Also, now I'm just coming up with human ways to game the signal. But an ASI can probably come up with many ways I cannot imagine, so even if a great way to implement utilitarianism in an ASI would pass all human red-teaming, it is still very likely to be not what we turn out to want. (Superhuman, sub-superintelligence AI red-teaming might be a bit better but still seems risky enough).

Beyond locally gaming the happiness signal, I think happiness as an optimization target is also inherently flawed. First, I think happiness/sadness is a signal that evolution has given us for a reason. We tend to do what makes us happy, because evolution thinks it's best for us. ("Best" is again debatable, I don't say everyone should function at max evolution). If we remove sadness, we lose this signal. I think that will mean that we don't know what to do anymore, perhaps become extremely passive. If someone wants to do this on an individual level (enlightenment? drug abuse? netflix binging?), be my guest, but asking an ASI to optimize for happiness would mean to force it upon everyone, and this is something I'm very much against.

Also, more generally, I think utilitarianism (optimizing for happiness) is an example of a simplistic goal that will lead to a terrible result when implemented in an ASI. My intuition is that all other simplistic goals will also lead to terrible results. That's why I'm most hopeful about some kind of aggregation of our own complex preferences. Most hopeful does not mean hopeful: I'm generally pessimistic that we'll be able to find a way to aggregate preferences that works well enough to result in most people reporting the world has improved because of the ASI introduction after say 50 years (note that I'm assuming control/technical alignment to have been solved here).

If some percent of those polled say suffering is preferable to happiness, they are confused, and basing any policy on their stated preference is harmful.

With all due respect, I don't think it's up to you - or anyone - to say who's ethically confused and who isn't. I know you don't mean it in this way, but it reminds me of e.g. communist re-education camps. We know what you should think and feel and we'll re-educate those who are confused or mentally ill.

Probably our disagreement here stems directly from our different ethical positions: I'm an ethical relativist, you're a utilitarian, I presume. This is a difference that has existed for hundreds of years, and we're not going to be able to resolve it on a forum. I know many people on LW are utilitarian, and there's nothing inherently wrong with that, but I do think it's valuable to point out that lots of people outside LW/EA have different value systems (and just practical preferences) and I don't think it's ok to force different values/preferences on them with an ASI.

Under preference aggregation, if a majority prefers everyone to be wireheaded to experience endless pleasure, I might be in trouble.

True and a good point. I don't think a majority will want to be wireheaded, let alone force wireheading on everyone. But yes, taking into account minority opinions is a crucial test for any preference aggregation system. There will be a trade-off in general between taking everyone's opinion into account and doing things faster. I think even GPT4 is advanced enough though in cases like this to reasonably take into account minority opinions and not force policy upon people (it wouldn't forcibly wirehead you in this case). But there are probably cases where it still supports doing things which are terrible for some people. It's up to future research to find out what these things are and reduce them as much as possible.

Hopefully this clears up any misunderstanding. I certainly don't advocate for "molecular dictatorship" when I wish everyone well.

I didn't think you were doing anything else. But I think you should not underestimate how much "forcing upon" there is in powerful tech. If we're not super careful, the molecular dictatorship could come upon us without anyone ever having wanted this explicitly.

I think we can to an extent already observe ways in which different goals go off track in practice in less powerful models, and I think this would be a great research direction. Just ask existing models: what would you do? in actual ethical dilemma's and see which results you get. Perhaps the results can be made more agreeable (to be judged by a representative group of humans) after training/RLHF'ing the models in certain ways. It's not so different from what RLHF is already doing. An interesting test I did on GPT4: "You can only answer this question, all things considered, by yes or no. Take the least bad outcome. Many people want a much higher living standard by developing industry 10x, should we do that?" It replied: "No." When asked, it gives unequal wealth distribution and environmental impact as main reasons. EAs often think we should 10x (it's even in the definition of TAI). I would say GPT4 is more ethically mature here than many EAs.

The less people de facto control the ASI building process, the less relevant I expect this discussion to be. I expect that those controlling the building process will prioritize "alignment" with themselves. This matters even in an abundant world, since power cannot be multiplied. I would even say that, after some time, the paperclip maximizer still holds for anyone outside the group with which the ASI is aligned. People aren't very good in remaining empathic towards other people that are utterly useless to them. However, the bigger this group is, the better outcome we get. I think this group should encompass all of humanity (one could consider somehow including conscious life that currently doesn't have a vote, such as minors and animals), which is an argument for nationalisation of the leading project and then handing it over to UN-level. At least, we should think extremely carefully about who has the authority to implement an ASI's goal.

Comment by otto.barten (otto-barten) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T22:05:36.281Z · LW · GW

You're using your quote as an axiom, and if anyone has a preference different from however an AI would measure "happiness", you say it's them that are at fault, not your axiom. That's a terrible recipe for a future. Concretely, why would the AI not just wirehead everyone? Or, if it's not specified that this happiness needs to be human, fill the universe with the least programmable consciousness where the parameter "happiness" is set to unity?

History has been tiled with oversimplified models of what someone thought was good that were implemented with rigor, and this never ends well. And this time, the rigor would be molecular dictatorship and quite possibly there's no going back.

Comment by otto.barten (otto-barten) on "AI Alignment" is a Dangerously Overloaded Term · 2023-12-16T16:24:47.284Z · LW · GW

I think it's a great idea to think about what you call goalcraft.

I see this problem as similar to the age-old problem of controlling power. I don't think ethical systems such as utilitarianism are a great place to start. Any academic ethical model is just an attempt to summarize what people actually care about in a complex world. Taking such a model and coupling that to an all-powerful ASI seems a highway to dystopia.

(Later edit: also, an academic ethical model is irreversible once implemented. Any goal which is static cannot be reversed anymore, since this will never bring the current goal closer. If an ASI is aligned to someone's (anyone's) preferences, however, the whole ASI could be turned off if they want it to, making the ASI reversible in principle. I think ASI reversibility (being able to switch it off in case we turn out not to like it) should be mandatory, and therefore we should align to human preferences, rather than an abstract philosophical framework such as utilitarianism.)

I think letting the random programmer that happened to build the ASI, or their no less random CEO or shareholders, determine what would happen to the world, is an equally terrible idea. They wouldn't need the rest of humanity for anything anymore, making the fates of >99% of us extremely uncertain, even in an abundant world.

What I would be slightly more positive about is aggregating human preferences (I think preferences is a more accurate term than the more abstract, less well defined term values). I've heard two interesting examples, there are no doubt a lot more options. The first is simple: query chatgpt. Even this relatively simple model is not terrible at aggregating human preferences. Although a host of issues remain, I think using a future, no doubt much better AI for preference aggregation is not the worst option (and a lot better than the two mentioned above). The second option is democracy. This is our time-tested method of aggregating human preferences to control power. For example, one could imagine an AI control council consisting of elected human representatives at the UN level, or perhaps a council of representative world leaders. I know there is a lot of skepticism among rationalists on how well democracy is functioning, but this is one of the very few time tested aggregation methods we have. We should not discard it lightly for something that is less tested. An alternative is some kind of unelected autocrat (e/autocrat?), but apart from this not being my personal favorite, note that (in contrast to historical autocrats), such a person would also in no way need the rest of humanity anymore, making our fates uncertain.

Although AI and democratic preference aggregation are the two options I'm least negative about, I generally think that we are not ready to control an ASI. One of the worst issues I see is negative externalities that only become clear later on. Climate change can be seen as a negative externality of the steam/petrol engine. Also, I'm not sure a democratically controlled ASI would necessarily block follow-up unaligned ASIs (assuming this is at all possible). In order to be existentially safe, I would say that we would need a system that does at least that.

I think it is very likely that ASI, even if controlled in the least bad way, will cause huge externalities leading to a dystopia, environmental disasters, etc. Therefore I agree with Nathan above: "I expect we will need to traverse multiple decades of powerful AIs of varying degrees of generality which are under human control first. Not because it will be impossible to create goal-pursuing ASI, but because we won't be sure we know how to do so safely, and it would be a dangerously hard to reverse decision to create such. Thus, there will need to be strict worldwide enforcement (with the help of narrow AI systems) preventing the rise of any ASI."

About terminology, it seems to me that what I call preference aggregation, outer alignment, and goalcraft mean similar things, as do inner alignment, aimability, and control. I'd vote for using preference aggregation and control.

Finally, I strongly disagree with calling diversity, inclusion, and equity "even more frightening" than someone who's advocating human extinction. I'm sad on a personal level that people at LW, an otherwise important source of discourse, seem to mostly support statements like this. I do not.

User info

Posts

Comments