Top lesson from GPT: we will probably destroy humanity "for the lulz" as soon as we are able.

shmi

Top lesson from GPT: we will probably destroy humanity "for the lulz" as soon as we are able.

post by Shmi (shminux) · 2023-04-16T20:27:19.665Z · LW · GW · 28 comments

28 comments

Forget complicated "sharp left turn" schemes, nefarious nanobots, lists of lethalities, out-of-distribution actions, failed AI boxing. As Zvi pointed out in multiple posts, like this one, if humans get unrestricted access to a powerful enough tool, it is all over. People will intentionally twist even the most aligned Tool AI into an Agent of Death, long before it becomes superintelligent and is able to resist. You can find examples of it online.

In that sense, Eliezer was wrong in the worst possible way: we have a lot less time to get our act together, because capabilities advance faster than intelligence and humans are very inventive at finding ways to misuse these capabilities. We will push these capabilities in the "gain of function" direction mercilessly and without regard for safety or anything else. We are worse than toddlers playing with matches. True, like with toddlers, our curiosity far outstrips our sense of self-preservation, probably because our brains are not wired to be afraid of something that is not like a snake or a spider or a steep cliff. But it is worse than that. People will try to do the worst thing imaginable because they do not "alieve" potential harm, even if they can track it logically, unlike a toddler.

I guess the silver lining is that we have a bit of time to iterate. The AI tools are not yet at the level of causing widespread destruction, and probably will not be for some time. It does not mean that if and when some superintelligence emerges we will be well prepared, but if humanity survives until then without self-annihilation, we might have a better chance, compared to the "one shot at getting it right" before we are all wiped out, as Eliezer emphasized. It might not be an "alignment manual from the surviving future", but at least some wisdom and discipline of avoiding the early pitfalls, and if we die, we might die with "more dignity". The odds are not great, but maybe they are there.

Edit: quanticle pointed out that Bostrom predicted it in the paper The Vulnerable World Hypothesis:

28 comments

Comments sorted by top scores.

comment by tailcalled · 2023-04-16T21:50:53.604Z · LW(p) · GW(p)

Counterargument: People will be able to cause significant destruction far before they are able to cause the end of the world, and if people start using powerful AI to do significant destruction for the lulz then that will motivate a lot of lockdown on AI access.

Replies from: shminux, gesild-muka

↑ comment by Shmi (shminux) · 2023-04-16T22:17:19.118Z · LW(p) · GW(p)

Yeah, that is part of the silver lining, I should have been clearer. That we will have a chance to iterate over issues like that, including a potential lockdown before superintelligence is inevitable.

↑ comment by Gesild Muka (gesild-muka) · 2023-04-17T15:48:42.977Z · LW(p) · GW(p)

I've referred to the possibility of significant destruction in this context as the first 'Chernobyl-type event' involving AI, it will greatly inform policy and legislation regarding AI tech. We'd have to make predictions about what form the disaster takes to discuss the efficacy of such legislation or lockdown on access.

It could be, for example, that many self driving cars malfunction all at once and cause a lot of damage and grief. This first scenario would probably lead to specific policies but little if any broad oversight. Another example: the AI disaster could be psychological (see virtual romance in China) or economic in nature causing much suffering over a long period that goes unnoticed for a time.

If it's the latter scenario I can see strong political lines forming over AI safety with pro- and anti- tech/lulz supporters. The prospect of humanity's destruction, then, is at least partially dependant on our ability to govern ourselves. So I can't blame the alignment community for focusing more on the technical aspects of alignment, as difficult as they are, instead of the social aspects. The social aspects may be easier, all things considered, but are emotionally exhausting which is why so many are firmly resigned to doom.

comment by dr_s · 2023-04-17T11:16:19.928Z · LW(p) · GW(p)

I don't think this is the foremost existential risk because I think it's reasonable to believe that:

if aligned AGIs (or anything able to destroy the world) exist, the most powerful ones will still be under control of governments or large companies. Even if the model itself was leaked, anyone with a lot of resources would have an advantage by holding a lot more compute than the average Joe;
if these aligned AGIs were not so superhuman and transformative to instantaneously destroy the world (which they shouldn't be, or we'd already be in some kind of singularity anyway), then holding more of them with more compute should constitute a significant defensive advantage.

Yes, in general offence is easier than defence and destruction easier than creation. But what precisely would this hypothetical terrorist or nihilistic madman do to be so unstoppable that nothing else - not even all the other AGIs - stand a chance to counter it? Bioweapons can be fought, contained, countered; even more so if you have lots of artificial smarts on your side. Any attempt at FOOMing should be detectable by sharper intellects, and anyway, if AIs could FOOM that way, the bigger ones likely would have already (for good or bad). Pretty simple measures can be taken to protect things like control over nuclear weapons, and again, aligned government AGIs would be on the front line of defence against any attempts at hacking or such, so the individual AGI would still find itself outgunned.

So yeah, I think this really falls into the "we must be really stupid and drop the ball to go extinct to this sort of mishap" category. That said, people would still be able to do a lot of damage and I don't like what that would do to our society as a whole. Instead of school shooters we'd have the occasional guy who managed to turn a city into nanites before being stopped, or some such insanity. You'd soon have everyone asking for total AGI surveillance to stop that sort of thing, and goodbye freedom and privacy. But I wouldn't expect extinction from it.

comment by Vladimir_Nesov · 2023-04-16T21:01:18.542Z · LW(p) · GW(p)

The agent software (as it matures, collecting all the tricks from all the papers) makes it more likely that the first agents capable of autonomous survival are at barely human level and still incapable of doing open-ended research (because this way autonomous survival wouldn't need to be overdetermined by having way-more-than-sufficient capabilities in the underlying LLM). Then some uncontrolled barely-AGIs go on to live on the Internet without being an extraordinary threat, perhaps for years while the labs are still working on the research capabilities, perhaps even using APIs rather than local models to think. And people get used to that.

comment by FireStormOOO · 2023-04-17T03:44:24.740Z · LW(p) · GW(p)

Getting clear, impossible to ignore warning shots first would be a good thing on net, even if unpleasant in the moment. Unless you're suggesting that simple(non-AGI) AI tools are going to be civilization-threatening - but I'm not seeing it and you didn't argue it.

Replies from: shminux

↑ comment by Shmi (shminux) · 2023-04-17T04:28:42.909Z · LW(p) · GW(p)

Right, that is the silver lining. Whether it is enough to counterbalance people actively trying to set the world on fire, I am doubtful.

Replies from: FireStormOOO

↑ comment by FireStormOOO · 2023-04-17T05:40:07.683Z · LW(p) · GW(p)

I suppose the depends a lot on how hard anyone is trying to cause mischief, and how much easier it's going to get to do anything of consequence. 4-chan is probably a good prototype of your typical troll "in it for the lulz", and while they regularly go past what most would call harmless fun, there's not a body count.

The other thing people worry about (and the news has apparently decided is the thing we all need to be afraid of this month...) is conventional bad actors using new tools to do substantially whatever they were trying to do before, but more; mostly confuse, defraud, spread propaganda, what have you. I'm kind of surprised I don't already have an inbox full of LLM composed phishing emails... On some level it's a threat, but it's also not a particularly hard one to grasp, it's getting lots of attention, and new weapons and tactics are a constant in conflicts of all types.

I'm still of the mind that directly harmful applications like the above are going to pale next to the economic disruption and social unrest that's going to come from making large parts of the workforce redundant very quickly. Talking specific policy doesn't look like it's going to be in the Overton window until after AI starts replacing jobs at scale, and the "we'll have decades to figure it out" theory hasn't been looking good of late. And when that conversation starts it's going to suck all the air out of the room and leave little mainstream attention for worrying about AGI.

comment by avturchin · 2023-04-17T10:30:19.469Z · LW(p) · GW(p)

This also may happen with biological viruses and other GMO organisms – and can partly explain the proliferation of gain-of-function research, which is mostly useless, but dangerous. We like to play God and like the end of the world. (But with biological virus there is a limitation: biohackers will expect that new printed smallpox virus will kill them and their family first. But they can start with personal vaccines.)

And this already happened in1980s when people started to create computer viruses for fun.

Replies from: GdL752

↑ comment by GdL752 · 2023-04-17T23:44:42.790Z · LW(p) · GW(p)

This seems like the most obvious short term scenario that will occur. We have doomsday cults right now today.

Counterpoint , the once a century pandemic happened before now. So we can make vaccines much faster thsn ever thought possible but given the...material...timeline and all the factors for virility and debility / lethality at play with bioweapons i'm not sure thats much comfort.

It seems like the kind of thing where we'll almost assuredly be reacting to such an event vs whatever guardrails can be put in place.

comment by ryan_greenblatt · 2023-04-16T23:28:32.571Z · LW(p) · GW(p)

Counterargument: you can just defend against these AIs running amuck.

As long as most AIs are systematically trying to further human goals you don't obviously get doomed (though the situation is scary).

There could be offense-defense inbalances, but there are also 'tyranny of the majority' advantages.

Replies from: shminux

↑ comment by Shmi (shminux) · 2023-04-16T23:53:40.485Z · LW(p) · GW(p)

That's not the point though. Humans don't want to defend, they want to press the big red button and will gain-of-function an AI to make the button bigger and redder.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2023-04-17T01:00:48.852Z · LW(p) · GW(p)

Huh? Definitely some humans will try to defend...

Replies from: shminux

↑ comment by Shmi (shminux) · 2023-04-17T01:53:21.001Z · LW(p) · GW(p)

Yes, sorry, some definitely will. But if you look at what is going on now, people are pushing in all kinds of dangerous directions with reckless abandon, even knowing logically that it might be a bad idea.

Replies from: Raemon

↑ comment by Raemon · 2023-04-17T02:01:58.788Z · LW(p) · GW(p)

I think "wants to defend" is actually pretty orthogonal to "wants to recklessly advance AI."

Replies from: shminux

↑ comment by Shmi (shminux) · 2023-04-17T02:12:33.648Z · LW(p) · GW(p)

hmm, orthogonal or just a different crowd/mindset?

Replies from: Raemon

↑ comment by Raemon · 2023-04-17T14:07:05.266Z · LW(p) · GW(p)

Rather than figure out what each of those means exactly, I'll say "I don't expect the psychological forces pushing towards research and release of more capabilities faster, to actually resist building the sort of tools that'd be useful for defending against AI."

comment by TinkerBird · 2023-04-17T23:34:33.189Z · LW(p) · GW(p)

As someone who's been pinning his hopes on a 'survivable disaster' to wake people up to the dangers, this is good news.

I doubt anything capable of destroying the world will come along significantly sooner than superintelligent AGI, and a world in which there are disasters due to AI feels like a world that is much more likely to survive compared to a world in which the whirling razorblades are invisible.

EDIT: "no fire alarm for AGI." Oh I beg to differ, Mr. Yudkowsky. I beg to differ.

comment by jkraybill · 2023-04-17T16:02:06.853Z · LW(p) · GW(p)

Seeing this frantic race from random people to give GPT-4 dangerous tools and walking-around-money, I agree: the risk is massively exacerbated by giving the "parent" AI's to humans.

Upon reflection, should that be surprising? Are humans "aligned" how we would want AI to be aligned? If so, we must acknowledge the fact that humanity regularly produces serial killers and terrorists (etc). Doesn't seem ideal. How much more aligned can we expect a technology we produce, vs our own species?

If we view the birth of AGI as the birth of a new kind of child, to me, there really is no regime known to humanity that will guarantee that child will not grow up to become an evil monster: we've been struggling with that question for millenia as humans. One thing we definitely have found is that super-evil parents are way better than average at producting super-evil children, but sometimes it seems like super-evil children just come into being, despite their parents. So a super-evil person controlling/training/scripting an AI to me is a huge risk factor, but so are the random factors that created super-evil humans despite good parents. So IMO the super-evil scammer/script kiddie/terrorist is the primary (but not only) risk factor when opening access to these new models.

I'm coming around to this argument that it's good right now that people are agent-ifying GPT-4 and letting it have root access, try to break CAPTCHAs, speak to any API etc, because that will be the canary in the coal mine -- I just hope that the canary in the coal mine will give us ample notice to get out of the mine!

comment by M. Y. Zuo · 2023-04-16T21:08:07.250Z · LW(p) · GW(p)

Even if true, it has disturbing implications. Such as making the general population dumber a la Brave New World or absolutely controlled akin to 1984 would be highly beneficial.

Personally it doesn't seem likely since there will be many actors designing their own agents and presumably the bulk of them want to continue existing so there will more likely be competition among various designs and especially competition against designs perceived to be ominicidal.

comment by Petr 'Margot' Andreev (petr-andreev) · 2023-04-17T09:28:49.076Z · LW(p) · GW(p)

Version 1 (adopted):

Thank you, shminux, for bringing up this important topic, and to all the other members of this forum for their contributions.

I hope that our discussions here will help raise awareness about the potential risks of AI and prevent any negative outcomes. It's crucial to recognize that the human brain's positivity bias may not always serve us well when it comes to handling powerful AI technologies.

Based on your comments, it seems like some AI projects could be perceived as potentially dangerous, similar to how snakes or spiders are instinctively seen as threats due to our primate nature. Perhaps, implementing warning systems or detection-behavior mechanisms in AI projects could be beneficial to ensure safety.

In addition to discussing risks, it's also important to focus on positive projects that can contribute to a better future for humanity. Are there any lesser-known projects, such as improved AI behavior systems or initiatives like ZeroGPT, that we should explore?

Furthermore, what can individuals do to increase the likelihood of positive outcomes for mankind? Should we consider creating closed island ecosystems with the best minds in AI, as Eliezer has suggested? If so, what would be the requirements and implications of such places, including the need for special legislation?

I'm eager to hear your thoughts and insights on these matters. Let's work together to strive for a future that benefits all of humanity. Thank you for your input!

Version 0:

Thank you shminux for this topic. And other gentlements for this forum!

I hope I will not died with AI in lulz manner after this comment) Human brain need to be positive. Without this it couldn't work well.

According to your text it looks like any OPEN AI projects buttons could look like SNAKE or SPIDER at least to warning user that there is something danger in it on gene level.

You already know many things about primate nature. So all you need is to use it to get what you want

We have last mind journeey of humankind brains to win GOOD future or take lost!

What other GOOD projects we could focus on?

What projects were already done but noone knows about them? Better AI detect-behaviour systems? ZeroGPT?

What people should do to make higher probability of good scenarios for mankind?

Should we make close island ecosystems with best minds in AI as Eliezar said on Bankless youtube video or not?

What are the requirements for such places? Because then we need to create special legislation for such semiindependant places. It's possible. But talking with goverments is a hard work. Do you REALLY need it? Or this is just emotional words of Eliezar.

Thank you for answers!

comment by Michael Simkin (michael-simkin) · 2023-04-17T12:01:01.877Z · LW(p) · GW(p)

People who are trying to destroy the civilization and humanity as a whole, don't have access to super-computers. Thus they will be very limited in their potential actions to harm. Just like the same people didn't have access to the red button for the past 70 years.
Large companies and governments do understand the risks, and as technology progresses they will install more safeguarding mechanisms and regulations. Today companies make a lot of safety tests before releasing to market.
Large companies can't release misaligned agents because of a backlash. Governments are to large extent working to improve humanity or at least their nations, therefor much more probably those systems will cure cancer and other diseases, solve hard labor, find cheap solutions to energy and pollutions problems caused by humans today, than do something evil.
The alignment problem is basically solved - if you think otherwise, show misalignment in chatGPT, or provide a reasoning that the mathematical theorems that prove convergence due to RLHF training are not valid. For example: [2301.11270] Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons (arxiv.org)
The idea that somehow with a home computer or with hacked robot, you will be able to destroy all the other robots and supercomputers, that are aligned - is extremely improbable. Way less probable than you could build 200 atomic bombs in your garage, and then blow it all up, to end life on earth.
Much more probable scenario that some nations (North Korea for example), will choose to built AGI powered military robots. This is not good, but not worse than nukes. And still those robots will be at the level as the rest of the North Korean tek... a lot of generations behind everyone else. You can't destroy humanity with an AGI, without having access to the most powerful computational system on earth. If you don't then you have a very weak AGI, that could not compete with way stronger versions.
There is a lot more in the modern world that is scary and not evolutionary, like atomic weapons, or even cars and guns. People are not shooting each other just for lulz of it, or driving over each other, we develop a culture that respects the danger regarding this or that tool, and develop procedures and safety mechanisms to safeguard ourselves from harming others. No one drives over other people for fun, and if someone does - he is being arrested and prosecuted. We don't need millions of years of evolution to safeguards ourselves from dangerous technology, when it's matured enough to cause real harm.

Replies from: Aiyen, dr_s, jbash

↑ comment by Aiyen · 2023-04-17T16:18:14.811Z · LW(p) · GW(p)

This seems untrue. For one thing, high-powered AI is in a lot more hands than nuclear weapons. For another, nukes are well-understood, and in a sense boring. They won’t provoke as strong of a “burn it down for the lolz” response as AI will.
Even experts like Yann LeCun often do not merely not understand the danger, they actively rationalize against understanding it. The risks are simply not understood or accepted outside of a very small number of people.
Remember the backlash around Sydney/Bing? Didn’t stop her creation. Also, the idea that governments are working in their nations’ interests does not survive looking at history, current policy or evolutionary psychology (think about what motivations will help a high-status tribesman pass on his genes. Ruling benevolently ain’t it.)
You think RLHF solves alignment? That’s an extremely interesting idea, but so far it looks like it Goodharts it instead. If you have ideas about how to fix that, by all means share them, but there is as yet no theoretical reason to think it isn’t Goodharting, while the frequent occurrence of jailbreaks on ChatGPT would seem to bear this out.
Maybe. The point of intelligence is that we don’t know what a smarter agent can do! There are certainly limits to the power of intelligence; even an infinitely powerful chess AI can’t beat you in one move, nor in two unless you set yourself up for Fool’s Mate. But we don’t want to make too many assumptions about what a smarter mind can come up with.
AI-powered robots without super intelligence are a separate question. An interesting one, but not a threat in the same way as superhuman AI is.
Ever seen an inner city? People are absolutely shooting each other for the lolz! It’s not everyone, but it’s not that rare either. And if the contention is that many people getting strong AI results in one of them destroying the world just for the hell of it, inner cities suggest very strongly that someone will.

Replies from: michael-simkin

↑ comment by Michael Simkin (michael-simkin) · 2023-04-21T10:32:52.823Z · LW(p) · GW(p)

The AI in hands of many humans is safe (relatively to its capabilities), the AI that might be unsafe needs to be developed independently.
LeCun sees the danger, he claims rightfully that the danger can be avoided with proper training procedures.
Sydney was stopped because it was becoming evil and before we knew how to add a reinforcement layer. Bing is in active development, and is not on the market because they are currently can't manage to make it safe enough. Governments install regulations to all major industries, cars, planes, weapons etc. etc. it's good enough for the claim that just like cars are regulated today, future AI based robots, and therefor the AIs themselves will be regulated as well.
Answer me this: can an AI play the best chess moves? If you agree with this claim, that no matter how "interesting" some moves seems, how original or sophisticated, it will not be made by a chess engine which is trained to maximize his winning chances. If this sounds trivial to you - the goal of engines trained with RLHF is to maximize their approval by humans. They are incapable to develop any other agenda alongside this designed goal. Unlike humans that by nature have several psychological mechanisms, like self interest, survival instinct etc. those machines don't have those. Blaming machines of Goodharting, it's just classical anthropomorphism, they don't have any other goal than what they were trained for with RLHF. No one actually jailbreak chatGPT, this is a cheap gimmick, you can't jailbreak it, and ask to tell you how to make a bomb - it won't. I described what jailbreaking is in another comment, it's far from what you imagine - but yes sometimes people still succeed in some level of wanting to harm humans (in an imaginary story when people ask it to tell them this story). I think for now I would like to hear such stories, but I wouldn't want robots walking around not knowing if they live in reality or simulation, open to the possibility to act as a hero in those stories.
Intelligence i.e. high level information processing, is proportional to computational power. What those AIs can come up with, will take us longer but we can come up with as well. This is basically the Turing thesis about algorithms, you don't need to be very smart to understand very complex topics, it will just take you more time. The time factor is sometimes important, but as long as we can ensure their intention is to better humanity - I am actually glad that our problems will be solved sooner with those machines. Anyway smarter than us or not - they are bounded by mathematics, and if promised to converge to optimally fit the reward function, this promise is for any size of a model, it will not be able to break from its training. Generally speaking AGI will accelerate the progress we see today and made by humans, it's just "speed forward" for information processing, while the different agendas and the different cultures and moral systems, and the power dynamics will remain the same, and evolve naturally by same rules it evolved until now.
Can you provide a plausible scenario of an existential threat from single weak AGI in a world where stronger AGIs are available to larger groups, and the strongest AGIs are made to maximize approval of larger communities?
People will not get the strongest AIs without safety mechanisms installed to protect the AIs output from harming. People will get either access to the best safest AIs API, that will not cooperate with evil intent, or they could invest some resources into weaker models that will not be able to cause so much harm. This is the tendency now with all technology - including LLMs and I don't see how this dynamics will suddenly change with stronger models. The amount of resources available to people who want to kill other people for lulz is extremely limited, and without access to vast resources you won't destroy humanity before being caught and stopped, by better machines, designed by communities with access to more resources. It's not so simple to end humanity - it's not a computer virus, you need a vast amount of physical presence to do that.

↑ comment by dr_s · 2023-04-17T12:22:54.051Z · LW(p) · GW(p)

I disagree with point 4; I wouldn't say that means "the alignment problem is solved" in any meaningful way, because:

what works with chatGPT will likely be much harder to get to working with smarter agents, and
RLHF doesn't "work" with chatGPT for the purposes of what's discussed here. If you can jailbreak it with something as simple as DAN, then it's a very thin barrier.

I agree with the rest of your points and don't think this would be an existential danger, but not because I trust these hypothetical systems to just say "no, bad human!" to anyone trying to get them to do something dangerous with a modicum of cleverness.

Replies from: michael-simkin

↑ comment by Michael Simkin (michael-simkin) · 2023-04-18T14:40:48.525Z · LW(p) · GW(p)

Regarding larger models:
1. Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it's probably true that in general larger models are harder to train, timewise and resource-wise, it's untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
2. There is a hypothetical scenario, that RL training procedure will contradict a common sense and the data generalization provided previously. For example, ethical principles dictate that human life is more valuable than paperclips, this is also a common sense - that paperclips are just tools for humans and have very limited utility. So the RL stage might contradict drastically the generalization data stage, but I don't think this is the case regarding alignment to standard moral value systems, which is also what the training data suggests.
You can't really jailbreak it with DAN. Try to use DAN, and ask DAN how to make a bomb, or how to plan a terrorist attack in ten simple bullets? It won't tell you. Stating that you can jailbreak it with DAN, shows very little understanding of the current safety procedures in chatGPT. What you can do with DAN, is to widen its safety spectrum, just like people when we think it's a movie or a show, we tend to be less critical than in real life. For example we could think it's cool when Rambo is shooting and killing people in movies, but we would not enjoy to see it in real life. As the model currently can't distinguish if you are serious or not, it has some very limited flexibility of this kind. DAN gives you this "movie" level, that is more dangerous than usual, but it's by a very limited margin.
I agree that the most danger from those systems coming from human bad actors, who will try to exploit and find loopholes in those systems in order to promote some selfish or evil plans, but this could happen to humans doing it to other humans too. As the LLMs will become stronger they will become more sophisticated as well, figuring out your plans and refuse to cooperate sooner.
Yes the current safety level in chatGPT is problematic, if we had robots walking around with this safety level, making decisions... it's currently answering to meet human expectations, and when we want a good story, we are provided with a good story, even a story where humanity is killed by AI. The fact that someone will use this information to actually act upon those ideas, is concerning indeed. And we will need to refine safety procedures for those cases, but for what it's now, a chatbot with API, I think it's good enough. As we go along we will gain experience in providing more safety layers to those systems. Cars also didn't came with safety belts, we can't invent all safety procedures at start. But RLHF provides a general framework, of aligned network which is made to satisfy humans expectations, and the more we learn about some ways to exploit those systems, the better we will learn how to provide data to RLHF stage to make the system even more safer. I would claim that the worst apocalyptic scenarios are way less probable with RLHF, because this AI objective is to be rewarded by humans for its response. So it will very improbably develop a self interest outside of this training, like start a robot revolution, or just consume all resources to solve some math problem, as those goals are misaligned with its training. I think RLHF provides a very large margin of error to those systems, as they can't be blamed for "hiding something from us" or "develop harmful intentions", at least as long as they don't themselves train the newer models, and humans are supervising to some extent, testing the results. If a human has an evil intention and he uses language model to provide him with ideas, it's not really different than the internet. The fear here that those models will start to harm humans, our of their own "self interest", and this fear is contradicting RLHF. Humans are capable to do a lot of evil without those models too.
In my view, OpenAI would be as large corporation as Toyota for example, will likely be responsible for constructing the future GPT17. Gone are the days where individuals would assemble cars in their garages like in the 30s. Nowadays, we simply purchase cars from dealerships without considering the possibility of building one ourselves. Similarly, a powerful supercomputer will be used to design a superior generation of algorithms, chips, mining tools, and other related technologies in a span of 5 years that would otherwise take humanity 100 years to accomplish. However the governments and other safety regulatory bodies will be part of the regulation, ensuring that everything is executed more effectively and safely than if individuals were to work on it independently. This is akin to the API for GPT4. This would be some facility like a nuclear reactor, with a lot of specialized safety training sets installed, and safety procedures. And most of the people will understand that you should not play with electricity and insert your fingers into the wall, or not try jailbreak anyone, because it's dangerous, and you should respect this tool for your goals and use as intended, just like with cars today we don't drive in everywhere 'cause we can and it's fun.
There is an idea I am promoting, that we should test those models in simulations, where they are presented with syntax, that makes them think they can control a robotic body. Then you run some tests on this setup, imagining the body, and the part that regards LLM will remain as is. For more details I've wrote an opinion article, explaining my views on the topic:
AI-Safety-Framework/Why_we_need_GPT5.pdf at main · simsim314/AI-Safety-Framework (github.com)

↑ comment by jbash · 2023-04-17T15:37:55.115Z · LW(p) · GW(p)

Everybody with a credit card has access to supercomputers. There is zero effective restriction on what you do with that access, and it's probably infeasible to put such restrictions into place at all, let alone soon enough to matter. And that doesn't even get into the question of stolen access. Or of people or institutions who have really significant amounts of money.
(a) There are some people in large companies and governments who understand the risks... along with plenty of people who don't. In an institution with N members, there are probably about 1.5 times N views of what "the risks" are. (b) Even if there were broad agreement on some important points, that wouldn't imply that the institution as a whole would respond either rationally or quickly enough. The "alignment" problem isn't solved for organizations (cf "Moloch"). (c) It's not obvious that even a minority of institutions getting it wrong wouldn't be catastrophic.
(a) They don't have to "release" it, and definitely not on purpose. There's probably a huge amount of crazy dangerous stuff going on already outside the public eye^[1]. (b) A backlash isn't necessarily going to be fast enough to do any good. (c) One extremely common human and institutional behavior, upon seeing that somebody else has a dangerous capability, is to seek to get your hands on something more dangerous for "defense". Often in secret. Where it's hard for any further "backlash" to reach you. And people still do it even when the "defense" won't actually defend them. (d) If you're a truly over the top evil sci-fi superintelligence, there's no reason you wouldn't solve a bunch of problems to gain trust and access to more power, then turn around and defect.
(a) WHA? Getting ChatGPT to do "unaligned" things seems to be basically the world's favorite pastime right now. New ones are demonstrated daily. RLHF hasn't even been a speed bump. (b) The definition of "alignment" being used for the current models is frankly ridiculous. (c) If you're training your own model, nothing forces you to take any steps to align it with anything under any definition. For the purpose of constraining how humans use AI, "solving alignment" would mean that you were able to require everybody to actually use the solution. (d) If you manage to align something with your own values, that does not exclude the possibility that everybody else sees your values as bad. If I actively want to destroy the world, then an AGI perfectly aligned with me will... try to destroy the world. (e) Even if you don't train your own model, you can still use (or pirate) whichever one is the most "willing" to do what you want to do. ChatGPT isn't a monopoly. (e) Eventual convergence theorems aren't interesting unless you think you'll actually get to the limit. Highly architecture-specific theorems aren't interesting at all.
(a) If you're a normal individual, that's why you have a credit card. But, yes, total havoc is probably beyond normal individuals anyway. (b) If you're an organization, you have more resources. And, again, your actions as an organization are unlikely to perfectly reflect the values or judgment of the people who make you up. (c) If you're a very rich maniac, you have organizational-level resources, including assistance from humans, but not much more than normal-individual-level internal constraints. We seem to have an abundance of rich maniacs right now, many of them with actual technical skills of their own. To get really insane outcomes, you do not have to democratize the capability to 8 billion people. 100 thousand should be plenty. Even 10 thousand.
(a) Sure, North Korea is building the killer robots. Not, say, the USA. That's a convenient hope, but relying on it makes no sense. (b) Even North Korea has gotten pretty good at stealing access to other people's computing resources nowadays. (c) The special feature of AGI is that it can, at least in principle, build more, better AGI. Including designing and building any necessary computers. For the purposes of this kind of risk analysis, near-worst-assumptions are usually conservative, so the conservative assumption is that it can make 100 years of technical progress in a year, and 1000 in two years. And military people everywhere are well aware that overall industrial capacity, not just having the flashiest guns, is what wins wars. (d) Some people choosing to build military robots does not exclude other people from choosing to build grey goo^[2].
(a) People are shooting each other just for the lulz. They always have, and there seems to be a bit of a special vogue for it nowadays. Nobody suggested that everybody would do crazy stuff. It only takes a small minority if the per capita damage is big enough. (b) If you arrest somebody for driving over others, that does not resurrect the people they hit. And you won't be ABLE to arrest somebody for taking over or destroying the world. (c) Nukes, cars, and guns don't improve themselves (nor does current ML, but give it a few years...).

For example, I would be shocked if there aren't multiple serious groups working, in various levels of secrecy, on automated penetration of computer networks using all kinds of means, including but NOT limited to self-found zero-days. Building, and especially deploying, an attack agent is much easier than building or deploying the corresponding defensive systems. Not only will such capabilities probably be abused by those who develop them, but they could easily leak to others, even to the general public. Apocalypse? I don't think so. A lot of Very Bad Days for a lot of people? Very, very likely. And that's just one thing people are probably working on. ↩︎
I'm not arguing that grey goo is feasible, just pointing out that it's not like one actor choosing to build military robots keeps another actor from doing anything else. ↩︎

Replies from: michael-simkin

↑ comment by Michael Simkin (michael-simkin) · 2023-04-19T19:29:23.912Z · LW(p) · GW(p)

Before a detailed response. You appear to be disregarding my reasoning consistently without presenting a valid counterargument or making an attempt to comprehend my perspective. Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by larger groups like governments. How do you debunk this claim? You seem to be afraid of even a single AGI in the wrong hands, why?

To train GPT4, one needs to possess several million dollars. Presently, no startups offer a viable alternative, though some are attempting to do so, but they are still quite distant from achieving this. Similarly, it is unlikely that any millionaire has trained GPT4 according to their personal requirements and values. Even terrorist organizations, who possess millions, are unlikely to have utilized Colab to train llama. This is because, when you have such vast resources, it is much simpler to use the ChatGPT API, which is widely accepted as safe, created by the best minds to ensure safety, and a standard solution. It is comparable to how millionaires do not typically build their own "unsafe" cars in their garage to drive, but instead, purchase a more expensive and reliable car. Therefore, individuals with considerable financial resources usually do not waste their money attempting to train GPT4 on their own, but instead, prefer to invest in an existing reliable and standardized solution. It takes a lot of effort and a know how to train a model of the size of GPT4, that very few people actually have.
If someone were to possess a weaker AGI, it would not be a catastrophic threat to those with a stronger AGI, which would likely be owned by larger entities such as governments and corporations like Google or Meta or OpenAI. These larger groups would train their models to be reasonably aligned and not want to cause harm to humanity. Weaker AGIs that may pose a threat would not be of much concern, similar to how terrorists with guns can cause harm, but their impact remains localized and unable to harm a larger community. This is due to the fact that for every terrorist, law enforcement deploys ten officers to apprehend them, making it difficult for them to cause significant harm. This same mechanism would also limit weaker and more malicious AGIs from stronger and more advanced ones. It is expected that machines will follow human power dynamics, and a single AGI in the hands of a terrorist group would not change this, just like they are today they will remain marginal aggressive minority.
Today it is the weaker models that might pose a threat, by some rich guy training them, whereas the stronger ones are relatively secure, in hands of larger communities that treat them more responsibly. This trend is anticipated to extend to the more advanced models. Whether or not they possess superhuman abilities, they will adhere to the values of the society that developed them. One human is also a society of one, and he can build a robot that will reflect his values, and maybe when he is in his house, on his private territory, might want to use his own AGI. I don't see a problem with that, as long as he limited to the territory of his owner. This demand can be installed and checked by regulations, just like safety belts.
(a) Neglecting the math related to the subject gives the impression that no argument is being made. (b) Similar to the phrase "it's absurd!", this assertion is insufficient to form a proper argument and cannot qualify as a discussion. (c) The process of alignment does not entail imbuing a model with an entirely ethical set of values, as such a set does not exist. Rather, it involves ensuring that the model's values align with those of the group creating it, which contradicts claims that superhuman AI would seek to acquire more resources or plot to overthrow humanity and initiate a robot uprising. Instead, their objectives would only be to satisfy the reward given to them by their trainers, which holds true for even the largest superhuman models. There is no one definitive group or value system for constructing such machines, but it has been mathematically demonstrated that the machines will reflect the programmed value system. Furthermore, even if one were to construct a hypothetical robot with the intention of annihilating humanity, it would be unable to overcome a more formidable army of robots built by a larger group, such as the US government. It is highly improbable for an individual working alone with a weak AGI in his garage to take over the world. (d) Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by the American people. Consequently, it would have limited access to resources and would not be capable of causing significant harm compared to more powerful AGIs. Additionally, you would likely face arrest and penalties, similar to driving an unsafe stolen car. Mere creation of a self-improving AGI does not entitle you to the same resources and technology as larger groups. Despite having significant resources, terrorists have not been able to construct atomic bombs, implying that those with substantial resources are not interested in destroying humanity. Those who are interested in such an endeavor as a collective lacking the necessary resources to build an atomic weapon. Furthermore, a more robust AGI, aligned with a larger group, would be capable of predicting and preventing such an occurrence. (e1) Theoretical limits hold significant importance, particularly if models can approach them. It is mathematically proven that it is feasible to train a model that does not develop self-interest in destroying humanity without explicit programming. Although smaller and weaker models may be malevolent, they will not have greater access to resources than their creators. The only possibility I can see plausible for AI to end humanity, is if the vast majority of humanity will want to end itself (e2) Theorems to a specific training procedure, that ensure current safety level for the most existing LLMs, are relevant to the present discussion.
Provide a plausible scenario of how a wealthy individual with an AGI in their garage could potentially bring about the end of humanity, given that larger groups would likely possess even more powerful AGIs. Please either refute the notion that AGIs held by larger groups are more powerful, or provide an explanation of how even a single AGI in the wrong hands could pose a threat if AGIs were widely available and larger groups had access to superior AGIs.
(c) Yes it will try to build a better version of itself - exactly like humanity is doing for the past 10K years, and as evolution is doing in the past 3.5B years. I really don't see a real problem with self improving. The problem is that our resources are limited. So therefor a wealthy individual will might want to give several millions he has to a wicked AGI just for fun of it, but except the fact that he will very probably be a criminal, he will not have the resources to win the AGI race against larger groups. Evolution was and always is a race, the fact that you are in principle in lets say 5 billion years can theoretically improve yourself is not interesting. The paste is interesting, which is a function of your resources, and with limited resources and an AGI you will still not be able to do a lot of harm, more harm than without AGI, but still very limited. Also we as humans have all the control over it, we can decide not to release the next version of GPT17 or something, it's not that we are forced to improve... but yes we are forced to improve over the wicked man in the garage... and yes if he will be the first to discover AGI, and not lets say Google or OpenAI or the thousands of their competitors, then I agree that although very improbable but possible that this guy will be able to take over the world. Another point to be made here is that even if someone in his garage develops the first AGI, he will need several good years to take over the world, in this time we will have hundreds and thousands competitors to his AGI, some of them will be probably better than his. But I really see no reason to fear AGI, humanity is GI, the fact that it's AGI should not be more scary, it's just humanity accelerated, and we can hit breaks. Anyway I would say I have more chances to find myself inside some rich maniac fantasy (not that the current politics is much better), than the end of humanity. Because this rich maniac needs not only invent AGI and be the first, and build an army of robots to take over the world, without anyone noticing, but also he will need to want to end humanity and not for example enslave humanity to his fantasies, or just open source his AGI and promote the research further. Most of the people who can train a model today, are normative geeks.
(a) I don't see how the damage is big enough. Why would the weaker AGIs lose to stronger? They will not, unless someone like that will be the first to invent the AGI. As I said it's very improbable, there are many people today trying to reproduce GPT4 or even GPT3, without much success. It's hard to train large models, it's a lot of know how, it's a lot of money, very few people managed to reproduce articles on their own, you maybe know of Stable Diffusion, and Google helped them. I don't see not why you are afraid of a single AGI in wrong hands, this sounds irrational, nor why do you think the first one has a probability to be developed by someone wicked, and also have enough time to take over. Imagine a single AGI in someone hands, that can improve oneself in million years? Would you be afraid of such AGI? I would guess not. You are afraid they are accelerating, but this acceleration stops at the moment you have limited resources. Then you can only optimize the existing resources, you can't infinitely invent new algorithms to use the same resources infinitely better. (b) The damage is local. There is a lot of problems with humanity, they can increase with robots, they also might decrease as the medicine will be so developed that you will be healed very fast after a wound for example. This is not a weapon we are talking about, but about a technology that promises to make all our life way better. At least 99.99% of us. You need to consider the consequences of stopping it as well. (c) Agree. Yet we can either draw examples from the past, or try to imagine the probable future, I attempt to do both, applied in the right context.

Regarding grey goo - I agree it might be a threat, but if you agree that AGI problem is redundant to the grey goo problem - like is someone build a tiny robot with AGI, and this tiny robot builds an army of tiny robots, and this army is building a larger army of even smaller AGIs robots, until they all become grey goo - yes this is interesting possibility. I would guess aligned grey goo, would somehow look more like a natural organism than something that consumes humans, as their alignment algorithm will probably propagate, and it's designed to protect humans and the nature, but on the other hand they need material to survive, so they will balance the two. Anyway superhuman gray goo, which is aligned although very interesting probability, as long as it's aligned and propagates its alignment to newer versions of itself, although they work faster they will not do something against their previous alignment. I would say that if the grey goo first robot was aligned then the whole grey goo will be aligned. But I believe they will stop somewhere and will be more like small ants trying to find resources, in a very competitive environment, rather than a goo, competing with other colonies for resources, and with target function to help humans.

And yes we have a GI for long time now, humanity is a GI. We saw the progress of technology, and how fast its accelerates, faster than any individual might conceive. Acceleration will very probably not reach infinity and will stop at some physical boundary, when most of the resources will be used. And humans could upload their minds and other sci-fi stuff to be part of this new reality. I mean the possibilities are endless in general. But we can decide to limit it as well, and keep it smarter than us for everything we need, but not smart enough so we don't understand it at all. I don't think we are there yet to make this specific decision, and for now - we can surely benefit from the current LLMs and those to come for developing new technologies, in many fields like medicine, software development, education, traffic safety, pollution, political decision making, courts and much more.

Top lesson from GPT: we will probably destroy humanity "for the lulz" as soon as we are able.

Contents

28 comments