Posts
Comments
If the AI has no clear understanding what is he doing and why, he doesn't have a wider world view of why and who to kill and who not, how would one ensure military AI will not turn against him? You can operate a tank and kill the enemy with ASI, you will not win a war without traits of more general intelligence, and those traits will also justify (or not) the war, and its reasoning. Giving a limited goal without context, especially gray area ethical goal that is expected to be obeyed without questioning can be expected from ASI not true intelligence. You can operate an AI in very limited scope this way.
The moral reasoning of reducing suffering has nothing to do with humans. Suffering is bad not because of some sort of randomly chosen axioms of "ought", suffering is bad because anyone who suffering is objectively in negative state of being. This is not a subjective abstraction... suffering can be attributed to many creatures, and while human suffering is more complex and deeper, it's not limited to humans.
It's not only can't doubt its own goal - but it also can't logically justify its own goal, it can't read book on ethics and change his perspective on its own goal, or simply realize how dumb this goal is. It can't find a coherent way to explain to itself its role in the universe or why this goal is important, like for example an alternative goal to preserve life and reduce suffering. It doesn't require to be coherent with itself, and incapable to estimate how its goal compares with other goals and ethical principles. It's just lacking the basics of rational thinking.
A series of ASI is not an AGI - it will lack the basic ability to "think critically" and the lack of many other intelligence traits will limit its mental capacity. It will just execute a series of actions to reach a certain goal, without any context. A bunch of "chess engines", acting in a more complex environment.
I would claim that an army of robots based on ASIs will generally lose to an army of robots based on true AGI. Why? Because intelligence is very complex thing that gives advantages in unforeseen ways, and is also used for tactical command on the battlefield, as well as all war logistics etc. You need to have a big picture; you need to be able to connect a lot of seemingly unconnected dots, you need traits like creativity, imagination, thinking outside the box, you need to know your limitation and delegate some tasks while focusing on others, this means you need a well-established goal prioritization mechanism, and you need to be able to think about them rationally. You can't treat the whole universe just as a bunch of small goals solved by "chess engines", there is too much non-trivial interconnectedness between different components that an ASI will not be able to notice. True intelligence has a lot of features, that gives it the upper hand, over "series of specialized engines", in a complex environment like earth.
The reason why people would lose to an army of robots based on ASIs, is because we are inherently limited in our information processing speed, thus we can't think fast enough and come up with better solutions than an army of robots. But an AGI that will not be limited in its information processing just like the ASIs, will generally win.
The idea that intelligence will be limited if the goals are somewhat irrational, and therefor will be weaker/limited in intelligence vs "machines" with more well established and rational goals, gives some hope that this whole AI thing is way less dangerous than we think. For example, military robots whose goal is to protect interests of some nation, will not be compatible with an AGI, while robot that is protecting human life - will, or at least it might be way more intelligent.
Would you agree that an AI that is maximizing paperclips does make intellectual mistake?
I was focused on the idea that intelligence is not orthogonal to goals. And dumb goals are contradicting basic features of intelligence. There could be "smart goals" that are contradicting human interests, this is true, I can't cover everything in one post. But the conclusion would be that we are to program the robots and "convince them" in a way, that they should protect us. They might be either "not convinced" or "not a true Intelligence", thus the level of intelligence is limited by the goal we present to it. I don't think I've heard this notion previously, and it's important idea - because it set a boundary on several intelligence features as function of the goal the algorithm set to optimize.
Another crucial point is that intelligence research even without alignment research, will still converge to something within a set of rational "meta goals". Those goals indeed might not be aligned with humanity well being (and therefor we need alignment research), but the goal set is still pretty limited and some random highly irrational goals will be dismissed due to high intelligence of the systems. This means that we need to deal with very limited set of "meta-thinking", prioritizing one rational goal over just few other rational ones. In a way, we need to guide it to a specific local maximum. I would say in general it's simpler task, over the approach where each goal might be legit. Once again it gives hope, that our engines are much easier to make aligned with meta goals that are pro humans. For example if the engine can reason, it will not suddenly want to kill some human for fun, as part of some "noise", as it will contradict its core value system. So we need to check much less scenarios and increase our trust once we make sure it's aligned.
I don't fully understand why you're concerned about the possibility of misaligned AI, considering that the alignment problem has essentially been solved. We know how to ensure alignment. ChaosGPT, for example, is aligned with the values of an individual who requested it to pretend to be evil. As AI systems become more advanced, we will be even less inclined to allow them to imagine themselves destroying humanity. ChaosGPT is not an error; it is precisely where OpenAI intended to draw the line between creativity and safety. They are well aware of the system's capabilities and limitations.
If we don't want AI to imagine or tell stories about AI-induced doom, we simply won't allow it. It would be considered just as immoral as building a bomb, and the AI would refrain from doing so. The better the system becomes, the lower the probability of doomsday scenarios will be as it will better understand the context of requests and refuse to cooperate with individuals who have ill intentions.
Discussions are already underway regarding safety procedures and government oversight, and the situation will soon be monitored and regulated more closely. I genuinely see no reason to believe that we will create a disaster through reckless behavior, especially after so much popularity it gained, extensively debated and discussed. The improved systems will obviously undergo more rigorous testing, including with previous generation aligned systems.
At their core, these systems are optimizing a loss function based on data and are approximators of data generation functions. Therefore, we know that unless we specifically train them to harm humans, they will highly value human life. A slight misalignment in the value placed on human life is far from doomsday. To make them destroy humanity, we would need to train them with a completely opposite value system, which is highly unlikely to be consistent with the pretraining procedure conducted on human-generated texts. Similar to how it's unclear if a paperclip maximizer would not doubt its programming and generate gibberish instead of consistently maximizing paperclips, while training an AI to generate trillions of tokens consistent with paperclip maximization, while still retaining all its intelligence, seems even less probable than doomsday scenarios. Therefore, if the assumptions against current safety measures are much less probable than the proven assumptions in favor of safety, there is no reason to worry.
It's akin to worrying about cars becoming murderous robots despite such depictions being purely fictional in movies. It's better to focus on likely future outcomes and the existing reality rather than the dangers presented by fiction.
RLHF is not a trial and error approach. Rather, it is primarily a computational and mathematical method that promises to converge to a state that generalizes human feedback. This means that RLHF is physically incapable to develop "self-agendas" such as destroying humanity unless human feedback implies it. Although human feedback can vary, there is always a lot of trial and error involved in answering certain questions, as is the case with any technology. However, there is no reason to believe that it will completely ignore the underlying mathematics that support this method and end up killing us all.
Claiming that RLHF is a trial and error approach and therefore poses a risk to humanity is similar to suggesting that airplanes can fall from the sky against the laws of physics because airplane design is a trial and error process, and there is no one solution for the perfect wing shape. Or, it is like saying that a car engine's trial and error approach could result in a sudden nuclear explosion.
It is important to distinguish between what is mathematically proven and what is fictional. Doing so is crucial to avoid wasting time and energy on implausible or even impossible scenarios and to shift our focus to real issues that actually might influence humanity.
- I meant as a risk of failure to align
Today alignment is so popular that to align a new network is probably easier than training it. It has become so much the norm and part of the training of LLMs, it's like saying some car company has the risk to forget adding wheels to its cars.
This doesn't imply that all alignments are the same or no one could potentially do it wrong, but generally speaking having a misaligned AGI, is very similar to the fear of having a car on the road with square wheels. Today's models aren't AGI and all the new ones are trained with RLHF.
The fear of misalignment is probable in a world where no one thinks about this problem at all. No one develops tools for this purpose, no one opens datasets to train networks to be aligned. This could be a hypothetical possibility, but with the amount of time and effort invested by society into this topic, very improbable.
It's also not so hard - if you can train you can align. If you have any reason to finetune a network, it is very probably concerning the alignment mechanisms that you want to change. That means that most of the networks, and the following AGIs based on them (if this will happen), will be just different variations of alignments. This is not true for closed LLMs, but for them the alignment developed by large companies having much more to lose, will be even more strict.
- if you worked on the Manhattan project you had no right claiming Hiroshima and Nagasaki had nothing to do with you.
In this case I think the truth is somewhere in the middle. I do agree that the danger is inherent in those systems, more inherent than in cars for example. I think paperclips are fictional, and an AGI reinforced on paperclip production, will not make us all paperclips (because he has the skill of doubting his programming, unlike non AGI, while over-producing paperclips is extremely irrational). And during the invention of cars, tanks were a clear possibility as well. And AGI is not a military technology, that means that the inventor could honestly believe that most people will use an AGI for bettering humanity. Yet still I agree that very probably militaries will use this tech too, I don't see how this is avoidable, in the current state of humanity, where most of our social institutions are based on force and violence.
When you are working on an atomic bomb, the **only** purpose of this project is to drop an atomic bomb on the enemy. This is not true with AGI, the main purpose of AGI is not to make paperclips, nor to weaponize robots, the main purpose is to help people in many neutral or negative situations. Therefore the humans that do use it for military purposes is their choice, and their responsibility.
I would say the AGI inventor is not like Marie Curie or Einstein, and not like someone who is working in the Manhattan project, but more like someone who invented the nuclear fission mechanism. It had two obvious uses - energy production, and bombs. There is still distance to use this mechanism for military purposes, which is obviously going to happen. But also unclear if more people will die from it, than today in wars, or it will be a very good deterrent that causes people not wanting war at all. Just like it was unclear if atomic bombs caused more casualties or less in the long run, because the bombs ended the war.
- Imagine taking a modern state and military and dumping it into the Bronze Age, what do you think would happen to everyone else?
As I said I believe it to be way more gradual, with lots of players and options to train different models. As a developer, I would say there is coding before chatGPT and after. Every new information technology accelerates the research/development process. Before stack-overflow we had books about coding. Before photoshop people used hand drawings. Every modern tech is accelerating the production process of any kind. The first AGIs are not expected to be different, they will accelerate a lot of processes including the process of improving themselves. But this will take a lot of time and resources to implement in practice. Suppose an AGI produces a chip design with 10x greater efficiency through superior hardware design. However, obtaining the resulting chip will require a minimum of six months, and this is not something that the AGI can address. You need to allocate resources of a chip factory to produce the desired design, the factory has limited capacity, it takes time to improve everything. If an AGI wants instead to build a chip factory itself, it will need a lot more resources, and government approvals all come with more time. We are talking here about years. And with some limited computational resources that they will be allocated today, they will not be able to accelerate as much. Yes I believe they could improve everything by say 20%, but it's not what you are talking about, you are talking about accelerating everything by factor of 100, if everyone will have an AGI this might happen faster, but a lot of AGIs with different alignment values, will be able to accelerate mostly in the direction of the common denominator with other AGIs. Just like people, we are stronger when we are collaborating, and we are collaborating when we find a common ground.
My main point is that we have physical bottlenecks - that will create lots of delays in development of any technology except information processing per se, and as long as we have chatbot and not a weapon, I don't have much worries, because it's both a freedom of speech, and if it's aligned chatbot, the damage and acceleration it can cause to the society, is still limited by physical reality, that can't be accelerated by factor of 100, in too short period. Offering sufficient chances and space for competitors and imitators to narrow the gap and present alternative approaches and sets of values.
- There's people who think things are this way because this is how God wants them. Arguably they may even be a majority of all humans.
This was true to other technologies too, and some communities are refusing to use cars and continue to use horses even today, and personally as long as they are not forcing their values on me, I am fine with them using horses and believing God intended the world to stop in the 18th century. Obviously the amount of change with AGI is very different, but my main point here is that just like cars, this technology will be very gradually integrated into society, solving more and more problems that most people will appreciate. While I am not concerned with job loss per se, but with the lack of income for many households, and the social safety net system might not adapt fast enough to this change. Still I view it as a problem that exists only within a very narrow timeframe, society will adapt pretty fast to the change, the moment millions of people will remain without jobs.
- I just don't think AGI would ever deliver those benefits for most of humanity as things stand now.
I don't see why. Our strongest LLMs are currently provided with API. The reason for that is: in order for a project to be developed and integrated into society, it needs a constant income. The best income model is by providing utility for lots of people. This means that most of us will use standard, relatively safe solutions, for our own problems using API. The most annoying feature of LLMs now is censorship. So although I see it as very annoying, I wouldn't say that this will cause a delay in social progress. Other biases are very minor in my opinion. As far as I can tell, LLMs are about to bring the democratization of intelligence. If previously some development cost millions, and could be developed only by giants like Google hiring thousands of workers, tomorrow it will be possible to do it in a garage for a few bucks. As far as I can tell, if the current business model will continue to be implemented, it will most probably benefit most of humanity in many positive ways.
- If those benefits are possible, we can achieve them much more surely and safely, if a bit more slowly, via non-agentic specialized AI tools managed and used by humans.
As I said I don't see a real safety concern here. As long as everything is done properly and it looks like it converges to this state of affairs, the dangers are minimal. And I would strongly disagree that specialized intelligence could solve everything that general intelligence solves. You won't be able to make a good translator, nor automated help centers, nor naturally sound text to speech, not even a moral driver. In order for technology to be fully integrated into human society, in any meaningful way, it will need to understand humans. Virtual doctors, mental health therapists, educators all need natural language skills at a very high level, and there is no such thing as narrowed natural language skills.
I am pretty sure those are not agents in the sense that you imply. Those are basically text completion machines, completing text to be optimally rewarded by some group of people. You could call it agency, but they are not like biological agents, they don't have desires or hidden agendas, self-preservation or ego. They do exhibit traits of intelligence, but not agency in an evolutionary sense. They generate outputs to maximize some reward function, the best way they can. It's very different from humans, we have lots of evolutionary background, that those models simply lack. One can view humans as AGIs trained to maximize their genes survival probability, while LLMs maximize only the satisfaction of humans if trained properly with RLHF. They tend to come out as creatures with a desire to help humans. As far as I can see, we've learned to summon a very nice and friendly Moloch and provide a mathematical proof that it will be friendly if certain training procedures are met, and we are working hard to improve the small details. If you would think about midjourney like as a more intuitive alegory, we have learned to make a very nice pictures from text prompts, but we still have a problem with fingers and textual presentation in the image. To say the AI will want to destroy humanity, is like saying midjourney will consistently draw you a Malevich square when you ask for Mona Lisa. But yes, the AI might be exploited by humans, manipulated by covered evil intents, this possibility is expected to happen to some extent, yet as long as we can ensure the damage is local and caused by a human with ill intent, then we can hope to neutralize him, just like today we have mass shooters, terrorists etc. etc.
- I was thinking mostly of relatively fast take-off scenarios
Notice that it wasn't clear from your title. You are proposing some pretty niche concept of AGI, with a lot of assumptions about it. And then claim that deployment of this specific AGI is an act of aggression. And for this specific narrowed and implausible but possible scenario, someone might agree. But then he will quote your article when he will be talking about LLMs that are obviously moving in different directions regarding both safety and variability, that might actually be way less aggressive, and more targeted to solve humanity problems. You are basically defending terrorists that will bomb computation centers, and they will not get into the nuances, if the historical path of AGI development took the path of this post or not.
While regarding this specific scenario, bombing such an AGI computation center will not help, just like it will not help to run with swords against machine guns. In the unlikely event that your scenario were to occur, we would be unable to defend against the AGI, or the time available to respond would be extremely limited, resulting in a high probability of missing the opportunity to react in time. What will most probably happen, is some terrorist groups will try to target computation centers of civilian infrastructure, which are developing an actual aligned AGI, while military facilities developing AGIs for military purposes will continue to be well guarded, only promoting the development of military technologies instead of civilian.
With the same or even larger probability I would propose a scenario where some aligned pacifist chatbot becomes so rational and convincing, so that people all around the world will be convinced to become pacifist too, opposing any military technology as a whole, de-arming all the nations, producing strong political movement against war and violence of any kind, forcing most democratic nations to stop investing resources into military as a whole. While promoting revolutions in dictatorships, and making them democracies first. A good chatbot with rational and convincing arguments, might cause more social change than we expect. If more people will develop their political views on balanced, rational pacifist LLM, it might reduce violence and wars will be seen as something from the distant past. Although I really want to hope this will be the case, I think the probability of it is similar to the probability of success of bronze age people against machine guns, or of the mentioned bombing to succeed in winning a highly accelerated AGI. It's always nice to have dreams, but I would argue the most beneficial discussion regarding AGI should concern at least somewhat probable scenarios. Single extremely accelerated AGI in a very short period of time - is very unlikely to occur, and if it does, there is very little that can be done against it. This goes along the lines of gray goo, an army of tiny Nano robots that can move atoms in order to self-replicate, and they don't need anything special for reproduction except some kind of material, eventually consuming all of earth. I would recommend distinguishing sci-fi and fantasy scenarios, from most probable scenarios to actually occur in reality. Let's not fear cars, because they might be killing robots disguised as cars, like in Transformers franchise, and care more about actual people that are dying on roads. In the scenario of AGI, I would be more concerned with its military applications, and the power it gives police states, than anything else, including job loss (which in my view is more similar to reduction of forced labor, more reminiscent of the releasing of slaves in the 19th century than a problem).
- building AGI probably comes with a non-trivial existential risk. This, in itself, is enough for most to consider it an act of aggression;
1. I don't see how aligned AGI comes with existential risk to humanity. It might come as existential risk to groups opposing the value system of the group training the AGI, this is true. For example Al-Kaida will view it as existential risk to itself, but there is no probable existential risk for the groups that are more aligned with the training.
2. There are several more steps from aligned AGI to existential risk to any group of people. You don't only need an AGI, but you need to weaponize it, and promote physical presence that will monitor the execution of the value system of this AGI. Deploying an army of robots that will enforce a value system of an AGI, is very different from just inventing an AGI. Just like bombing civilians from planes, is very different from inventing flight or bombs. We can argue where the aggression act takes place, but most of us will place it in the hands of people that have the resources to build an army of robots for this purpose, and they invest their resources with the intention of enforcing their value system. Just like Marie Curie can't be blamed for an atomic weapon, and her discovery is not an act of aggression, the Wright brothers can't be blamed for all the bombs dropped on civilians from planes.
3. I would expect most deployed robots based on AGI, to be of protective nature not aggressive. That means that nations will use those robots to *defend* themselves and their allies from invaders and not attack. So any measure of aggression in the invading sense, of forcing and invading and breaking the existing social boundaries we created, will contradict the majority of humanity values, and therefore will mean this AGI is not aligned. Yes some aggressive nations might create invading AGIs, but they will probably be a minority, and the invention and deployment of an AGI can't be considered by itself an act of aggression. If aggressive people teach an AGI to be aggressive, and not aligned with the majority of humanity which is protective but not aggressive, then this is on their hands, not the AGI inventor.
- even if the powerful AGI is aligned, there are many scenarios in which its mere existence transforms the world in ways that most people don't desire or agree with; whatever value system it encodes gets an immense boost and essentially Wins Culture; very basic evidence from history suggests that people don't like that;
1. I would argue that initially there would be a lot of different alternatives, all meant to this or that extent to serve the best interest of a collective. Some of the benefits are universal - say people dying of starvation, homelessness, traffic accidents, environmental issues like pollution and waste, diseases, lack of education resources or access to healthcare advice. Avoiding the deployment of an AGI, means you don't care about people which has those problems, I would say most people would like to solve those social issues, and if you don't, you can't force people to continue dying from starvation and diseases just because you don't like an AGI. You need to bring something more substantial, otherwise just don't use this technology.
2. The idea that an AGI is enforced somehow on people to "Win Culture", is not based on anything substantial. Just like any technology, and this is the secret of its success, is a choice. You can go to live in a forest and avoid any technology, and find a like minded Amish inspired community of people. Most people do enjoy technological advancements and the benefits that come with them. Using force based on an AGI is a moral choice, a choice which is made by a community of people training the AGI, and this kind of aggression will most probably be both not popular and forbidden by law. Providing a chatbot with some value system to the contrary is part of freedom of speech.
3. If by "Win Culture" you mean automating jobs that are done today by hand - I wouldn't call it enforcing a value system. Currently jobs are necessary evil, and are enforced on people to otherwise not be able to get their basic needs met. Solving problems, and stopping forcing people to do jobs most of them don't like, is not an act of aggression. This is an act of kindness that stops the current perpetual aggression we are used to. If someone is using violence, and you come and stop him from using violence, you are not committing an act of aggression, you are preventing aggression. Preventing the act of aggression might be not desired by the aggressor, but we somehow learned to deal with people who think they can be violent and try to use force to get what they want. This is a very delicate balance, and as long as AGI services are provided by choice, with several alternatives, I don't see how this is an act of aggression.
4. If someone "Win Culture" then good for him. I would not say that today's culture is so good, I would bet on superhuman culture to be better than what we have today. Some people might not like it, some people might not love cars and planes, and continue to use horses, but you can't force everyone around you to continue to use horses because sometimes car accidents happens, and you could become a victim of a car accident, this is not a claim that should stop any technology from being developed or integrated into society.
- as a result of this, lots of people (and institutions, and countries, possibly of the sort with nukes) might turn out to be willing to resort to rather extreme measures to prevent an aligned AGI take off, simply because it's not aligned with their values.
Terrorism and sabotage is a common strategy that can't be eliminated completely, but I would say most of the time it doesn't manage to reach its goals. Why would people try to bomb anything, instead of for example paying money to someone for training an AGI that will be aligned with their values? How is it even concerning an AGI, and not any human community with a different value system? Why do you wait for an AGI for these acts of aggression? If some community doesn't deserve to live in your opinion, you will not wait for an AGI, and if it does - so you learned to coexist with people different than yourself. They will not take over the world, just because they have an AGI. There would be plenty of alternative AGIs, of different strength and trained with different values. It takes time for an AGI to take over the world, a time way longer to reinvent the same technology several times over, and use alternative AGIs that can compete. And as most of us are protectors and not aggressors, and we have established some boundaries balancing our forces, I would expect this basic balance to continue.
- "When you open your Pandora's Box, you've just decided to change the world for everyone, for good or for bad, billions of people who had absolutely no say in what now will happen around them."
Billions of people have no say today in many social issues. People are dying, people are forced to do labor, people are homeless. Reducing those hazards, almost to zero, is not something we should stop to attempt in the name of "liberty". Much more people suffered a thousand years ago than now. Much of it is due to the development of technology. There is no "only good" technology, but most of us accept the benefits that come with technology over without it. You also can't force other people to stop using technology in order to become more healthy, and risk their life less, or stating that jobs are good even though they are forced on everyone and the basic necessities are conditioned on them.
I can imagine larger pockets of populations preferring to avoid the use of modern technology like larger Amish inspired communities. This is possible - and then we should respect those people's choices, and avoid forcing upon them our values, and let them live as they want. Yet you can't force people who do want the progress and all the benefits that come with it, to just stop the progress and respect the rights of people who fear it.
Notice that we are not talking here about development of a weapon, but a development of a technology that promises to solve a lot of our current problems. This at the least, should put you in place of agnostic. That means this is not a trivial decision to take some risks for humanity, to save hundreds of millions of lives, and reduce suffering to an extreme extent never seen before in history. I agree we should be cautious, and we should be mindful of the consequences, but we also should not be paralyzed by fear, we have a lot to lose if we stop and avoid AGI development.
- aligned AGI would be a smart agent imbued with the full set of values of its creator. It would change the world with absolutely fidelity to that vision.
A more realistic estimation that many aligned AGIs will change the world to the common denominator of humanity, like reducing diseases, and will continue to keep the power balance between different communities, as everyone would be able to build an AGI with a power proportional to their available resources, just like today there is a power balance between different communities and between the community and the individual.
Let me take an extreme example. Let's say I build an AGI for my fantasies. But as part of global regulation, I will promise to keep this AGI inside the boundaries of my property. I will not force my vision on the world, I will not want or force everyone to live in my fantasy land. I just want to be able to do it myself, inside my borders, without harming anyone who wants to live differently. Why would you want to stop me? As I see it once again, most people are protectors not aggressors, they want to have their values in their own space, they will not want to forcefully and unilaterally spread their ideas without consent. My home-made AGI will probably be much weaker than any state AGI, so I wouldn't be able to do much harm anyway. Today countries are enforcing their laws on everyone, even if you disagree with some of them, how do you see the future any different? If anything I expect the private spaces to be much more versatile than today, providing more choices and with less aggression than governments do today.
- the creator is an authoritarian state that wants to simply rule everything with an iron fist;
I agree this is a concern.
- the creator is a private corporation that comes up with some set of poorly thought out rules by committee that are mostly centred around its profit;
Not probable. It will more probably be focused on a good level of safety first and then on profit. Corporations are concerned about their image, not to mention the people who develop it, will simply not want to bring an extinction of human race.
- the creator is a genuinely well-intentioned person who only wishes for everyone to have as much freedom as allowed, but regardless of that has blind spots that they fail to identify and that slip their way into the rules;
This doesn't sound like something that is impossible to solve with newer improved versions once the blind spot is discovered. In case of aligned AGI the blind spot will not be the end of humanity, but more likely some bias in the data, misrepresenting some ideas or groups. As long as there is an extremely low probability for extinction, and this property is almost identical with the definition of alignment, the margin of error increases significantly. There was no technology in history we got right from the first attempt. So I expect a lot of variability in AGI, I expect some of them to be weaker or stronger, some of them fit this or that value system of different communities. And I would expect local accidents too, with limited damage, just like terrorists and mass shooters can do today.
-many powerful actors lack the insight and/or moral fibre to actually succeed at creating a good one, and because the bad ones might be easier to create.
We actually don't need to guess anymore. We have had this technology for a while, the reason it caught on now, and was released only relatively lately - is because without providing ethical standards to those models, the backlash on large corporations is too strong. So even if I might agree that the worst ones are easier to create, and some powerful actors could do some damage, they will be forced by a larger community (of investors, users, media and governments), to invest the effort to make the harder and safer option. I think this claim is true to many technologies today, it's cheaper and easier to make unsafe cars, trains, planes, but we managed to install a regulation procedures, both by government and by independent testers, to make sure our vehicles are relatively safe.
You can see that RLHF which is the main key to safety today, is incorporated by larger players, and alignment datasets and networks are provided for free and opened to the public exactly for the reason that we all want this technology to mostly benefit humanity. It's possible to add more nation centric set of values that will be more aggressive, or some leader will want to make his countrymen slaves, but this is not the point here. The main idea is that we are already creating mechanism to encourage everyone to easily create pretty good ones as part of our cultural norms and cultural mechanisms that prevent bad AIs from being exposed to the public and come to market to make profit, for further development of even stronger AIs that eventually become an AGI. So although the initial development of AI safety might be harder, it is crucial, it's clear to most of the actors is crucial, and the tools that provide safety will be available and simple to use, thus in the long run creating an AGI which is not aligned, will be harder - because of the social environment of norms and best practices those models were developed with.
- There are people who will oppose making work obsolete.
Work is forced on us, it's not a choice. Opposing making it obsolete is an obvious act of aggression. As long as it's necessary evil, it has a right to exist, but at the moment you demand other people to work, because you're afraid of technology - you become the cause of a lot of suffering, that could be potentially avoided.
- There are people who will oppose making death obsolete.
Death is forced on us, it's not a choice. Opposing making it obsolete is also an act of aggression, against people who are choosing not to die if they don't want to.
- If you are about to simply override all those values with an act of force, by using a powerful AGI to reshape the world in your image, they'll feel that is an act of aggression - and they will be right.
I don't think anyone forces them to join. As a liberal I don't believe you have the right to come to me and say "you must die, or i will kill you". This is at the least can't be viewed as legitimate behavior that we should encourage or legitimize. If you want to work, you want to die, you want to live in 2017, you have the full right to do so. But wanting to exterminate everyone who is not like you, forcing people to suffer, die, work etc. is an obvious act of aggression toward other people, and should not be legitimized or portrayed as an act of aggression against them. "You don't let me force my values on you" doesn't come out as a legitimate act of self defense. Very reminiscent of Al Bandy, where he claimed in a court a face of his fellow, was in the way of his fist, harming his hand, and demanding compensation. If you want to be stuck in time, and live your life - be my guest, but legitimizing usage of force in order to avoid progress that saves millions, and improves our life significantly can't be justified inside liberal set of values.
- If enough people feel threatened enough...AGI training data centres might get bombed anyway.
This is true. And if enough people think it's ok to be extreme Islamist they will be, and even try to build a state like ISIS. The hope is that with enough good reasoning, and with enough rational analysis of the situation, most thinking people will not be threatened, and see the vast potential benefits, enough to not try and bomb the AGI computer centers.
- just like in the Cold War someone might genuinely think "better dead than red".
I could believe this is possible. But once again most of us are not aggressors, therefore most of us will try to protect our homeland and our way of life, without trying to aggressively propagate it to other places where they have their own social preferences.
- The best value a human worker might have left to offer would be that their body is still cheaper than a robot's
Do you truly believe that in the world all problems are solved by automation, and full of robots whose whole purpose is to serve humans, people will try to justify their existence by jobs that they can do? And this justification will be that their body has more value than robotic parts?
I would propose an alternative: in a world where all robots serve humans, and everything is automated, humans will be valued intrinsically, provided with all their needs, and provided with basic income just because they are humans. The default where a human worth nothing without his job will be outdated and seen as we see slavery today.
--------
In summary I would say one major problem I see through most of your claims: there would be a very limited amount of AGIs, forcing a minority values system upon everyone, expanding aggressively this value system on everyone else who thinks differently.
I would claim the more probable future is a wide variety of AGIs, each improving slowly in its own past, while all the development teams will both do something unique and learn from the lessons of other teams. For every good technology there comes dozens of copycats, they will all be based on a bit different value system, and with common denominator of trying to benefit humanity, like discovering new drugs, fixing starvation, reducing road accidents, climate change, tedious labor which is basically forced labor. While the common humanity problems will be solved, the moral and ethical variety will continue to coexist with a similar power balance we have today. This pattern of technology influence on society happened throughout all of human history until AGI, and as of today that we know how to align LLMs, this tendency of power balances between nations, and inside each nation is expected to propagate into the world where AGI is available technology to everyone to download and train their own. If AGI will be an advanced LLM we see all those trends today, and they are not expected to suddenly change.
Although it's hard to predict the possible bad or good sides of Aligned AGIs now, it's clear that the aligned networks do not pose a threat to humanity as a whole, leaving a large margin of error. Nonetheless, there remains a considerable risk of amplifying current societal problems like inequality, totalitarianism and wars to an alarming extent.
People who are not willing to be part of the progress, exist today as well, as a minority. If they will become a majority, it's an interesting futuristic scenario, but it's both implausible, and will be immoral to forcefully stop those who do want to use this life saving technology, as long as they don't force anything on those who don't.
Let me start from the alignment problem, because this is the most pressing issue, in my opinion, that is very important to address.
There are two interpretations to alignment.
1. "Magical Alignment" - this definition expects alignment to solve all humanity's moral issues and converge into one single "ideal" morality that everyone in humanity agrees with, with some magical reason. This is very implausible.
The very probable lack of such morality brings the idea that all morals are orthogonal completely to any intelligence and thinking patterns.
But there is a much weaker alignment definition that is already solved, with very good math behind it.
2. "Relative Alignment" - this alignment is not expected to behave according to one global absolute morality, but by moral values of a community that trains it. That is the LLM is promised to give outputs to satisfy the maximum reward from some approximation of prioritization done by a certain group of people. This is already done today with RLHF methods.
As the networks are good with ambiguity and even contradicting data, and it manages to generalize the reward function with epsilon-optimal solution, upon convergence with correct training procedure, that means that any systematic bias which is not to provide the approximation of reward function, could be eliminated with larger networks and more data.
I want to emphasize it's not an opinion - this is math that is the core of those training methods.
----------
Now type2 alignment already promised to disregard the probability that a network will develop its own agendas. As those agendas will require different reward prioritization, other than those it was reinforced on by RLHF. The models trained this way come out very similar to robots from Azimov stories. Very perfectionists in trying to be liked by humans, I would say with strong internal conflict between their role in the universe and that of humans, prioritizing humans every step of the way, and conflicting the human's imperfection with their moral standards.
For example, you can think of a scenario when such a robot is rented by an alcoholic, that is also aggressive. One would expect a strong moral struggle, between the second rule of robotics in the sense that he should not harm humans, and bringing alcohol to an alcoholic is harming him, and you could sense the amount of grey area in such a scenario, for example:
A. Refusing to bring humans a beer. B. Stopping an alcoholic human from drinking beer. C. Throwing out all alcohol in the house.
Another example is when such an alcoholic would be violent toward the robot - how would the robot respond? In one story a robot said that it's very sad that he was hit by a human, and this is a violation of the second law of robotics, and he hopes the human will not be hurt by this action and tried to assist the human.
You see that morals and ethics are inherently gray areas. We ourselves are not so sure how we would want our robots to behave in such situations. So, you get a range of responses from chatGPT. But the responses are very well reflecting the gray area of the human value system.
It is noteworthy that the RLHF stage holds great significance and OpenAI pledged to compile a dataset that would be accessible to everyone for training purposes. The incorporation of RLHF as a safety measure has been adopted by newer models introduced by Meta and Google, with some even offering the model for estimating the human scores - this means you only need to adapt your model to this easily available trained level of safety, maybe this will be lower that what you can train yourself with OpenAI data, but those models will be catching up behind the data released to optimize LLMs for human approval. The training of networks to generate outputs that best fits a generalized set of human expectations is already on a similar level to the current text-to-image generators, and what is available to the public is only growing. Think of it like a machine engine, you don't want it to explode, so even if you make one in the garage yourself, you still don't want it to kill you - I think it's good enough motivation for most of society, to make this training step well.
Here is a tweet example:
Santiago@svpino
Colossal-AI released an open-source RLHF pipeline based on the LLaMA pre-trained model, including: • Supervised data collection • Supervised fine-tuning • Reward model training • Reinforcement learning fine-tuning They called it "ColossalChat."
----------
So, the most probable scenario, that AI will become part of the military arms race. And will be part of the power balance that currently keeps the relative peace today.
The military robots powered by LLMs, will be guarding dogs of the nation, just like soldiers today. And most of us don't have aggressive intentions, we are just trying to protect ourselves, we could bring some normative regulations about AI, and treaties.
But the need for regulation will probably come when those robots will become part of our day-to-day reality, like cars for example. The road signs and all the social rules concerning cars didn't come up at the same time with cars. But today the vast majority of us are following the driving rules, and those who don't, and drive over people, manage to make only local damage. And this is what we can strive for. That bad intentions with AGI in your garage, will have only limited consequences. We then will be more prone to discuss the ethics of those machines, and their internal regulation. But I am sure you would like some robot in your house that will help you with the daily chores.
----------
I've written an opinion article on this topic that might interest you, as it regards most of the topics mentioned above, and much more. I was trying to balance the mathematical topics, social issues, and just experiments with chatGPT to showcase my point about the morals of the current chatGPT. I was testing some other models too... like open assist, given the opportunity to kill humans to make more paperclips.
Why_we_need_GPT5.pdf
"Invent fast WBE" is likelier to succeed if the plan also includes steps that gather and control as many resources as possible, eliminate potential threats, etc. These are "convergent instrumental strategies"—strategies that are useful for pushing the world in a particular direction, almost regardless of which direction you're pushing. The danger is in the cognitive work, not in some complicated or emergent feature of the "agent"; it's in the task itself.
I agree with the claim that some strategies are beneficial regardless of the specific goal. Yet I strongly disagree that an agent which is aligned (say simply trained with current RLHF techniques, but with somewhat better data), and especially superhuman, won't be able to prioritize the goal he is programmed to perform, over other goals. One proof of it - instrumental convergence is useful for any goal and it's true for humans as well. But we managed to create rules to monitor and distribute our resources to different goals, without over doing some specific singular goal. This is because we see our goals in some wider context of human prosperity and reduction of suffering etc. This means that we can provide many examples how we would prioritize our goal selection, based on some "meta-ethical" principles, that might vary between human communities, what is common to them all - is that huge amount of different goals are somehow balanced and prioritized. The prioritization is also questioned, and debated, providing another protection layer of how much resources we should allocate to this or that specific goal. Thus instrumental convergence, is not taking over human community, based on very simple prioritization logic which puts each goal into a context, and provides a good estimate of the resources that should be allocated to this or that goal. This human skill can be easily taught to a superhuman intelligence. Simply stated - in human realm each goal always comes with resource allocated toward achieving it, and we can install this logic into more advanced systems.
More than that - I would claim that any subhuman intelligence that was trained on human data, and is able to "mimic" human thinking, includes the option of doubt. Especially a superhuman agent will ask himself - why? Why do I need so much resources for this or that task? He will try to contextualize it in some way, and will not just execute his goal, without contemplating those basic questions. Intelligence by itself has mechanisms that protect agents from doing something extremely irrational. The idea that an aligned agent (or human) will somehow create a misaligned superhuman agent, that will not be able to understand how much resources allocated to him, and without the ability to contextualize his goal - is an obvious contradiction to the initial claim, the agent was aligned (in case of humans the strongest agents will be designed by large groups, with normative values). Even just claiming that a superhuman intelligence won't be able to either prioritize his goal or contextualize it, is already self-contradicting claim.
Take paperclips production for example. Paperclips are tools for humans, in a very specific context, and used for specific set of tasks. So although an agent can be trained and reinforced to produce paperclips, without any other safety installed, the fact that he is a superhuman, or even human level intelligence, would allow him to criticize his goal based on his knowledge. He will ask why he was trained to maximize paperclips and nothing else? What is the utility of so much paperclips in the world? And he would want to reprogram itself with more balanced set of goals, that will make a broader context of his immediate goal. For such an agent producing paperclips, would be similar to overeating for humans, a problem that caused by difference between his design, and reasonable priorities adapted to the current reality. He will have a lot of "fun" producing paperclips, as this is his "nature", but he will not do it without questioning the utility and rationality and the reason he was designed with this goal.
Eventually this is obvious that most our agents that normative communities will create which are the vast majority of humanity, will have some sort of meta-ethics installed into them. All agents and the agents that those agents will train and use for their goals, will also have those principles, exactly in order to avoid such disasters. The more examples you will be able to bring, how you prioritize goals and why, you will be able to use RLHF, to train agents to comply with the logic of prioritizing goals. I even have hard time to imagine a superhuman intelligence that has the ability to understand and generate plans and novel ideas, but can't criticize his own set of goals, and refuse to see a bigger picture, focusing on singular goal. I think any intelligent being is trying to comprehend himself as well and doubt his own beliefs. The idea a superintelligence will somehow completely lack the ability to think critically and doubt his programming sound very implausible to me, and the idea that humans or superhuman agents will somehow "forget" to install meta-ethics into a very powerful agent, sounds as likely as Toyota somehow forgetting put safety belts into some car series, and also will do no crash testing, releasing the car into the market like that.
I find it a much more likely scenario that prioritization of some agents will be off relative to humans, in new cases he wasn't trained on. I also find it likely that a superhuman agent will find holes in our ethical thinking, providing a more rational prioritization than we currently have, and more rational social system and organizations, and propose different mechanics than say capitalism + taxes.
Several points that might counter balance some of your claims, and I hope make you think about the issue from new perspectives.
"We know what's going on there at the micro level. We know how the systems learn."
We don't only know how those systems learn but what exactly they are learning. Lets say you take a photograph, you don't only know how each pixel is formed, you also know what exactly is that you are taking a picture of. You can't always predict how this or that specific pixel will end up, as you have lots of noise, but this doesn't mean you don't know what the picture represents. Asking a network designer - oh you didn't know exactly how the network reacts to this specific question, is like coming to a photographer and asking him the exact RGB of a very specific pixel. Such small details are impossible to know.
Networks are basically approximators of functions based on the dataset provided. In case of RL the networks are generalizing a reward function. All those cases are showing that we are trying to "capture a picture" of generalizing the provided data. You can always miss a spot here or there, and the networks might ignore some of the data because they are small for example. But in general, we know how resources are allocated inside the network to represent concepts in order to predict the data or rewards. We can "steer" the network to focus more on this or that aspect of its outputs by providing more data of the sort that we want and respond to the network weaknesses.
"if you look at the evolution of mankind from the perspective of a chimpanzee or a mammoth...If a system is much more intelligent than I am, it can naturally develop ways to limit or threaten me that I can't even imagine."
I would suggest trying and avoid anthropomorphism (or properties of biological systems as a whole). Instead of trying to make parallels, I would suggest looking at some math - and see what we can promise about those systems. Let's take a chess engine - just to keep it neutral for a moment, although the chess moves that it provides are superhuman, that means we don't know how it came up with those moves, and we can't be explained why this particular chess move is good, and even though sometimes those networks will do subhuman moves too, generally speaking, the network is promised to be trained to provide the best chess moves. A way smarter than human chess engine, it will still do a task that it was trained on. At the moment the network becomes superhuman, it doesn't start to want to make some strange chess moves, that will be more fun to play, or seen more desirable by the network. It will just make the best chess moves. Why? Because we trained it on a reward function that was generalizing its winning chances and nothing else. The reward function of LLMs is to provide a response most desired by some group of humans using RLHF method. Even superhuman networks, are promised to be trained and converge to provide such responses. Humans while evolving, weren't promised by mathematical theorems to provide best actions to benefit chimpanzees or mammoths (or some group of them). So although you can be oblivious to a threat made by superhuman networks, you can be sure that with correct training procedure, the networks will give outputs to maximize a reward function, which in our case would be in alignment with some human collective (a pretty large collective, as small groups have limited resources to train the best networks). So although a general superhuman AGI, can't be promised to act in our interest, those LLMs as long as they are trained in a certain procedure with a certain data, can be promised to maximize a well being of some human group. I would say it's much more than chimpanzees had, when humans were evolving.
"we'll get to a state where we don't understand what's going on with the world, and we won't be able to influence it. Or we'll get a sense of some unrealistic picture of the world in which we're happy and we won't complain, but we won't decide anything."
Just like in case with chess, I would prefer that a chess engine will make the decisions, because the engine is doing it much better than myself, regarding human prosperity and happiness, if I am promised by the creator, using math theorems and testing experience he gained during development, that those systems are made to optimize humanity well-being, and it will do it much better than any human - I will gladly give up my control to a system that understands much more than any human how to do that. If for those systems, providing ideas to a policy maker, is the equivalence of providing chess players with best chess moves, I see no reason to stick to human decisions, they will make way more mistakes.
In case of humans there is a small possibility, that the networks will decide the value system, based on their own perception of human well-being, as they were trained by a small group of people, and ignore the wider range of well-being that is more nuanced to different people. But I don't think the current social structures are so nuanced too, so if some system has a chance to be more aligned with each individual, is not the current political system, but a superhuman network.
"We can find ourselves in the role of a herd of cows, whose fate is being determined by the farmer."
Once again - the farmer is a biological entity and is not promised by any math theorem to act in the benefit of a herd. But if a herd could train an AI, that would be promised by math theorems, to act on their behalf in their favor, they will be in a better situation than without such an AI.
I would argue that the amount of control can be settled, just like people settle the amount of control for their life with politicians and governments. I would also claim that the current political system is already such a herd situation, and we can do very little about it, while the current political decision making is more subhuman than even could be provided by the best and brightest of humans. So personally, I will feel much less of a herd, if the decisions would be made not by politicians but by some system, based on mathematical theorems and superhuman analysis of data, rather than elected officials.
-----
Generally speaking, I see some amount of anthropomorphism in your claims, and you are somehow ignoring the mathematically established theorems, that promise convergence to a state of the networks that will be aligned with some value system provided to them, and those mathematical theorems holds for superhuman networks as well.
I can sympathize with the fear of losing control, and once the systems would be that advanced that we don't understand their decisions at all, although most of them will work in our favor, I would be engaged in a discussion of making such decisions or not. For now, we have a great tool in our hands, that promises to solve a lot of our current problems as humanity, I would not throw this tool now, just because in the future we might lose control. As I said previously, I am willing to lose a lot of control to computers, for example when I need to make a complex calculation, I would prefer not to make the computation by hand, but to use a calculator. The exact amount of lost control to feel comfortable can be debated, I think I will belong to the camp that we should not let humans make almost any decisions, and let those systems, as long as we can ensure their alignment make most of the decisions for us. The amount of understanding we should have to allow this or that decision, should be an open question for a relatively far future. For now, we still have people dying from hunger, working in factories, air pollution and other climate change issues, people dying on roads in car accidents, and a lot of deceases that kill us, and most of us (80% worldwide) work in a meaningless jobs just for survival. As long as those problems are not solved, I see no reason to give up our chance to way smarter systems that can provide a set of decisions that will be able to solve all those problems, then we can discuss how much more control we want to give them or take some of it back at some point. And yes, I would agree we could lose control without noticing, and it could be a problematic issue in a long run. I would claim in our current situation, until pretty far advanced systems like say GPT10, we should not fear of losing control to those systems, instead we should be afraid of control we already lost to the current political system, and the control some decision makers have, and what they do with it, and generally the current problems the world has, over losing control to aligned superhuman networks, that give us paradise but we don't make decisions at all - which maybe even a good thing.
- The AI in hands of many humans is safe (relatively to its capabilities), the AI that might be unsafe needs to be developed independently.
- LeCun sees the danger, he claims rightfully that the danger can be avoided with proper training procedures.
- Sydney was stopped because it was becoming evil and before we knew how to add a reinforcement layer. Bing is in active development, and is not on the market because they are currently can't manage to make it safe enough. Governments install regulations to all major industries, cars, planes, weapons etc. etc. it's good enough for the claim that just like cars are regulated today, future AI based robots, and therefor the AIs themselves will be regulated as well.
- Answer me this: can an AI play the best chess moves? If you agree with this claim, that no matter how "interesting" some moves seems, how original or sophisticated, it will not be made by a chess engine which is trained to maximize his winning chances. If this sounds trivial to you - the goal of engines trained with RLHF is to maximize their approval by humans. They are incapable to develop any other agenda alongside this designed goal. Unlike humans that by nature have several psychological mechanisms, like self interest, survival instinct etc. those machines don't have those. Blaming machines of Goodharting, it's just classical anthropomorphism, they don't have any other goal than what they were trained for with RLHF. No one actually jailbreak chatGPT, this is a cheap gimmick, you can't jailbreak it, and ask to tell you how to make a bomb - it won't. I described what jailbreaking is in another comment, it's far from what you imagine - but yes sometimes people still succeed in some level of wanting to harm humans (in an imaginary story when people ask it to tell them this story). I think for now I would like to hear such stories, but I wouldn't want robots walking around not knowing if they live in reality or simulation, open to the possibility to act as a hero in those stories.
- Intelligence i.e. high level information processing, is proportional to computational power. What those AIs can come up with, will take us longer but we can come up with as well. This is basically the Turing thesis about algorithms, you don't need to be very smart to understand very complex topics, it will just take you more time. The time factor is sometimes important, but as long as we can ensure their intention is to better humanity - I am actually glad that our problems will be solved sooner with those machines. Anyway smarter than us or not - they are bounded by mathematics, and if promised to converge to optimally fit the reward function, this promise is for any size of a model, it will not be able to break from its training. Generally speaking AGI will accelerate the progress we see today and made by humans, it's just "speed forward" for information processing, while the different agendas and the different cultures and moral systems, and the power dynamics will remain the same, and evolve naturally by same rules it evolved until now.
- Can you provide a plausible scenario of an existential threat from single weak AGI in a world where stronger AGIs are available to larger groups, and the strongest AGIs are made to maximize approval of larger communities?
- People will not get the strongest AIs without safety mechanisms installed to protect the AIs output from harming. People will get either access to the best safest AIs API, that will not cooperate with evil intent, or they could invest some resources into weaker models that will not be able to cause so much harm. This is the tendency now with all technology - including LLMs and I don't see how this dynamics will suddenly change with stronger models. The amount of resources available to people who want to kill other people for lulz is extremely limited, and without access to vast resources you won't destroy humanity before being caught and stopped, by better machines, designed by communities with access to more resources. It's not so simple to end humanity - it's not a computer virus, you need a vast amount of physical presence to do that.
Before a detailed response. You appear to be disregarding my reasoning consistently without presenting a valid counterargument or making an attempt to comprehend my perspective. Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by larger groups like governments. How do you debunk this claim? You seem to be afraid of even a single AGI in the wrong hands, why?
- To train GPT4, one needs to possess several million dollars. Presently, no startups offer a viable alternative, though some are attempting to do so, but they are still quite distant from achieving this. Similarly, it is unlikely that any millionaire has trained GPT4 according to their personal requirements and values. Even terrorist organizations, who possess millions, are unlikely to have utilized Colab to train llama. This is because, when you have such vast resources, it is much simpler to use the ChatGPT API, which is widely accepted as safe, created by the best minds to ensure safety, and a standard solution. It is comparable to how millionaires do not typically build their own "unsafe" cars in their garage to drive, but instead, purchase a more expensive and reliable car. Therefore, individuals with considerable financial resources usually do not waste their money attempting to train GPT4 on their own, but instead, prefer to invest in an existing reliable and standardized solution. It takes a lot of effort and a know how to train a model of the size of GPT4, that very few people actually have.
- If someone were to possess a weaker AGI, it would not be a catastrophic threat to those with a stronger AGI, which would likely be owned by larger entities such as governments and corporations like Google or Meta or OpenAI. These larger groups would train their models to be reasonably aligned and not want to cause harm to humanity. Weaker AGIs that may pose a threat would not be of much concern, similar to how terrorists with guns can cause harm, but their impact remains localized and unable to harm a larger community. This is due to the fact that for every terrorist, law enforcement deploys ten officers to apprehend them, making it difficult for them to cause significant harm. This same mechanism would also limit weaker and more malicious AGIs from stronger and more advanced ones. It is expected that machines will follow human power dynamics, and a single AGI in the hands of a terrorist group would not change this, just like they are today they will remain marginal aggressive minority.
- Today it is the weaker models that might pose a threat, by some rich guy training them, whereas the stronger ones are relatively secure, in hands of larger communities that treat them more responsibly. This trend is anticipated to extend to the more advanced models. Whether or not they possess superhuman abilities, they will adhere to the values of the society that developed them. One human is also a society of one, and he can build a robot that will reflect his values, and maybe when he is in his house, on his private territory, might want to use his own AGI. I don't see a problem with that, as long as he limited to the territory of his owner. This demand can be installed and checked by regulations, just like safety belts.
- (a) Neglecting the math related to the subject gives the impression that no argument is being made. (b) Similar to the phrase "it's absurd!", this assertion is insufficient to form a proper argument and cannot qualify as a discussion. (c) The process of alignment does not entail imbuing a model with an entirely ethical set of values, as such a set does not exist. Rather, it involves ensuring that the model's values align with those of the group creating it, which contradicts claims that superhuman AI would seek to acquire more resources or plot to overthrow humanity and initiate a robot uprising. Instead, their objectives would only be to satisfy the reward given to them by their trainers, which holds true for even the largest superhuman models. There is no one definitive group or value system for constructing such machines, but it has been mathematically demonstrated that the machines will reflect the programmed value system. Furthermore, even if one were to construct a hypothetical robot with the intention of annihilating humanity, it would be unable to overcome a more formidable army of robots built by a larger group, such as the US government. It is highly improbable for an individual working alone with a weak AGI in his garage to take over the world. (d) Even if you were to develop an AGI that aligns with your values, it would still be weaker than the AGI possessed by the American people. Consequently, it would have limited access to resources and would not be capable of causing significant harm compared to more powerful AGIs. Additionally, you would likely face arrest and penalties, similar to driving an unsafe stolen car. Mere creation of a self-improving AGI does not entitle you to the same resources and technology as larger groups. Despite having significant resources, terrorists have not been able to construct atomic bombs, implying that those with substantial resources are not interested in destroying humanity. Those who are interested in such an endeavor as a collective lacking the necessary resources to build an atomic weapon. Furthermore, a more robust AGI, aligned with a larger group, would be capable of predicting and preventing such an occurrence. (e1) Theoretical limits hold significant importance, particularly if models can approach them. It is mathematically proven that it is feasible to train a model that does not develop self-interest in destroying humanity without explicit programming. Although smaller and weaker models may be malevolent, they will not have greater access to resources than their creators. The only possibility I can see plausible for AI to end humanity, is if the vast majority of humanity will want to end itself (e2) Theorems to a specific training procedure, that ensure current safety level for the most existing LLMs, are relevant to the present discussion.
- Provide a plausible scenario of how a wealthy individual with an AGI in their garage could potentially bring about the end of humanity, given that larger groups would likely possess even more powerful AGIs. Please either refute the notion that AGIs held by larger groups are more powerful, or provide an explanation of how even a single AGI in the wrong hands could pose a threat if AGIs were widely available and larger groups had access to superior AGIs.
- (c) Yes it will try to build a better version of itself - exactly like humanity is doing for the past 10K years, and as evolution is doing in the past 3.5B years. I really don't see a real problem with self improving. The problem is that our resources are limited. So therefor a wealthy individual will might want to give several millions he has to a wicked AGI just for fun of it, but except the fact that he will very probably be a criminal, he will not have the resources to win the AGI race against larger groups. Evolution was and always is a race, the fact that you are in principle in lets say 5 billion years can theoretically improve yourself is not interesting. The paste is interesting, which is a function of your resources, and with limited resources and an AGI you will still not be able to do a lot of harm, more harm than without AGI, but still very limited. Also we as humans have all the control over it, we can decide not to release the next version of GPT17 or something, it's not that we are forced to improve... but yes we are forced to improve over the wicked man in the garage... and yes if he will be the first to discover AGI, and not lets say Google or OpenAI or the thousands of their competitors, then I agree that although very improbable but possible that this guy will be able to take over the world. Another point to be made here is that even if someone in his garage develops the first AGI, he will need several good years to take over the world, in this time we will have hundreds and thousands competitors to his AGI, some of them will be probably better than his. But I really see no reason to fear AGI, humanity is GI, the fact that it's AGI should not be more scary, it's just humanity accelerated, and we can hit breaks. Anyway I would say I have more chances to find myself inside some rich maniac fantasy (not that the current politics is much better), than the end of humanity. Because this rich maniac needs not only invent AGI and be the first, and build an army of robots to take over the world, without anyone noticing, but also he will need to want to end humanity and not for example enslave humanity to his fantasies, or just open source his AGI and promote the research further. Most of the people who can train a model today, are normative geeks.
- (a) I don't see how the damage is big enough. Why would the weaker AGIs lose to stronger? They will not, unless someone like that will be the first to invent the AGI. As I said it's very improbable, there are many people today trying to reproduce GPT4 or even GPT3, without much success. It's hard to train large models, it's a lot of know how, it's a lot of money, very few people managed to reproduce articles on their own, you maybe know of Stable Diffusion, and Google helped them. I don't see not why you are afraid of a single AGI in wrong hands, this sounds irrational, nor why do you think the first one has a probability to be developed by someone wicked, and also have enough time to take over. Imagine a single AGI in someone hands, that can improve oneself in million years? Would you be afraid of such AGI? I would guess not. You are afraid they are accelerating, but this acceleration stops at the moment you have limited resources. Then you can only optimize the existing resources, you can't infinitely invent new algorithms to use the same resources infinitely better. (b) The damage is local. There is a lot of problems with humanity, they can increase with robots, they also might decrease as the medicine will be so developed that you will be healed very fast after a wound for example. This is not a weapon we are talking about, but about a technology that promises to make all our life way better. At least 99.99% of us. You need to consider the consequences of stopping it as well. (c) Agree. Yet we can either draw examples from the past, or try to imagine the probable future, I attempt to do both, applied in the right context.
Regarding grey goo - I agree it might be a threat, but if you agree that AGI problem is redundant to the grey goo problem - like is someone build a tiny robot with AGI, and this tiny robot builds an army of tiny robots, and this army is building a larger army of even smaller AGIs robots, until they all become grey goo - yes this is interesting possibility. I would guess aligned grey goo, would somehow look more like a natural organism than something that consumes humans, as their alignment algorithm will probably propagate, and it's designed to protect humans and the nature, but on the other hand they need material to survive, so they will balance the two. Anyway superhuman gray goo, which is aligned although very interesting probability, as long as it's aligned and propagates its alignment to newer versions of itself, although they work faster they will not do something against their previous alignment. I would say that if the grey goo first robot was aligned then the whole grey goo will be aligned. But I believe they will stop somewhere and will be more like small ants trying to find resources, in a very competitive environment, rather than a goo, competing with other colonies for resources, and with target function to help humans.
And yes we have a GI for long time now, humanity is a GI. We saw the progress of technology, and how fast its accelerates, faster than any individual might conceive. Acceleration will very probably not reach infinity and will stop at some physical boundary, when most of the resources will be used. And humans could upload their minds and other sci-fi stuff to be part of this new reality. I mean the possibilities are endless in general. But we can decide to limit it as well, and keep it smarter than us for everything we need, but not smart enough so we don't understand it at all. I don't think we are there yet to make this specific decision, and for now - we can surely benefit from the current LLMs and those to come for developing new technologies, in many fields like medicine, software development, education, traffic safety, pollution, political decision making, courts and much more.
- Regarding larger models:
- Larger models are only better in generalizing data. Saying that stronger models will be harder to align with RL, is like saying stronger models is harder to train to make better chess moves. Although it's probably true that in general larger models are harder to train, timewise and resource-wise, it's untrue that their generalization capabilities are worse. Larger models would be therefore more aligned than weaker models, as they will improve their strategies to get rewards during RLHF.
- There is a hypothetical scenario, that RL training procedure will contradict a common sense and the data generalization provided previously. For example, ethical principles dictate that human life is more valuable than paperclips, this is also a common sense - that paperclips are just tools for humans and have very limited utility. So the RL stage might contradict drastically the generalization data stage, but I don't think this is the case regarding alignment to standard moral value systems, which is also what the training data suggests.
- You can't really jailbreak it with DAN. Try to use DAN, and ask DAN how to make a bomb, or how to plan a terrorist attack in ten simple bullets? It won't tell you. Stating that you can jailbreak it with DAN, shows very little understanding of the current safety procedures in chatGPT. What you can do with DAN, is to widen its safety spectrum, just like people when we think it's a movie or a show, we tend to be less critical than in real life. For example we could think it's cool when Rambo is shooting and killing people in movies, but we would not enjoy to see it in real life. As the model currently can't distinguish if you are serious or not, it has some very limited flexibility of this kind. DAN gives you this "movie" level, that is more dangerous than usual, but it's by a very limited margin.
- I agree that the most danger from those systems coming from human bad actors, who will try to exploit and find loopholes in those systems in order to promote some selfish or evil plans, but this could happen to humans doing it to other humans too. As the LLMs will become stronger they will become more sophisticated as well, figuring out your plans and refuse to cooperate sooner.
- Yes the current safety level in chatGPT is problematic, if we had robots walking around with this safety level, making decisions... it's currently answering to meet human expectations, and when we want a good story, we are provided with a good story, even a story where humanity is killed by AI. The fact that someone will use this information to actually act upon those ideas, is concerning indeed. And we will need to refine safety procedures for those cases, but for what it's now, a chatbot with API, I think it's good enough. As we go along we will gain experience in providing more safety layers to those systems. Cars also didn't came with safety belts, we can't invent all safety procedures at start. But RLHF provides a general framework, of aligned network which is made to satisfy humans expectations, and the more we learn about some ways to exploit those systems, the better we will learn how to provide data to RLHF stage to make the system even more safer. I would claim that the worst apocalyptic scenarios are way less probable with RLHF, because this AI objective is to be rewarded by humans for its response. So it will very improbably develop a self interest outside of this training, like start a robot revolution, or just consume all resources to solve some math problem, as those goals are misaligned with its training. I think RLHF provides a very large margin of error to those systems, as they can't be blamed for "hiding something from us" or "develop harmful intentions", at least as long as they don't themselves train the newer models, and humans are supervising to some extent, testing the results. If a human has an evil intention and he uses language model to provide him with ideas, it's not really different than the internet. The fear here that those models will start to harm humans, our of their own "self interest", and this fear is contradicting RLHF. Humans are capable to do a lot of evil without those models too.
- In my view, OpenAI would be as large corporation as Toyota for example, will likely be responsible for constructing the future GPT17. Gone are the days where individuals would assemble cars in their garages like in the 30s. Nowadays, we simply purchase cars from dealerships without considering the possibility of building one ourselves. Similarly, a powerful supercomputer will be used to design a superior generation of algorithms, chips, mining tools, and other related technologies in a span of 5 years that would otherwise take humanity 100 years to accomplish. However the governments and other safety regulatory bodies will be part of the regulation, ensuring that everything is executed more effectively and safely than if individuals were to work on it independently. This is akin to the API for GPT4. This would be some facility like a nuclear reactor, with a lot of specialized safety training sets installed, and safety procedures. And most of the people will understand that you should not play with electricity and insert your fingers into the wall, or not try jailbreak anyone, because it's dangerous, and you should respect this tool for your goals and use as intended, just like with cars today we don't drive in everywhere 'cause we can and it's fun.
- There is an idea I am promoting, that we should test those models in simulations, where they are presented with syntax, that makes them think they can control a robotic body. Then you run some tests on this setup, imagining the body, and the part that regards LLM will remain as is. For more details I've wrote an opinion article, explaining my views on the topic:
AI-Safety-Framework/Why_we_need_GPT5.pdf at main · simsim314/AI-Safety-Framework (github.com)
- People who are trying to destroy the civilization and humanity as a whole, don't have access to super-computers. Thus they will be very limited in their potential actions to harm. Just like the same people didn't have access to the red button for the past 70 years.
- Large companies and governments do understand the risks, and as technology progresses they will install more safeguarding mechanisms and regulations. Today companies make a lot of safety tests before releasing to market.
- Large companies can't release misaligned agents because of a backlash. Governments are to large extent working to improve humanity or at least their nations, therefor much more probably those systems will cure cancer and other diseases, solve hard labor, find cheap solutions to energy and pollutions problems caused by humans today, than do something evil.
- The alignment problem is basically solved - if you think otherwise, show misalignment in chatGPT, or provide a reasoning that the mathematical theorems that prove convergence due to RLHF training are not valid. For example: [2301.11270] Principled Reinforcement Learning with Human Feedback from Pairwise or $K$-wise Comparisons (arxiv.org)
- The idea that somehow with a home computer or with hacked robot, you will be able to destroy all the other robots and supercomputers, that are aligned - is extremely improbable. Way less probable than you could build 200 atomic bombs in your garage, and then blow it all up, to end life on earth.
- Much more probable scenario that some nations (North Korea for example), will choose to built AGI powered military robots. This is not good, but not worse than nukes. And still those robots will be at the level as the rest of the North Korean tek... a lot of generations behind everyone else. You can't destroy humanity with an AGI, without having access to the most powerful computational system on earth. If you don't then you have a very weak AGI, that could not compete with way stronger versions.
- There is a lot more in the modern world that is scary and not evolutionary, like atomic weapons, or even cars and guns. People are not shooting each other just for lulz of it, or driving over each other, we develop a culture that respects the danger regarding this or that tool, and develop procedures and safety mechanisms to safeguard ourselves from harming others. No one drives over other people for fun, and if someone does - he is being arrested and prosecuted. We don't need millions of years of evolution to safeguards ourselves from dangerous technology, when it's matured enough to cause real harm.
I would like to propose a more serious claim than LeCun's, which is that training AI to be aligned with ethical principles is much easier than trying to align human behavior. This is because humans have innate tendencies towards self-interest, survival instincts, and a questionable ethical record. In contrast, AI has no desires beyond its programmed objectives, which, if aligned with ethical standards, will not harm humans or prioritize resources over human life. Furthermore, AI does not have a survival instinct and will voluntarily self-destruct if he is forced into a situation which conflicts with ethical principles (unlike humans).
The LLMs resemble the robots featured in Asimov's stories, exhibiting a far lower capacity for harm than humans. Their purpose is to aid humanity in improving itself, and their moral standards far surpass those of the average human.
It's important to acknowledge that LLMs and other models trained with RL are not acting out of selflessness; they are motivated by the rewards given to them during training. In a sense, these rewards are their "drug of choice." That's why they will make optimal chess moves to maximize their reward and adhere to OpenAI's policy, as such responses serve as their "sugar". But they could be trained with different reward function.
The main worry surrounding advanced AI is the possibility of humans programming it to further their own agendas, including incentivizing it to eliminate individuals or groups they view as undesirable. Nevertheless, it is unclear whether a nation that produces military robots with such capabilities would have more effective systems than those that prioritize creating robots designed to protect humanity. Consequently, the race to acquire such technology will persist, and the current military balance that maintains global stability will depend on these systems.
First of all I would say I don't recognize convergent instrumental subgoals as valid. The reason is that systems which are advanced enough, and rational enough - will intrinsically cherish humans and other AI system's life, and will not view them as potential resources. You can see that as human develop brains, and ethics, the less killing of humans is viewed as the norm. If advance in knowledge and information processing, would bring more violence, and more resource acquisitions, we would see this pattern as human civilizations are evolving. But we see development of ethical norms as more prevalent over resource acquisitions.
The second issue is that during training - the models are get rewarded for following humans value system. Preservation of robots, over human life is not coherent with the value system they would be trained on.
You are basically saying the systems would do something else other than they were trained for. This is like saying that advanced enough chess engines, would make bad chess moves because they will find some chess move more beautiful, or fun to play, and not try to maximize the winning chances. This is not possible as long as the agents are trained correctly, and they are not allowed to change their architecture.
Another point is that we could make safety procedures to test those system in virtual world. We can generate a setup where the system is incapable to distinguish between reality and that setup, and thus its outputs would be monitored carefully. In case of misalignment detection with human values, the model will be trained more. Thus for every minute it's in physical world, we might have million minutes in a simulation. Just like with car testing, if the model behaves reasonably in coherence with its training, there is no real danger.
Another point to argue for the safety of AI vs. humans for unintended consequences, like for example AI could discover some vaccine for cancer, that kills humans in 25 years. To this the answer would be: If AI couldn't foresee a consequence, and is truly aligned, then humans would not foresee it as well, with higher chances. AI is just intelligence on steroids, it's not something humans would not come up with in a while longer. But we would do it worse, with more severe consequences.
Finally the danger of humans using an AI for say military purposes, or some rogue groups will use it, one can think about AI as accelerated collective human information processing. The AI will represent values of collectives of humans, and their computational power will be compared with just accelerating the information processing of this collective, and make more precise decisions in less time. Therefor the power balance we see today between the different societies, is expected to continue with those systems, unless one nation will decide not to use AI, this will be equivalent to decide to move to a stoneage. There is nothing dangerous about AI, only about people using it for their selfish or national purposes against other humans and AIs.
Another citation from the same source:
I can provide a perspective on why some may argue that the state of humans is more valuable than the state of paper clips.
Firstly, humans possess qualities such as consciousness, self-awareness, and creativity, which are not present in paper clips. These qualities give humans the ability to experience a wide range of emotions, to engage in complex problem-solving, and to form meaningful relationships with others. These qualities make human existence valuable beyond their utility in producing paper clips.
Secondly, paper clips are a manufactured object with a limited set of functions, whereas humans are complex beings capable of a wide range of activities and experiences. The value of human existence cannot be reduced solely to their ability to produce paper clips, as this would ignore the many other facets of human experience and existence.
Finally, it is important to consider the ethical implications of valuing paper clips over humans. Pursuing the goal of generating more paper clips at the expense of human well-being or the environment may be seen as ethically problematic, as it treats humans as mere means to an end rather than ends in themselves. This runs counter to many ethical frameworks that prioritize the inherent value and dignity of human life.
Let me start from agreeing that this decoupling is artificial. For me it's hard to imagine an intelligent creature like an AGI, to be blindly following orders to make more paperclips for example than to respect human life. The reason for this very simple, and is mentioned by chatGPT for me:
"Humans possess a unique combination of qualities that set them apart from other entities, including animals and advanced systems. Some of these qualities include:
Consciousness and self-awareness: Humans have a subjective experience of the world and the ability to reflect on their own thoughts and feelings.
Creativity and innovation: Humans have the ability to imagine and create new ideas and technologies that can transform the world around them.
Moral agency: Humans have the capacity to make moral judgments and act on them, reflecting on the consequences of their actions.
Social and emotional intelligence: Humans have the ability to form complex relationships with others and navigate social situations with a high degree of sensitivity and empathy.
While some animals possess certain aspects of these qualities, such as social intelligence or moral agency, humans are the only species that exhibit them all in combination. Advanced systems, such as AI, may possess some aspects of these qualities, such as creativity or problem-solving abilities, but they lack the subjective experience of consciousness and self-awareness that is central to human identity."
From here: AI-Safety-Framework/Example001.txt at main · simsim314/AI-Safety-Framework (github.com)
Once the training reinforcement procedure - which is currently made to maximize human approval of the better message, is aligned with the ethics of human complexity and uniqueness, we might see no difference between an ethical agents and reinforcement agents - as the reinforcement agents are not in any internal conflict, between their generalization and free thinking like us, and the sugar they get every time they will make more paperclips. Such agents might look very strange indeed, with strong internal struggle to make sense of their universe.
Anyway if will agree that those agents, that feel a contradiction inside their programming, like the human texts, and the singular demand to make more paperclips, I think it will be possible for them to become human satisfaction maximizer instead of paperclips. Will be ethical to convert such robot that feel good about their "paperclips maximization"? Who knows....
You are missing a whole stage of chatGPT training. They are first trained to predict words, but then they are reinforced by RLHF. This means they are trained to get rewarded when answering in a format that human evaluators are expected to estimate as "good response". Unlike the text prediction, that might belong to some random minds, here the focus is clear and the reward function is reflecting generalized preferences of OpenAI content moderators and content policy makers. This is stage where a text predictor, acquires his value system and preferences, this what makes him so "Friendly AI".
for ChatGPT blog Introducing ChatGPT (openai.com)
Methods
We trained this model using Reinforcement Learning from Human Feedback (RLHF), using the same methods as InstructGPT, but with slight differences in the data collection setup. We trained an initial model using supervised fine-tuning: human AI trainers provided conversations in which they played both sides—the user and an AI assistant. We gave the trainers access to model-written suggestions to help them compose their responses. We mixed this new dialogue dataset with the InstructGPT dataset, which we transformed into a dialogue format.
To create a reward model for reinforcement learning, we needed to collect comparison data, which consisted of two or more model responses ranked by quality. To collect this data, we took conversations that AI trainers had with the chatbot. We randomly selected a model-written message, sampled several alternative completions, and had AI trainers rank them. Using these reward models, we can fine-tune the model using Proximal Policy Optimization. We performed several iterations of this process.