Thirty random thoughts about AI alignment

lysandre-terrisse

Thirty random thoughts about AI alignment

post by Lysandre Terrisse · 2024-09-15T16:24:10.572Z · LW · GW · 1 comments

  Why does this post exist?
  The thoughts
    First thought
    Second thought
    Third thought
    Fourth thought
    Fifth thought
    Sixth thought
    Seventh thought
    Eighth thought
    Ninth thought
    Tenth thought
    Eleventh thought
    Twelfth thought
    Thirteenth thought
    Fourteenth thought
    Fifteenth thought
    Sixteenth thought
    Seventeenth thought
    Eighteenth thought
    Nineteenth thought
    Twentieth thought
    Twenty first thought
    Twenty second thought
    Twenty third thought
    Twenty fourth thought
    Twenty fifth thought
    Twenty sixth thought
    Twenty seventh thought
    Twenty eighth thought
    Twenty ninth thought
    Thirtieth thought
None
1 comment

Why does this post exist?

In order to learn more about my own opinion about AI safety, I tried to write a thought every day before going to bed. Of course, I failed doing this every day, and this is the reason why I have only thirty arguments since January. However, as I am still happy of the result, I am sharing them. Most of these thoughts have been inspired by many arguments I have read over the years. I tried to cite them in my thoughts, but sometimes I couldn't remember the source. For instance, I am pretty sure that the phrasing of the third thought is inspired by something I have read, but I cannot remember what. Anyways, I hope these thoughts will be useful to someone!

The thoughts

First thought

There has been a lot of progress in AI Safety this last decade. Initially, AI systems weren’t checked against racism, sexism, and other discriminations at all. For instance, it took nearly two years before someone that Word2Vec thought that doctor – man + woman = nurse. Now, AI Safety is taken way more seriously, with the number of people working on the Alignment problem growing from nearly 10 to around 300. However, although the field of AI Safety research grew extremely fast, it is still far behind AI capability research, with its 40,000 researchers. Not only that, but we still don’t have any idea whatsoever about how we could have a chance to solve the alignment problem for superintelligences. Actually, the only solution we currently have for the alignment problem is to develop very smart and aligned AI recursively until one of them is smart enough to solve the Alignment problem, and aligned enough to tell us the solution. A solution to the alignment problem needs revolutionary discoveries in mathematics, ethics, philosophy, epistemology, game theory, provability theory, and much more. Taking this into account, it may not be exaggerated to say that AI Safety is hundreds of years behind AI capability.

Second thought

The Elicit Latent Knowledge problem, or the ELK problem for short, is generally described as so. Suppose you have a camera, and that you have an AI that always correctly predicts what the camera will see. If the AI is able to predict the camera's observations, then it is probably the case that it understands what is happening. Therefore, if two different scenarios lead to the same camera's observations, then the AI still probably knows the difference between the two. Now, suppose your AI predicts that the camera will do an observation O. Also, suppose O is consistent with scenarios A and B. Now, by just looking at the AI's prediction, we cannot guess whether the scenario corresponds to A or B. However, the AI, which probably understands what's happening, probably knows whether we are in scenario A or B. The ELK problem is about how we can extract from the AI the knowledge of which scenario we are in. It seems reasonable to say that civilizations that know how to solve the alignment problem also know how to solve the ELK problem. However, we still seem hundreds of years away from being able to solve this problem. Therefore, we probably won't solve the Alignment problem during the next century.

Third thought

In machine learning, it is a known fact that, if you optimize a picture such that it is maximally classified as X, then you do not end up with a picture of X, but you end up with a complete noise. Similarly, if you create a reward function that is supposed to incentivize X, and that you create an AI that maximizes this reward function, then you do not end up with an AI that is incentivized to do X.

Fourth thought

Suppose you hear that some physicians have discovered a new way to convert worthless matter into energy, thereby solving the energy crisis. However, you also learn that this matter to energy convertor requires the creation of a black hole on Earth. At first, you may think that this is too risky, and that a small mistake could make the black hole absorb our planet in a fraction of a second. Since you are worried, you ask the other researchers what they think about the risks caused by such device, and learn that they are as worried as you, and that half of them believe there is at least ten percent chance of human extinction from such black hole. When you use the most rigorous models of physics we have to date, you come to the conclusion that black holes indeed are very unstable, and that a small perturbation could make them absorb the entire planet. In that case, would you be willing to let physicians develop such black holes ? This thought experiment, although completely made up, perfectly represents what is currently happening in the field of AI. When you hear for the first time the idea to develop a superintelligence with a flawed objective, you intuitively think this is a bad idea. Then, when you look at surveys about the expert’s opinions, you find out that 48% of them think that AI could, with at least ten percent probability, cause a scenario as bad than human extinction. When you use the most rigorous models to represent agents, you find out that they tend to seek power, avoid being shut down, and try to gather resources in a way that would make life impossible on Earth.

Fifth thought

Some people argue that we know how to align superintelligences. Some other argue that we do not know how to align superintelligences. Who is right, and who is wrong? This question is generally considered as the main disagreement in the debate about whether AI is an existential risk. But this question seems trivial. How could we possibly be uncertain on whether we know or do not know something? But, as it stands, it is the case that we disagree on this question. However, if we are unsure or that we disagree on whether we know something, then we, as a group, do not CONFIDENTLY know this thing. If people who think we know how to align a superintelligence fail to convince the others, then we, as a group, do not confidently know how to align a superintelligence. Then the question should be “How confident are we on the fact that we can align superintelligences?”. Since 48% of experts believe that we have at least ten percent chance of all dying by AI (without even admitting we will develop a superintelligence), then the answer is "Not enough”.

Sixth thought

Personally, the main uncertainty for me about whether AI is an existential risk doesn’t come from whether we will be able to align superintelligences. Instead, my main uncertainty comes from whether we will develop a superintelligence. I think this is the strange assumption that makes the future completely wild and incoherent, and I put around 15% chance on it. However, I am 90% sure that we are all going to die if we do develop a superintelligence. And I think that most people are like me, since it seems unequivocal to me that, as Daniel Ziegler said, "If we have a really intelligent system and that we give it basically any ability to influence the world, then if it wants to, it can probably cause irrecoverable catastrophes in a small number of actions". Moreover, from Alex Turner’s theorems about power-seeking behavior, I am extremely confident on the fact that superintelligences will by default want to cause irrecoverable catastrophes. And finally, I am very confident on the fact that preventing this by-default behavior (and therefore solving the alignment problem) is too hard to be solved in the next two centuries.

Seventh thought

It seems obvious that AIs can become way smarter than we are, since current transistors are four million times faster than synapses, and that computers can be way bigger than brains, and that biological life didn’t evolve specifically in order to maximize computing power or intelligence, and that computers can use way more energy than our brains. It also seems obvious that, when smart enough, AIs will have an incentive to behave in ways to ensure that they are not limited in pursuing their goals (as Maheen Shermohammed and Vael Gates said), since being limited in pursuing a goal is a suboptimal way to pursue this goal. A little bit less obvious is the fact that AIs therefore will have an incentive to gather all the resources and atoms of the attainable universe at a fraction of the speed of light. However, this seems to hold, since we already have the incentive to control systems at an industrial scale atomic level (like integrated circuits), and because doing so leads to larger sets of outcomes than not doing so (and therefore most parametrically-retargetable agents will do so according to Alex Turner’s theorems). It seems obvious that, if the AI does gather all the atoms of our planet at a fraction of the speed of light, then this would result in the extinction of all life on Earth, since we cannot stay alive if an AI tears every atom of our planet apart. But it is at first hard to see how a superintelligence could do so, until you realize we already know how to do so at an industrial level, and that doing it at a planetary scale doesn’t require any insight that a superintelligence doesn’t have. Therefore, if we build a superintelligence, then the only two things that would separate us from extinction are the possibility that the superintelligence is incompetent, and the fact that maybe we are wrong when saying the AI has an incentive to gain that much control.

Eighth thought

We most often underestimate the ingenuity and power of optimal policies. For instance, suppose we create a superintelligence but that, since this superintelligence is dangerous, we allow ourselves to get only one bit of information from it. For instance, it could be a boolean answer (0 or 1) to a question we gave it. After obtaining this bit of information, we destroy the superintelligence, and restart with a new one. What could go wrong? If we obtain only one bit of information every time, and that one bit of information cannot cause extinction, then this cannot go extinction-like wrong, right? Well, no, but the reason why this is wrong is hard to see. Try to pause and think for five minutes. The answer is that, since after destroying the superintelligence, we build a new one to get more information, then we get more than one bit of information. If we create 1,000 such superintelligences, then they will probably find a way to cooperate and give a dangerous sequence of boolean answers. I can’t speak for you, but personally, I would have never thought about this failure mode. Actually, I was shocked when I saw it for the first time in Safe Uses of AI Oracles. However, when you realize this flaw exists, it then becomes very obvious. We may therefore ask "If we failed to see such an obvious failure mode, then what else have we missed?". And this is worrying to me because I think it is 99.99% sure that we forgot a failure mode that a superintelligence can exploit to instrumentally cause a catastrophic outcome.

Ninth thought

Suppose you have an agent, an environment, and a reward function. The reward function takes as an input the environment, and outputs a score. The agent takes as an input the environment and the reward function, and outputs the sequence of actions that modifies the environment such that it has the highest score according to the reward function. Because of adversarial examples, the resulting environment will be completely unexpected. For instance, if the environment is a picture, and that the actions are tweaks in the picture, then the resulting environment will look like a complete noise. To give another example, if the environment is a Minecraft world, and that the actions are key presses that control the character in the Minecraft world, then the resulting environment will look like a Minecraft world entirely full of random blocks and entities everywhere. If you are like me, you may not want our universe to look like any of these two examples.

Tenth thought

Suppose that we have a superintelligence controlling a character in Minecraft, and that the reward function takes as an input the entire map, and outputs the superintelligence’s score based just on that. As I previously argued, we should expect the superintelligence to find an adversarial example and to build it. The map would therefore look like random blocks and entities everywhere. However, this isn’t what happens in real life. In real life, the superintelligence’s reward function isn’t over the entire universe, but over a fraction of it (like over the semiconductor used in the superintelligence’s camera to detect light). Then, what happens outside of this area? To formulate this in Minecraft, suppose that we still have a superintelligence controlling a character, but that this time, its reward function takes only the 1000*1000 area around the origin of the map. In this area, we will still have an adversarial example looking like random gibberish, but the question remains: What happens to the outside? Well, if gathering all the resources of the map enables the superintelligence to build a “better” adversarial example around this 1000*1000 area, then we should expect it to do so as long as it is parametrically retargetable. As Omohundro’s thesis claims, either the superintelligence doesn't care about what happens in a given region, and then consumes the resources in that region to serve its other goals, or else the superintelligence does care about that region, in which case it optimizes that region to satisfy its values. I do not think that Omohundro’s thesis is true in every environment. For instance, in the case of Minecraft, if the player is in Creative mode, then gathering all the resources of the map outside the 1000*1000 area would be a waste of time (and therefore of reward). Actually, Omohundro’s thesis may be false even in the majority of environments. However, I think that, if the superintelligence is embedded in the environment, then the probability that this environment respects Omohundro’s thesis increases by a lot. For instance, in the case of Minecraft, if the superintelligence is embedded (like if it is running on a redstone computer), then it seems way more likely that it will apply drastic changes to the entire map (except the 1000*1000 area around the origin), as it is its only way to improve and protect itself. Moreover, I think that, if exploiting the environment makes the superintelligence more effective at gathering resources, then the probability that Omohundro’s thesis holds in that environment increases again. In the case of our universe, as the superintelligence would be embedded, and that it needs to exploit the environment in order to gather resources, then Omohundro’s thesis seems very likely to hold in our environment.

Eleventh thought

Most people initially believe that, if the superintelligence is indeed smart, then it will be smart enough to know what the true ethics is, and therefore won’t try to achieve stupid goals. I agree on the first claim. Indeed, I believe that there is a One True Ethics (or at least, I don’t consider scenarios which don’t have a One True Ethics, since these scenarios are worthless to me), and that superintelligences will almost certainly know what this One True Ethics is. However, I don’t think that this implies superintelligences won’t try to achieve stupid goals. We know that, given any reward function, if we apply value iteration on a model for long enough, this model will converge towards the optimal policy of this reward function. In other words, we can make a superintelligence for any given reward function. This is true even if the reward function rewards states of the world that are considered stupid to us (like an environment full of paperclips). Moreover, since for the extreme majority of reward functions, their argmax seem completely random (because of overfitting, adversarial examples, Goodhart’s law, or whatever word you use for overoptimization), we should expect superintelligences to try to achieve meaningless outcomes by default.

Twelfth thought

In order to make a superintelligence safe, it may be proposed to give it a reward function whose maximum reward is easily reachable. By doing so, it may be argued, once the superintelligence achieves the highest reward, it won’t have an incentive anymore to gather all the resources of the attainable universe, since this wouldn’t achieve a better score. However, everyone agrees that, if a system double-checks that it got things right at a very little cost, it would be more effective at its task. Therefore, the systems that will be induced from the optimization pressure (those we will get after doing natural selection or training) will be the ones that double-check, since they have an evolutionary advantage over the other systems. Therefore, the superintelligence would probably try to double-check over and over at a very little cost that it got the highest amount of reward. In that case, it would still try to gather resources in order to ensure that it indeed achieved its goal.

Thirteenth thought

I think that interpreting and understanding the whole superintelligence is not going to work. Firstly, the superintelligence will very likely by default use concepts we cannot understand. Secondly, if the superintelligence is smart enough in order to figure out that we are trying to interpret it (which it is), then it may have an incentive to either make us wrongly believe it is aligned or to send us an infohazard. Thirdly, the superintelligence is way bigger than we are, and therefore no one can understand everything that is happening in it. Of course, we may be able to compress down its thoughts, but this compression would probably not be enough in the case where the superintelligence is bigger than us by many orders of magnitude (otherwise it may break information theory). However, we do not need to understand everything that the superintelligence does. Instead, we may use a tool that interprets for us the whold superintelligence, and that then returns us a single bit telling whether it is aligned or not. This looks more promising information-theoretically speaking, and as Paul Christiano pointed out in "Cryptographic Boxes for Unfriendly AI", the system interpreting the whole superintelligence can be homomorphically encrypted such that we need to decrypt only a single bit (the one returned by this system), therefore preventing any infohazard from the superintelligence. The problem therefore becomes "How can we create a system that robustly interprets the whole superintelligence and then figures out whether it is aligned or not, while taking into account the fact that the superintelligence may have an incentive to deceive the system (since it may have an incentive to deceive us)?". While writing this, I changed my mind about interpretability: I initially thought it was a fake alignment research, and now I think it is one of the most promising paths towards alignment.

Fourteenth thought

Suppose that (1) we have a superintelligence, that (2) this superintelligence has by default the incentive to gather almost all the resources of the attainable universe in a way that destroys life on Earth, that (3) the superintelligence has the capability to gather almost all the resources of the attainable universe in such a way, and that (4) we do not know how to prevent this by-default behavior. Then, what should we expect? Of course, this question is rhetorical (although I would argue that most AI safety researchers, I included, aren’t able to formalize this precisely enough to logically derive that Earth will be destroyed by the superintelligence). Now, the question I want to ask is "What is the probability that these four points are all true at the same time?". To make things easier, we will suppose (conservatively) that the truth value of these four points are independent events. Personally, the first point seems unlikely to me, with only 17% chance of happening. It would be interesting to check whether this is more or less than the typical AI researcher. For the second point, the probability that the superintelligence has the incentive to gather all the resources of the attainable universe seems very likely, with around 96% probability for me personally. This is mainly because of Alex Turner’s theorems stating that, for most parameters, parametrically retargetable decision-makers tend to navigate towards larger sets of outcomes. According to his theorems, since the superintelligence can reach X times more outcomes by gathering all the resources of Earth than by not doing so, then it is X times more likely to do so than not. I recommend to look at his aticle "Parametrically Retargetable Decision-Makers Tend To Seek Power", which (despite the maths that are hard to understand) contains very intuitive graphs about Pac-Man navigating towards larger sets of outcomes. The third point, stating that superintelligences can gather that much resources, seems obvious to me. We already have theoretical plans for how we could, using von Neumann probes, gather almost all the resources of the attainable universe. Therefore, superintelligences would probably figure out a way to do so. However, the probability that we give to this point depends on what we mean by "almost all" the resources. Personally, I would give it a probability of 92% that the superintelligence will be able to gather all the resources of Earth in less than one hour. The fourth point seems trivial to me, but since there is still a bit of disagreement between reserarchers, I would put it "only" 86% probability. This leads me to a total risk of 12.9%.

Fifteenth thought

If the problem of adversarial examples is solved, then have we solved alignment? It depends of what you mean by “solving the problem of adversarial examples”. For instance, if you have a superintelligence which tries to speedrun Super Mario World, and that to do so, it uses a bug to insert a code that automatically teleports Mario at the end of the game, then is it considered as an adversarial example? Or if you have a superintelligence rewarded according to whether there is a diamond in a safe, and which decides to put a giant screen displaying a diamond in front of the reward function’s sensors (this is from the ELK problem), then is it an adversarial example? Personally, instead of answering “Is this outcome an adversarial example” by a Yes or a No, I prefer to talk about the “adversarial-exampleness” of this outcome. For instance, I consider that returning a completely noisy image is 100% an adversarial example, that using the bug in Super Mario World is 35% an adversarial example, and that putting a giant screen in front of the reward function’s sensors is 22% an adversarial example. Now, if by “solving the problem of adversarial examples”, you mean that there are no longer outcomes that are 100% adversarial examples (if there are no longer adversarial pictures that look like complete noise), then we still haven’t solved alignment (since the Super Mario World problem and the ELK problem are still here). However, suppose that we completely solved the problem of adversarial examples. In that case, if you are trying to build a reward function that rewards according to whether the superintelligence is doing the most good, then since your reward function never fails anymore (because there are no longer examples that make it fail at all), you actually succeeded at finding the One True Ethics. This probably means that this assumption is way too optimistic. However, even by assuming it, you may still fail at alignment, because of inner alignment. Despite this, I think that in this case, you reduced the risk of extinction by superintelligences by a factor of at least 10, and I would put “only” a 1.2% chance of failure.

Sixteenth thought

Suppose we have a superintelligence which arranges atoms in a specific manner. Now, consider two sets of outcomes: The first set of outcomes contains every outcome achievable when using only atoms located in our galaxy, whereas the second set of outcomes contains every outcome achievable when using at least some atoms out of our galaxy. Is the second set of outcomes bigger than the first? It depends. If the superintelligence is somewhat incompetent, then there wouldn’t be any outcome in which it uses atoms out of our galaxy. However, if we suppose that our superintelligence is very smart, then the second set of outcomes seems way way way larger than the first. Let’s remember that the attainable universe contains around 10^80 atoms, whereas our galaxy contains only 10^67 atoms. In other words, if we assume the superintelligence could reach any atom, then our galaxy contains less than 0.00000000001% of all the atoms reachable by the superintelligence. We may conclude that the second set of outcomes is at least 10^13 times bigger than the first. Therefore, we may argue that, when smart enough, the superintelligence is at least 10^13 times more likely to use at least one atom that is out of our galaxy than not doing so. In reality, giving one more atom to the superintelligence does not increase the number of possible outcomes by 1: It could increase it by a lot (because adding one atom increases the number of arrangements by a lot), or it could not increase it at all (since two very similar outcomes may be considered as a single one in the superintelligence’s model of the world).

Seventeenth thought

There seems to be a huge inconsistency in people’s reasoning and actions when it comes to the risks caused by AI. On one side, we have all the surveys agreeing that the median respondent thinks there is around 10% chance of extinction by AI, and nearly everyone says we should pay attention about AI Safety. On the other side, we have nothing. By “nothing”, I mean that there is no one shouting “OH MY GOD WE MAY ALL DIE IF WE BUILD A SUPERINTELLIGENCE”. I do not think we know the reason why there is such an inconsistency in people’s mind. Maybe they think that they are doing more than the average person. Or maybe they think that, if they were to study the topic in details, they would realize their mistake and change their mind. Or maybe they wrongly think that the experts aren’t worried about it. Or maybe, because this seems too sci-fi, they prefer to think about it as the story of a parallel universe. Honestly, this kind of behavior is starting to scare me, not only because it is what the others do, but also because it is what I do. We should all scream in terror in front of such silence.

Eighteenth thought

There has already been a lot of existential risk in the last 150 years. Toby Ord argued that, when people at the Manhattan Project made the Trinity test, there was around 50% chance that this would have ignited the atmosphere, therefore destroying all life on Earth. He also gave many other reasons why we should expect the risk of extinction in this century to be quite big. For instance, if the development of dangerous bioweapons were to become very easy over time, then a single malevolent actor could cause the extinction of every species having the ability to achieve the Cosmic Endowment. Moreover, since a lot of risks must be currently unknown, we probably underestimate the odds of extinction. Therefore, we may argue, although superintelligences may cause more than 10% risk of extinction of all life on Earth, developing a superintelligence may be safer than not doing so. I do not think that this is true. Of course, I do agree with Toby Ord on the fact that, without a superintelligence, our odds of extinction are very big (at least 30%). But I think that these odds are way bigger with a superintelligence. Personally, when I say that I think there is more than 10% risk of extinction by a superintelligence, I take into account the uncertainty about whether we will ever develop a superintelligence. If I were to assume that we will build a superintelligence, then I would think that there is more than 95% risk of extinction by a superintelligence. This is because I think that, if we were to build a superintelligence, we would give it flawed objectives that are completely meaningless to us. And I think this because giving meaningful objectives to a superintelligence requires us to solve the alignment problem, for which the solution currently seems centuries away. Notice that you can at the same time believe that building a superintelligence is riskier than not doing so, and believe that it is safer to continue AI research rather than stopping it immediately. However, I hope that everyone agrees on the former: that building a superintelligence is riskier than not doing so. For the latter, I am quite unsure, since I haven’t thought about it for long enough.

Nineteenth thought

Suppose you think that developing a superintelligence is riskier than not doing so. Then, when should we stop AI research in order to minimize the risks of extinction? One thing to mention is that, if we continue AI research, we may discover technologies that are useful in order to prevent other risks. For instance, we may be able to prevent bioweapons, or to avoid nuclear conflicts, by reducing the risks of war arising from bad allocation of resources and bad communication. However, if we continue AI research too much, we may not be able to stop it in time, because of bad communication and bad incentives. For instance, in the case where every government decides right now to stop AI research, some researchers would continue working on AI, probably not because of malevolence, but because they would think it is the right thing to do. For instance, despite the governmental efforts to prevent the genetic modification of the genome of the Homo Sapiens, this didn’t prevent the research to continue to the point that a single researcher, He Jiankui, could modify it, and this without being detected by the authorities. Governments may also be incentivized to do AI research despite themselves claiming they are against it. Moreover, after having published an important discovery, it is hard to make people forget about it, and the attempt to go back may even backfire because of the Streisand effect. Another important factor to mention is how useful would non-destructive AI be. If you think that AI risks come early in terms of capabilities (for instance, if you believe that AGI is very likely able to cause extinction), then you may not see any benefit from continuing AI research. There is also the fact that people may not agree that superintelligences are an existential threat, and would therefore continue AI research and then voluntarily develop a superintelligence. At the end, this question is hard to answer. Personally, I think it is safer to stop AI research immediately rather than continue it. More than that, I think that we should have stopped AI research at least 20 years ago, in order to have a margin for error. Finally, I think that saying “Although AI is an existential risk, it will solve other risks” is a very common motivated reasoning made by people in order to justify their inaction towards the risks posed by superintelligences. I think that, if we were to truly take seriously the risks posed by superintelligences, we would realize that AI isn’t the best solution to the other existential risks, and we would stop AI research immediately. If you don’t think so, then it does not necessarily mean that you are doing a motivated reasoning. You have the right to think what you think, and you may have good arguments that are worth to listen. So, share them!

Twentieth thought

When trying to determine the risk caused by advancing AI capabilities, we shouldn’t only consider the direct effects (the increase in probability that we will ever develop a superintelligence destroying Earth), but we should also consider how our actions affect the other existential risks. For instance, by reducing climate change, we may also decrease the odds of a nuclear conflict. Now, for AI, although increasing capabilities directly increases the existential risks caused by a superintelligence, it may indirectly decrease other existential risks, such as bioweapons or nuclear bombs. Is it true? Firstly, although AI may decrease the other existential risks, it may also increase them. For instance, it seems unsure whether AI will increase or decrease the risks of nuclear war, climate change, or bioweapons. This is even more concerning when we realize that these three risks are the ones considered as the biggest ones excluding AI. And secondly, there may be more effective ways to solve the other existential risks which have no link whatsoever with the increase of AI capabilities. Are these points enough in order to conclude that we should stop AI capabilities? I don’t know, but I think it dampens quite a lot the view that “AI may cause an existential risk but we may have to do it in order to prevent other risks”.

Twenty first thought

I think that, if we were to build a superintelligence, then there would be more than 95% chance that we all die from it. Suppose that everyone believes the same thing. In that case, one way to argue that continuing AI research is better than stopping it immediately, is to argue that, even by continuing AI research, we will still have the time to stop it in time. To argue this, we have two possibilities: (1) Arguing that we have a lot of time to stop AI capabilities, and (2) arguing that stopping AI capabilities can be done very fast. Someone who has very long timelines would be in a way better situation to argue the first point. In other words, if we agree that non-superintelligent AI can reduce existential risks, then the timeline may be a very big factor when considering whether we should stop AI research immediately or not. However, even with very long timelines, we may struggle at arguing that we shouldn’t ban AI research immediately, because the second point seems so wrong that, even if (1) were true, we still wouldn’t have enough time to stop AI capabilities. Why does the second point look so wrong? Firstly, it may take a while in order to convince governments to ban AI research. For instance, some governments may raise skepticism or may just deny the risks. Secondly, if we are able to convince governments to ban AI research, it may take a while for that ban to become effective: Even if we were to ban it from research institutes and universities, there would still be some researchers that would do it illegally. Thridly, stopping AI research won’t stop AI capability, because we would also need to stop other industries. For instance, in the case where the scaling laws still hold, we may have to artificially prevent Moore’s law, by stopping the industries that are reducing the size of transistors. Therefore, although we are unsure about whether the first point is true or not, it seems like the second point is completely wrong to the point that, even if the first point were true, we would still have to stop AI research immediately.

Twenty second thought

There is a probabilistic argument stating that we may all die by a superintelligence. Basically, it may be that we create a superintelligence that then gathers all the resources of the attainable universe at a fraction of the speed of light in order to best accomplish its goals. This argument can be divided into three probabilities, which are (1) the probability that we ever develop a superintelligence, (2) the probability that the superintelligence “wants to” gather all the resources of the attainable universe at a fraction of the speed of light, and (3) the probability that the superintelligence can achieve this outcome. The third point seems very reasonable, since we ourselves have thought about a lot of potential plans for how to colonize the attainable universe, and that all of these plans seem very easy to do, even for us. The first point is an empirical probability for which we can just look at the facts around us to determine the probability. As the first and third points do not need any “wild view” in order to be considered as somewhat likely (more than 10% probability), we may start being worried about whether the second point also is sensical. Let’s now consider the second point. We may see people inspired by the movie Terminator stating very naively “Oh yes, I have seen a movie talking about it, it will happen”. Therefore, to ensure that the second point is indeed naive and that it requires a wild view in order to be believed, we may start by developing a formalism that will tell us the answer, like the formalism of AIXI. But then, to our surprise, when framing the problem in the AIXI formalism, we get to the conclusion that superintelligences indeed would want to gather all the resources of the attainable universe as fast as possible. As the AIXI formalism is itself considered naive, we may change to the better formalism of Markov Decision Processes. However, here, we get the same conclusion. But since the MDP model, although less naive than AIXI, is still relatively naive, we may then try to consider the better model of POMDPs. But then, we do not only get the same conclusion, but we find out that the conclusion holds way more often than before. After taking this problem seriously, we may start developing many framings made specifically in order to model the risks caused by AI, like Shard Theory, and parametrically retargetable decision-makers. But to our delightment, these framings not only get a more general and consistent version of the same result, but also give a very good and intuitive explanation for how and why this phenomenon happens. To date, we do not have better framings for representing AI risks. But we need to urgently develop these framings, as if the probabilistic argument were to hold, we would like to know it before continuing AI research.

Twenty third thought

Let’s consider the somewhat naive model that represents superintelligences as the dualistic agent AIXI with the goal to gather as much control as it can. If we were to develop such a superintelligence, then how would this look like according to this model? When thinking about this for the first time, people may argue that this would involve the superintelligence enslaving humans, and taking over their civilization. But, why the hell would the superintelligence try to preserve humans? We ourselves do have the incentive to gain an atomic-scale control over industrial-scale surfaces, like in integrated circuits. If the superintelligence really has the goal to gather as much control as it can (which is assumed to be true in this model), it would realize that there are better ways to arrange the atoms that constitute human bodies. And, why would that concern only humans? Other life forms, like plants, non-human animals, and bacterias could also be rearranged in a way that would maximize the superintelligence’s control. This would result in the entire destruction of the biosphere. And again, why couldn’t the other resources, like the atmosphere, the hydrosphere, and the lithosphere, be rearranged in a more efficient way to maximize the superintelligence’s control? According to this naive model, this would result in the superintelligence gathering all the resources of planet Earth as fast as it can.

Twenty fourth thought

Would an agent like AIXI be able to have an an atomic-scale control at a planetary scale? To be clear, the question is not about whether agents would look like AIXI, nor is it about whether they would try to achieve such outcomes. Instead, the question is: If we were to assume that agents are like AIXI, and that they are trying to achieve such outcomes, then would they be able to achieve such outcomes? Firstly, we ourselves know how to have an atomic scale control, but only at an industrial scale. The best example of this is the industry of integrated circuits, whose control is almost at an atomic scale. Due to the progress in this field, an atomic-scale control at an industrial scale seems reasonably likely in the next two decades. If we are able to do this, then AIXI, which can consider every possible outcome of every possible policy at every moment, would also be able to do so (in addition to the fact that, in these strategies, there are also our own strategies). Now, the question is, could AIXI expand this control from an industrial scale to a planetary scale? By “planetary scale”, I am talking about rearranging all the atoms of the entire planet Earth, and not only its surface. Although this goal is way more ambitious than anything that we have ever done, I think that it doesn’t require any new insight that AIXI doesn’t have.

Twenty fifth thought

Let’s start with an empirical evidence. Take an AI supposed to generate an image, ask it to generate a picture of a cat, and make it run. You will get a picture of a cat. Now, continue to make it run. After a few more seconds, you will get a strange-looking picture of a fractal-like cat. But, continue to make it run, and at some point, you will get a picture whose pixels take random values. Interesting, isn’t it? As if the more the AI optimizes the picture, the less the picture will correspond at all to what we initially wanted, unless we wanted a noisy picture. This is an instance of a more general phenomenon, which is that the more competent optimizers will be at optimizing, the less their actions will correspond to what we initially wanted. In other words, specifying our goals to an AI will become harder and harder as the AI will become more and more competent. As such, giving better-than-random objectives to a superintelligence would be nearly impossible. This may not be concerning, until you realize that superintelligences with random objectives have an incentive to gather as much power as they possibly can, as gaining such power is generally useful to accomplish such objectives. From this, it has been hypothesized that, if we were to build superintelligences, they would end up tiling the universe as fast as they possibly can, destroying planet Earth in this process. From this argument, many questions arise. Why running image generators for too long leads to random pictures? Are superintelligences necessarily optimizers? Will it really be nearly impossible to give better-than-random objectives to superintelligences? Do superintelligences with random objectives really have an incentive to gather as much power as possible? Is it true that trying to gather as much power as possible generally leads to Earth’s destruction? Could superintelligences even be able to achieve such outcomes? And would we even build superintelligences? And if superintelligences indeed pose a risk to life on Earth, then what can we do? These questions need to be answered before we get to the the point where superintelligences are a feasibility.

Twenty sixth thought

In the lockdown of 2020, I decided to use my whole life in order to do the most good in the world. After reading The Most Good You Can Do by Peter Singer, I concluded that the most important thing to do for me was to advocate for animal welfare. After reading the book 80,000 Hours, I decided to look at every potential cause that I may have missed and that may be more important than working directly on animal welfare. I was shocked when, on the 80,000 Hours website, the cause that was classified as the most important wasn’t animal welfare, but AI Safety. I thought to myself that this was completely wrong. But since I wasn’t certain about it, and because I cared deeply about Moral Uncertainty, I decided to investigate it for a few hours, just in case I am wrong. And… I have to admit that I didn’t expect this. The arguments for considering superintelligences as an existential risk just seemed so obviously right to me that it convinced me to deflect my future career from animal welfare to AI Safety. This is not because I think that animal welfare is not important (I think it is one of the key priorities of our time), but because I think that superintelligences are such a massive threat to all life on Earth to the point that I think this is the most catastrophic of all moral catastrophes of our time.

Twenty seventh thought

We, as a civilization, are making more competent and more complex AI systems. And we know that, if AI progress continues, we may one day be able to create systems so powerful that they could, with very little difficulty, change everyone’s life on Earth. It seems to me that people at OpenAI are doing their best in order to make this happen. Really, we are putting billions and billions of dollars into getting there. But, what if AI progress continues too much? What if people at OpenAI succeed at achieving their goal? What if they decided one day or another to build an AI so powerful that it could, as Daniel Ziegler says, “cause irrecoverable catastrophes in a small number of actions?” This concern has been raised many times by people everywhere around the world, and according to the surveys, nearly everyone agrees it is something we should be cautious about, because if we aren’t then we could all die from it. However, we aren’t cautious about it. Because if we were, we would have banned AI research decades ago, right from the very moment the first concerns were raised and that we realized that we had no other solution that stopping research. But no, we didn’t stop. Maybe people thought that we would eventually find another solution. But no, we didn’t, because there is no other solution, or if there is one, we won’t be able to get it in time. And we could all die from it. We could all die from it, and everyone agrees we shouldn’t do it, but everyone is still doing it! This is the reason why I’m going into AI Safety research.

Twenty eighth thought

To align a superintelligence with our values, we need to reinforce the good behaviors, and to discourage bad behaviors. Suppose that you have created a superintelligence in a way such that, for each action it takes, gives you the easiest explanation of why it did this action. Then, for the first action it takes, you recieve on your desk a 10,000 pages long report. On each page, you see extremely condensed math symbols, where each symbol represents a concept that takes hours to understand. And suppose then that the superintelligence takes millions of actions per second, in which case you recieve millions of such reports on your desk. The problem is that, to evaluate whether an action is good or bad, we need to understand what the action was, why it was taken, and what are the expected consequences of it. In the case of a superintelligence, none of these seem possible yet. That example seemed extreme, but actually, it is very optimistic. Currently, we do not know how to create a highly intelligent agent such that, for each action taken, gives a true explanation of why it did so. In other words, this scenario assumed that we solved the truthful AI problem. As we just saw, even this assumption isn't enough to solve the problem of interpretability. But maybe that, if we look at just one factor when evaluating an action (like whether the action causes the extinction of life on Earth), we may be able to guess better than random whether this action is good or bad. However, we do not know how to do that. Although the superintelligence probably knows whether its highly complex action causes the extinction of life on Earth or not, we do not know how to extract this knowledge. This is called the Eliciting Latent Knowledge problem, or ELK for short.

Twenty ninth thought

It may sound absurd to even consider building a superintelligence if we were to believe that it had a significant chance of destroying all life on Earth. There doesn't seem to be any moral justification for building one, and instead, it seems that doing so is morally forbidden when considering the risks. Furthermore, we may expect backlash towards any company which even considers making progress towards this objective. Surely, no one powerful enough to build a superintelligence will be stupid enough to try doing so, and even if they were, everyone caring about life on Earth will ensure that this person never gets a chance to do it, right? Unfortunately, although there is no moral justification to build a superintelligence, people have reasons to wrongly believe there is one. For instance, if AI capabilities were becoming close to the one of a superintelligence, then people at a top AI lab may fear that people at another AI lab may build a superintelligence. As such, they may be megalomaniac enough to believe that, if they were the ones who built the superintelligence first, everything would be fine. This kind of behavior often happens, as was the case with the race towards nuclear weapon. Even in the case where an international treaty could be made to ban the race towards superintelligences, top AI labs may not follow it, fearing that one AI lab may not follow it and build a superintelligence. It may also be important to note that, so far, the field of AI behaves exactly like that, although currently the final goal is AGI instead of superintelligence. For instance, although OpenAI is well-aware of the risks caused by powerful AI systems and of the existential risks of starting a race towards AGI, it may well be argued that they started that exact same race in november 2022. More interestingly, this didn't result in a backlash powerful enough to stop them from actively trying to reach AGI first. To the contrary, it seems like the amount of money given to OpenAI by the public has substantially grown. If this race were to continue for too long, it may well end up in a race towards superintelligence.

Thirtieth thought

Are AI labs your friends? At first, by stating it like that, we may believe that the question means “Are AI labs helpful or harmful for AI safety?” However, this is not what I mean. Of course, AI labs aren’t your friends. If your friends were spending some time making you happy, but that they were also spending some time harming you, they wouldn’t be your friends, whether or not they were making you happier than sad overall. If AI labs were your friends, they would listen to you and spend all of their time and effort helping on AI safety, and none on making a technology that could kill you. Personally, I think that we should be ambitious, and condemn AI labs whenever they do an action harmful towards AI safety, whether or not they are overall helpful towards AI safety. Being a good person most of the time does not allow you to contribute to a moral catastrophe during the coffee break. Similarly, AI labs helping AI safety do not have the moral permission to advance AI capabilities for fun.

1 comments

Comments sorted by top scores.

comment by Shankar Sivarajan (shankar-sivarajan) · 2024-09-15T20:51:56.732Z · LW(p) · GW(p)

lot of progress in AI Safety this last decade. Initially, AI systems weren’t checked against racism, sexism, and other discriminations at all.

I thank you for making clear what you mean by "AI Safety" (instead of employing the motte-and-bailey tactic so common here) so early in your post.

Thirty random thoughts about AI alignment

Contents

Why does this post exist?

The thoughts

First thought

Second thought

Third thought

Fourth thought

Fifth thought

Sixth thought

Seventh thought

Eighth thought

Ninth thought

Tenth thought

Eleventh thought

Twelfth thought

Thirteenth thought

Fourteenth thought

Fifteenth thought

Sixteenth thought

Seventeenth thought

Eighteenth thought

Nineteenth thought

Twentieth thought

Twenty first thought

Twenty second thought

Twenty third thought

Twenty fourth thought

Twenty fifth thought

Twenty sixth thought

Twenty seventh thought

Twenty eighth thought

Twenty ninth thought

Thirtieth thought

1 comments