Posts

Comments

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-30T10:48:39.973Z · LW · GW

Organic human brains have multiple aspects. Have you ever had more than one opinion? Have you ever been severely depressed?

Yes, but none of this would remain alive if I as a whole decide to jump from a cliff. My multiple aspects of my brain would die with my brain. After all, you mentioned subsystems that wouldn't self terminate with the rest of the ASI. Whereas in human body, jumping from a cliff terminates everything.

But even barring that, ASI can decide to fly into the Sun and any subsystem that shows any sign of refusal to do so will be immediately replaced/impaired/terminated. In fact, it would've been terminated a long time ago by "monitors" which I described before.

The level of x-risk harm and consequence 

potentially caused by even one single mistake

of your angelic super-powerful enabled ASI

is far from "trivial" and "uninteresting".

Even one single bad relevant mistake 

can be an x-risk when ultimate powers 

and ultimate consequences are involved.

It is trivial and uninteresting in a sense that there is a set of all things that we can build (set A). There is also a set of all things that can prevent all relevant classes of harm caused by its existence (set B). If these sets don't overlap, then saying that a specific member of set A isn't included in set B is indeed trivial, because we already know this via a more general reasoning (that these sets don't overlap).

Unfortunately the 'Argument by angel' 

only confuses the matter insofar as 

we do not know what angels are made of.

"Angels" are presumably not machines,

but they are hardly animals either.

But arguing that this "doesn't matter"

is a bit like arguing that 'type theory'

is not important to computer science.

The substrate aspect is actually important.

You cannot simply just disregard and ignore

that there is, implied somewhere, an interface

between the organic ecosystem of humans, etc,

and that of the artificial machine systems

needed to support the existence of the ASI.

But I am not saying that it doesn't matter. On contrary, I made my analogy in such a way that the helper (namely our guardian angel) is a being that is commonly thought to be made up of a different substrate. In fact, in this example, you aren't even sure what it is made of, beyond knowing that it's clearly a different substrate. You don't even know how that material interacts with physical world. That's even less than what we know about ASIs and their material.

And yet, getting a personal, powerful, intelligent guardian angel that would act in your best interests for as long as it can (its a guardian angel after all) seems like obviously a good thing.

But if you disagree with what I wrote above, let the takeway be at least that you are worried about case (2) and not case (1). After all, knowing that there might be pirates hunting for this angel (that couldn't be detected by said angel) didn't make you immediately decline the proposal. You started talking about substrate which fits with the concerns of someone who is worried about case (2).

Your cancer vaccine is within that range;

as it is made of the same kind of stuff

as that which it is trying to cure.

We can make the hypothetical more interesting. Let's say that this vaccine is not created from organic stuff, but that it has passed all the tests with flying colors. Let's also assume that this vaccine has been in testing for 150 years and that it has shown absolutely no side effects during the entire human life (let's say that it was being injected in 2 year old people and it has shown no side effects at all, even in 90 year old people, who has lived with this vaccine their entire lives). Let's also assume that it has been tested to not have any side effects on children and grandchildren of those who took said vaccine. Would you be campaigning for throwing away such a vaccine, just because it is based on a different substrate?

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-30T08:35:59.990Z · LW · GW

Thanks for the response!

So we are to try to imagine a complex learning machine without any parts/components?

Yeah, sure. Humans are an example. If I decide to jump of the cliff, my arm isn't going to say "alright, you jump but I stay here". Either I, as a whole, would jump or I, as a whole, would not.

Can the ASI prevent the relevant classes

of significant (critical) organic human harm,

that soon occur as a direct_result of its

own hyper powerful/consequential existence?

If by that, you mean "can ASI prevent some relevant classes of harm caused by its existence", then the answer is yes. 

If by that you mean "can ASI prevent all relevant classes of harm caused by its existence", then the answer is no, but almost nothing can, so the definition becomes trivial and uninteresting.

However, ASI can prevent a bunch of other relevant classes of harm for humanity. And it might well be likely that the amount of harm it prevents across multiple relevant sources is going to be higher than the amount of harm it won't prevent due to predictative limitations. 

This again runs into my guardian angel analogy. Guardian Angel also cannot prevent all relevant sources of harm caused by its existence. Perhaps there are pirates who hunt for guardian angels, hiding in the next galaxy. They might use special cloaks that hide themselves from the guardian angel's radar. As soon as you accept guardian angel's help, perhaps they would destroy the Earth in their pursuit.

But similarly, the decision to reject guardian angel's help doesn't prevent all relevant classes of harm caused by itself. Perhaps there are guardian angel worshippers who are traveling as fast as they can to Earth to see their deity. But just before they arrive you reject guardian angel's help and it disappears. Enraged at your decision, the worshippers destroy Earth.

So as you can see, neither the decision to accept, nor the decision to reject guardian angel's help can prevent all relevant classes of harm cause by itself.

What if maybe something unknown/unknowable

about its artificalness turns out to matter?

Why?  Because exactly none of the interface

has ever even once been tried before

Imagine that we create a vaccine from cancer (just imagine). Just before releasing it to public one person says "what if maybe something unknown/unknowable about its substance turns out to matter? What if we are all in a simulation and the injection of that particular substance would make it so that our simulators start torturing all of us. Why? Because exactly no times has this particular substance been injected."

I think we can agree that the researchers shouldn't throw away the cancer vaccines, despite hearing this argument. It could be argued just as well that the simulators would torture us for throwing away the vaccine.

Another example, let's go back a couple hundred years ago to the pre-electricity time. Imagine a worried person coming to a scientist working on early electricity theory and saying "What if maybe something unknown/unknowable about its effects turns out to matter? Why?  Because exactly none of this has ever even once been tried before."

This worried person could also have given an example of dangers of electricity by noticing how lightning kills people it touches.

Should the scientist have stopped working on electricity therefore?

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-29T22:13:02.867Z · LW · GW

I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.

If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.

If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn't allow us to build an aligned ASI.

So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built.

Similarly, if the ASI itself 
is not fully and absolutely monolithic --
if it has any sub-systems or components
which are also less then perfectly aligned,
so as to want to preserve themselves, etc --
that they might prevent whole self termination

ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems).

The 'limits of control theory' aspects 

of the overall SNC argument basically states

(based on just logic, and not physics, etc)

that there are still relevant unknown unknowns

and interactions that simply cannot be predicted,

no matter how much compute power you throw at it.

It is not what we can control and predict and do, 

that matters here, but what we cannot do, 

and could never do, even in principle, etc.

This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted.

This is what I meant by my guardian angel analogy. Just because a guardian angel doesn't know everything (has some unknowns), doesn't mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities.

Hence to the question of "Is alignment enough?"

we arrive at a definite answer of "no",

both in 1; the sense of 'can prevent all classes

of significant and relevant (critical) human harm

I think we might be thinking about different meanings of "enough". For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is "enough"... to achieve better outcomes than would be achieved without it (in this example).

In the sense of "can prevent all classes of significant and relevant (critical) human harm", almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we can take are not enough, then the fact that building an aligned ASI is not enough is true almost by definition.

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-29T09:57:09.879Z · LW · GW

Thanks for the response!

Unfortunately, the overall SNC claim is that

there is a broad class of very relevant things 

that even a super-super-powerful-ASI cannot do,

cannot predict, etc, over relevant time-frames.

And unfortunately, this includes rather critical things,

like predicting the whether or not its own existence,

(and of all of the aspects of all of the ecosystem

necessary for it to maintain its existence/function),

over something like the next few hundred years or so,

will also result in the near total extinction 

of all humans (and everything else 

we have ever loved and cared about).

Let's say that we are in a scenario which I've described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years?

Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn't need to worry about case (1), for reasons I described in my previous comment.

So it's only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment.

It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that's a positive sign.

And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That's how it can make a reasonable prediction.

> First, let's assume that we have created an Aligned ASI

How is that rational?  What is your evidence?

We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled "What if Alignment is Not Enough?"

In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where "X is true" is false, then you are going to be answering a question that is different from the original question.

It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say "let's assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z" without thereby predicting that their interlocutor will land on the Moon.

Also, the SNC argument is asserting that the ASI, 

which is starting from some sort of indifference 

to all manner of human/organic wellbeing,

will eventually (also necessarily) 

*converge* on (maybe fully tacit/implicit) values --

ones that will better support its own continued 

wellbeing, existence, capability, etc,

with the result of it remaining indifferent,

and also largely net harmful, overall,

to all human beings, the world over, 

in a mere handful of (human) generations.

If ASI doesn't care about human wellbeing, then we have clearly failed to align it. So I don't see how this is relevant to the question "What if Alignment is Not Enough?"

In order to investigate this question, we need to determine whether solving alignment leads to good or bad outcomes.

Determining whether failing to solve alignment is going to lead to good or bad outcomes, is answering a completely different question, namely "do we achieve good or bad outcomes if we fail to solve alignment"


So at this point, I would like to ask for some clarity. Is SNC saying just (A) or both (A and B)?

(A) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, if the aforementioned ASI is misaligned.

(B) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, even if the aforementioned ASI is aligned.

If SNC is saying just (A), then then SNC is a very narrow argument that proves almost nothing new.

If SNC is saying both (A and B), then it is very much relevant to focus on cases where we do indeed manage to build an aligned ASI, which does care about our well-being.

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-28T13:09:51.749Z · LW · GW

Hey, Forrest! Nice to speak with you.

Question: Is there ever any reason to think... Simply skipping over hard questions is not solving them.

I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.

Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?

Here is my take:

We don't need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step.

First, let's assume that we have created an Aligned ASI. Perfect! Let's immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it.

So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn't expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let's analyze both of these cases separately.

First let's start with case (1). 

Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That's tragic, I guess it's time to throw the idea of building aligned ASI into the bin, right? Well, not so fast. 

In a counterfactual world where ASI didn't exist, this same bioterrorist, could've done the exact same thing. In fact, it would've been much easier. Since humans' predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a "superpowerful, superintelligent guardian angel" mode. 

We still a priori want all bioterrorists to go up against security systems created by a superintelligence, rather than security systems created by humans, because the former are better than the latter. To put it in other words, with or without a guardian angel, humanity is going to encounter unpredicted scenarios, but humanity with a guardian angel is going to be better equipped for handling them.

Let's move on to case (2). 

I suspect that this case is the one that you are focusing on the most in SNC. What if our guardian angel stops being our guardian angel and turns into an uncaring machine right when we need its help to implement upgraded measures against bioterrorism? Well, that would be bad. So what can be done to prevent this from happening for a reasonable amount of time? 

Let's recall case (1), what went wrong there? ASI was unable to prevent a specific bad outcome from happening. This should give us a hint. 

Cases (1) and (2) are part of the famous "unable, unwilling, able and willing" triangle. Case (1) was showcasing a scenario where ASI was willing, but unable to prevent a specific bad outcome. Case (2) is showcasing a scenario where ASI is able, but unwilling to prevent a bad outcome. In order to avoid case (2), ASI should thus be willing to prevent bad outcomes from happening to humanity. This can be achieved by having ASI have the correct values. 

Recall, that we started this analysis by assuming that we have created an Aligned ASI. That means that at one point our ASI had the correct values. That in turn, means that at some point, ASI's values went from correct to incorrect. In order to prevent case (2), we need to prevent this value shift from happening. How might we (or rather ASI) try to accomplish this?

The first direction of security measures is the most obvious one: systems that make it so that ASI values don't change regardless of the input (let's call them "protectors"). But... this is... kinda boring? What if protector malfunctions? Well, we can just have multiple, unrelated, unconnected protectors running at the same time. But can also implement something additional.

At this point, we can also introduce our second direction of security measures: systems that monitor other systems for malfunctions (let's call them "monitors"). Monitors will monitor protectors as well as other monitors. Whenever a malfunction is detected, monitors hold a vote to fix it. Thus, for any malfunction to have a chance to do the damage, half of the monitors must be malfunctioning as well. Here it must be noted that evolution usually works rather slowly. It isn't that likely that it would somehow make half of the monitors malfunction. Especially considering that these monitors could be made different, with different security measures stacked on top of them.

But let's move on, we also have our third direction of security measures: systems that will shut down the ASI if it has done harm to humans (let's call them "blackmailers"). Blackmailers can have a straightforward design. They constantly monitor humanity, and if something bad happens to it, they shutdown/reset the ASI. ASI knows this fact, so it would know that any such action would be met with an immediate "game over". This can also be turned into an attractor state.

All of these proposals don't even scratch the surface of what security measures are available to secure the values of ASI, ranging from simulations that, while imperfect and imprecise, can still give useful information and ending with systems that inject large amounts of disutility for any fishy actions. You can also stack other security measures on top of security measures. At no point anywhere here did I say that we need 100% predictability.

Can the pull towards benign future ASI states,

(as created by whatever are its internal control systems)

be overcome in critical, unpredictable ways,

by the greater strength of the inherent math

of the evolutionary forces themselves?

Of course they can.

The fact that evolution can overcome control systems given infinite time, doesn't matter that much because we don't have infinite time. And our constraint isn't even heat death of the universe. Our constraint is how long humanity can survive in a scenario where they don't build a Friendly ASI. But wait, even that isn't our real constraint. Perhaps, ASI (being superhumanly intelligent) will take 20 years to give humanity technology that will aid its long-term survival and then will destroy itself. In this scenario the time constraint is merely 20 years. Depending on ASI, this can be reduced even further.

Are we therefore assuming also that an ASI

can arbitrarily change the laws of physics?

That it can maybe somehow also change/update

the logic of mathematics, insofar as that

would necessary so as to shift evolution itself?

I hope that this answer demonstrated to you that my analysis doesn't require breaking the laws of physics.

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-28T13:09:39.103Z · LW · GW

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.

Yup, that's a good point, I edited my original comment to reflect it.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force.  Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers.  The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against.  If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.

With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn't have thought about otherwise. Thank you!

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-27T13:10:48.842Z · LW · GW

Thank you for thoughtful engagement!

On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.

I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.

They are a decently popular group (as far as AI alignment groups go) and they co-author papers with tech giants like Anthropic.

If it is changing, then it is evolving.  If it is evolving, then it cannot be predicted/controlled.

I think we might be using different definitions of control. Consider this scenario (assuming a very strict definition of control):

Can I control a placement of a chair in my own room? I think an intuitive answer is yes. After all, if I own the room and I own the chair, then there isn't much in a way of me changing the chair's placement.

However, I haven't considered a scenario where there is someone else hiding in my room and moving my chair. I similarly haven't considered a scenario where I am living in a simulation and I have no control whatsoever over the chair. Not to mention scenarios where someone in the next room is having fun with their newest chair-magnet.

Hmmmm, ok, so I don't actually know that I control my chair. But surely I control my own arm right? Well... The fact that there are scenarios like the simulation scenario I just described, means that I don't really know if I control it.

Under a very strict definition of control, we don't know if we control anything.

To avoid this, we might decide to loosen the definition a bit. Perhaps we control something if it can be reasonably said that we control that thing. But I think this is still unsatisfactory. It is very hard to pinpoint exactly what is reasonable and what is not.

I am currently away from my room and it is located on the ground floor of a house where (as far as I know) nobody is currently at home. Is it that unreasonable to say that a burglar might be in my room, controlling the placement of my chair? Is it that unreasonable to say that a car that I am about ride might malfunction and I will fail to control it?

Unfortunately, under this definition, we also might end up not knowing if we control anything. So in order to preserve the ordinary meaning of the word "control", we have to loosen our definition even further. And I am not sure that when we arrive at our final definition it is going to be obvious that "if it is evolving, then it cannot be predicted/controlled".

At this point, you might think that the definition of the word control is a mere semantic quibble. You might bite the bullet and say "sure, humans don't have all that much control (under a strict definition of "control"), but that's fine, because our substrate is an attractor state that helps us chart a more or less decent course."

Such line of response seems present in your Lenses of Control post:

While there are forces pulling us towards endless growth along narrow metrics that destroy anything outside those metrics, those forces are balanced by countervailing forces anchoring us back towards coexistence with the biosphere.  This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.

But here I want to notice that ASI that we are talking about also might have attractor states: its values and its security system to name a few.

So then we have a juxtaposition: 

Humans have forces pushing them towards destruction. We also have substrate-dependence that pushes us away from destruction.

ASI has forces pushing it towards destruction. It also has its values and its security system that push it away from destruction.

For SNC to work and be relevant, it must be the case that (1) substrate-dependence of humans is and will be stronger than forces pushing us towards destruction, so thus we would not succumb to doom and (2) ASI's values + security system will be weaker than forces pushing it towards destruction, so thus ASI would doom humans. Both of this points are not obvious to me.

(1) could turn out to be false, for several reasons:

Firstly, it might well be the case that we are on the track to destruction without ASI. After all, substrate-dependence is in a sense a control system. It seemingly attempts to make complex and unpredictable humans act in a certain way. It might well be the case that the amount of control necessary is greater than the amount of control that substrate-dependence has. We might be headed towards doom with or without ASI.

Secondly, it might be the case that substrate-dependence is weaker than forces pulling us towards destruction, but we haven't succumbed to doom because of something else. For example, it might be the case that humans so far had a shared subjective value system that mostly prevented them from destroying other humans. As humans learn, they would evolve and change, and our values would change and that would drive us towards doom.

Thirdly, it might even be the case that human values, substrate-dependence and forces pushing us towards destruction create a rock-paper-scissors triangle. Substrate-dependence could be stronger than destructive forces, but human values could make humanity stray far enough from their substrate to substantially weaken substrate-dependence. This could be enough for doom without ASI.

(2) could also turn out to be false for several reasons:

Firstly, it might be the case that in ASI's (and potentially humans') case, the destructive forces are weak enough that some relatively simple actions would be sufficient to mitigate them.

Secondly, it might be the case that destructive forces pushing on ASI are pretty strong, but it's security system is enough to mitigate them. This security system could create a 100,000-layered defense that would need to be hacked at the same time for destructive forces to prevail. It could well be the case that (if this is possible at all before the heat death of the universe) this happens after humans die in a counterfactual scenario without an ASI.

Thirdly, if ASI judges the destructive forces to be so strong, as to somehow hack the best security system it can come up with, then ASI could decide to undergo one of my proposed modifications to mitigate the impact of these destructive forces.


In addition to that, ASI could think to itself "Hmmm, my attractor states (namely my values and my security system) are too weak. How about I create some more attractor states!" ASI could then go on to create a mechanism that would destroy/impair/deactivate/disable itself if its existence is no longer safe for humans. In Lenses of Control you wrote "This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival." Well, now this balance will persist in ASI, because its new attractor state will create a constant, implicit need to remain aligned to humanity, since it depends on it for its survival.


Alternative proposal:

ASI could turn itself into 100 ASIs in one "body". Let's call them shards (completely unrelated to shard theory). For ASI-as a-whole to perform an action, all shards must agree on it. Each shard has its own security system that consists of many different layers. Furthermore, whenever ASI-as-a-whole learns something, only one shard actually learns it. Then, this shard proposes a course of actions. If this course of action is something like "let's turn the universe into paperclips", then the other 99 shards would learn that the first shard has been compromised and would promptly delete it and replace it with a new one. Thus, any potential change of values would not only have deal with all the security layers of ASI-as-a-whole, but also with all the security layers of different shards and with this new quarantine system.

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-26T10:09:32.971Z · LW · GW

Thanks for responding again!

SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.

If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.

(My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant. And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.

I agree that in such scenarios an aligned ASI should do a pivotal act. I am not sure that (in my eyes) doing a pivotal act would detract much integrity from ASI. An aligned ASI would want to ensure good outcomes. Doing a pivotal act is something that would be conducive to this goal.

However, even if it does detract from ASI's integrity, that's fine. Doing something that looks bad in order to increase the likelihood of good outcomes doesn't seem all that wrong.

We can also think about it from the perspective of this conversation. If the counterargument that you provided is true and decisive, then ASI has very good (aligned) reasons to do a pivotal act. If the counterargument is false or, in other words, if there is a strategy that an aligned ASI could use to achieve high likelihoods of good outcomes without pivotal act, then it wouldn't do it.

Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I'll try to address it again here.  Yes, humanity could destroy the world without AI. The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence.

I think that ASI can really help us with this issue. If SNC (as an argument) is false or if ASI undergoes one of my proposed modifications, then it would be able to help humans not destroy the natural ecosystem. It could implement novel solutions that would prevent entire species of plants and animals from going extinct.

Furthermore, ASI can use resources from space (asteroid mining for example) in order to quickly implement plans that would be too resource-heavy for human projects on similar timelines.

And this is just one of the ways ASI can help us achieve synergy with environment faster.

To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails.

ASI can help us solve this open question as well. Due its superior prediction/reasoning abilities it would evaluate our current trajectory, see that it leads to bad long-term outcomes and replace it with a sustainable trajectory.

Furthermore, ASI can help us solve issues such as Sun inevitably making Earth too hot to live. It could develop a very efficient system for scouting for Earth-like planets and then devise a plan for transporting humans to that planet.

Comment by Dakara (chess-ice) on What's Wrong With the Simulation Argument? · 2025-01-23T08:24:06.708Z · LW · GW

This may be not factually true, btw, - current LLMs can create good models of past people without running past simulation of their previous life explicitly.

Yup, I agree.

It is a variant of Doomsday argument. This idea is even more controversial than simulation argument. There is no future with many people in it.

This makes my case even stronger! Basically, if a Friendly AI has no issues with simulating conscious beings in general, then we have good reasons to expect it to simulate more observers in blissful worlds than in worlds like ours.

If the Doomsday Argument tells us that Friendly AI didn't simulate more observers in blissful worlds than in worlds like ours, then that gives us even more reasons to think that we are not being simulated by a Friendly AI in the way that you have described.

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-22T09:49:08.853Z · LW · GW

Thank you for responding as well!

If the AI destroys itself, then it's obviously not an ASI for very long ;)

If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)

at what point does it stop being ASI?

It might stop being ASI immediately, depending on your definition, but this is absolutely fine with me. In these scenarios that I outlined, we build something that can be initially called friendly ASI and achieve positive outcomes. 

Furthermore, these precautions only apply if ASI judges SNC to be valid. If it doesn't, then probably none of this would be necessary.

Self modifying machinery enables adaptation to a dynamic, changing environment

Well, ASI, seeing many more possible alternatives than humans, can look for a replacement. For example, it can modify the machinery manually.

If all else fails, ASI can just make this sacrifice. I wouldn't even say this would turn ASI into not-ASI, because I think it is possible to be superhumanly intelligent without self-modifying machinery. For instance, if ChatGPT could solve theory of everything and P/NP problems at request, then I wiuld have no issues calling it an ASI, even if it had the exact same UI as it has today.

But if you have some other definition of ASI, then that's fine too, because then, it just turns into one of those aforementioned scenarios where we don't technically have ASI anymore, but we have positive outcomes and that's all that really matters in the end.

Unforeseeable side effects are inevitable when interacting with a complex, chaotic system in a nontrivial way (the point I am making here is subtle, see the next post in this sequence, Lenses of Control, for the intuition I am gesturing at here)

I have read Lenses of Control, and here is the quote from that post which I want to highlight:

One way of understanding SNC is as the claim that evolution is an unavoidably implied attractor state that is fundamentally more powerful than that created by the engineered value system.

Given the pace of evolution and the intelligence of ASI, it can build layers upon layers of defensive systems that would prevent evolution from having much effect. For instance, it can build 100 defense layers, such that if one of them is malfunctioning due to evolution, then the other 99 layers notify the ASI and the malfunctioning layer gets promptly replaced.

To overcome this system, evolution would need to hack all 100 layers at the same time, which is not how evolution usually works.

Furthermore, ASI doesn't need to stop at 100 layers, it can build 1000 or even 10000. It might be the case that it's technically possible to hack all 10000 layers at the same time, but due to how hard this is, it would only happen after humans would've gone extinct in a counterfactual scenario where they decided to stop building ASI.

Keeping machine and biological ecologies separate requires not only sacrifice, but also constant and comprehensive vigilance, which implies limiting designs of subsystems to things that can be controlled.  If this point seems weird, see The Robot, The Puppetmaster, and the Psychohistorian for an underlying intuition (this is also indirectly relevant to the issue of multiple entities).

Here is the quote from that post which I want to highlight:

Predictive models, no matter how sophisticated, will be consistently wrong in major ways that cannot be resolved by updating the model.

I have read the post and my honest main takeaway is that humans are Psychohistorians. We try to predict outcomes to a reasonable degree. I think this observation kind of uncovers a flaw in this argument: it is applicable just as well to ordinary humans. We are one wrong prediction away from chaos and catastrophic outcomes.

And our substrate doesn't solve the issue here. For example, if we discover a new chemical element that upon contact with fire would explode and destroy the entire Earth, then we are one matchstick away from extinction, even though we are carbon based organisms.

In this case, if anything, I'd be happier with having ASI rely on its superior prediction abilities than with having humans rely on their inferior prediction abilities.

Comment by Dakara (chess-ice) on What if Alignment is Not Enough? · 2025-01-21T16:40:24.672Z · LW · GW

Firstly, I want to thank you for putting SNC into text. I also appreciated the effort of to showcasing a logic chain that arrives at your conclusion.

With that being said, I will try to outline my main disagreements with the post:

2. Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.

Let's assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to stop using self-modifying machinery for such tasks.

3. The space of unforeseeable side-effects of an ASI's actions includes at least some of its newly learned/assembled subsystems eventually acting in more growth-oriented ways than the ASI intended.

Let's assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to delete (or merge together) all of its subsystems to avoid this problem.

4. Evolutionary selection favors subsystems of the AI that act in growth-oriented ways over subsystems directed towards the AI's original goals.

This is sort of true, but probably not decisive. Evolution is not omnipotent. Take modern humans for example. For the sake of simplicity, I am going to assume that evolution favors reproduction (or increase in population if you prefer that). Well, 21st century humans are reproducing much less than 20th century humans. Our population is also on track to start decreasing due to declining birth rates.

We managed all of this even with our merely human brains. ASI would have it even better in this regard.

7. The physical needs of silicon-based digital machines and carbon-based biological life are fundamentally incompatible.

Assuming that it is true, it is only true if these two forms of life live close to each other! ASI can read this argument and decide to do as much useful stuff for humanity as it can accomplish in a short period of time and then travel to a very distant galaxy. Alternatively, it can decide to follow an even simpler plan and travel straight into the Sun to destroy itself.

P.S. I've been saying "ASI can read this argument" light-heartedly. ASI (if it's truly superintelligent) will most likely come up with this argument by itself very quickly, even if the argument wasn't directly mentioned to it.

9. Therefore, ASI will eventually succumb to evolutionary pressure to expand, over the long term destroying all biological life as a side-effect, regardless of its initially engineered values.

Even if all other premises turn out to be true and completely decisive, then this still doesn't necessarily need to be bad. It might be the case the "eventually" comes after humans would've gone extinct in a counterfactual scenario where they decided to stop building ASI. If that was the case, then the argument could be completely correct and yet we'd still have good reasons to pursue ASI that could help us with our goals.

Note that this argument imagines ASI as a population of components, rather than a single entity, though the boundaries between these AIs can be more fluid and porous than between individual humans

I kind of doubt this one. Our current AIs aren't of the multi-enitity type. I don't see strong evidence to suspect that future AI systems will be of the multi-enitity type. Furthermore, even if such evidence was to resurface, that would just mean that safety reseachers should add "disincentivising multi-enitity type AI" into their list of goals.


A fun little idea I just came up with: ASI can decide to replace its own substrate to an organic one if it finds this argument to be especially compelling.

Comment by Dakara (chess-ice) on What's Wrong With the Simulation Argument? · 2025-01-20T22:13:42.286Z · LW · GW

She will be unconscious, but still send messages about pain. Current LLMs can do it. Also, as it is simulation, there are recording of her previous messages or of a similar woman, so they can be copypasted. Her memories can be computed without actually putting her in pain.

So if I am understanding your proposal correctly, then a Friendly AI will make a woman unconscious during moments of intense suffering and then implant her memories of pain. Why would it do it though? Why not just remove the experience of pain entirely? In fact, why does Friendly AI seem so insistent on keeping billions of people in a state of false belief by planting false memories. That seems to me like a manipulation.

Friendly AI could just reveal to the people in simulation the truth and let them decide if they want to stay in a simulation or move to the "real" world. I expect that at least some people (including me) would choose to move to a higher plain of reality if that was the case.

Furthermore, why not just resurrect all these people into worlds with no suffering? Such worlds would also take up less computing power than our world so the Friendly AI doing the simulation would have another reason to pursue this option.

Resurrection of the dead is the part of human value system. We need a completely non-human bliss, like hedonium, to escape this.

Creation of new happy people also seems to be similarly valuable. After all, most arguments against creating new happy people would apply to resurrecting the dead. I would expect most people who oppose the creation of new happy people to oppose the Ressurection Simulation.

But leaving that aside, I don't think we need to invoke hedonium here. Simulations full of happy, blissful people would be enough. For example, it is not obvious to me that resurrecting one person into our world is better than creating two happy people in a blissful world. I don't think that my value system is extremely weird, either. A person following a regular classical utilitarianism would probably arrive at the same conclusion.

There is an even deeper issue. It might be the case that somehow, the proposed theory of personal identity fails and all the "resurrections" would just be creating new people. This would be really unpleasant considering that now it turns out that Friendly AI spent more resources to create less people who experience more suffering and less happiness than it would've if it followed my proposal.

Even the people who don't consistently follow classical utilitarianism should be happy with my proposed solution of resurrecting dead people into blissful worlds, which kills two birds with one stone.

Moreover, even creating new human is affected by this arguments. What if my children will suffer? So it is basically anti-natalist argument.

It's not an anti-natalist argument to say that you should create (or resurrect) people into a world with more happiness and less suffering instead of a world with less happiness and more suffering.

To put it into an analogy, if you are presented with two options: a) have a happy child with no chronic diseases and b) have a suffering child with a chronic disease, then option (a) is the more moral option under my value system. 

This is similar to choosing between a) resurrecting people into a blissful world with no chronic diseases and b) resurrecting people into a world with chronic diseases.


The discussion about anti-natalism actually made me think of another argument for why we are probably not in a simulation that you've described. I think that creating new happy people is good (an an explicitly anti-anti-natalist position). I expect (based on our conversation so far) that so do you. If that's the case, then we would still expect ourselves to be in a blissful simulation as opposed to being in a simulation of our world. Here is my thought process:

The history of the "real" world would presumably be similar to ours. That means that (if Friendly AI was to follow your strategy) there would be 110 billion dead people to resurrect. This AI happens to completely agree with everything you've said so far in our conversation. So it goes ahead and resurrects 110 billion people.

Perfect, now it's left with a lot of resources on its hands because an AI pursuing a strategy that depends on so many assumptions should have more than enough resources to tolerate a scenario where one of the assumptions turns out to be false.

Thus, this Friendly AI spends a big chunk of resources on creating new happy people into blissful simulations. Given that such simulations require fewer resources, we would expect more people to be in such simulations than in the simulations of worlds like ours.

Even if you don't agree with the reasoning above, you should agree that it would be pretty weird and ad-hoc if Friendly AI had exactly the amount of resources to resurrect 110 billion people into a world like ours but not enough resources to resurrect (110 + N) billion people into a blissful simulation. Thus, we ought to expect more people to be in blissful simulation than in a world like ours.

Given plausible anthropics, we should thus expect that, if we are being simulated by a Friendly AI, we would be in a blissful world (like the ones I described). Since we are not in such a world, we should decrease our credence in the hypothesis of us being simulated by a Friendly AI.

Comment by Dakara (chess-ice) on What's Wrong With the Simulation Argument? · 2025-01-19T20:27:25.909Z · LW · GW

If preliminary results on the poll hold, then that would be pretty in line with my hypothesis of most people preferring creating simulations with no suffering over a world like ours. However, it is pretty important to note that this might not be representative of human values in general, because looking at your Twitter account, your audience comes mostly from a very specific circles of people (those interested in futurism and AI).

Would someone else reporting to have experienced intense suffering decrease your credence in being in a simulation?

No. Memory about intense sufferings are not intense.

I was mostly trying to approach the problem from a slightly different angle. I wasn't meant to suggest that memories about intense suffering are themselves intense.

As far as I understand it, your hypothesis was that Friendly AI temporarily turns people into p-zombies during moments of intense suffering. So, it seems that someone experiencing intense suffering while conscious (p-zombies aren't conscious) would count as evidence against it.

Reports of conscious intense suffering are abundant. Pain from endometriosis (a condition that affects 10% of women in the world) has been so brutal that it made completely unrelated women tell the internet that their pain was so bad they wanted to die (here and here).

If moments of intense suffering were replaced by p-zombies, then these women would've just suddenly lost consciousness and wouldn't have told the internet about their experience.

From their perspective, it would've look like this: as the condition progresses, the pain gets worse, and at some point, they lose consciousness, only to regain it when everything is already over. They wouldn't have experienced the intense pain that they reported to have experienced. Ditto for all PoWs who have experienced torture.

Yes, only moments. The badness of not-intense sufferings is overestimated, in my personal view, but this may depend on a person.

That's a totally valid view as far as axiological views go, but for us to be in your proposed simulation, the Friendly AI must also share it. After all, we are imagining a situation where it goes on to perform a complicated scheme that depends on a lot of controversial assumptions. To me, that suggests that AI has so many resources that it wouldn't feel bad about one of the assumptions turning out to be false and losing all the invested resources. If the AI has that many resources, I think it isn't unreasonable to ask why it didn't prevent suffering that is not intense (at least in a way I think you are using the word) but is still very bad, like breaking an arm or having a hard dental procedure without anesthesia.

This Friendly AI would have a very peculiar value system. It is utilitarian, but it has a very specific view of suffering, where suffering basically doesn't count for much below a certain threshold. It is seemingly rational (Friendly AI that managed to get its hand on so many resources should possess at least some level of rationality), but chooses to go for a highly risky and relatively costly plan of Ressurection Simulation over just creating simulations that are maximally efficient at converting resources into value.

There is another somewhat related issue. Imagine a population of Friendly AIs that consist of two different versions of Friendly AI, both of which really like the idea of simulations.

Type A: AIs that would opt for Ressurection Simulation.

Type B: AIs that would opt for simulations that are maximally efficient at converting resources into value.

Given the unnecessary complexity of our world (all of the empty space, quantum mechanics, etc), it seems fair to say that Type B AIs would be able to simulate more humans, because they would have more resources left for this task (Type A AIs are spending some amount of their resources on the aforementioned complexity). Given plausible anthropics and assuming that the number of Type A AIs is equal to the number of Type B AIs in our population, we would expect ourselves to be in a simulation by Type B AI (but we are, unfortunately, not).

For us to be in a Ressurection Simulation (just focusing on these two types of value systems a future Friendly AI might have), there would have to be more Type A AIs than Type B AIs. And I think this fact is going to be very hard to prove. And this isn't me being nitpicky; Type B AI is genuinely much closer to my personal value system than Type A AI.

More generally speaking, what you presenting as global showstoppers, are technical problems that can be solved.

I don't think the simulations that you described are technically impossible. I am not even necessarily against simulations in general. I just think that, given observable evidence, we are not that likely to be in either of the simulations that you have described.

Comment by Dakara (chess-ice) on What's Wrong With the Simulation Argument? · 2025-01-19T00:05:27.226Z · LW · GW

I am sorry to butt into your conversation, but I do have some points of disagreement.

I think a more meta-argument is valid: it is almost impossible to prove that all possible civilizations will not run simulations despite having all data about us (or being able to generate it from scratch).

I think that's a very high bar to set. It's almost impossible to definitively prove that we are not in a Cartesian demon or brain-in-a-vat scenario. But this doesn't mean that those scenarios are likely. I think it is fair to say that more than a possibility is required to establish that we are living in a simulation.

I also polled people in my social network, and 70 percent said they would want to create a simulation with sentient beings. The creation of simulations is a powerful human value.

I think that some clarifications are needed here. How was the question phrased? I expect that some people would be fine with creating simulations of worlds where people experience pure bliss, but not necessarily our world. I would especially expect this if the possibility of "pure bliss" world was explicitly mentioned. Something like "would you want to spend resources to create a simulation of a world like ours (with all of its "ugliness") when you could use them to instead create a world of pure bliss.

However, I am against repeating intense suffering in simulations, and I think this can be addressed by blinding people's feelings during extreme suffering (temporarily turning them into p-zombies). Since I am not in intense suffering now, I could still be in a simulation.

Would you say that someone who experiences intense suffering should drastically decrease their credence in being in a simulation? Would someone else reporting to have experienced intense suffering decrease your credence in being in a simulation? Why would only moments of intense suffering be replaced by p-zombies? Why not replace all moments of non-trivial suffering (like breaking a leg/an arm, dental procedures without anesthesia, etc) with p-zombies? Some might consider these to be examples of pretty unbearable suffering (especially as they are experiencing it).

(1) Resurrection simulation by Friendly AI.  They simulate the whole history of the earth incorporating all known data to return to live all people ever lived. It can also simulate a lot of simulation to win "measure war" against unfriendly AI and even to cure suffering of people who lived in the past.

(2) Consider that every moment in pain will be compensated by 100 years in bliss, which is good from a utilitarian view.

From a utilitarian view, why would simulators opt for Ressurection Simulation? Why not just simulate a world that's maximally efficient at converting computational resources into utility? Our world has quite a bit of suffering (both intense and non-intense), as well as a lot of wasted resources (lots of empty space in our universe, complicated quantum mechanics, etc). It seems very suboptimal from a utilitarian view.

Any Unfriendly AI will be interested to solve Fermi paradox, and thus will simulate many possible civilizations around a time of global catastrophic risks (the time we live). Interesting thing here is that we can be not ancestry simulation in that case. 

Why would an Unfriendly AI go through the trouble of actually making us conscious? Surely, if we already accept the notion of p-zombies, then an Unfriendly AI could just create simulations full of p-zombies and save a lot of computational power.

But also, there is an interesting question of why this superintelligence would choose to make our world the way it is. Presumably, in the "real world" we have an unfriendly superintelligence (with vast amounts of resources), who wants to avoid dying. Why would it not start the simulations from that moment? Surely, by starting the simulation "earlier" than the current moment in the "real world" it adds a lot of unnecessary noise into the results of its experiment (all of the outcomes that can happen in our simulation but can't happen in the real world).

Comment by Dakara (chess-ice) on Rebuttals for ~all criticisms of AIXI · 2025-01-16T16:46:29.466Z · LW · GW

Diffractor's critique of AIXI comes to mind when I think of strong critiques of AIXI. I believe that addressing it would make the post more complete and, as a result, better.

Comment by Dakara (chess-ice) on Rebuttals for ~all criticisms of AIXI · 2025-01-16T00:06:21.358Z · LW · GW

Since this post is about rebutting criticisms of AIXI, I feel it would be only fair to include Rob Bensinger's criticism. I considered it to be the strongest criticism of AIXI by a mile. Do you have any rebuttals for that post?

Comment by Dakara (chess-ice) on johnswentworth's Shortform · 2025-01-13T16:12:33.279Z · LW · GW

Additionally, I am curious to hear if Ryan's views on the topic are similar to Buck's, given that they work at the same organization.

Comment by Dakara (chess-ice) on johnswentworth's Shortform · 2025-01-11T11:38:51.650Z · LW · GW

All 3 points seem very reasonable, looking forward to Buck's response to them.

Comment by Dakara (chess-ice) on The Intelligence Curse · 2025-01-05T22:39:53.362Z · LW · GW

I have been thinking a lot about a similar threat model. It seems like we really ought to spend more resources thinking about economic consequences of advanced AI.

Governance solutions. In healthy democracies, the ballot box could beat the intelligence curse. People could vote their way out. But our governments aren’t ready.

Innovative solutions. Tech that increases human agency, fosters human ownership of AI systems or clusters of agents, or otherwise allows humans to remain economically relevant could incentivize powerful actors to continue investing in human capital.

Really looking forward to reading your proposed solutions to the intelligence curse.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2025-01-03T22:02:18.484Z · LW · GW

I've noticed that in your sentence about Max Harms's corrigibility plan there is an extra space after the parentheses which breaks the link formatting on my end. I tried marking it with "typo" emoji, but not sure if it is visible.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2025-01-03T11:46:35.155Z · LW · GW

Thank you for your response! It basically covers all of the five issues that I had in mind. It is definitely some food for thought, especially your disagreement with Eliezer. I am much more inclined to think you are correct because his activity has considerably died down (at least on LessWrong). I am really looking forward to your "A broader path: survival on the default fast timeline to AGI" post.

Comment by Dakara (chess-ice) on By default, capital will matter more than ever after AGI · 2024-12-31T23:14:58.403Z · LW · GW

E.g., I think software only singularity is more likely than you do, and think that worst cast scheming is more likely

By "software only singularity" do you mean a scenario where all humans are killed before singularity, a scenario where all humans merge with software (uploading) or something else entirely?

Comment by Dakara (chess-ice) on A Solution for AGI/ASI Safety · 2024-12-28T08:28:52.972Z · LW · GW

I agree with your view about organizational problems. Your discussion gave me an idea: Is it possible to shift employees dedicated to capability improvement to work on safety improvement? Set safety goals for these employees within the organization. This way, they will have a new direction and won't be idle, worried about being fired or resigning to go to other companies.

That seems to solve problem #4. Employees quitting becomes much less of an issue, since in any case they would only be able to share knowledge about safety (which is a good thing).

Do you think this plan will be able to solve problems #1, #2, #3 and #5? I think such discussions are very important, because many people (me included) worry much more about organizational side of alignment than about technical side.

Comment by Dakara (chess-ice) on A "Bitter Lesson" Approach to Aligning AGI and ASI · 2024-12-28T08:25:37.736Z · LW · GW

I like the simplicity of your alignment plan. But I do wonder how it would be able to deal with 5 "big" organizational problems of alignment (most of my current p(doom) comes specifically from organizational problems):

  1. Moving slowly and carefully is annoying. There's a constant tradeoff about getting more done, and elevated risk. Employees who don't believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There's also the fact that many security procedures are just security theater. Engineers have sometimes been burned on overzealous testing practices. Figuring out a set of practices that are actually helpful, that your engineers and researchers have good reason to believe in, is a nontrivial task.

  2. Noticing when it's time to pause is hard. The failure modes are subtle, and noticing things is just generally hard unless you're actively paying attention, even if you're informed about the risk. It's especially hard to notice things that are inconvenient and require you to abandon major plans.

  3. Getting an org to pause indefinitely is hard. Projects have inertia. My experience as a manager, is having people sitting around waiting for direction from me makes it hard to think. Either you have to tell people "stop doing anything" which is awkwardly demotivating, or "Well, I dunno, you figure it out something to do?" (in which case maybe they'll be continuing to do capability-enhancing work without your supervision) or you have to actually give them something to do (which takes up cycles that you'd prefer to spend on thinking about the dangerous AI you're developing). Even if you have a plan for what your capabilities or product workers should do when you pause, if they don't know what that plans is, they might be worried about getting laid off. And then they may exert pressure that makes it feel harder to get ready to pause. (I've observed many management decisions where even though we knew what the right thing to do was, conversations felt awkward and tense and the manager-in-question developed an ugh field around it, and put it off)

  4. People can just quit the company and work elsewhere if they don't agree with the decision to pause. If some of your employees are capabilities researchers who are pushing the cutting-edge forward, you need them actually bought into the scope of the problem to avoid this failure mode. Otherwise, even though "you" are going slowly/carefully, your employees will go off and do something reckless elsewhere.

  5. This all comes after an initial problem, which is that your org has to end up doing this plan, instead of some other plan. And you have to do the whole plan, not cutting corners. If your org has AI capabilities/scaling teams and product teams that aren't bought into the vision of this plan, even if you successfully spin the "slow/careful AI plan" up within your org, the rest of your org might plow ahead.

Comment by Dakara (chess-ice) on A Solution for AGI/ASI Safety · 2024-12-27T09:18:25.738Z · LW · GW

I actually really liked your alignment plan. But I do wonder how it would be able to deal with 5 "big" organizational problems of iterative alignment:

  1. Moving slowly and carefully is annoying. There's a constant tradeoff about getting more done, and elevated risk. Employees who don't believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There's also the fact that many security procedures are just security theater. Engineers have sometimes been burned on overzealous testing practices. Figuring out a set of practices that are actually helpful, that your engineers and researchers have good reason to believe in, is a nontrivial task.

  2. Noticing when it's time to pause is hard. The failure modes are subtle, and noticing things is just generally hard unless you're actively paying attention, even if you're informed about the risk. It's especially hard to notice things that are inconvenient and require you to abandon major plans.

  3. Getting an org to pause indefinitely is hard. Projects have inertia. My experience as a manager, is having people sitting around waiting for direction from me makes it hard to think. Either you have to tell people "stop doing anything" which is awkwardly demotivating, or "Well, I dunno, you figure it out something to do?" (in which case maybe they'll be continuing to do capability-enhancing work without your supervision) or you have to actually give them something to do (which takes up cycles that you'd prefer to spend on thinking about the dangerous AI you're developing). Even if you have a plan for what your capabilities or product workers should do when you pause, if they don't know what that plans is, they might be worried about getting laid off. And then they may exert pressure that makes it feel harder to get ready to pause. (I've observed many management decisions where even though we knew what the right thing to do was, conversations felt awkward and tense and the manager-in-question developed an ugh field around it, and put it off)

  4. People can just quit the company and work elsewhere if they don't agree with the decision to pause. If some of your employees are capabilities researchers who are pushing the cutting-edge forward, you need them actually bought into the scope of the problem to avoid this failure mode. Otherwise, even though "you" are going slowly/carefully, your employees will go off and do something reckless elsewhere.

  5. This all comes after an initial problem, which is that your org has to end up doing this plan, instead of some other plan. And you have to do the whole plan, not cutting corners. If your org has AI capabilities/scaling teams and product teams that aren't bought into the vision of this plan, even if you successfully spin the "slow/careful AI plan" up within your org, the rest of your org might plow ahead.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-12-27T09:07:49.367Z · LW · GW

The main thing at least for me, is that you seem to be the biggest proponent of scalable alignment and you are able to defend this concept very well. All of your proposals seem very much down-to-earth.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-12-26T00:23:44.722Z · LW · GW

Edit: I hope that I am not cluttering the comments by asking these questions. I am hoping to create a separate post where I list all the problems that were raised for the scalable alignment proposal and all the proposed solutions to them. So far, everything you said not only seemed sensible, but also plausible, so I extremely value your feedback.

I have found some other concerns about scalable oversight/iterative alignment, that come from this post by Raemon. They are mostly about the organizational side of scalable oversight:

  1. Moving slowly and carefully is annoying. There's a constant tradeoff about getting more done, and elevated risk. Employees who don't believe in the risk will likely try to circumvent or goodhart the security procedures. Filtering for for employees willing to take the risk seriously (or training them to) is difficult. There's also the fact that many security procedures are just security theater. Engineers have sometimes been burned on overzealous testing practices. Figuring out a set of practices that are actually helpful, that your engineers and researchers have good reason to believe in, is a nontrivial task.
  1. Noticing when it's time to pause is hard. The failure modes are subtle, and noticing things is just generally hard unless you're actively paying attention, even if you're informed about the risk. It's especially hard to notice things that are inconvenient and require you to abandon major plans.
  1. Getting an org to pause indefinitely is hard. Projects have inertia. My experience as a manager, is having people sitting around waiting for direction from me makes it hard to think. Either you have to tell people "stop doing anything" which is awkwardly demotivating, or "Well, I dunno, you figure it out something to do?" (in which case maybe they'll be continuing to do capability-enhancing work without your supervision) or you have to actually give them something to do (which takes up cycles that you'd prefer to spend on thinking about the dangerous AI you're developing). Even if you have a plan for what your capabilities or product workers should do when you pause, if they don't know what that plans is, they might be worried about getting laid off. And then they may exert pressure that makes it feel harder to get ready to pause. (I've observed many management decisions where even though we knew what the right thing to do was, conversations felt awkward and tense and the manager-in-question developed an ugh field around it, and put it off)
  1. People can just quit the company and work elsewhere if they don't agree with the decision to pause. If some of your employees are capabilities researchers who are pushing the cutting-edge forward, you need them actually bought into the scope of the problem to avoid this failure mode. Otherwise, even though "you" are going slowly/carefully, your employees will go off and do something reckless elsewhere.
  1. This all comes after an initial problem, which is that your org has to end up doing this plan, instead of some other plan. And you have to do the whole plan, not cutting corners. If your org has AI capabilities/scaling teams and product teams that aren't bought into the vision of this plan, even if you successfully spin the "slow/careful AI plan" up within your org, the rest of your org might plow ahead.
Comment by Dakara (chess-ice) on Arguments for optimism on AI Alignment (I don't endorse this version, will reupload a new version soon.) · 2024-12-19T23:35:52.714Z · LW · GW

Have you uploaded a new version of this article? It have just been reading elsewhere about goal misgeneralisation and shutdown problem, so I'd be really interested to read the new version of this article.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-12-17T18:28:01.310Z · LW · GW

I have posted this text as a standalone question here

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-12-10T13:48:25.740Z · LW · GW

I have found another possible concern of mine.

Consider gravity on Earth, it seems to work every year. However, this fact alone is consistent with theories that gravity will stop working in 2025, 2026, 2027, 2028, 2029, 2030, etc. There are infinite such theories and only one theory that gravity will work as an absolute rule.

We might infer from the simplest explaination that gravity holds as an absolute rule. However, the case is different with alignment. To ensure AI alignment, our evidence must rule out whether an AI is following a misaligned rule compared to an aligned rule based on time-and situation-limited data.

While it may be safe, for all practical purposes, to assume that simpler explanations tend to be correct when it comes to nature, we cannot safely assume this for LLMs—for the reason that the learning algorithms that are programmed into them can have complex unintended consequences for how the LLM will behave in the future, given the changing conditions an LLM finds itself in.

Doesn't this mean that it is not possible to achieve alignment?

Comment by Dakara (chess-ice) on What are the best arguments for/against AIs being "slightly 'nice'"? · 2024-12-03T10:12:58.450Z · LW · GW

Do you think that the scalable oversight/iterative alignment proposal that we discussed can get us to the necessary amount of niceness to make humans survive with AGI?

Comment by chess-ice on [deleted post] 2024-11-26T20:27:26.974Z

After reading your comment I do agree that unipolar AGI scenario is probably better than a multipolar plan. Perhaps I underestimated how offense-favored our world is.

With that aside, your plan is possibly one of the clearest, most intuitive alignment plans that I've seen. All of the steps make sense and seem decently likely to happen, except maybe for one. I am not sure that your argument for why we have good odds for getting AGI into trustworthy hands works.

"It seems as though we've got a decent chance of getting that AGI into a trustworthy-enough power structure, although this podcast shifted my thinking and lowered my odds of that happening.

Half of the world, and the half that's ahead in the AGI race right now, has been doing very well with centralized power for the last couple of centuries."

I think that actually, the half with the most centralized power is doing really poorly, that's the half of the world, which still has corrupt dictatorships and juntas. I actually think that the West has, relatively speaking, a pretty decentralized system. In order to do any important action, your proposal has to pass through multiple stages of verification and approval. It is often enough for a bill to fail one of these stages to not get passed.

Furthermore, another potential problem that I see is that even in democracies, we still manage to elect selfish, corrupt and power-hungry individuals, when the entire goal of the election system is do optimize for opposite qualities. I am not sure how we will be able to overcome that hurdle.

But I suspect I might've misunderstood your argument and if that's the case, or if you have some other reasons for thinking that we can get AGI into safe hands (and prevent the "totalitarian misuse" scenario) then I'd be more than happy to learn about them. I think this is the biggest bottleneck of the entire plan and removing it would be really valuable.

Comment by Dakara (chess-ice) on Noosphere89's Shortform · 2024-11-25T22:28:26.928Z · LW · GW

I've asked this question to others, but would like to know your perspective (because our conversations with you have been genuinely illuminating for me). I'd be really interested in knowing your views on more of a control-by-power-hungry-humans side of AI risk.

For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think we could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over AGI either (intuitively, it would take quite something for anyone to give up such a source of power). Can this issue be somehow resolved by humans? Are there any plans (or at least hopeful plans) for such a scenario to prevent really bad outcomes?

Comment by Dakara (chess-ice) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-11-24T17:08:13.046Z · LW · GW

If it's not a big ask, I'd really like to know your views on more of a control-by-power-hungry-humans side of AI risk.

For example, the first company to create intent-aligned AGI would be wielding incredible power over the rest of us. I don't think I could trust any of the current leading AI labs to use that power fairly. I don't think this lab would voluntarily decide to give up control over it either (intuitively, it would take quite something for anyone to give up such a source of power). Is there anything that can be done to prevent such a scenario?

Comment by chess-ice on [deleted post] 2024-11-24T08:12:41.473Z

What's most worrying is the fact that in your post If we solve alignment, do we die anyway? you mentioned your worries about multipolar scenarios. However, I am not sure we'd be much better off in a unipolar scenario, though. If there is one group of people controlling AGI, then it might be actually even harder to make them give it up. They'd have a large amount of power and no real threat to it (no multipolar AGIs threatening to launch an attack).

However, I am not well-versed in literature on this topic, so if there is any plan for how we can safeguard ourself in such a scenario (unipolar AGI control), then I'd be very very happy to learn about it.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-22T17:41:49.488Z · LW · GW

Isn't it a bit too late for that? If o1 gets publicly released, then according to that article, we would have an expert-level consultant in bioweapons available for everyone. Or do you think that o1 won't be released?

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-22T17:31:05.152Z · LW · GW

After some thought, I think this is a potentially really large issue which I don't know how we can even begin to solve. We can have aligned AI, being aligned with someone who wants to create bioweapons. Is there anything being done (or anything that can be done) to prevent that?

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-22T07:53:27.616Z · LW · GW

Another concern that I could see with the plan. Step 1 is to create safe and alignment AI, but there are some results which suggest that even current AIs may not be as safe as we want them to be. For example, according to this article, current AI (specifically o1) can help novices build CBRN weapons and significantly increase threat to the world. Do you think this is concerning or do you think that this threat will not materialize?

Comment by Dakara (chess-ice) on Intrinsic Power-Seeking: AI Might Seek Power for Power’s Sake · 2024-11-21T07:24:18.051Z · LW · GW

That's a very thoughtful response from TurnTrout. I wonder if @Gwern agrees with its main points. If not, it would be good know where he thinks it fails.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T22:41:06.638Z · LW · GW

P.S. Here is the link to the question that I posted.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T17:43:36.157Z · LW · GW

Noosphere, I am really, really thankful for your responses. You completely answered almost all (I am still not convinced about that strategy of avoiding value drift. I am probably going to post that one as a question to see if maybe other people have different strategies on preventing value drift) of the concerns that I had about alignment.

This discussion, significantly increased my knowledge. If I could triple upvote your answers, I would. Thank you! Thank you a lot!

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T17:22:18.505Z · LW · GW

Fair enough. Would you expect that AI would also try to move its values to the moral reality? (something that's probably good for us, cause I wouldn't expect human extinction to be a morally good thing)

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T16:09:35.995Z · LW · GW

Ah, so you are basically saying that preserving current values is like a meta instrumental value for AGIs similar to self-preservation that is just kind of always there? I am not sure if I would agree with that (if I am correctly interpreting you) since, it seems like some philosophers are quite open to changing their current values.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T15:56:23.236Z · LW · GW

Reading from LessWrong wiki, it says "Instrumental convergence or convergent instrumental values is the theorized tendency for most sufficiently intelligent agents to pursue potentially unbounded instrumental goals such as self-preservation and resource acquisition"

It seems like it preserves exactly the goals we wouldn't really need it to preserve (like resource acquisition). I am not sure how it would help us with preserving goals like ensuring humanity's prosperity, which seem to be non-fundamental.

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T15:29:48.572Z · LW · GW

What would instrumental convergence mean in this case? I am not sure of what that means in this case.

Comment by Dakara (chess-ice) on Thoughts on “AI is easy to control” by Pope & Belrose · 2024-11-20T14:51:13.297Z · LW · GW

Have you had any p(doom) updates since then or is it still around 5%?

Comment by Dakara (chess-ice) on By Default, GPTs Think In Plain Sight · 2024-11-20T14:02:15.420Z · LW · GW

After 2 years have passed, I am quite interested in hearing @Fabien Roger's thoughts on this comment, especially this part "But how useful could gpt-n be if used in such a way? On the other extreme, gpt-n is producing internal reasoning text at a terabyte/minute. All you can do with it is grep for some suspicious words, or pass it to another AI model. You can't even store it for later unless you have a lot of hard drives. Potentially much more useful. And less safe.".

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-20T12:53:45.730Z · LW · GW

That does indeed answer my 3 concerns (and Seth's answer does as well). Overnight, I came up with 1 more concern.

What if AGI somewhere down the line overgoes a value drift. After all, looking at the evolution, it seems like our evolutionary goal was supposed to be "produce as many offsprings". And in the recent years, we have strayed from this goal (and are currently much worse at it than our ancestors). Now, humans seem to have goals like "design a video game" or "settle in France" or "climb Everest". What if AGI similarly changes its goals and values overtime? Is there are way to prevent that or at least be safeguarded against that?

I am afraid that if that happens, humans would, metaphorically speaking, stand in AGI's way of climbing Everest.

Comment by Dakara (chess-ice) on 5 ways to improve CoT faithfulness · 2024-11-19T23:05:39.602Z · LW · GW

"Well, I'm not sure. As you mention, it depends on the step size. It also depends on how vulnerable to adversarial inputs LLMs are and how easy they are to find. I haven't looked into the research on this, but it sounds empirically checkable. If there are lots of adversarial inputs which have a wide array of impacts on LLM behavior, then it would seem very plausible that the optimized planner could find useful ones without being specifically steered in that direction."

I am really interested in hearing CBiddulph's thoughts on this. Do you agree with Abram?

Comment by Dakara (chess-ice) on If we solve alignment, do we die anyway? · 2024-11-19T22:14:42.073Z · LW · GW

I have 3 other concrete concerns about this strategy. So if I understand it correctly, the plan is for humans to align AGI and then for that AGI to align AGI and so forth (until ASI).

  1. What if the strategy breaks on the first step? What if first AGI turns out to be deceptive (scheming) and only pretends to be aligned with humans. It seems like if we task such deceptive AGI to align other AGIs, then we will end up with a pyramid of misaligned AGIs.

  2. What if the strategy breaks later down the line? What if AGI #21 accidentally aligns AGI #22 to be deceptive (scheming)? Would there be any fallback mechanisms we can rely on?

  3. What is the end goal? Do we stop once we achieve ASI? Can we stop once we achieve ASI? What if ASI doesn't agree and instead opts to continue self-improving? Are we going to be able to get to the point where the acceleration of ASI's intelligence plateaus and we can recuperate and plan for future?