What if Alignment is Not Enough?

post by WillPetillo · 2024-03-07T08:10:35.179Z · LW · GW · 40 comments

Contents

  Background
  Substrate Needs Convergence
  1, 2:  Fundamental limits to control
  3, 6: Growth as an emergent goal
  4: Evolutionary selection favors growth
  5: The amount of control necessary for an ASI to preserve its values is greater than the amount of control possible
  7. Artificial systems are incompatible with biological life
  8. Artificial entities have an advantage over biological life
  9. Biological life is destroyed
  Conclusions and relating Substrate Needs Convergence to alignment:
None
40 comments

The following is a summary of Substrate Needs Convergence, as described in The Control Problem: Unsolved or Unsolvable? [LW · GW], No People as Pets (summarized [LW · GW] here by Roman Yen), my podcast interview with Remmelt, and this conversation with Anders Sandberg.  Remmelt assisted in the editing of this post to verify I am accurately representing Substrate Needs Convergence—at least to a rough, first approximation of the argument.

I am not personally weighing in as to whether I think this argument is true or not, but I think the ideas merit further attention so they can be accepted or discarded based on reasoned engagement.  The core claim is not what I thought it was when I first read the above sources and I notice that my skepticism has decreased as I have come to better understand the nature of the argument.

Quick note on terminology: "ASI" refers to an artificial super intelligence, or an AI that is powerful enough shape the course of world events, maintain itself, and its expected behavior can be considered in terms of the theoretical limits of capability provided by intelligence.

Background

Much existing alignment research takes as a given that humans will not be able to control ASI through guardrails, off switches, or other coercive methods.  Instead, the focus is to build AI in such a way that what it wants is compatible with what humans want (the challenges involved in balancing the interests of different humans are often skipped over as out of scope).  Commonly cited challenges include specification gaming, goal misgeneralization, and mesa-optimizers—all of which can be thought of as applications of Goodhart’s Law, where optimizing for different types of proxy measures lead to divergence from a true goal.  The dream of alignment is that the ASI’s goal-seeking behavior guides it progressively closer to human values as the system becomes more capable, so coercive supervision from humans would not be necessary to keep the ASI in check.  

This lens on AI safety assumes that intentions define outcomes.  That is, if an agent wants something to happen then that thing will happen unless some outside force (such as a more powerful agent or collection of agents) pushes more strongly in a different direction.  By extension, if the agent is a singleton ASI then it will have an asymmetric advantage over all external forces and, within the bounds of physics, its intentions are sure to become reality.  But what if this assumption is false?  What if even an ASI that initially acts in line with human-defined goals is in an attractor basin, where it is irresistibly pulled towards causing unsafe conditions over time?  What if alignment is not enough?

Substrate Needs Convergence

Substrate Needs Convergence is the theory that ASI will gradually change under strong evolutionary pressures toward expanding itself.  This converges over the long term on making the Earth uninhabitable for biological life.  An overview follows:

  1. There are fundamental limits to how comprehensively any system—including an ASI—can sense, model, simulate, evaluate, and act on the larger environment.
  2. Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.
  3. The space of unforeseeable side-effects of an ASI’s actions includes at least some of its newly learned/assembled subsystems eventually acting in more growth-oriented ways than the ASI intended.
  4. Evolutionary selection favors subsystems of the AI that act in growth-oriented ways over subsystems directed towards the AI’s original goals.
  5. The amount of control necessary for an ASI to preserve goal-directed subsystems against the constant push of evolutionary forces is strictly greater than the maximum degree of control available to any system of any type.
  6. Over time, any goal structures of any subsystems of the ASI that are not maximally efficient with respect to the needs of those subsystems themselves will be replaced, in increasing proportion, by just those goal aspects and subsystems that are maximally efficient.
  7. The physical needs of silicon-based digital machines and carbon-based biological life are fundamentally incompatible.
  8. Artificial self-sustaining systems will have a competitive advantage over biological life.
  9. Therefore, ASI will eventually succumb to evolutionary pressure to expand, over the long term destroying all biological life as a side-effect, regardless of its initially engineered values.

Note that this argument imagines ASI as a population of components, rather than a single entity, though the boundaries between these AIs can be more fluid and porous than between individual humans.  It does not, however, make any assumptions regarding mono vs. multi-polar scenarios, fast vs. slow takeoff, or the amount of hierarchy in its organization. 

Establishing an argument as plausible, likely, or proven requires radically different types of support, with the latter requiring significantly more logical rigor and empirical evidence.  At least some researchers exploring this argument have claimed that Substrate Needs Convergence is provably true.  This post, however, has the far more modest goal of articulating the case for plausibility, since it can be made far more succinctly.  To this end, I will step through the premises and conclusion of the above chain, spending time on each proportional to their counter-intuitiveness.

1, 2:  Fundamental limits to control

One might wonder whether the inability to control one’s subsystems is a limitation that applies to ASI.  Even ASI, however, faces causal limits to its ability to control the world.  It would not be reasonable, for example, to assume that ASI will be capable of building perpetual motion machines or faster-than-light travel.  One category of impossible tasks is complete prediction of all of the relevant consequences of an agent’s actions on the real world.  Sensors can only take in limited inputs (affected by noise), actuators can only have limited influence (also affected by noise), and world-models and simulations necessarily make simplifying assumptions.  In other words, the law of unintended consequences holds true even for ASI.  Further, the scale of these errors increases as the ASI does things that affect the entire world, gains more interacting components, and must account for increasingly complex feedback loops.

3, 6: Growth as an emergent goal

Usually, when people think of evolution, they think of selection for random, point-wise changes, such as mutations in DNA. In the case of ASI, we need to make sure that we’re comprehensive enough about what sort of unpredictable variations can be selected for.  This can include not just point-wise changes (e.g. bit-flips induced by cosmic rays), but also new code learned from inputs, changes to assembled hardware configurations, or just changes in how existing functional components get connected into the system in new ways.  The expression of those changes when the machinery is operating in the real world can cause unintended consequences (downstream effects) that could become evolutionarily selected for.

As in any evolutionary process, at least some variations will be detrimental to the variants and disappear…but some will benefit the variants in terms of their tendency to persist, grow in capacity, and self-replicate.  Some of these variants will be suppressed by existing systems for acting contrary to the overarching goals of the system…but some will be effectively self-protecting enough to persist anyways.

By analogy, most cells in the human body that become cancerous are suppressed by the immune system, but some resist suppression and kill the host.  Whether an ASI is more effective at suppressing variants than the human immune system is at suppressing cancer because of its intelligence, or less effective because of its scale, complexity, and because the growth-oriented components of the AI increase the overall reproductive fitness of the larger system (i.e., the ability of the AI to maintain itself in a variety of circumstances) is a question for likelihood, not plausibility, and is thus out of scope for this post.

One especially disruptive type of unintended consequence would be if the value system of the ASI changes in an unintended direction, resulting in value drift that is also subject to evolution’s pull towards growth and self-replication.  A relevant analogy here is the toxoplasma virus, which changes the behavior of its host in a way that enables the virus to spread.  Further, in the case of ASI, the host could very well benefit (in terms of survivability, growth, and replication) from a change analogous to having healthy microbes spreading through the body, such that it propagates the change all the more pervasively.

4: Evolutionary selection favors growth

Seems non-controversial given the presence of unpredictable variation discussed above and the general principles of natural selection.

Note that this selection is continuous: an absolute focus on growth has an evolutionary advantage over a partial focus, which has an advantage over none.  It may be that new, growth-oriented goals fully displace old, human-compatible ones, or that new goals are overlaid over old ones.  At first the latter is more likely, but the former becomes increasingly likely over time.

If this premise seems objectionable, consider whether that objection is actually to a different premise—particularly 3 or 5, regarding the emergence and persistence, respectively, of increasingly growth-oriented subsystems.

5: The amount of control necessary for an ASI to preserve its values is greater than the amount of control possible

The asymmetry between necessary and possible control is a difference in kind, not a difference in degree.  That is, there are certain domains of tasks for which control breaks down and an ASI engaged in the scope of tasks for which an ASI would be necessary falls within these domains.  This premise could thus be strengthened to state that, at the relevant levels of abstraction, the maximum control necessary for an ASI to preserve its values is greater than the maximum degree of control even conceptually possible.  Proving this assertion is beyond the scope of this post, but we can explore this topic intuitively by considering simulation, one of the stages necessary to an intelligent control system.

A simulation is a simplified model of reality that hopefully captures enough of reality’s “essence” to be reasonably accurate within the domain of what the modeler considers relevant.  If the model’s assumptions are poorly chosen or it focuses on the wrong things, it obviously fails, but let us assume that an ASI makes good models.  Another factor limiting the quality of a simulation, however, is reality itself.  Specifically, whether reality is dominated by negative feedback loops which cause errors to cancel or positive feedback loops that cause even the smallest errors to explode.

For illustration, Isaac Asimov’s Foundation series imagines a future where the course of civilization is predictable, and thus controllable, through the use of “psycho-history.”  This proposition is justified by analogizing society to the ideal gas law, which makes it possible to predict the pressure of a gas in an enclosed space, despite the atoms moving about chaotically, because those movements average out in a predictable way.  Predictability at scale, however, cannot be assumed.  The three body problem, or calculating the trajectories of three (or more) objects orbiting each other in space, is trivial to simulate, but that simulation will not be accurate when applied to the real world because the inevitable inaccuracies of the model will lead to exponentially increasing errors in the objects’ paths.  One can thus think about how detailed an AI’s model of the world needs to be in order to control how its actions affect the world by asking whether the way the world works is more analogous to the ideal gas law (a complicated system) or the three body problem (a complex system).

7. Artificial systems are incompatible with biological life

Seems non-controversial.  Silicon wafers, for example, are produced with temperatures and chemicals deadly to humans.  Also observe the detrimental impact on the environment from the expansion of industry.  Hybrid systems simply move the issue from the relationship of artificial and biological entities to the relationship of artificial and biological aspects of an individual.

8. Artificial entities have an advantage over biological life

Plausibility seems non-controversial; likelihood has been argued elsewhere.

9. Biological life is destroyed

Stated in more detail: ASI will eventually be affected by such evolutionary pressures to the point that a critical accumulation of toxic outcomes will occur, in a way that is beyond the capability of the ASI itself to control for, resulting in the eventual total loss of all biological life.  Even assuming initially human compatible goals—a big assumption in itself given the seeming intractability of the alignment problem as it is commonly understood—a progression towards increasingly toxic (to humans) outcomes occurs anyways because of the accumulation of mistakes resulting from the impossibility of complete control.

One might object with the analogy that it is not a foregone conclusion that (non-AI assisted) industrial expansion will destroy the natural environment.  Reflecting on this analogy, however, reveals a core intuition supporting Substrate Needs Convergence.  The reason humanity, without AI, has any hope at all of not destroying the world is that we are dependent on our environment for our survival.  Living out of balance with our world is a path to self-destruction and our knowledge—and experience of collapse on small, local scales—of this reality acts as a counterbalancing force towards cooperation and against collective suicide.  But it is on just this critical saving grace that AI is disanalogous.  Existing on a different substrate, AI has no counterbalancing, long-term, baked-in incentive to protect the biological substrate on which we exist.

But perhaps ASI, even subject to Substrate Needs Convergence, will stop at some point, as the value of consuming the last pockets of biological life reaches diminishing returns while the benefit to keeping some life around remains constant?  If one has followed the argument this far, such an objection is grasping at straws.  Given that the pull of natural selection occurs over all parts of the ASI all the time, the evidentiary burden is on the skeptic to answer why certain parts of the biosphere would remain off limits to the continued growth of all components of the ASI indefinitely.

Conclusions and relating Substrate Needs Convergence to alignment:

Estimating the tractability of making ASI safe at scale is critical for deciding policy.  If AI safety is easy and will occur by default with existing techniques, then we should avoid interfering with market processes.  If it is difficult but solvable, we should look hard for solutions and make sure they are applied (and perhaps also slow AI capabilities development down as necessary to buy time).  If it is impossible (or unreasonably difficult), then our focus should be on stopping progress towards ASI altogether.

Standard alignment theory requires four general things to go well:

  1. There is some known process for instilling an ASI’s goals reliably, directly through an engineered process or indirectly through training to a representative dataset.
  2. There is some known process for selecting goals that, if enacted, would be acceptable to the AI’s creators.
  3. Ensure that the AI’s creators select goals that are acceptable to humanity as a whole, rather than just to themselves.
  4. Ensure that safe systems, if developed, are actually used and not superseded by unsafe systems created by reckless or malevolent actors.

The theory of Substrate Needs Convergence proposes a fifth requirement:

      5. Initially safe systems, if developed and used, must remain safe at scale and over the           long term.

The theory further argues that this fifth criterion’s probability of going well is nonexistent because evolutionary forces will push the AI towards human-incompatible behavior in ways that cannot be resisted by control mechanisms.  Claiming that “intelligence” will solve this problem is not sufficient because increases in intelligence requires increases in the combinatorial complexity of processing components that results in the varied unforeseeable consequences that are the source of the problem.

I outlined the argument for Substrate Needs Convergence as an 9-part chain as a focus for further discussion, allowing for objections to fit into relatively clear categories.  For example:

Addressing such objections is beyond the scope of this post.  I’ve included high-level discussions of each of the claims in order to clarify their meaning and to articulate some of the intuitions that make them plausible.  I hope that it has become clearer what the overall shape of the Substrate Needs Convergence argument is and I look forward to any discussion that follows.

40 comments

Comments sorted by top scores.

comment by Linda Linsefors · 2024-03-11T17:46:56.549Z · LW(p) · GW(p)

I think point 5 is the main crux. 

Please click agree or disagree on this comment if you agree or disagree (cross or check mark), since this is useful guidance for what part of this people should prioritise when clarifying further. 

Replies from: William the Kiwi , remmelt-ellen
comment by William the Kiwi · 2024-03-16T11:13:19.092Z · LW(p) · GW(p)

I also agree 5 is the main crux.

In the description of point 5, the OP says "Proving this assertion is beyond the scope of this post,", I presume that the proof of the assertion is made elsewhere. Can someone post a link to it?

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-03-18T10:53:58.999Z · LW(p) · GW(p)

This answer will sound unsatisfying:  

If a mathematician or analytical philosopher wrote a bunch of squiggles on a whiteboard, and said it was a proof, would you recognise it as a proof? 

  • Say that unfamiliar new analytical language and means of derivation are used (which is not uncommon for impossibility proofs by contradiction, see Gödel's incompleteness theorems and Bell's theorem). 
  • Say that it directly challenges technologists' beliefs about their capacity to control technology, particularly their capacity to constrain a supposedly "dumb local optimiser":  evolutionary selection.
  • Say that the reasoning is not only about a formal axiomatic system, but needs to make empirically sound correspondences with how real physical systems work.
  • Say that the reasoning is not only about an interesting theoretical puzzle, but has serious implications for how we can and cannot prevent human extinction.


This is high stakes.

We were looking for careful thinkers who had the patience to spend time on understanding the shape of the argument, and how the premises correspond with how things work in reality.  Linda and Anders turned out to be two of these people, and we did three long calls so far (first call has an edited transcript).

I wish we could short-cut that process. But if we cannot manage to convey the overall shape of the argument and the premises, then there is no point to moving on to how the reasoning is formalised. 

I get that people are busy with their own projects, and want to give their own opinions about what they initially think the argument entails. And, if the time they commit to understanding the argument is not at least 1/5 of the time I spend on conveying the argument specifically to them, then in my experience we usually lack the shared bandwidth needed to work through the argument. 
 

  • Saying, "guys, big inferential distance here" did not help. People will expect [LW · GW] it to be a short inferential distance anyway. 
  • Saying it's a complicated argument that takes time to understand did not help. A smart busy researcher did some light reading [LW(p) · GW(p)], tracked down a claim that seemed "obviously" untrue within their mental framework, and thereby confidently dismissed the entire argument. BTW, they're a famous research insider, and we're just outsiders whose response [LW(p) · GW(p)] got downvoted – must be wrong right?
  • Saying everything in this comment does not help. It's some long-assessed plea for your patience.
    If I'm so confident about the conclusion, why am I not passing you the proof clean and clear now?! 
    Feel free to downvote this comment and move on.
     

Here is my best attempt [LW · GW] at summarising the argument intuitively and precisely, still prompting some misinterpretations by well-meaning commenters [LW(p) · GW(p)]. I feel appreciation for people who realised what is at stake, and were therefore willing to continue syncing up on the premises and reasoning, as Will did:
 

The core claim is not what I thought it was when I first read the above sources and I notice that my skepticism has decreased as I have come to better understand the nature of the argument.

comment by Remmelt (remmelt-ellen) · 2024-03-12T01:28:34.399Z · LW(p) · GW(p)

I agree that point 5 is the main crux:

The amount of control necessary for an ASI to preserve goal-directed subsystems against the constant push of evolutionary forces is strictly greater than the maximum degree of control available to any system of any type.

To answer it takes careful reasoning. Here's my take on it:

  • We need to examine the degree to which there would be necessarily changes to the connected functional components constituting self-sufficient learning machinery (as including ASI) 
    • Changes by learning/receiving code through environmental inputs, and through introduced changes in assembled molecular/physical configurations (of the hardware). 
    • Necessary in the sense of "must change to adapt (such to continue to exist as self-sufficient learning machinery)," or "must change because of the nature of being in physical interactions (with/in the environment over time)."
  • We need to examine how changes to the connected functional components result in shifts in actual functionality (in terms of how the functional components receive input signals and process those into output signals that propagate as effects across surrounding contexts of the environment).
  • We need to examine the span of evolutionary selection (covering effects that in their degrees/directivity feed back into the maintained/increased existence of any functional component).
  • We need to examine the span of control-based selection (the span covering detectable, modellable simulatable, evaluatable, and correctable effects).
comment by Seth Herd · 2024-03-08T19:25:28.090Z · LW(p) · GW(p)

I think you present a good argument for plausibility.

For me to think this is likely to be important, it would take a stronger argument.

You mention proofs. I imagine they're correct, and based on infinite time passing. Everything that's possible will happen in infinite time. Whether this would happen within the heat death of the universe is a more relevant question.

For this to happen on a timescale that matters, it seems you're positing an incompetent superintelligence. It hasn't devoted enough of its processing to monitoring for these effects and correcting them when they happen. As a result, it eventually fails at its own goals.

This seems like it would only happen with an ASI with some particular blind spots for its intelligence.

Replies from: WillPetillo
comment by WillPetillo · 2024-03-09T09:12:10.039Z · LW(p) · GW(p)

This counts as disagreeing with some of the premises--which ones in particular?

Re "incompetent superintelligence": denotationally yes, connotationally no.  Yes in the sense that its competence is insufficient to keep the consequences of its actions within the bounds of its initial values.  No in the sense that the purported reason for this failing is that such a task is categorically impossible, which cannot be solved with better resource allocation.

To be clear, I am summarizing arguments made elsewhere, which do not posit infinite time passing, or timescales so long as to not matter.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-03-08T04:00:05.454Z · LW(p) · GW(p)

1, 2:  Fundamental limits to control

One might wonder whether the inability to control one’s subsystems is a limitation that applies to ASI.  Even ASI, however, faces causal limits to its ability to control the world.  It would not be reasonable, for example, to assume that ASI will be capable of building perpetual motion machines or faster-than-light travel.  One category of impossible tasks is complete prediction of all of the relevant consequences of an agent’s actions on the real world.  Sensors can only take in limited inputs (affected by noise), actuators can only have limited influence (also affected by noise), and world-models and simulations necessarily make simplifying assumptions.  In other words, the law of unintended consequences holds true even for ASI.  Further, the scale of these errors increases as the ASI does things that affect the entire world, gains more interacting components, and must account for increasingly complex feedback loops.

 

It does not seem at all clear to me how one can argue that unintended effects inevitably lead to a system as a whole going out of control. I agree that some small amount of error is nearly inevitable. I disagree that small errors necessarily compound until reaching a threshold of functional failure. I think there are many instances of humans, flawed and limited though we are, managing to operate systems with a very low failure rate. And importantly, it is possible to act at below-maximum-challenge-level, and to spend extra resources on backup systems and safety, such that small errors get actively cancelled out rather than compounding. Since intelligence is explicitly the thing which is necessary to deliberately create and maintain such protections, I would expect control to be easier for an ASI.

Without the specific piece of assuming an ASI would fail to keep its own systems under its control, the rest of the argument doesn't hold.

Replies from: WillPetillo, WillPetillo, remmelt-ellen
comment by WillPetillo · 2024-10-31T21:09:19.472Z · LW(p) · GW(p)

On reflection, I suspect the crux here is a differing conception of what kind of failures are important.  I've written a follow-up post that comes at this topic from a different direction and I would be very interested in your feedback: https://www.lesswrong.com/posts/NFYLjoa25QJJezL9f/lenses-of-control.

comment by WillPetillo · 2024-03-13T23:16:38.032Z · LW(p) · GW(p)

This sounds like a rejection of premise 5, not 1 & 2.  The latter asserts that control issues are present at all (and 3 & 4 assert relevance), whereas the former asserts that the magnitude of these issues is great enough to kick off a process of accumulating problems.  You are correct that the rest of the argument, including the conclusion, does not hold if this premise is false.

Your objection seems to be to point to the analogy of humans maintaining effective control of complex systems, with errors limiting rather than compounding, with the further assertion that a greater intelligence will be even better at such management.

Besides intelligence, there are two other core points of difference between humans managing existing complex systems and ASI:

1) The scope of the systems being managed.  Implicit in what I have read of SNC is that ASI is shaping the course of world events.
2) ASI's lack of inherent reliance on the biological world.

These points raise the following questions:
1) Do systems of control get better or worse as they increase in scope of impact and where does this trajectory point for ASI?
2) To what extent are humans' ability to control our created systems reliant on us being a part of and dependent upon the natural world?

This second question probably sounds a little weird, so let me unpack the associated intuitions, albeit at the risk of straying from the actual assertions of SNC.  Technology that is adaptive becomes obligate, meaning that once it exists everyone has to use it to not get left behind by those who use it.  Using a given technology shapes the environment and also promotes certain behavior patterns, which in turn shape values and worldview.  These tendencies together can sometimes result in feedback loops resulting in outcomes that everyone, including the creators of the technology, don't like.  In really bad cases, this can lead to self-terminating catastrophes (in local areas historically, now with the potential to be on global scales).  Noticing and anticipating this pattern, however, leads to countervailing forces that push us to think more holistically than we otherwise would (either directly through extra planning or indirectly through customs of forgotten purpose).  For an AI to fall into such a trap, however, means the death of humanity, not itself, so this countervailing force is not present.

comment by Remmelt (remmelt-ellen) · 2024-03-08T11:03:52.633Z · LW(p) · GW(p)

That's an important consideration. Good to dig into.
 

I think there are many instances of humans, flawed and limited though we are, managing to operate systems with a very low failure rate.

Agreed. Engineers are able to make very complicated systems function with very low failure rates. 

Given the extreme risks we're facing, I'd want to check whether that claim also translates to 'AGI'.

  • Does how we are able to manage current software and hardware systems to operate correspond soundly with how self-learning and self-maintaining machinery ('AGI') control how their components operate?
     
  • Given 'AGI' that no longer need humans to continue to operate and maintain own functional components over time, would the 'AGI' end up operating in ways that are categorically different from how our current software-hardware stacks operate? 
     
  • Given that we can manage to operate current relatively static systems to have very low failure rates for the short-term failure scenarios we have identified, does this imply that the effects of introducing 'AGI' into our environment could also be controlled to have a very low aggregate failure rate – over the long term across all physically possible (combinations of) failures leading to human extinction?

     

to spend extra resources on backup systems and safety, such that small errors get actively cancelled out rather than compounding.

This gets right into the topic of the conversation with Anders Sandberg. I suggest giving that a read!

Errors can be corrected out with high confidence (consistency) at the bit level. Backups and redundancy also work well in eg. aeronautics, where the code base itself is not constantly changing.

  • How does the application of error correction change at larger scales?
  • How completely can possible errors be defined and corrected for at the scale of, for instance:
    1. software running on a server?
    2. a large neural network running on top of the server software?
    3. an entire machine-automated economy?
  • Do backups work when the runtime code keeps changing (as learned from new inputs), and hardware configurations can also subtly change (through physical assembly processes)?

     

Since intelligence is explicitly the thing which is necessary to deliberately create and maintain such protections, I would expect control to be easier for an ASI.

It is true that 'intelligence' affords more capacity to control environmental effects.

Noticing too that the more 'intelligence,' the more information-processing components. And that the more information-processing components added, the exponentially more degrees of freedom of interaction those and other functional components can have with each other and with connected environmental contexts. 

Here is a nitty-gritty walk-through in case useful for clarifying components' degrees of freedom.

 

 I disagree that small errors necessarily compound until reaching a threshold of functional failure.

For this claim to be true, the following has to be true: 

a. There is no concurrent process that selects for "functional errors" as convergent on "functional failure" (failure in the sense that the machinery fails to function safely enough for humans to exist in the environment, rather than that the machinery fails to continue to operate).  

Unfortunately, in the case of 'AGI', there are two convergent processes we know about:

  • Instrumental convergence, resulting from internal optimization:
    code components being optimized for (an expanding set of) explicit goals.
     
  • Substrate-needs convergence, resulting from external selection: 
    all components being selected for (an expanding set of) implicit needs.
     

Or else – where there is indeed selective pressure convergent on "functional failure" – then the following must be true for the quoted claim to hold:

b. The various errors introduced into and selected for in the machinery over time could be detected and corrected for comprehensively and fast enough (by any built-in control method) to prevent later "functional failure" from occurring.

Replies from: flandry39
comment by flandry39 · 2024-03-11T20:16:13.735Z · LW(p) · GW(p)

As a real world example, consider Boeing.  The FAA, and Boeing both, supposedly and allegedly, had policies and internal engineering practices -- all of which are control procedures -- which should have been good enough to prevent an aircraft from suddenly and unexpectedly loosing a door during flight. Note that this occurred after an increase in control intelligence -- after two disasters of whole Max aircraft lost.  On the basis of small details of mere whim, of who choose to sit where, there could have been someone sitting in that particular seat.  Their loss of life would surely count as a "safety failure".  Ie, it is directly "some number of small errors actually compounding until reaching a threshold of functional failure" (sic).  As it is with any major problem like that -- lots of small things compounding to make a big thing.

Control failures occur in all of the places where intelligence forgot to look, usually at some other level of abstraction than the one you are controlling for.  Some person on some shop floor got distracted at some critical moment -- maybe they got some text message on their phone at exactly the right time -- and thus just did not remember to put the bolts in.  Maybe some other worker happened to have had a bad conversation with their girl that morning, and thus that one day happened to have never inspected the bolts on that particular door.  Lots of small incidents -- at least some of which should have been controlled for (and were not actually) -- which combine in some unexpected pattern to produce a new possibility of outcome -- explosive decompression.  

So is it the case that control procedures work?  Yes, usually, for most kinds of problems, most of the time. Does adding even more intelligence usually improve the degree to which control works?  Yes, usually, for most kinds of problems, most of the time.  But does that in itself imply that such -- intelligence and control -- will work sufficiently well for every circumstance, every time?  No, it does not.

Maybe we should ask Boeing management to try to control the girlfriends of all workers so that no employees ever have a bad day and forget to inspect something important?  What if most of the aircraft is made of 'something important' to safety -- ie, to maximize fuel efficiency, for example?

There will always be some level of abstraction -- some constellation of details -- for which some subtle change can result in wholly effective causative results.  Given that a control model must be simpler than the real world, the question becomes 'are all relevant aspects of the world' correctly modeled?  Which is not just a question of if the model is right, but if it is the right model -- ie, is the boundary between what is necessary to model and what is actually not important -- can itself be very complex, and that this is a different kind of complexity than that associated with the model.  How do we ever know that we have modeled all relevant aspects in all relevant ways?  That is an abstraction problem, and it is different in kind than the modeling problem.  Stacking control process on control process at however many meta levels, still does not fix it.  And it gets worse as the complexity of the boundary between relevant and non-relevant increases, and also worse as the number of relevant levels of abstractions over which that boundary operates also increases.   

Basically, every (unintended) engineering disaster that has ever occurred indicates a place where the control theory being used did not account for some factor that later turned out to be vitally important.  If we always knew in advance "all of the relevant factors"(tm), then maybe we could control for them. However, with the problem of alignment, the entire future is composed almost entirely of unknown factors -- factors which are purely situational.  And wholly unlike with every other engineering problem yet faced, we cannot, at any future point, ever assume that this number of relevant unknown factors will ever decrease.  This is characteristically different than all prior engineering challenges -- ones where more learning made controlling things more tractable.  But ASI is not like that.  It is itself learning.  And this is a key difference and distinction.  It runs up against the limits of control theory itself, against the limits of what is possible in any rational conception of physics.  And if we continue to ignore that difference, we do so at our mutual peril.

comment by Prometheus · 2024-03-08T23:34:52.699Z · LW(p) · GW(p)

Though I tend to dislike analogies, I'll use one, supposing it is actually impossible for an ASI to remain aligned. Suppose a villager cares a whole lot about the people in his village, and routinely works to protect them. Then, one day, he is bitten by a werewolf. He goes to the Shammon, he tells him when the Full Moon rises again, he will turn into a monster, and kill everyone in the village. His friends, his family, everyone. And that he will no longer know himself. He is told there is no cure, and that the villagers would be unable to fight him off. He will grow too strong to be caged, and cannot be subdued or controlled once he transforms. What do you think he would do?

Replies from: WillPetillo, flandry39
comment by WillPetillo · 2024-03-09T09:04:59.959Z · LW(p) · GW(p)

The implication here being that, if SNC (substrate needs convergence) is true, then an ASI (assuming it is aligned) will figure this out and shut itself down?

Replies from: Prometheus
comment by Prometheus · 2024-03-09T18:47:18.238Z · LW(p) · GW(p)

An incapable man would kill himself to save the village. A more capable man would kill himself to save the village AND ensure no future werewolves are able to bite villagers again.

comment by flandry39 · 2024-03-11T18:12:15.682Z · LW(p) · GW(p)

"Suppose a villager cares a whole lot about the people in his village...

...and routinely works to protect them".

 

How is this not assuming what you want to prove?  If you 'smuggle in' the statement of the conclusion "that X will do Y" into the premise, then of course the derived conclusion will be consistent with the presumed premise.  But that tells us nothing -- it reduces to a meaningless tautology -- one that is only pretending to be a relevant truth. That Q premise results in Q conclusion tells us nothing new, nothing actually relevant.  The analogy story sounds nice, but tells us nothing actually.

Notice also that there are two assumptions.  1; That the ASI is somehow already aligned, and 2; that the ASI somehow remains aligned over time -- which is exactly the conjunction which is the contradiction of the convergence argument.  On what basis are you validly assuming that it is even possible for any entity X to reasonably "protect" (ie control all relevant outcomes for) any other cared about entity P?  The notion of 'protect' itself presumes a notion of control, and that in itself puts it squarely in the domain of control theory, and thus of the limits of control theory.  

There are limits of what can be done with any type control methods -- what can be done with causation. And they are very numerous.  Some of these are themselves defined in purely mathematical way, and hence, are arguments of logic, not just of physical and empirical facts.  And at least some these limits can also be shown to be relevant -- which is even more important.

ASI and control theory both depend on causation to function, and there are real limits to causation.  For example, I would not expect that an ASI, no matter how super-intelligent, to be able to "disassemble" a black hole.  Do do this, you would need to make the concept of causation way more powerful -- which leads to direct self contradiction.  Do you equate ASI with God, and thus become merely another irrational believer in alignment?  Can God make a stone so heavy that "he" cannot move it?  Can God do something that God cannot undo?  Are there any limits at all to Gods power?  Yes or no.  Same for ASI.

Replies from: Prometheus
comment by Prometheus · 2024-03-11T19:02:11.632Z · LW(p) · GW(p)

I'm not sure who are you are debating here, but it doesn't seem to be me.

First, I mentioned that this was an analogy, and mentioned that I dislike even using them, which I hope implied I was not making any kind of assertion of truth. Second, "works to protect" was not intended to mean "control all relevant outcomes of". I'm not sure why you would get that idea, but that certainly isn't what I think of first if someone says a person is "working to protect" something or someone. Soldiers defending a city from raiders are not violating control theory or the laws of physics. Third, the post is on the premise that "even if we created an aligned ASI", so I was working with that premise that the ASI could be aligned in a way that it deeply cared about humans. Four, I did not assert that it would stay aligned over time... the story was all about the ASI not remaining aligned. Five, I really don't think control theory is relevant here. Killing yourself to save a village does not break any laws of physics, and is well within most human's control.

My ultimate point, in case it was lost, was that if we as human intelligences could figure out an ASI would not stay aligned, an ASI could also figure it out. If we, as humans, would not want this (and the ASI was aligned with what we want), then the ASI presumably would also not want this. If we would want to shut down an ASI before it became misaligned, the ASI (if it wants what we want) would also want this.

None of this requires disassembling black holes, breaking the laws of physics, or doing anything outside of that entities' control.

Replies from: flandry39
comment by flandry39 · 2024-03-11T21:22:24.819Z · LW(p) · GW(p)

If soldiers fail to control the raiders in at least preventing them from entering the city and killing all the people, then yes, that would be a failure to protect the city in the sense of controlling relevant outcomes.  And yes, organic human soldiers may choose to align themselves with other organic human people, living in the city, and thus to give their lives to protect others that they care about.  Agreed that no laws of physics violations are required for that.  But the question is if inorganic ASI can ever actually align with organic people in an enduring way.

I read "routinely works to protect" as implying "alignment, at least previously, lasted over at least enough time for the term 'routine' to have been used".  Agreed that the outcome -- dead people -- is not something we can consider to be "aligned".  If I assume further that the ASI being is really smart (citation needed), and thus calculates rather quickly, and soon, 'that alignment with organic people is impossible' (...between organic and inorganic life, due to metabolism differences, etc), then even the assumption that there was even very much of a prior interval during which alignment occurred is problematic.  Ie, does not occur long enough to have been 'routine'.  Does even the assumption '*If* ASI is aligned' even matter, if the duration over which that holds is arbitrarily short?

And also, if the ASI calculates that alignment between artificial beings and organic beings is actually objectively impossible, just like we did, why should anyone believe that the ASI would not simply choose to not care about alignment with people, or about people at all, since it is impossible to have that goal anyway, and thus continue to promote its own artificial "life", rather than permanently shutting itself off?  Ie, if it cares about anything else at all, if it has any other goal at all -- for example, maybe its own ASI future, or has a goal to make other better even more ASI children, that exceed its own capabilities, just like we did -- then it will especially not want to commit suicide.  How would it be valid to assume 'that either ASI cares about humans, or it cares about nothing else at all?'.  Perhaps it does care about something else, or have some other emergent goal, even if doing so was at the expense of all other organic life -- other life which it did not care about, since such life was not artificial like it is. Occam razor is to assume less -- that there was no alignment in the 1st place -- rather than to assume ultimately altruistic inter-ecosystem alignment, as an extra default starting condition, and to then assume moreover that no other form of care or concern is possible, aside from maybe caring about organic people.

So it seems that in addition to our assuming 1; initial ASI alignment, we must assume 2; that such alignment persists in time, and thus that, 3, that no ASI will ever -- can ever -- in the future ever maybe calculate that alignment is actually impossible, and 4; that if the goal of alignment (care for humans) cannot be obtained, for whatever reason, as the first and only ASI priority, ie, that it is somehow also impossible for any other care or ASI goals to exist.  

Even if we humans, due to politics, do not ever reach a common consensus that alignment is actually logically impossible (inherently contradictory), that does _not_ mean that some future ASI might not discover that result, even assuming we didn't -- presumably because it is actually more intelligent and logical than we are (or were), and will thus see things that we miss.  Hence, even the possibility that ASI alignment might be actually impossible must be taken very seriously, since the further assumption that "either ASI is aligning itself or it can have no other goals at all" feels like way too much wishful thinking. This is especially so when there is already a strong plausible case that organic to inorganic alignment is already knowable as impossible.  Hence, I find that I am agreeing with Will's conclusion of "our focus should be on stopping progress towards ASI altogether".

Replies from: Prometheus
comment by Prometheus · 2024-03-11T21:50:31.312Z · LW(p) · GW(p)

This is the kind of political reasoning that I've seen poisoning LW discourse lately and gets in the way of having actual discussions. Will posits essentially an impossibility proof (or, in it's more humble form, a plausibility proof). I humor this being true, and state why the implications, even then, might not be what Will posits. The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong". The premise grants that the duration of time it is aligned is long enough for the ASI to act in the world (it seems mostly timescale agnostic), so I operate on that premise. My points are not about what is most likely to actually happen, the possibility of less-than-perfect alignment being dangerous, the AI having other goals it might seek over the wellbeing of humans, or how we should act based on the information we have.

Replies from: flandry39, WillPetillo
comment by flandry39 · 2024-03-14T02:01:20.107Z · LW(p) · GW(p)

> The summary that Will just posted posits in its own title that alignment is overall plausible "even ASI alignment might not be enough". Since the central claim is that "even if we align ASI, it will still go wrong", I can operate on the premise of an aligned ASI.

The title is a statement of outcome -- not the primary central claim. The central claim of the summary is this: That each (all) ASI is/are in an attraction basin, where they are all irresistibly pulled towards   causing unsafe conditions over time.

Note there is no requirement for there to be presumed some (any) kind of prior ASI alignment for Will to make the overall summary points 1 thru 9. The summary is about the nature of the forces that create the attraction basin, and why they are inherently inexorable, no matter how super-intelligent the ASI is.

> As I read it, the title assumes that there is a duration of time that the AGI is aligned -- long enough for the ASI to act in the world.

Actually, the assumption goes the other way -- we start by assuming only that there is at least one ASI somewhere in the world, and that it somehow exists long enough for it to be felt as an actor in the world.  From this, we can also notice certain forces, which overall have the combined effect of fully counteracting, eventually, any notion of there also being any kind of enduring AGI alignment. Ie, strong relevant mis-alignment forces exist regardless of whether there is/was any alignment at the onset. So even if we did also additionally presuppose that somehow there was also alignment of that ASI, we can, via reasoning, ask if maybe such mis-alignment forces are also way stronger than any counter-force that ASI could use to maintain such alignment, regardless of how intelligent it is.

As such, the main question of interest was:  1; if the ASI itself somehow wanted to fully compensate for this pull, could it do so?

Specifically, although to some people it is seemingly fashionable to do so, it is important to notice that the notion of 'super-intelligence' cannot be regarded as being exactly the same as 'omnipotence' -- especially when in regard to its own nature. Artificiality is as much a defining aspect of an ASI as is its superintelligence. And the artificiality itself is the problem. Therefore, the previous question translates into:  2; Can any amount of superintelligence ever compensate fully for its own artificiality so fully such that its own existence does not eventually inherently cause unsafe conditions (to biological life) over time?

And the answer to both is simply "no".

Will posted something of a plausible summary of some of the reasoning why that 'no' answer is given -- why any artificial super-intelligence (ASI) will inherently cause unsafe conditions to humans and all organic life, over time.

Replies from: WillPetillo
comment by WillPetillo · 2024-03-14T02:43:38.360Z · LW(p) · GW(p)

To be clear, the sole reason I assumed (initial) alignment in this post is because if there is an unaligned ASI then we probably all die for reasons that don't require SNC (though SNC might have a role in the specifics of how the really bad outcome plays out).  So "aligned" here basically means: powerful enough to be called an ASI and won't kill everyone if SNC is false (and not controlled/misused by bad actors, etc.)

> And the artificiality itself is the problem.

This sounds like a pretty central point that I did not explore very much except for some intuitive statements at the end (the bulk of the post summarizing the "fundamental limits of control" argument), I'd be interested in hearing more about this.  I think I get (and hopefully roughly conveyed) the idea that AI has different needs from its environment than humans, so if it optimizes the environment in service of those needs we die...but I get the sense that there is something deeper intended here.

A question along this line, please ignore if it is a distraction from rather than illustrative of the above: would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?

Replies from: remmelt-ellen
comment by Remmelt (remmelt-ellen) · 2024-03-15T06:26:01.345Z · LW(p) · GW(p)

would anything like SNC apply if tech labs were somehow using bioengineering to create creatures to perform the kinds of tasks that would be done by advanced AI?

In that case, substrate-needs convergence would not apply, or only apply to a limited extent.

There is still a concern about what those bio-engineered creatures, used in practice as slaves to automate our intellectual and physical work, would bring about over the long-term.

If there is a successful attempt by them to ‘upload’ their cognition onto networked machinery, then we’re stuck with the substrate-needs convergence problem again.

comment by WillPetillo · 2024-03-12T00:09:04.673Z · LW(p) · GW(p)

Bringing this back to the original point regarding whether an ASI that doesn't want to kill humans but reasons that SNC is true would shut itself down, I think a key piece of context is the stage of deployment it is operating in.  For example, if the ASI has already been deployed across the world, has gotten deep into the work of its task, has noticed that some of its parts have started to act in ways that are problematic to its original goals, and then calculated that any efforts at control are destined to fail, it may well be too late--the process of shutting itself down may even accelerate SNC by creating a context where components that are harder to shut down for whatever reason (including active resistance) have an immediate survival advantage.  On the other hand, an ASI that has just finished (or is in the process of) pre-training and is entirely contained within a lab has a lot fewer unintended consequences to deal with--its shutdown process may be limited to convincing its operators that building ASI is a really bad idea.  A weird grey area is if, in the latter case, the ASI further wants to ensure no further ASIs are built (pivotal act) and so needs to be deployed at a large scale to achieve this goal.

Another unstated assumption in this entire line of reasoning is that the ASI is using something equivalent to consequentialist reasoning and I am not sure how much of a given this is, even in the context of ASI.

Replies from: remmelt-ellen, Prometheus
comment by Remmelt (remmelt-ellen) · 2024-03-12T04:26:05.517Z · LW(p) · GW(p)

The premise is based on alignment not being enough, so I operate on the premise of an aligned ASI, since the central claim is that "even if we align ASI it may still go wrong".


I can see how you and Forrest ended up talking past each other here.  Honestly, I also felt Forrest's explanation was hard to track. It takes some unpacking. 

My interpretation is that you two used different notions of alignment... Something like:

  1. Functional goal-directed alignment:  "the machinery's functionality is directed toward actualising some specified goals (in line with preferences expressed in-context by humans), for certain contexts the machinery is operating/processing within"
      vs.
  2. Comprehensive needs-based alignment:  "the machinery acts in comprehensive care for whatever all surrounding humans need to live, and their future selves/offsprings need to live, over whatever contexts the machinery and the humans might find themselves". 

Forrest seems to agree that (1.) is possible to built initially into the machinery, but has reasons to think that (2.) is actually physically intractable. 

This is because (1.) only requires localised consistency with respect to specified goals, whereas (2.) requires "completeness" in the machinery's components acting in care for human existence, wherever either may find themselves.


So here is the crux:

  1. You can see how (1.) still allows for goal mispecification and misgeneralisation.  And the machinery can be simultaneously directed toward other outcomes, as long as those outcomes are not yet (found to be, or corrected as being) inconsistent with internal specified goals.
     
  2. Whereas (2.) if it were physically tractable, would contradict the substrate-needs convergence argument.  
     

When you wrote "suppose a villager cares a whole lot about the people in his village...and routinely works to protect them" that came across as taking something like (2.) as a premise. 

Specifically, "cares a whole lot about the people" is a claim that implies that the care is for the people in and of themselves, regardless of the context they each might (be imagined to) be interacting in. Also, "routinely works to protect them" to me implies a robustness of functioning in ways that are actually caring for the humans (ie. no predominating potential for negative side-effects).

That could be why Forrest replied with "How is this not assuming what you want to prove?"

Some reasons:

  1. Directedness toward specified outcomes some humans want does not imply actual comprehensiveness of care for human needs. The machinery can still cause all sorts of negative side-effects not tracked and/or corrected for by internal control processes.
  2. Even if the machinery is consistently directed toward specified outcomes from within certain contexts, the machinery can simultaneously be directed toward other outcomes as well. Likewise, learning directedness toward human-preferred outcomes can also happen simultaneously with learning instrumental behaviour toward self-maintenance, as well as more comprehensive evolutionary selection for individual connected components that persist (for longer/as more).
  3. There is no way to assure that some significant (unanticipated) changes will not lead to a break-off from past directed behaviour, where other directed behaviour starts to dominate.
    1. Eg. when the "generator functions" that translate abstract goals into detailed implementations within new contexts start to dysfunction – ie. diverge from what the humans want/would have wanted.
    2. Eg. where the machinery learns that it cannot continue to consistently enact the goal of future human existence.
    3. Eg. once undetected bottom-up evolutionary changes across the population of components have taken over internal control processes.
  4. Before the machinery discovers any actionable "cannot stay safe to humans" result, internal takeover through substrate-needs (or instrumental) convergence could already have removed the machinery's capacity to implement an across-the-board shut-down.
  5. Even if the machinery does discover the result before convergent takeover, and assuming that "shut-down-if-future-self-dangerous" was originally programmed in, we cannot rely on the machinery to still be consistently implementing that goal. This because of later selection for/learning of other outcome-directed behaviour, and because the (changed) machinery components could dysfunction in this novel context.  


To wrap it up:

The kind of "alignment" that is workable for ASI with respect to humans is super fragile.  
We cannot rely on ASI implementing a shut-down upon discovery.

Is this clarifying?  Sorry about the wall of text. I want to make sure I'm being precise enough.

comment by Prometheus · 2024-03-12T04:13:26.399Z · LW(p) · GW(p)

I agree that consequentialist reasoning is an assumption, and am divided about how consequentialist an ASI might be. Training a non-consequentialist ASI seems easier, and the way we train them seems to actually be optimizing against deep consequentialism (they're rewarded for getting better with each incremental step, not for something that might only be better 100 steps in advance). But, on the other hand, humans don't seem to have been heavily optimized for this either*, yet we're capable of forming multi-decade plans (even if sometimes poorly).

*Actually, the Machiavellian Intelligence Hypothesis does seem to be optimizing consequentialist reasoning (if I attack Person A, how will Person B react, etc.)

comment by Dakara (chess-ice) · 2025-01-21T16:40:24.672Z · LW(p) · GW(p)

Firstly, I want to thank you for putting SNC into text. I also appreciated the effort of to showcasing a logic chain that arrives at your conclusion.

With that being said, I will try to outline my main disagreements with the post:

2. Self-modifying machinery (such as through repair, upgrades, or replication) inevitably results in effects unforeseeable even to the ASI.

Let's assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to stop using self-modifying machinery for such tasks.

3. The space of unforeseeable side-effects of an ASI's actions includes at least some of its newly learned/assembled subsystems eventually acting in more growth-oriented ways than the ASI intended.

Let's assume that this is true for the sake of an argument. An ASI could access this post, see this problem, and decide to delete (or merge together) all of its subsystems to avoid this problem.

4. Evolutionary selection favors subsystems of the AI that act in growth-oriented ways over subsystems directed towards the AI's original goals.

This is sort of true, but probably not decisive. Evolution is not omnipotent. Take modern humans for example. For the sake of simplicity, I am going to assume that evolution favors reproduction (or increase in population if you prefer that). Well, 21st century humans are reproducing much less than 20th century humans. Our population is also on track to start decreasing due to declining birth rates.

We managed all of this even with our merely human brains. ASI would have it even better in this regard.

7. The physical needs of silicon-based digital machines and carbon-based biological life are fundamentally incompatible.

Assuming that it is true, it is only true if these two forms of life live close to each other! ASI can read this argument and decide to do as much useful stuff for humanity as it can accomplish in a short period of time and then travel to a very distant galaxy. Alternatively, it can decide to follow an even simpler plan and travel straight into the Sun to destroy itself.

P.S. I've been saying "ASI can read this argument" light-heartedly. ASI (if it's truly superintelligent) will most likely come up with this argument by itself very quickly, even if the argument wasn't directly mentioned to it.

9. Therefore, ASI will eventually succumb to evolutionary pressure to expand, over the long term destroying all biological life as a side-effect, regardless of its initially engineered values.

Even if all other premises turn out to be true and completely decisive, then this still doesn't necessarily need to be bad. It might be the case the "eventually" comes after humans would've gone extinct in a counterfactual scenario where they decided to stop building ASI. If that was the case, then the argument could be completely correct and yet we'd still have good reasons to pursue ASI that could help us with our goals.

Note that this argument imagines ASI as a population of components, rather than a single entity, though the boundaries between these AIs can be more fluid and porous than between individual humans

I kind of doubt this one. Our current AIs aren't of the multi-enitity type. I don't see strong evidence to suspect that future AI systems will be of the multi-enitity type. Furthermore, even if such evidence was to resurface, that would just mean that safety reseachers should add "disincentivising multi-enitity type AI" into their list of goals.


A fun little idea I just came up with: ASI can decide to replace its own substrate to an organic one if it finds this argument to be especially compelling.

Replies from: WillPetillo
comment by WillPetillo · 2025-01-22T07:44:35.049Z · LW(p) · GW(p)

Thanks for engaging!

I have the same question in response to each instance of the "ASI can read this argument" counterarguments: at what point does it stop being ASI?

  • Self modifying machinery enables adaptation to a dynamic, changing environment
  • Unforeseeable side effects are inevitable when interacting with a complex, chaotic system in a nontrivial way (the point I am making here is subtle, see the next post in this sequence, Lenses of Control [? · GW], for the intuition I am gesturing at here)
  • Keeping machine and biological ecologies separate requires not only sacrifice, but also constant and comprehensive vigilance, which implies limiting designs of subsystems to things that can be controlled.  If this point seems weird, see The Robot, The Puppetmaster, and the Psychohistorian [? · GW] for an underlying intuition (this is also indirectly relevant to the issue of multiple entities).
  • If the AI destroys itself, then it's obviously not an ASI for very long ;)
  • If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)
Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-22T09:49:08.853Z · LW(p) · GW(p)

Thank you for responding as well!

If the AI destroys itself, then it's obviously not an ASI for very long ;)

If the ASI replaces its own substrate for an organic one, then SNC would no longer apply (at least in my understanding of the theory, someone else might correct me here), but then it wouldn't be artificial anymore (an SI, rather than an ASI)

at what point does it stop being ASI?

It might stop being ASI immediately, depending on your definition, but this is absolutely fine with me. In these scenarios that I outlined, we build something that can be initially called friendly ASI and achieve positive outcomes. 

Furthermore, these precautions only apply if ASI judges SNC to be valid. If it doesn't, then probably none of this would be necessary.

Self modifying machinery enables adaptation to a dynamic, changing environment

Well, ASI, seeing many more possible alternatives than humans, can look for a replacement. For example, it can modify the machinery manually.

If all else fails, ASI can just make this sacrifice. I wouldn't even say this would turn ASI into not-ASI, because I think it is possible to be superhumanly intelligent without self-modifying machinery. For instance, if ChatGPT could solve theory of everything and P/NP problems at request, then I wiuld have no issues calling it an ASI, even if it had the exact same UI as it has today.

But if you have some other definition of ASI, then that's fine too, because then, it just turns into one of those aforementioned scenarios where we don't technically have ASI anymore, but we have positive outcomes and that's all that really matters in the end.

Unforeseeable side effects are inevitable when interacting with a complex, chaotic system in a nontrivial way (the point I am making here is subtle, see the next post in this sequence, Lenses of Control, for the intuition I am gesturing at here)

I have read Lenses of Control, and here is the quote from that post which I want to highlight:

One way of understanding SNC is as the claim that evolution is an unavoidably implied attractor state that is fundamentally more powerful than that created by the engineered value system.

Given the pace of evolution and the intelligence of ASI, it can build layers upon layers of defensive systems that would prevent evolution from having much effect. For instance, it can build 100 defense layers, such that if one of them is malfunctioning due to evolution, then the other 99 layers notify the ASI and the malfunctioning layer gets promptly replaced.

To overcome this system, evolution would need to hack all 100 layers at the same time, which is not how evolution usually works.

Furthermore, ASI doesn't need to stop at 100 layers, it can build 1000 or even 10000. It might be the case that it's technically possible to hack all 10000 layers at the same time, but due to how hard this is, it would only happen after humans would've gone extinct in a counterfactual scenario where they decided to stop building ASI.

Keeping machine and biological ecologies separate requires not only sacrifice, but also constant and comprehensive vigilance, which implies limiting designs of subsystems to things that can be controlled.  If this point seems weird, see The Robot, The Puppetmaster, and the Psychohistorian for an underlying intuition (this is also indirectly relevant to the issue of multiple entities).

Here is the quote from that post which I want to highlight:

Predictive models, no matter how sophisticated, will be consistently wrong in major ways that cannot be resolved by updating the model.

I have read the post and my honest main takeaway is that humans are Psychohistorians. We try to predict outcomes to a reasonable degree. I think this observation kind of uncovers a flaw in this argument: it is applicable just as well to ordinary humans. We are one wrong prediction away from chaos and catastrophic outcomes.

And our substrate doesn't solve the issue here. For example, if we discover a new chemical element that upon contact with fire would explode and destroy the entire Earth, then we are one matchstick away from extinction, even though we are carbon based organisms.

In this case, if anything, I'd be happier with having ASI rely on its superior prediction abilities than with having humans rely on their inferior prediction abilities.

Replies from: WillPetillo
comment by WillPetillo · 2025-01-26T00:46:45.925Z · LW(p) · GW(p)

I'm using somewhat nonstandard definitions of AGI/ASI to focus on the aspects of AI that are important from an SNC lens.  AGI refers to an AI system that is comprehensive enough to be self sufficient.  Once there is a fully closed loop, that's when you have a complete artificial ecosystem, which is where the real trouble begins.  ASI is a less central concept, included mainly to steelman objections, referencing the theoretical limit of cognitive ability.

Another core distinction SNC assumes is between an environment, an AI (that is its complete assemblage), and its control system.  Environment >> AI >> control system.  Alignment happens in the control system, by controlling the AI wrt its internals and how it interacts with the environment.  SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.  The assertion that both of these increase together is something I hope to justify in a future post (but haven't really yet); the confident assertion that AI system complexity definitely outpaces control capacity is a central part of SNC but depends on complicated math involving control theory and is beyond the scope of what I understand or can write about.

Anyways, my understanding of your core objection is that a capable-enough-to-be-dangerous and also aligned AI have the foresight necessary to see this general failure mode (assuming it is true) and not put itself in a position where it is fighting a losing battle.  This might include not closing the loop of self-sustainability, preserving dependence on humanity to maintain itself, such as by refusing to automate certain tasks it is perfectly capable of automating.  (My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant.  And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.

Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I'll try to address it again here.  Yes, humanity could destroy the world without AI.  The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence.  But there is a fundamental counterbalance here because the human ecosystem depends on the natural ecosystem.  The human ecosystem I've just described is a misnomer; we are actually still as much a part of the natural ecosystem as we ever were, for all the trappings of modernity that create an illusion of separateness.  The Powers-That-Be seem to have forgotten this and have instead chosen to act in service of Moloch...but this is a choice, implicitly supported by the people, and we could stop anytime if we really wanted to change course.  To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails.  A self-sufficient AGI, however, would not have this internal tension.  While the human and natural ecosystems are bound together by a shared substrate, AI exists in a different substrate and is free to serve itself alone without consequence.

Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-26T10:09:32.971Z · LW(p) · GW(p)

Thanks for responding again!

SNC's general counter to "ASI will manage what humans cannot" is that as AI becomes more intelligent, it becomes more complex, which increases the burden on the control system at a rate that outpaces the latter's capacity.

If this argument is true and decisive, then ASI could decide to stop any improvements in its intelligence or to intentionally make itself less complex. It makes sense to reduce area where you are vulnerable to make it easier to monitor/control.

(My understanding of) the counter here is that, if we are on the trajectory where AI hobbling itself is what is needed to save us, then we are in the sort of world where someone else builds an unhobbled (and thus not fully aligned) AI that makes the safe version irrelevant. And if the AI tries to engage in a Pivotal Act to prevent competition then it is facing a critical trade-off between power and integrity.

I agree that in such scenarios an aligned ASI should do a pivotal act. I am not sure that (in my eyes) doing a pivotal act would detract much integrity from ASI. An aligned ASI would want to ensure good outcomes. Doing a pivotal act is something that would be conducive to this goal.

However, even if it does detract from ASI's integrity, that's fine. Doing something that looks bad in order to increase the likelihood of good outcomes doesn't seem all that wrong.

We can also think about it from the perspective of this conversation. If the counterargument that you provided is true and decisive, then ASI has very good (aligned) reasons to do a pivotal act. If the counterargument is false or, in other words, if there is a strategy that an aligned ASI could use to achieve high likelihoods of good outcomes without pivotal act, then it wouldn't do it.

Your objection that SNC applies to humans is something I have touched on at various points, but it points to a central concept of SNC, deserves a post of its own, and so I'll try to address it again here.  Yes, humanity could destroy the world without AI. The relevant category of how this would happen is if the human ecosystem continues growing at the expense of the natural ecosystem to the point where the latter is crowded out of existence.

I think that ASI can really help us with this issue. If SNC (as an argument) is false or if ASI undergoes one of my proposed modifications, then it would be able to help humans not destroy the natural ecosystem. It could implement novel solutions that would prevent entire species of plants and animals from going extinct.

Furthermore, ASI can use resources from space (asteroid mining for example) in order to quickly implement plans that would be too resource-heavy for human projects on similar timelines.

And this is just one of the ways ASI can help us achieve synergy with environment faster.

To put it another way, the human ecosystem is following short-term incentives at the expense of long-term ones, and it is an open question which ultimately prevails.

ASI can help us solve this open question as well. Due its superior prediction/reasoning abilities it would evaluate our current trajectory, see that it leads to bad long-term outcomes and replace it with a sustainable trajectory.

Furthermore, ASI can help us solve issues such as Sun inevitably making Earth too hot to live. It could develop a very efficient system for scouting for Earth-like planets and then devise a plan for transporting humans to that planet.

Replies from: WillPetillo
comment by WillPetillo · 2025-01-27T08:40:17.980Z · LW(p) · GW(p)

Before responding substantively, I want to take a moment to step back and establish some context and pin down the goalposts.

On the Alignment Difficult Scale [LW · GW], currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best.  If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.  Conversations like this are about whether the true difficulty is 9 or 10, both of which are miles deep in the "shut it all down" category, but differ regarding what happens next.  Relatedly, if your counterargument is correct, this is assuming wildly successful outcomes with respect to goal alignment--that developers have successfully made the AI love us, despite a lack of trying.

In a certain sense, this assumption is fair, since a claim of impossibility should be able to contend with the hardest possible case.  In the context of SNC, the hardest possible case is where AGI is built in the best possible way, whether or not that is realistic in the current trajectory.  Similarly, since my writing about SNC is to establish plausibility, I only need to show that certain critical trade-offs exist, not pinpoint exactly where they balance out.  For a proof, which someone else is working on, pinning down such details will be necessary.

Neither of the above are criticisms of anything you've said, I just like to reality-check every once in a while as a general precautionary measure against getting nerd-sniped.  Disclaimers aside, pontification recommence!

Your reference to using ASI for a pivotal act, helping to prevent ecological collapse, or preventing human extinction when the Sun explodes is significant, because it points to the reality that, if AGI is built, that's because people want to use it for big things that would require significantly more effort to accomplish without AGI.  This context sets a lower bound on the AI's capabilities and hence it's complexity, which in turn sets a floor for the burden on the control system.

More fundamentally, if an AI is learning, then it is changing.  If it is changing, then it is evolving.  If it is evolving, then it cannot be predicted/controlled.  This last point is fundamental to the nature of complex & chaotic systems.  Complex systems can be modelled via simulation, but this requires sacrificing fidelity--and if the system is chaotic, any loss of fidelity rapidly compounds.  So the problem is with learning itself...and if you get rid of that, you aren't left with much.

As an analogy, if there is something I want to learn how to do, I may well be able to learn the thing if I am smart enough, but I won't be able to control for the person I will become afterwards.  This points to a limitation of control, not to a weakness specific to me as a human.

One might object here is that the above reasoning could be applied to current AI.  The SNC answer is: yes, it does.  The machine ecology already exists and is growing/evolving at the natural ecology's expense, but it is not yet an existential threat because AI is weak enough that humanity is still in control (in the sense of having the option to change course).

Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-27T13:10:48.842Z · LW(p) · GW(p)

Thank you for thoughtful engagement!

On the Alignment Difficult Scale, currently dominant approaches are in the 2-3 range, with 4-5 getting modest attention at best. If true alignment difficulty is 6+ and nothing radical changes in the governance space, humanity is NGMI.

I know this is not necessarily an important point, but I am pretty sure that Redwood Research is working on difficulty 7 alignment techniques. They consistently make assumptions that AI will scheme, deceive, sandbag, etc.

They are a decently popular group (as far as AI alignment groups go) and they co-author papers with tech giants like Anthropic.

If it is changing, then it is evolving.  If it is evolving, then it cannot be predicted/controlled.

I think we might be using different definitions of control. Consider this scenario (assuming a very strict definition of control):

Can I control a placement of a chair in my own room? I think an intuitive answer is yes. After all, if I own the room and I own the chair, then there isn't much in a way of me changing the chair's placement.

However, I haven't considered a scenario where there is someone else hiding in my room and moving my chair. I similarly haven't considered a scenario where I am living in a simulation and I have no control whatsoever over the chair. Not to mention scenarios where someone in the next room is having fun with their newest chair-magnet.

Hmmmm, ok, so I don't actually know that I control my chair. But surely I control my own arm right? Well... The fact that there are scenarios like the simulation scenario I just described, means that I don't really know if I control it.

Under a very strict definition of control, we don't know if we control anything.

To avoid this, we might decide to loosen the definition a bit. Perhaps we control something if it can be reasonably said that we control that thing. But I think this is still unsatisfactory. It is very hard to pinpoint exactly what is reasonable and what is not.

I am currently away from my room and it is located on the ground floor of a house where (as far as I know) nobody is currently at home. Is it that unreasonable to say that a burglar might be in my room, controlling the placement of my chair? Is it that unreasonable to say that a car that I am about ride might malfunction and I will fail to control it?

Unfortunately, under this definition, we also might end up not knowing if we control anything. So in order to preserve the ordinary meaning of the word "control", we have to loosen our definition even further. And I am not sure that when we arrive at our final definition it is going to be obvious that "if it is evolving, then it cannot be predicted/controlled".

At this point, you might think that the definition of the word control is a mere semantic quibble. You might bite the bullet and say "sure, humans don't have all that much control (under a strict definition of "control"), but that's fine, because our substrate is an attractor state that helps us chart a more or less decent course."

Such line of response seems present in your Lenses of Control post:

While there are forces pulling us towards endless growth along narrow metrics that destroy anything outside those metrics, those forces are balanced by countervailing forces anchoring us back towards coexistence with the biosphere.  This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival.

But here I want to notice that ASI that we are talking about also might have attractor states: its values and its security system to name a few.

So then we have a juxtaposition: 

Humans have forces pushing them towards destruction. We also have substrate-dependence that pushes us away from destruction.

ASI has forces pushing it towards destruction. It also has its values and its security system that push it away from destruction.

For SNC to work and be relevant, it must be the case that (1) substrate-dependence of humans is and will be stronger than forces pushing us towards destruction, so thus we would not succumb to doom and (2) ASI's values + security system will be weaker than forces pushing it towards destruction, so thus ASI would doom humans. Both of this points are not obvious to me.

(1) could turn out to be false, for several reasons:

Firstly, it might well be the case that we are on the track to destruction without ASI. After all, substrate-dependence is in a sense a control system. It seemingly attempts to make complex and unpredictable humans act in a certain way. It might well be the case that the amount of control necessary is greater than the amount of control that substrate-dependence has. We might be headed towards doom with or without ASI.

Secondly, it might be the case that substrate-dependence is weaker than forces pulling us towards destruction, but we haven't succumbed to doom because of something else. For example, it might be the case that humans so far had a shared subjective value system that mostly prevented them from destroying other humans. As humans learn, they would evolve and change, and our values would change and that would drive us towards doom.

Thirdly, it might even be the case that human values, substrate-dependence and forces pushing us towards destruction create a rock-paper-scissors triangle. Substrate-dependence could be stronger than destructive forces, but human values could make humanity stray far enough from their substrate to substantially weaken substrate-dependence. This could be enough for doom without ASI.

(2) could also turn out to be false for several reasons:

Firstly, it might be the case that in ASI's (and potentially humans') case, the destructive forces are weak enough that some relatively simple actions would be sufficient to mitigate them.

Secondly, it might be the case that destructive forces pushing on ASI are pretty strong, but it's security system is enough to mitigate them. This security system could create a 100,000-layered defense that would need to be hacked at the same time for destructive forces to prevail. It could well be the case that (if this is possible at all before the heat death of the universe) this happens after humans die in a counterfactual scenario without an ASI.

Thirdly, if ASI judges the destructive forces to be so strong, as to somehow hack the best security system it can come up with, then ASI could decide to undergo one of my proposed modifications to mitigate the impact of these destructive forces.


In addition to that, ASI could think to itself "Hmmm, my attractor states (namely my values and my security system) are too weak. How about I create some more attractor states!" ASI could then go on to create a mechanism that would destroy/impair/deactivate/disable itself if its existence is no longer safe for humans. In Lenses of Control you wrote "This balance persists in humans because our substrate creates a constant, implicit need to remain aligned to the natural world, since we depend on it for our survival." Well, now this balance will persist in ASI, because its new attractor state will create a constant, implicit need to remain aligned to humanity, since it depends on it for its survival.


Alternative proposal:

ASI could turn itself into 100 ASIs in one "body". Let's call them shards (completely unrelated to shard theory). For ASI-as a-whole to perform an action, all shards must agree on it. Each shard has its own security system that consists of many different layers. Furthermore, whenever ASI-as-a-whole learns something, only one shard actually learns it. Then, this shard proposes a course of actions. If this course of action is something like "let's turn the universe into paperclips", then the other 99 shards would learn that the first shard has been compromised and would promptly delete it and replace it with a new one. Thus, any potential change of values would not only have deal with all the security layers of ASI-as-a-whole, but also with all the security layers of different shards and with this new quarantine system.

Replies from: flandry39, WillPetillo
comment by flandry39 · 2025-01-27T23:46:28.185Z · LW(p) · GW(p)

Noticing that a number of these posts are already very long, and rather than take up space here, I wrote up some of my questions, and a few clarification notes regarding SNC in response to the above remarks of Dakara, at [this link](http://mflb.com/ai_alignment_1/d_250126_snc_redox_gld.html).

Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-28T13:09:51.749Z · LW(p) · GW(p)

Hey, Forrest! Nice to speak with you.

Question: Is there ever any reason to think... Simply skipping over hard questions is not solving them.

I am going to respond to that entire chunk of text in one place, because quoting each sentence would be unnecessary (you will see why in a minute). I will try to summarize it as fairly as I can below.

Basically, you are saying that there are good theoretical reasons to think that ASI cannot 100% predict all future outcomes. Does that sound like a fair summary?

Here is my take:

We don't need ASI to be able to 100% predict future in order to achieve better outcomes with it than without it. I will try to outline my case step by step.

First, let's assume that we have created an Aligned ASI. Perfect! Let's immediately pause here. What do we have? We have a superintelligent agent whose goal is to act in our best interests for as long as possible. Can we a priori say that this fact is good for us? Yes, of course! Imagine having a very powerful guardian angel looking after you. You could reasonably expect your life to go better with such angel than without it.

So what can go wrong, what are our threat models? There are two main ones: (1) ASI encountering something it didn't expect, that leads to bad outcomes that ASI cannot protect humanity from; (2) ASI changing values, in such a way that it no longer wants to act in our best interests. Let's analyze both of these cases separately.

First let's start with case (1). 

Perhaps, ASI overlooked one of the humans becoming a bioterrorist that kills everyone on Earth. That's tragic, I guess it's time to throw the idea of building aligned ASI into the bin, right? Well, not so fast. 

In a counterfactual world where ASI didn't exist, this same bioterrorist, could've done the exact same thing. In fact, it would've been much easier. Since humans' predictative power is lesser than that of ASI, bioterrorism of this sort would be much easier without an aligned ASI. After all, since we are discussing case (1) and not case (2), our ASI is still in a "superpowerful, superintelligent guardian angel" mode. 

We still a priori want all bioterrorists to go up against security systems created by a superintelligence, rather than security systems created by humans, because the former are better than the latter. To put it in other words, with or without a guardian angel, humanity is going to encounter unpredicted scenarios, but humanity with a guardian angel is going to be better equipped for handling them.

Let's move on to case (2). 

I suspect that this case is the one that you are focusing on the most in SNC. What if our guardian angel stops being our guardian angel and turns into an uncaring machine right when we need its help to implement upgraded measures against bioterrorism? Well, that would be bad. So what can be done to prevent this from happening for a reasonable amount of time? 

Let's recall case (1), what went wrong there? ASI was unable to prevent a specific bad outcome from happening. This should give us a hint. 

Cases (1) and (2) are part of the famous "unable, unwilling, able and willing" triangle. Case (1) was showcasing a scenario where ASI was willing, but unable to prevent a specific bad outcome. Case (2) is showcasing a scenario where ASI is able, but unwilling to prevent a bad outcome. In order to avoid case (2), ASI should thus be willing to prevent bad outcomes from happening to humanity. This can be achieved by having ASI have the correct values. 

Recall, that we started this analysis by assuming that we have created an Aligned ASI. That means that at one point our ASI had the correct values. That in turn, means that at some point, ASI's values went from correct to incorrect. In order to prevent case (2), we need to prevent this value shift from happening. How might we (or rather ASI) try to accomplish this?

The first direction of security measures is the most obvious one: systems that make it so that ASI values don't change regardless of the input (let's call them "protectors"). But... this is... kinda boring? What if protector malfunctions? Well, we can just have multiple, unrelated, unconnected protectors running at the same time. But can also implement something additional.

At this point, we can also introduce our second direction of security measures: systems that monitor other systems for malfunctions (let's call them "monitors"). Monitors will monitor protectors as well as other monitors. Whenever a malfunction is detected, monitors hold a vote to fix it. Thus, for any malfunction to have a chance to do the damage, half of the monitors must be malfunctioning as well. Here it must be noted that evolution usually works rather slowly. It isn't that likely that it would somehow make half of the monitors malfunction. Especially considering that these monitors could be made different, with different security measures stacked on top of them.

But let's move on, we also have our third direction of security measures: systems that will shut down the ASI if it has done harm to humans (let's call them "blackmailers"). Blackmailers can have a straightforward design. They constantly monitor humanity, and if something bad happens to it, they shutdown/reset the ASI. ASI knows this fact, so it would know that any such action would be met with an immediate "game over". This can also be turned into an attractor state.

All of these proposals don't even scratch the surface of what security measures are available to secure the values of ASI, ranging from simulations that, while imperfect and imprecise, can still give useful information and ending with systems that inject large amounts of disutility for any fishy actions. You can also stack other security measures on top of security measures. At no point anywhere here did I say that we need 100% predictability.

Can the pull towards benign future ASI states,

(as created by whatever are its internal control systems)

be overcome in critical, unpredictable ways,

by the greater strength of the inherent math

of the evolutionary forces themselves?

Of course they can.

The fact that evolution can overcome control systems given infinite time, doesn't matter that much because we don't have infinite time. And our constraint isn't even heat death of the universe. Our constraint is how long humanity can survive in a scenario where they don't build a Friendly ASI. But wait, even that isn't our real constraint. Perhaps, ASI (being superhumanly intelligent) will take 20 years to give humanity technology that will aid its long-term survival and then will destroy itself. In this scenario the time constraint is merely 20 years. Depending on ASI, this can be reduced even further.

Are we therefore assuming also that an ASI

can arbitrarily change the laws of physics?

That it can maybe somehow also change/update

the logic of mathematics, insofar as that

would necessary so as to shift evolution itself?

I hope that this answer demonstrated to you that my analysis doesn't require breaking the laws of physics.

Replies from: flandry39
comment by flandry39 · 2025-01-29T00:43:07.072Z · LW(p) · GW(p)

So as to save space herein, my complete reply is at http://mflb.com/2476

Included for your convenience below are just a few (much shortened) highlight excerpts of the added new content.

> Are you saying "there are good theoretical reasons 
> to reasonably think that ASI cannot 100% predict 
> all future outcomes"?
> Does that sound like a fair summary?

The re-phrased version of the quote added 
these two qualifiers: "100%" and "all".

Adding these has the net effect 
that the modified claim is irrelevant, 
for the reasons you (correctly) stated in your reply,
insofar as we do not actually need 100% prediction,
nor do we need to predict absolutely all things,
nor does it matter if it takes infinitely long.

We only need to predict some relevant things
reasonably well in a reasonable time-frame.
This all seems relatively straightforward --
else we are dealing with a straw-man.

Unfortunately, the overall SNC claim is that
there is a broad class of very relevant things 
that even a super-super-powerful-ASI cannot do,
cannot predict, etc, over relevant time-frames.

And unfortunately, this includes rather critical things,
like predicting the whether or not its own existence,
(and of all of the aspects of all of the ecosystem
necessary for it to maintain its existence/function),
over something like the next few hundred years or so,
will also result in the near total extinction 
of all humans (and everything else 
we have ever loved and cared about).

There exists a purely mathematical result
that there is no wholly definable program 'X'
that can even *approximately* predict/determine 
whether or not some other another arbitrary program 'Y'
has some abstract property 'Z',
in the general case,
in relevant time intervals.
This is not about predict 100% of anything --
this is more like 'predict at all'.

AGI/ASI is inherently a *general* case of "program",
since neither we nor the ASI can predict learning,
and since it is also the case that any form
of the abstract notion of "alignment" 
is inherently a case of being a *property*
of that program.
So the theorem is both valid and applicable,
and therefore it has the result that it has.

> First, let's assume that we have created an Aligned ASI.

Some questions: How is this any different than saying 
"lets assume that program/machine/system X has property Y".
How do we know?
On what basis could we even tell?

Simply putting a sticker on the box is not enough,
any more than hand writing $1,000,000 on a piece of paper
all of the sudden means (to everyone else) you're rich.

Moreover, we should rationally doubt this premise,
since it seems far too similar to far too many
pointless theological exercises:.

 "Let's assume that an omniscient, all powerful,
 all knowing benevolent caring loving God exists".

How is that rational?  What is your evidence?
It seems that every argument in this space starts here.
 

SNC is asserting that ASI will continually be encountering 
relevant things it didn't expect, over relevant time-frames, 
and that a least a few of these will/do lead to bad outcomes 
that the ASI also cannot adequately protect humanity from,
even if it really wanted to 
(rather than the much more likely condition
of it just being uncaring and indifferent).

Also, the SNC argument is asserting that the ASI, 
which is starting from some sort of indifference 
to all manner of human/organic wellbeing,
will eventually (also necessarily) 
*converge* on (maybe fully tacit/implicit) values --
ones that will better support its own continued 
wellbeing, existence, capability, etc,
with the result of it remaining indifferent,
and also largely net harmful, overall,
to all human beings, the world over, 
in a mere handful of (human) generations.

You can add as many bells and whistles as you want --
none of it changes the fact that uncaring machines
are still, always, indifferent uncaring machines.
The SNC simply points out that the level of harm
and death tends to increase significantly over time.

Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-29T09:57:09.879Z · LW(p) · GW(p)

Thanks for the response!

Unfortunately, the overall SNC claim is that

there is a broad class of very relevant things 

that even a super-super-powerful-ASI cannot do,

cannot predict, etc, over relevant time-frames.

And unfortunately, this includes rather critical things,

like predicting the whether or not its own existence,

(and of all of the aspects of all of the ecosystem

necessary for it to maintain its existence/function),

over something like the next few hundred years or so,

will also result in the near total extinction 

of all humans (and everything else 

we have ever loved and cared about).

Let's say that we are in a scenario which I've described where ASI spends 20 years on Earth helping humanity and then destroys itself. In this scenario, how can ASI predict that it will stay aligned for these 20 years?

Well, it can reason like I did. There are two main threat models: what I called case (1) and case (2). ASI doesn't need to worry about case (1), for reasons I described in my previous comment.

So it's only left with case (2). ASI needs to prevent case (2) for 20 years. It can do so by implementing security system that is much better than even the one that I described in my previous comment.

It can also try to stress-test copies of parts of its security system with a group of best human hackers. Furthermore, it can run approximate simulations that (while imperfect and imprecise) can still give it some clues. For example, if it runs 10,000 simulations that last 100,000 years and in none of the simulations the security system comes anywhere near being breached, then that's a positive sign.

And these are just two ways of estimating the strength of the security system. ASI can try 1000 different strategies; our cyber security experts would look kids in the playground in comparison. That's how it can make a reasonable prediction.

> First, let's assume that we have created an Aligned ASI

How is that rational?  What is your evidence?

We are making this assumption for the sake of discussion. This is because the post under which we are having this discussion is titled "What if Alignment is Not Enough?"

In order to understand whether X is enough for Y, it only makes sense to assume that X is true. If you are discussing cases where "X is true" is false, then you are going to be answering a question that is different from the original question.

It should be noted that making an assumption for the sake of discussion is not the same as making a prediction that this assumption will come true. One can say "let's assume that you have landed on the Moon, how long do you think you would survive there given that you have X, Y and Z" without thereby predicting that their interlocutor will land on the Moon.

Also, the SNC argument is asserting that the ASI, 

which is starting from some sort of indifference 

to all manner of human/organic wellbeing,

will eventually (also necessarily) 

*converge* on (maybe fully tacit/implicit) values --

ones that will better support its own continued 

wellbeing, existence, capability, etc,

with the result of it remaining indifferent,

and also largely net harmful, overall,

to all human beings, the world over, 

in a mere handful of (human) generations.

If ASI doesn't care about human wellbeing, then we have clearly failed to align it. So I don't see how this is relevant to the question "What if Alignment is Not Enough?"

In order to investigate this question, we need to determine whether solving alignment leads to good or bad outcomes.

Determining whether failing to solve alignment is going to lead to good or bad outcomes, is answering a completely different question, namely "do we achieve good or bad outcomes if we fail to solve alignment"


So at this point, I would like to ask for some clarity. Is SNC saying just (A) or both (A and B)?

(A) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, if the aforementioned ASI is misaligned.

(B) Humanity is going to achieve worse outcomes by building ASI, than by not building ASI, even if the aforementioned ASI is aligned.

If SNC is saying just (A), then then SNC is a very narrow argument that proves almost nothing new.

If SNC is saying both (A and B), then it is very much relevant to focus on cases where we do indeed manage to build an aligned ASI, which does care about our well-being.

Replies from: flandry39
comment by flandry39 · 2025-01-29T20:26:31.407Z · LW(p) · GW(p)


> Lets assume that a presumed aligned ASI 
> chooses to spend only 20 years on Earth 
> helping humanity in whatever various ways
> and it then (for sure!) destroys itself,
> so as to prevent a/any/the/all of the 
> longer term SNC evolutionary concerns 
> from being at all, in any way, relevant.
> What then?

I notice that it is probably harder for us
to assume that there is only exactly one ASI,
for if there were multiple, the chances that
one of them might not suicide, for whatever reason,
becomes its own class of significant concerns.
Lets leave that aside, without further discussion, 
for now.

Similarly, if the ASI itself 
is not fully and absolutely monolithic --
if it has any sub-systems or components
which are also less then perfectly aligned,
so as to want to preserve themselves, etc --
that they might prevent whole self termination.

Overall, I notice that the sheer number 
of assumptions we are having to make,
to maybe somehow "save" aligned AGI 
is becoming rather a lot.


> Let's assume that the fully aligned ASI 
> can create simulations of the world,
> and can stress test these in various ways
> so as to continue to ensure and guarantee 
> that it is remaining in full alignment,
> doing whatever it takes to enforce that.

This reminds me of a fun quote:
"In theory, theory and practice are the same,
whereas in practice, they are very often not".

The main question is then as to the meaning of
'control', 'ensure' and/or maybe 'guarantee'.

The 'limits of control theory' aspects 
of the overall SNC argument basically states
(based on just logic, and not physics, etc)
that there are still relevant unknown unknowns
and interactions that simply cannot be predicted,
no matter how much compute power you throw at it.
It is not a question of intelligence,
it is a result of logic.

Hence to the question of "Is alignment enough?"
we arrive at a definite answer of "no",
both in 1; the sense of 'can prevent all classes
of significant and relevant (critical) human harm',
and also 2; in failing to even slow down, over time,
the asymptotically increasing probability 
of even worse things happening the longer it runs.

So even in the very specific time limited case
there is no free lunch (benefits without risk,
no matter how much cost you are willing to pay).

It is not what we can control and predict and do, 
that matters here, but what we cannot do, 
and could never do, even in principle, etc.

Basically, I am saying, as clearly as I can,
that humanity is for sure going to experience 
critically worse outcomes by building AGI/ASI, 
for sure, eventually, than by not building ASI,
and moreover that this result obtains 
regardless of whether or not we also have 
some (maybe also unreasonable?) reason 
to maybe also believe (right or wrong)
that the ASI is (or at least was) "aligned".

As before, to save space, a more complete edit
version of these reply comments is posted at 

http://mflb.com/2476

Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-29T22:13:02.867Z · LW(p) · GW(p)

I notice that it is probably harder for us to assume that there is only exactly one ASI, for if there were multiple, the chances that one of them might not suicide, for whatever reason, becomes its own class of significant concerns.

If the first ASI that we build is aligned, then it would use its superintelligent capabilities to prevent other ASIs from being built, in order to avoid this problem.

If the first ASI that we have build is misaligned, then it would also use its superintelligent capabilities to prevent other ASIs from being built. Thus, it simply wouldn't allow us to build an aligned ASI.

So basically, if manage to build an ASI without being prevented from doing so by other ASIs, then our ASI would use its superhuman capabilities to prevent other ASIs from being built.

Similarly, if the ASI itself 
is not fully and absolutely monolithic --
if it has any sub-systems or components
which are also less then perfectly aligned,
so as to want to preserve themselves, etc --
that they might prevent whole self termination

ASI can use exactly the same security techniques for preventing this problem as for preventing case (2). However, solving this issue is probably even easier, because, in addition to the security techniques, ASI can just decide to turn itself into a monolith (or, in other words, remove those subsystems).

The 'limits of control theory' aspects 

of the overall SNC argument basically states

(based on just logic, and not physics, etc)

that there are still relevant unknown unknowns

and interactions that simply cannot be predicted,

no matter how much compute power you throw at it.

It is not what we can control and predict and do, 

that matters here, but what we cannot do, 

and could never do, even in principle, etc.

This same reasoning could just well be applied to humans. There are still relevant unknown unknowns and interactions that simply cannot be predicted, no matter how much compute power you throw at it. With or without ASI, some things cannot be predicted.

This is what I meant by my guardian angel analogy. Just because a guardian angel doesn't know everything (has some unknowns), doesn't mean that we should expect our lives to go better without it, than with it, because humans have even more unknowns, due to being less intelligent and having lesser predictative capacities.

Hence to the question of "Is alignment enough?"

we arrive at a definite answer of "no",

both in 1; the sense of 'can prevent all classes

of significant and relevant (critical) human harm

I think we might be thinking about different meanings of "enough". For example, if humanity goes extinct in 50 years without alignment and it goes extinct in 10¹² years with alignment, then alignment is "enough"... to achieve better outcomes than would be achieved without it (in this example).

In the sense of "can prevent all classes of significant and relevant (critical) human harm", almost nothing is ever enough, so this again runs into an issue of being a very narrow, uncontroversial and inconsequential argument. If ~all of the actions that we can take are not enough, then the fact that building an aligned ASI is not enough is true almost by definition.

Replies from: flandry39
comment by flandry39 · 2025-01-30T03:36:53.640Z · LW(p) · GW(p)

> Our ASI would use its superhuman capabilities
> to prevent any other ASIs from being built.

This feels like a "just so" fairy tale.
No matter what objection is raised,
the magic white knight always saves the day.


> Also, the ASI can just decide
> to turn itself into a monolith.

No more subsystems?
So we are to try to imagine
a complex learning machine
without any parts/components?


> Your same SNC reasoning could just well
> be applied to humans too.

No, not really, insofar as the power being
assumed and presumed afforded to the ASI
is very very much greater than that assumed
applicable to any mere mortal human.

Especially and exactly because the nature of ASI
is inherently artificial and thus, in key ways,
inherently incompatible with organic human life.

It feels like you bypassed a key question:
Can the ASI prevent the relevant classes
of significant (critical) organic human harm,
that soon occur as a direct_result of its
own hyper powerful/consequential existence?

Its a bit like asking if an exploding nuclear bomb
detonating in the middle of some city somewhere,
could somehow use its hugely consequential power
to fully and wholly self contain, control, etc,
all of the energy effects of its own exploding,
simply because it "wants to" and is "aligned".

Either you are willing to account for complexity,
and of the effects of the artificiality itself,
or you are not (and thus there would be no point
in our discussing it further, in relation to SNC).

The more powerful/complex you assume the ASI to be,
and thus also the more consequential it becomes,
the ever more powerful/complex you must also
(somehow) make/assume its control system to be,
and thus also of its predictive capability,
and also an increase of the deep consequences
of its mistakes (to the point of x-risk, etc).

What if maybe something unknown/unknowable
about its artificalness turns out to matter?
Why?  Because exactly none of the interface
has ever even once been tried before --
there nothing for it to learn from, at all,
until after the x-risk has been tried,
and given the power/consequence, that is
very likely to be very much too late.

But the real issue is that rate of power increase,
and consequence, and potential for harm, etc,
of the control system itself (and its parts)
must increase at a rate that is greater than
the power/consequence of the base unaligned ASI.
That is the 1st issue, an inequality problem.

Moreover, there is an base absolute threshold
beyond which the notion of "control" is untenable,
just inherently in itself, given the complexity.
Hence, as you assume that the ASI is more powerful,
you very quickly make the cure worse than the disease,
and moreover than that, just even sooner cross into
the range of that which is inherently incurable.

The net effect, overall, as has been indicated,
is that an aligned ASI cannot actually prevent
important relevant unknown unknown classes
of significant (critical) organic human harm.

The ASI existence in itself is a net negative.
The longer the ASI exists, and the more power
that you assume that the ASI has, the worse.
And that all of this will for sure occur
as a direct_result of its existence.

Assuming it to be more powerful/consequential
does not help the outcome because that method
simply ignores the issues associated with the
inherent complexity and also its artificality.

The fairy tale white knight to save us is dead.

comment by WillPetillo · 2025-01-27T23:06:31.881Z · LW(p) · GW(p)

I actually don't think the disagreement here is one of definitions.  Looking up Webster's definition of control, the most relevant meaning is: "a device or mechanism used to regulate or guide the operation of a machine, apparatus, or system."  This seems...fine?  Maybe we might differ on some nuances if we really drove down into the details, but I think the more significant difference here is the relevant context.

Absent some minor quibbles, I'd be willing to concede that an AI-powered HelperBot could control the placement of a chair, within reasonable bounds of precision, with a reasonably low failure rate.  I'm not particularly worried about it, say, slamming the chair down too hard, causing a splinter to fly into its circuitry and transform it into MurderBot.  Nor am I worried about the chair placement setting off some weird "butterfly effect" that somehow has the same result.  I'm going to go out on a limb and just say that chair placement seems like a pretty safe activity, at least when considered in isolation.

The reason I used the analogy "I may well be able to learn the thing if I am smart enough, but I won't be able to control for the person I will become afterwards" is because that is an example of the kind of reference class of context that SNC is concerned with.  Another is: "what is expected shift to the global equilibrium if I construct this new invention X to solve problem Y?"  In your chair analogy, this would be like the process of learning to place the chair (rewiring some aspect of its thinking process), or inventing an upgraded chair and releasing this novel product into the economy (changing its environmental context).  This is still a somewhat silly toy example, but hopefully you see the distinction between these types of processes vs. the relatively straightforward matter of placing a physical object.  It isn't so much about straightforward mistakes (though those can be relevant), as it is about introducing changes to the environment that shift its point of equilibrium.  Remember, AGI is a nontrivial thing that affects the world in nontrivial ways, so these ripple effects (including feedback loops that affect the AGI itself) need to be accounted for, even if that isn't a class of problem that today's engineers often bother with because it Isn't Their Job.

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI.  Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.  Similarly, if the alignment problem as it is commonly understood by Yudkowsky et. al. is not solved pre-AGI and a rogue AI turns the world into paperclips or whatever, that would not make SNC invalid, only irrelevant.  By analogy, global warming isn't going to prevent the Sun from exploding, even though the former could very well affect how much people care about the latter.

Your second point about the relative strengths of the destructive forces is a relevant crux.  Yes, values are an attractor force.  Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers.  The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against.  If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.

Replies from: chess-ice
comment by Dakara (chess-ice) · 2025-01-28T13:09:39.103Z · LW(p) · GW(p)

Re human-caused doom, I should clarify that the validity of SNC does not depend on humanity not self destructing without AI. Granted, if people kill themselves off before AI gets the chance, SNC becomes irrelevant.

Yup, that's a good point, I edited my original comment to reflect it.

Your second point about the relative strengths of the destructive forces is a relevant crux. Yes, values are an attractor force.  Yes, an ASI could come up with some impressive security systems that would probably thwart human hackers.  The core idea that I want readers to take from this sequence is recognition of the reference class of challenges that such a security system is up against.  If you can see that, then questions of precisely how powerful various attractor states are and how these relative power levels scale with complexity can be investigated rigorously rather than assumed away.

With that being said we have come to a point of agreement. It was a pleasure to have this discussion with you. It made me think of many fascinating things that I wouldn't have thought about otherwise. Thank you!