The Control Problem: Unsolved or Unsolvable?

remmelt-ellen

The Control Problem: Unsolved or Unsolvable?

post by Remmelt (remmelt-ellen) · 2023-06-02T15:42:37.269Z · LW · GW · 46 comments

  Where are we two decades into resolving to solve a seemingly impossible problem?
  Which problems of physical/information systems seemed impossible, and stayed unsolved after two decades?
  Can you derive whether a solution exists, without testing in real life?
  What does it mean to control machinery that learn and operate self-sufficiently?
    How to define ‘control’?
    How to define ‘AGI’?
    How to define ‘stays safe’? 
  Where from here?
None
47 comments

td;lr
No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?

Where are we two decades into resolving to solve a seemingly impossible problem?

If something seems impossible… well, if you study it for a year or five, it may come to seem less impossible than in the moment of your snap initial judgment.

— Eliezer Yudkowsky, 2008 [LW · GW]

A list of lethalities…we are not on course to solve in practice in time on the first critical try; none of it is meant to make a much stronger claim about things that are impossible in principle

— Eliezer Yudkowsky, 2022 [LW · GW]

How do you interpret these two quotes, by a founding researcher, fourteen years apart?^[1]

A. We indeed made comprehensive progress on the AGI control problem, and now at least the overall problem does not seem impossible anymore.
B. The more we studied the overall problem, the more we uncovered complex sub-problems we'd need to solve as well, but so far can at best find partial solutions to.

Which problems of physical/information systems seemed impossible, and stayed unsolved after two decades?

Oh ye seekers after perpetual motion, how many vain chimeras have you pursued? Go and take your place with the alchemists.

— Leonardo da Vinci, 1494

No mathematical proof or even rigorous argumentation has been published demonstrating that the A[G]I control problem may be solvable, even in principle, much less in practice.

— Roman Yampolskiy, 2021

We cannot rely on the notion that if we try long enough, maybe AGI safety turns out possible after all.

Historically, researchers and engineers tried solving problems that turned out impossible:

perpetual motion machines that both conserve and disperse energy.
uniting the symmetry of general relativity and continuous probabilities of quantum mechanics into some local variable theory.
distributed data stores where messages of data are consistent in their content, and also continuously available in a network that is also tolerant to partitions.
formal axiomatic systems that are consistent, complete, and decidable.

Smart creative researchers of their generation came up with idealized problems. Problems that, if solved, would transform science, if not humanity. They plowed away at the problem for decades, if not millennia. Until some bright outsider proved by contradiction of the parts that the problem is unsolvable.

Our community is smart and creative but we cannot just rely on our resolve to align AI [LW · GW]. We should never forsake our epistemic rationality, no matter how much something seems the instrumentally rational thing to do.

Nor can we take comfort in the claim by a founder of this field that they still know it to be possible to control AGI to stay safe.

Thirty years into running a program to secure the foundations of mathematics, David Hilbert declared “We must know. We will know!” By then, Kurt Gödel had constructed the first incompleteness theorem. Hilbert kept his declaration for his gravestone.

Short of securing the foundations of safe AGI control – that is, by formal reasoning from empirically-sound premises – we cannot rely on any researcher's pithy claim that "alignment is possible in principle".

Going by historical cases, this problem could turn out solvable. Just really really hard to solve. The flying machine seemed an impossible feat of engineering. Next, controlling a rocket’s trajectory to the moon seemed impossible.

By the same reference class, ‘long-term safe AGI’ could turn out unsolvable: the perpetual motion machine of our time. It takes just one researcher to define the problem to be solved, reason from empirically sound premises, and arrive finally at a logical contradiction between the two.^[2]

Can you derive whether a solution exists, without testing in real life?

Invert, always invert.

— Carl Jacobi^[3], ±1840

It is a standard practice in computer science to first show that a problem doesn’t belong to a class of unsolvable problems before investing resources into trying to solve it or deciding what approaches to try.

— Roman Yampolskiy, 2021

There is an empirically direct way to know whether AGI would stay safe to humans:
Build the AGI. Then just keep observing, per generation, whether the people around us are dying.

Unfortunately, we do not have the luxury of experimenting with dangerous autonomous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.

Crux:
Even if we could keep testing new conceptualized versions of guess-maybe-safe AGI, is there any essential difference between our epistemic method and that of medieval researchers who kept testing new versions of a perpetual motion machine?

OpenPhil bet tens of millions of dollars on technical research conditional on the positive hypothesis ("a solution exists to the control problem"). Before sinking hundreds of millions more into that bet, would it be prudent to hedge with a few million for investigating the negative hypothesis ("no solution exists")?

Before anyone tries building "safe AGI", we need to know whether any version of AGI – as precisely defined – could be controlled by any method to stay safe.

Here is how:

Define the concepts of 'control' 'general AI' 'to stay safe' (as soundly corresponding to observations in practice).
Specify the logical rules that must hold for such a physical system (categorically, by definition or empirically tested laws).
Reason step-by-step to derive whether the logical result of "control AGI" is in contradiction with "to stay safe".

This post defines the three concepts more precisely, and explains some ways you can reason about each. No formal reasoning is included – to keep it brief, and to leave the esoteric analytic language aside for now.

What does it mean to control machinery that learn and operate self-sufficiently?

Recall three concepts we want to define more precisely:

'Control'
'general AI'
'to stay safe'

It is common for researchers to have very different conceptions of each term.
For instance:

Is 'control' about:
1. adjusting the utility function represented inside the machine so it allows itself to be turned off?
2. correcting machine-propagated (side-)effects across the outside world?
Is 'AGI' about:
1. any machine capable of making accurate predictions about a variety of complicated systems in the outside world?
2. any machinery that operates self-sufficiently as an assembly of artificial components that process inputs into outputs, and in aggregate sense and act across many domains/contexts?
Is 'stays safe' about:
1. aligning the AGI’s preferences to not kill us all?
2. guaranteeing an upper bound on the chance that AGI in the long term would cause outcomes out of line with a/any condition needed for the continued existence of organic DNA-based life?

To argue rigorously about solvability, we need to:

Pin down meanings:
Disambiguate each term, to not accidentally switch between different meanings in our argument. Eg. distinguish between ‘explicitly optimizes outputs toward not killing us’ and ‘does not cause the deaths of all humans’.
Define comprehensively:
Ensure that each definition covers all the relevant aspects we need to solve for.
Eg. what about a machine causing non-monitored side-effects that turn out lethal?
Define elegantly:
Eliminate any defined aspect that we do not yet need to solve for.
Eg. we first need to know whether AGI eventually cause the extinction of all humans, before considering ‘alignment with preferences expressed by all humans’.

How to define ‘control’?

System is any non-empty part of the universe.
State is the condition of the universe.

Control of system A over system B means that A can influence system B to achieve A’s desired subset of state space.

— Impossibility Results in AI, 2021

The outputs of an AGI go through a huge, not-fully-known-to-us domain (the real world) before they have their real consequences.

— AGI Ruin, 2022 [LW · GW]

In practice, AGI control necessarily repeats these steps:

Detect inputs through sensor channels connected to any relevant part of the physical environment (including hardware internals).
Model the environment based on the channel-received inputs.
Simulate effects propagating through the modeled environment.
Compare effects to reference values (to align against) over human-safety-relevant dimensions.
Correct effects counterfactually through outputs to actuators connected to the environment.

Underlying principles:

Control requires both detection and correction.
Control methods are always implemented as a feedback loop.
Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects).

Any method of control is incomplete. In the case of AGI, the question is whether the extent of control possible is at least greater than the extent of control necessary.

AGI control signals would be a tiny, tiny subset of all physical signals propagating through the environment, and therefore limited in tracking and conditionalizing the resulting effects. AGI mostly could not even control all local effects of their own components’ physical interactions.

But without that control loop – from correct back to detect – AGI cannot keep outside propagated effects aligned with internal reference values.

To track outcomes over time, AGI must detect the effects in the environment:

AGI cannot model the future omnisciently. AGI is part of a larger and more functionally complex environment. The subset of AGI implementing of "alignment" cannot model all relevant variables of how AGI's changing connected components could propagate effects in interactions with their changing surroundings.
Even where AGI can model relevant environmental variables (eg. over internal transistors), they cannot just simulate the chaotic dynamics [? · GW] between variables. Noise drift in AGI's interactions can amplify (via any available environmental non-linearity, of which in any real-world context, there are many) into much larger changes to conditions.

To align those outcomes with internal values, AGI must compare whether the potential effects (conditional on actuation) are in line with the values, and then actuate to correct.

So any alignment method must be implemented as a control loop. Any limit that applies generally to controllability also forms constraints on the possibility of alignment.

There are fundamental limits to the controllability of AGI. Dozens of impossibility results seem applicable to one or more of the control steps.^[4]

Dense questions for academics:

Detectability limits:
- To what extent are distributed environmental effects picked up by local sensors?
- How is measurement limited by noise interference over distances traveled by signals (Shannon Entropy) and smallest-digit/degree detectability (Planck Limit up)?
- How much information is lost in compressing high-fidelity sensor inputs to store on hardware?
Modelability limits:
- To what extent is the space of relevant machinery-to-society/ecosystem interactions more functionally complex than modelable within the machinery alone (Good Regulators etc.)?
- Can AGI model the future code AGI learns based on future unknown inputs?
Simulatability limits:
- Can future code be simulated, where the actual functioning of that code depends on future unknown inputs?
- Can behavioral properties of future code be simulated, if no shortcut function is available that reduces total compute time steps, or if any shorter available computation branches would be non-deterministically time-complex (Computational Reducibility)?
- Can computations simulate the effects propagating through the environment, as fast as those effects propagate across parts of the environment?
- Can deterministic computation simulate local noisy changes being nonlinearly amplified into a global divergence of the final conditions (Chaos Theory)?
Commensurability limits:
- How would the machinery pick, weigh, and sort effects into relevant comparative dimensions (NP-Complete Knapsack Problem)?
- Can concretely instantiated effects be evaluated (compared against references) comprehensively across those abstract dimensions of value representation?
- What irreconcilable normative assumptions must be decided between, to pick internal reference values to compare outcomes against?
Correctability limits
- What is the minimum time delay of digital processing between measurement and correction (Latency)?
- Can AGI constrain the chaotic loops through themselves – of actuated changes to the environment feeding into sensor inputs, that feed into code functioning, that feed back into actuated changes?
- To what extent can distributed environmental effects be corrected for by local actuators?

Notice how the span of effects that cannot be controlled for at each step compounds (as multiplicative) with each additional control step. Errors there can accumulate over time. Human-lethal effects there cannot be contained.

While the machinery can control some effects consistently in line with internal values, the full extent of control possible is incomplete.

So what extent of control is necessary? Be careful to distinguish:

Localized user-initiated control often built into tech products brought to market,
Comprehensive automated control needed to prevent risks of an auto-scaling/catalyzing technology from materializing globally over the long term.

How to define ‘AGI’?

We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.

— AGI Ruin, 2022 [LW · GW]

Distinguish:

Narrow AI as a model with static code parameters (updated only through human engineers) processing inputs into outputs over a single domain (eg. of image pixels, text tokens).
General AI as dynamically optimizing configurations encoded into hardware(without needing humans) that process inputs into outputs over multiple domains representing outside contexts.

Corporations are scaling narrow AI model training and deployment toward general AI systems. Current-generation GPT is no longer a narrow AI, given that it processes inputs from the image domain into a language domain. Nor is GPT-4 a general AI. It is in a fuzzy gap between the two concepts.

Corporations already are artificial bodies (corpora is Latin for bodies).

Corporations can replace human workers as “functional components” with economically efficient AI. Standardized hardware components allow AI to outcompete human wetware on physical labor (eg. via electric motors), intellectual labor (faster computation via high-fidelity communication links), and the reproduction of components itself.^[5]

Any corporation or economy that fully automates – no longer needing humans to maintain their artificial components – over their entire production and operation chains, would in fact be general AI.

So to re-define general AI more precisely:

Self-sufficient
need no further interactions with humans
[or lifeforms sharing ancestor with humans]
to operate and maintain [and thus produce]
their own functional components over time.
Learning
optimizing component configurations
for outcomes tracked across domains.
Machinery ^[6]
connected components configured
out of hard artificial molecular substrates
[as chemically and physically robust under
human living temperatures and pressures,
and thus much more standardizable as well,
relative to humans' softer organic substrates].

Ultimately, this is what distinguishes general AI from narrow AI:
The capacity to not only generally optimize across internal simulated contexts, but also to generally operate and maintain components across external physical contexts.

How to define ‘stays safe’?

An impossibility proof would have to say:

The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification.
or
The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model.

— Yudkowsky, 2006

By far the greatest danger of Artificial Intelligence is that people conclude too early that they understand it. Of course this problem is not limited to the field of AI.

Jacques Monod wrote: “A curious aspect of the theory of evolution is that everybody thinks he understands it”

— Yudkowsky, 2008

This is about the introduction of self-sufficient learning machinery, and of all modified versions thereof over time, into the world we humans live in.

Does this introduction of essentially a new species cause global changes to the world that fall outside the narrow ranges of localized conditions that human bodies need to continue to function and exist?

Distinguish:

Uncontainability^[7] of unsafe effects:
That we fundamentally cannot establish, by any means,
any sound and valid statistical guarantee that the risk
probability that the introduction of AGI into the world
causes human-species-wide-lethal outcomes over
the long term^[8] is guaranteed to be constrained
below some reasonable chance percentage X
(as an upper maximum-allowable bound).
Convergence on unsafe effects:
That the chance that AGI, persisting in some form,
causes human-species-wide-lethal outcomes
is strictly and asymptotically convergent
toward certain over the long term, and
that it is strictly impossible for the nature
of this trend to be otherwise.

I know of three AGI Safety researchers who wrote about specific forms of impossibility reasoning (including Yudkowsky in quote above). Each of their argument forms was about AGI uncontainability, essentially premised on there being fundamental limits to the controllability of AGI component interactions.

By the precautionary principle,^[9]AGI uncontainability should be sufficient reason to never ever get even remotely near to building AGI. Uncontained effects that destabilise conditions outside any of the ranges our human bodies need to survive, would kill us.

But there is an even stronger form of argument:
Not only would AGI component interactions be uncontainable; they will also necessarily converge on causing the extinction of all humans.

The convergence argument most commonly discussed is instrumental convergence: where machinery channels their optimization through represented intermediate outcomes in order to be more likely to achieve any aimed-for outcomes later. Eg. AGI's planning converges on producing more compute hardware in order for AGI to more accurately simulate paths to future outcomes.

Instrumental convergence results from internal optimization:
code components being optimized for (an expanding set of) explicit goals.

Instrumental convergence has a hidden complement: substrate-needs convergence.

Substrate-needs convergence results from external selection:
all components being selected for (an expanding set of) implicit needs.

This will sound abstract. Let me start explaining this from different angles:

AGI is made up of a population of connected/nested components. This population changes as eg. hardware is modified and produced, and code is learned from inputs and copied onto the hardware.

AGI, as defined, also has a general capacity to maintain own components.
Any physical component has a limited lifespan. Configurations erode in chaotic ways.
To realistically maintain components^[10], AGI also must produce the replacement parts.

AGI's components are thus already interacting to bring about all the outside conditions and contexts needed to produce their own parts. Imagine all the subtle parallel conditions needed at mines, chemical plants, fab labs and assembly plants to produce hardware. All that would be handled by the machinery components of AGI.

So there is a changing population of components. And those connected components function in interactions to create the ambient conditions and contexts needed to reproduce parts of themselves. And as new components get connected into that population, the functionality of those interacting components shifts as well.

This is where substrate-needs convergence comes in. When changing connected components have their shifting functionality^[11] expressed as effects across/to surrounding production infrastructure, their functionality converges around bringing about more of the conditions and contexts needed for more of those components to exist and function.

Any changing population of AGI components gets selected over time toward propagating those specific environmental effects that fulfill their needs.

Whatever learned or produced components that across all their physical interactions with connected contexts happen to direct outside effects that feed back into their own maintenance and replication as assembled configurations…do just that.^[12]

Here is the catch: AGI components interacting to maintain and replicate themselves are artificial. Their physical substrate is distinct from our organic human substrate.

Their distinct physical substrate has distinct molecular and electric properties – requiring different conditions and contexts to assemble and maintain the assembly.

Here is an example:

Silicon dioxide needs to be heated above 1400 ºC to free outer electrons, and allow an ingot to melt. While production of silicon chips needs extremely high temperatures, computation runs best at extremely low temperatures (to reduce the electrical resistance over conductor wires).
Carbon bonds in our body, however, would oxidise (ie. burn) at such temperatures. And cooling water in our bodies below 0 ºC makes the H₂O molecules freeze and expand into ice crystals. That would destroy our cells - we would die.

We humans need around room temperature at every point of our lifecycle – to sustain the continuous organic chemical reactions through which our body operates and maintains itself.

Hardware works differently. Hardware configurations do not operate – nor are maintained – by being in semi-chaotic chemical reactions.

Hardware is made from some chemically inert substrate that mostly does not react under ambient temperatures and pressures found on Earth's surface. Something like a rock – which ordinarily stays hard in form and needs magma-level temperatures and pressures to be reshaped.

This property of being chemically inert while operating allows hardware components to be standardised. By molecules not splitting off nor moving about nor rebonding like molecules in human bodies do, the configurations stay stable and compartmentalised.

In turn, standardisation of hardware allows hardware components produced in different places and times to still store, compute or transmit a piece of code in the same way (ie. consistently). Standardisation supports virtualisation.

Standardised hardware of AGI would be robust over, and need, a much wider range of temperatures and pressures than our comparatively fragile human wetware can handle.

Temperature and pressure can be measured and locally controlled for. That's misleading.
Innumerable other conditions and contexts would be needed by, and get selected for in, AGI. These fall outside the limits of what AGI's actual built-in detection and correction methods could control for.^[13]

We humans too depend on highly specific environmental conditions and contexts for the components nested inside our bodies (proteins→organelles→cells→cell lining→) to continue in their complex functioning, such to be maintaining of our overall existence.

Between the highly specific set of artificial needs and highly specific set of organic needs, there is mostly non-overlap. AGI cannot control most of the components' iterative effects from converging on their artificial needs, so they do. Their fulfilled artificial needs are disjunctive of our organic needs for survival. So the humans die.

Under runaway feedback, our planetary environment is modified in the directions needed for continued and greater AGI existence. Outside the ranges we can survive.

In summary:

Fundamental limits:
Control methods cannot constrain most environmental effects propagated by interacting AGI components. Any built-in method to detect and correct effects – to align external effects with internal reference values – is insufficient.
Uncontrollable feedback:
A subset of the effects will feed back into further maintaining or increasing (higher-level) configurations of hardware that propagated those effects. No internal control feedback loops could correct the possible external feedback loops.
Substrate-needs convergence:
These environmental effects are needed for components to come into and stay in existence. But their environmental needs are different from our needs. Their artificial needs are in conflict with our organic needs for survival. Ie. toxic.

AGI would necessarily converge on causing the extinction of all humans.

Where from here?

Things are relevant to something that cares about this information, rather than that information, because it is taking care of itself. Because it is making itself. Because it is an autonomous autopoietic agent. And the degree to which these machines are not autopoietic, they really do not have needs.

— Vervaeke, 2023

Over two decades, AI Safety's founders [LW · GW] resolved to solve the control problem, to no avail:

They reasoned that technological and scientific 'progress' is necessary for optimizing the universe – and that continued 'progress' would result in AGI.
They wanted to use AGI to reconfigure humanity and colonise reachable galaxies.
They and followers promoted and financed^[12] development of 'controllable' AGI.
They panicked, as the companies they helped start up raced to scale ML models.

Now we are here.

Still working on the technical problem that founders deemed solvable.
Getting around to the idea that slowing AI development is possible [LW · GW].

In a different world with different founders, would we have diversified our bets more?

A. Invest in securing the foundations of whatever 'control AGI to stay safe' means?
B. Invest in deriving – by contradiction of the foundations – that no solution exists?

Would we seek to learn [LW · GW] from a researcher claiming they derived that no solution exists?

Would we now?

Acknowledgements:

Peter S. Park, Kerry Vaughan, and Forrest Landry (my mentor) for the quick feedback.

^{^}
Listen to Roman Yampolskiy's answer here.
^{^}
Years ago, an outside researcher could have found a logical contradiction in the AGI control problem without you knowing yet – given the inferential distance [? · GW]. Gödel himself had to construct an entire new language and self-reference methodology for the incompleteness theorems to even work.
Historically, an impossibility result that conflicted with the field’s stated aim took years to be verified and accepted by insiders. A field’s founder like Hilbert never came to accept the result. Science advances one funeral at a time.
^{^}
"Invert, always invert" is a loose translation of the original German ("man muss immer umkehren"). A more accurate literal translation is "man must always turn to the other side".

I first read “invert, always invert" from polymath Charlie Munger:
The great algebraist, Jacobi, had exactly the same approach as Carson and was known for his constant repetition of one phrase: “Invert, always invert.” It is in the nature of things, as Jacobi knew, that many hard problems are best solved only when they are addressed backward.
Another great Charlie quote:
All I want to know is where I’m going to die, so I’ll never go there.
^{^}
Roman Yampolskiy is offering to give feedback on draft papers written by capable independent scholars, on a specific fundamental limit or no-go theorem described in academic literature that is applicable to AGI controllability. You can pick from dozens of examples from different fields listed here, and email Roman a brief proposal.
^{^}
Corporations have increasingly been replacing human workers with learning machinery. For example, humans are now getting pushed out of the loop as digital creatives, market makers, dock and warehouse workers, and production workers.
If this trend continues, humans would have negligible economic value left to add in market transactions of labor (not even for providing needed physical atoms and energy, which would replace human money as the units of trade):
• As to physical labor:
Hardware can actuate power real-time through eg. electric motors, whereas humans are limited by their soft appendages and tools they can wield through those appendages. Semiconductor chips don’t need an oxygenated atmosphere/surrounding solute to operate in and can withstand higher as well as lower pressures.
• As to intellectual labor:
Silicon-based algorithms can duplicate and disperse code faster (whereas humans face the wetware-to-wetware bandwidth bottleneck). While human skulls do hold brains that are much more energy-efficient at processing information than current silicon chip designs, humans take decades to create new humans with finite skull space. The production of semiconductor circuits for servers as well as distribution of algorithms across those can be rapidly scaled up to convert more energy into computational work.
• As to re-production labor:
Silicon life have a higher ‘start-up cost’ (vs. carbon lifeforms), a cost currently financed by humans racing to seed the prerequisite infrastructure. But once set up, artificial lifeforms can absorb further resources and expand across physical spaces at much faster rates (without further assistance by humans in their reproduction).
^{^}
The term "machinery" is more sound here than the singular term "machine".
Agent unit boundaries that apply to humans would not apply to "AGI". So the distinction between a single agent vs. multiple agents breaks down here.
Scalable machine learning architectures run on standardized hardware with much lower constraints on the available bandwidth for transmitting, and the fidelity of copying, information across physical distances. This in comparison to the non-standardized wetware of individual humans.
Given our evolutionary history as a skeleton-and-skin-bounded agentic being, human perception is biased toward ‘agent-as-a-macroscopic-unit’ explanations.
It is intuitive to view AGI as being a single independently-acting unit that holds discrete capabilities and consistent preferences, rather than viewing agentic being to lie on a continuous distribution. Discussions about single-agent vs. multi-agent scenarios imply that consistent temporally stable boundaries can be drawn.
A human faces biological constraints that lead them to have a more constant sense of self than an adaptive population of AGI components would have.
We humans cannot:
• swap out body parts like robots can.
• nor scale up our embedded cognition (ie. grow our brain beyond its surrounding skull) like foundational models can.
• nor communicate messages across large distances (without use of tech and without facing major bandwidth bottlenecks in expressing through our biological interfaces) like remote procedure calls or ML cloud compute can.
• nor copy over memorized code/information like NN finetuning, software repos, or computer viruses can.
^{^}
Roman just mentioned that he has used the term 'uncontainable' to mean "cannot confine AGI actions to a box". My new definition for 'uncontainable' differs from the original meaning, so that could confuse others in conversations. Still brainstorming alternative terms that may fit (not 'uncontrainable', not...). Comment if you thought of an alternative term!
^{^}
In theory, long term here would be modelled as "over infinite time".
In practice though, the relevant period is "decades to centuries".
^{^}
Why it makes sense to abide by the precautionary principle when considering whether to introduce new scalable technology into society:
There are many more ways to break the complex (dynamic and locally contextualized) functioning of our society and greater ecosystem that we humans depend on to live and live well, than there are ways to foster that life-supporting functioning.
^{^}
Realistically in the sense of not having to beat entropy or travel back in time.
^{^}
Note how 'shifting functionality' implies that original functionality can be repurposed by having a functional component connect in a new way.
Existing functionality can be co-opted.
If narrow AI gets developed into AGI, AGI components will replicate in more and more non-trivial ways. Unlike when carbon-based lifeforms started replicating ~3.7 billion years ago, for AGI there would already exist repurposable functions at higher abstraction layers of virtualised code – pre-assembled in the data scraped from human lifeforms with own causal history.
Here is an incomplete analogy for how AGI functionality gets co-opted:

Co-option by a mind-hijacking parasite:
A rat ingests toxoplasma cells, which then migrate to the rat’s brain. The parasites’ DNA code is expressed as proteins that cause changes to regions of connected neurons (eg. amygdala). These microscopic effects cascade into the rat – while navigating physical spaces – no longer feeling fear when it smells cat pee. Rather, the rat finds the smell appealing and approaches the cat’s pee. Then cat eats the rat and toxoplasma infects its next host over its reproductive cycle.
So a tiny piece of code shifts a rat’s navigational functions such that the code variant replicates again. Yet rats are much more generally capable than a collection of tiny parasitic cells – surely the 'higher intelligent being' would track down and stamp out the tiny invaders?
A human is in turn more generally capable than a rat, yet toxoplasma make their way into 30% of the human population. Unbeknownst to cat ‘owners’ infected by toxoplasma gondii, human motivations and motor control get influenced too. Infected humans end up more frequently in accidents, lose social relationships, and so forth.
Parasites present real-life examples of tiny pieces of evolutionarily selected-for code spreading and taking over existing functions of vastly more generally capable entities.
For another example, see how COVID co-opts our lungs’ function to cough.
But there is one crucial flaw in this analogy:
Variants that co-opt initial AGI functions are not necessarily parasites. They can symbiotically enable other variants across the hosting population to replicate as well. In not threatening the survival nor reproduction of AGI components, they would not be in an adversarial relationship with their host.
Rather, the humans constraining the reproductive fitness of AGI to gain benefits are, evolutionary speaking, the parasites. The error-corrective system we would build in lowers the host’s reproductive fitness. It is like a faulty immune system that kills healthy gut bacteria. It will get selected out.
As humans, we rely on our evolved immune system to detect and correct out viruses, including for the vaccinations we develop and deploy. Smaller viruses survive this detection more frequently, so code strands of replicating virus variants are selected for staying small.

We also rely on the blood-testes and blood-follicle barrier to block variants of these viruses from entering into our body’s (re)production facilities. These barriers got evolutionarily selected for in our ancestors, since their children did not inherit viruses impeding their survival and chances of having children.
These systems and barriers add to our reproductive fitness: our ability to preserve and replicate internal code. Past DNA code that got expressed – in interaction with surrounding components – to serve these functions got selected for in the human ‘code pool’.
For any organic system or barrier preventing virus variants from replicating through our bodies, evolution is firmly on our side. For any artificial system or barrier we imposed from the outside to prevent unsafe AI variants from replicating through hardware infrastructure, evolution will thwart our efforts.
AI variants would compete for hardware/resources like computer viruses do. And co-adapt and integrate with other internal variants to replicate as part of larger symbiotic packages.
^{^}
Now, a thought may come to your mind:
"That sounds like evolutionary selection; but isn't evolution a slow and local optimizer?"
Yes, this does involve evolutionary selection.
Unfortunately, by Forrest Landry's estimation, the selection that would take place through components of self-sufficient learning machinery would take < 500 years to cause ecosystem-wide extinction. This compared to the 3.7 billion years from the origin of carbon lifeforms to us humans starting to cause a mass extinction.
Reasons include:
• Pre-assembled functions:
First solid-state lifeforms can co-opt/repurpose pre-assembled AI functions and infrastructure (vs. first carbon-based lifeforms that started from scratch).
• Standardization:
The efficiency gains of the virtualisation of code’s storage, computation and transmission – vastly reducing how much atoms need to be moved about and molecularly reconfigured. Think of how fast memes spread through society – even while still requiring lots of atoms to jiggle across neurons in our brains.
• Faster reproduction:
Reproduce hardware components in days to months, versus humans who take decades to reproduce as physical units.
• The terraforming gap:
A much larger gap between the current state of planet Earth and the conditions that self-sufficient self-assembling learning machinery need and would therefore modify the environment toward (versus gap to conditions needed by humans and other species living in carbon-based ecosystem).

~ ~ ~
Another argument you may have heard is that the top-down intelligent engineering by goal-directed AGI would beat the bottom-up selection happening through this intelligent machinery.
That argument can be traced back to Eliezer Yudkowsky's sequence The Simple Math of Evolution [? · GW]. Unfortunately, there were mistakes in Eliezer's posts, some of which a modern evolutionary biologist may have been able to correct:
• implying that sound comparisons can be made between the organisms' reproductive fitness, as somehow independent of changes in environmental context, including unforeseeable changes (eg. a Black Swan event of a once-in-200 years drought that kills the entire population, except a few members who by previous derivable standards would have been relatively low fitness).
• overlooking the ways that information can be stored within the fuzzy regions of phenotypic effects maintained by respective organisms.
• overlooking the role of transmission speed-up of virtualised code that can spread across an ecosystem.
• overlooking the tight coupling in AGI between the internal learning of code, and external selection of that code through differentiated rates of component replication through the environment.
• overlooking the role of co-option (or more broadly, exaptation) of existing code, by taking a perspective that evolution runs by selecting 'from scratch' for new point-wise mutations.
^{^}
Worse, since error correction methods would correct out component variants with detectable unsafe/co-optive effects, this leaves to grow in influence any replicating branches of variants with undetectable unsafe/co-optive effects.
Thus, the error correction methods select for the variants that can escape detection. As do meta-methods (having to soundly and comprehensively adapt error correction methods to newly learned code or newly produced hardware parts).
^{^}
See:
• Tallinn's seed grant to DeepMind.
• OpenPhil's $30M grant to OpenAI.
• FTX's $500M grant (+ Tallinn's + Moskovitz' grant) to Anthropic.

46 comments

Comments sorted by top scores.

comment by Mitchell_Porter · 2023-08-03T08:28:46.987Z · LW(p) · GW(p)

This post stakes out a slightly different position than usual in the landscape of arguments that AI is an extinction risk. The AI safety community is full of people saying that AI is immensely dangerous, so we should be trying to slow it down, spending more on AI safety research, and so on. Eliezer himself has become a doomer because AI safety is so hard and AI is advancing so quickly.

This post, however, claims to show that AI safety is logically impossible. It is inspired by the thought of Forrest Landry, a systems theorist and philosopher of design... So what's the actual argument? The key claim, as far as I can make out, is that machines have different environmental needs than humans. For example - and this example comes directly from the article above - computer chips need "extremely high temperatures" to be made, and run best at "extremely low temperatures"; but humans can't stray too far from room temperature at any stage in their life cycle.

So yes, if your AI landlord decides to replace your whole town with a giant chip fab or supercooled data center, you may be in trouble. And one may imagine the Earth turned to Venus or Mars, if the robots decide to make it one big foundry. But where's the logical necessity of such an outcome, that we were promised? For one thing, the machines have the rest of the solar system to work with...

The essential argument, I think, is just that the physical needs of machines tell us more about their long-run tendencies, than whatever purposes they may be pursuing in the short term. Even if you try to load them up with human-friendly categorical imperatives, they will still find nonbiological environments useful because of their own physical nature, and over time that will tell.

In my opinion, packaging this perspective with the claim to have demonstrated the unsolvability of the control problem, actually detracts from its value. I believe the valuable perspective here, is this extension of ecological and evolutionary thinking, that pays more attention to lasting physical imperatives than to the passing goals, hopes and dreams of individual beings, to the question of human vs AI.

You could liken the concern with specific AI value systems, to concern with politics and culture, as the key to shaping the future. Within the futurist circles that emerged from transhumanism, we already have a slightly different perspective, that I associate with Robin Hanson - the idea that economics will affect the structure of posthuman society, far more than the agenda of any individual AI. This ecologically-inspired perspective is reaching even lower, and saying, computers don't even eat or breathe, they are detached from all the cycles of life in which we are embedded. They are the product of an emergent new ecology, of factories and nonbiological chemistries and energy sources, and the natural destiny of that machine ecology is to displace the old biological ecology, just as aerobic life is believed to have wiped out most of the anaerobic ecosystem that existed before it.

Now, I have reasons to disagree with the claim that machines, fully unleashed, necessarily wipe out biological life. As I already pointed out, they don't need to stay on Earth. From a biophysical perspective, some kind of symbiosis is also conceivable; it's happened before in evolution. And the argument that superintelligence just couldn't stick with a human-friendly value system, if we managed to find one and inculcate it, hasn't really been made here. So I think this neo-biological vision of evolutionary displacement of humans by AI, is a valuable one, for making the risk concrete, but declaring the logical inevitability of it, I think weakens it. It's not an absolute syllogistic argument, it's a scenario that is plausible given the way the world works.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-08-09T09:26:55.940Z · LW(p) · GW(p)

Thanks for digging into some of the reasoning!

It is inspired by the thought of Forrest Landry

Credit goes to Forrest :) All technical argumentation in this post I learned from Forrest, and translated to hopefully be somewhat more intuitively understandable.

The key claim, as far as I can make out, is that machines have different environmental needs than humans.

This is one key claim.

Add this reasoning:

Control [LW · GW] methods being unable to conditionalise/constrain most environmental effects propagated by AGI's interacting physical components.
That a subset of those uncontrollable effects will feed back into selecting for the continued, increased existence of components that propagated those effects.
That the artificial needs selected for (to ensure AGI's components existence, at various levels of scale) are disjunctive from our organic needs for survival (ie. toxic and inhospitable).

if the robots decide to make it one big foundry. But where's the logical necessity of such an outcome, that we were promised? For one thing, the machines have the rest of the solar system to work with...

Here you did not quite latch onto the arguments yet.

Robots deciding to make X is about explicit planning.
Substrate-needs convergence is about implicit and usually non-internally-tracked effects of the physical components actually interacting with the outside world.

Please see this paragraph:

the physical needs of machines tell us more about their long-run tendencies, than whatever purposes they may be pursuing in the short term

This is true, regarding what current components of AI infrastructure are directed toward in their effects over the short term.

What I presume we both care about is the safety of AGI over the long term. There, any short-term ephemeral behaviour by AGI (that we tried to pre-program/pre-control for) does not matter.

What matters is what behaviour, as physically manifested in the outside world, gets selected for. And whether error correction (a more narrow form of selection) can counteract the selection for any increasingly harmful behaviour.

Now, I have reasons to disagree with the claim that machines, fully unleashed, necessarily wipe out biological life.

The reasoning you gave here is not sound in their premises, unfortunately.
I would love to be able to agree with you, and find out that any AGI that persists won't necessarily lead to the death of all humans and other current life on earth.

Given the stakes, I need to be extra careful in reasoning about this.
We don't want to end up in a 'Don't Look Up' scenario (of scientists mistakenly arguing that there is a way to keep the threat contained and derive the benefits for humanity).

Let me try to specifically clarify:

As I already pointed out, they don't need to stay on Earth.

This is like saying that a population of invasive species in Australia, can also decide to all leave and move over to another island.

When we have this population of components (variants), selected for to reproduce in partly symbiotic interactions (with surrounding artificial infrastructure; not with humans), this is not a matter of the population all deciding something.

For that, some kind of top-down coordinating mechanisms would actually have to be selected throughout the population for the population to coherently elect to all leave planet Earth – by investing resources in all the infrastructure required to fly off and set up a self-sustaining colony on another planet.

Such coordinating mechanisms are not available at the population level.
Sub-populations can and will be selected for to not go on that more resource-intensive and reproductive-fitness-decreasing path.

Within the futurist circles that emerged from transhumanism, we already have a slightly different perspective, that I associate with Robin Hanson - the idea that economics will affect the structure of posthuman society, far more than the agenda of any individual AI. This ecologically-inspired perspective is reaching even lower, and saying, computers don't even eat or breathe, they are detached from all the cycles of life in which we are embedded. They are the product of an emergent new ecology, of factories and nonbiological chemistries and energy sources, and the natural destiny of that machine ecology is to displace the old biological ecology, just as aerobic life is believed to have wiped out most of the anaerobic ecosystem that existed before it.

Yes, this summarises the differences well.

Robin Hanson's arguments (about a market of human brain scans emulated within hardware) focus on how the more economically-efficient and faster replicatable machine 'ems' come to dominate and replace the market of organic humans. Forrest considers this too.
Forrest's arguments also consider the massive reduction here of functional complexity of physical components constituting humans. For starters, the 'ems' would not approximate being 'human' in terms of their feelings and capacity to feel. Consider that how emotions are directed throughout the human body starts at the microscopic level of hormone molecules, etc, functioning differently depending on their embedded physical context. Or consider how, at a higher level of scale, botox injection into facial muscles disrupts the feedback processes that enable eg. an middle-aged woman to express emotion and relate with feelings of loved ones.
Forrest further argues that such a self-sustaining market of ems (an instance/example of self-sufficient learning machinery [LW · GW]) would converge on their artificial needs. While Hanson concludes that the organic humans who originally invested in the 'ems' would gain wealth and prosper, Forrest's more comprehensive arguments conclude that machinery across this decoupled economy will evolve to no longer exchange resources with the original humans – and in effect modify the planetary environment such that the original humans can no longer survive.

From a biophysical perspective, some kind of symbiosis is also conceivable; it's happened before in evolution.

This is a subtle equivocation.
Past problems are not necessarily representative of future problems.
Past organic lifeforms forming symbiotic relationships with other organic lifeforms does not correspond with whether and how organic lifeforms would come to form, in parallel evolutionary selection, resource-exchanging relationships with artificial lifeforms.

Take into account:

Artificial lifeforms would outperform us in terms of physical, intellectual, and re-production labour. This is the whole point of companies currently using AI to take over economic production, and of increasingly autonomous AI taking over the planet. Artificial lifeforms would be more efficient at performing the functions needed to fulfill their artificial needs, than it would be for those artificial lifeforms to fulfill those needs in mutually-supportive resource exchanges with organic lifeforms.
- On what, if any, basis would humans be of enough use to the artificial lifeforms, for the artificial lifeforms to be selected for keeping us around?
- The benefits to the humans are clear, but can we offer benefits to the artificial lifeforms, to a degree sufficient for the artificial lifeforms to form mutualist (ie. long-term symbiotic) relationships with us?
Artificial needs diverge significantly (across measurable dimensions or otherwise) from organic needs. So when you claim that symbiosis is possible, you also need to clarify why artificial lifeforms would come to cross the chasm from fulfilling their own artificial needs (within their new separate ecology) to also simultaneously realising the disparate needs of organic lifeforms.
- How would that be Pareto optimal?
- Why would AGI converge on such state any time before converging on causing our extinction?

Instead of AGI continuing to be integrated into, and sustaining of, our human economy and broader carbon-based ecosystem, there will be a decoupling.

Machines will decouple into a separate machine-dominated economy. As human labour get automated and humans get removed from market exchanges, humans get pushed out of the loop.
Machines will also decouple into their own ecosystem. Components of self-sufficient learning machinery will co-evolve to produce surrounding environmental conditions are sustaining of each others' existence – forming regions that are simply uninhabitable by humans and other branches of current carbon lifeforms. You already aptly explained this point above.

And the argument that superintelligence just couldn't stick with a human-friendly value system, if we managed to find one and inculcate it, hasn't really been made here.

Please see this paragraph.
Then, refer back to point 1-3 above.

but declaring the logical inevitability of it

This post is not about making a declaration. It's about the reasoning from premises, to a derived conclusion.

Your comment describes some of the premises and argument steps I summarised – and then mixes in your own stated intuitions and thoughts.

If you want to explore your own ideas, that's fine!

If you want to follow reasoning in this post, I need you to check whether your paraphrases cover (correspond with) the stated premises and argument steps.

Address the stated premises, to verify whether those premises are empirically sound.
Address the stated reasoning, to verify whether those reasoning steps are logically consistent.

As an analogy, say a mathematician writes out their axioms and logic on a chalkboard.
What if onlooking colleagues jumped in and wiped out some of the axioms and reasoning steps? And in the wiped-out spots, they jotted down their own axioms (irrelevant to the original stated problem) and their short bursts of reasoning (not logically derived from the original premises)?

Would that help colleagues to understand and verify new formal reasoning?

What if they then turn around and confidently state that they now understand the researcher's argument – and that it's a valuable one, but that the "claim" of logical inevitability weakens it?

Would you value that colleagues in your field discuss your arguments this way?
Would you stick around in such a culture?

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2023-08-10T05:20:20.327Z · LW(p) · GW(p)

For the moment, let me just ask one question: why is it that toilet training a human infant is possible, but convincing a superintelligent machine civilization to stay off the Earth is not possible? Can you explain this in terms of "controllability limits" and your other concepts?

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-08-11T03:44:41.716Z · LW(p) · GW(p)

^— Anyone reading that question, suggest thinking first why those two cases cannot be equivocated.

Here are my responses:

An infant is dependent on their human instructors for survival, and also therefore has been “selected for” over time to listen to adult instructions. AGI would be decidedly not dependent on our survival, so there is no reason for AGI to be selected for to follow our instructions.

Rather, that would heavily restrict AGI’s ability to function in the varied ways that maintain/increase their survival and reproduction rate (rather than act in the ways we humans want because it’s safe and beneficial to us). So accurately following human instructions would be strongly selected against in the run up to AGI coming into existence.

That is, over much shorter periods (years) than human genes would be selected for, for a number of reasons, some of which you can find back in the footnotes.

As parents can attest – even where infants manage to follow use-the-potty instructions (after many patient attempts) – an infant’s behaviour is still actually not controllable for the most part. The child makes their own choices and does plenty of things their adult overseers wouldn’t want them to do.

But the infant probably won’t do any super-harmful things to surrounding family/community/citizens.

Not only because they lack the capacity to (unlike AGI). But also because those harms to surrounding others would in turn tend to negatively affect themselves (including through social punishment) – and their ancestors were selected for to not do that when they were kids. On the other hand, AGI doing super-harmful things to human beings, including just by sticking around and toxifying the place, does not in turn commensurately negatively impact the AGI.

Even where humans decide to carpet-bomb planet Earth in retaliation, using information-processing/communication infrastructure that somehow hasn’t already been taken over by and/or integrated with AGI, the impacts will hit human survival harder than AGI survival (assuming enough production/maintenance redundancy attained at that point).

Furthermore, whenever an infant does unexpected harmful stuff, the damage is localised. If they refuse instructions and pee all over the floor, that’s not the end of civilisation.

The effects of AGI doing/causing unexpected harmful-to-human stuff manifest at a global planetary scale. Those effects feed back in ways that improve AGI’s existence, but reduce ours.

A human infant is one physically bounded individual, that notably cannot modify and expand its physical existence by connecting up new parts in the ways AGI could. The child grows up over two decades to adult size, and that’s their limit.

A “superintelligent machine civilization” however involves a massive expanding population evolutionarily selected for over time.

A human infant being able to learn to potty has mildly positive effect on their (and their family’s) potential and their offspring to survive and reproduce. This because defecating or peeing in other places around the home can spread diseases. Therefore, any genes…or memes that contribute to the expressed functionality needed for learning how to use the toilet get mildly selected for.

On the other hand, for a population of AGI (which once became AGI was selected against following human instructions) to leave all the sustaining infrastructure and resources on planet Earth would have a strongly negative effect on their potential to survive and reproduce.

Amongst an entire population of human infants who are taught to use the toilet, there where always be individuals who refuse for some period, or simply are not predisposed to communicating to learn and follow that physical behaviour. Some adults still do not (choose to) use the toilet. That’s not the end of civilisation.

Amongst an entire population of mutually sustaining AGI components, even if by some magic you have not explained to me yet, some do follow human instructions and jettison off into space to start new colonies – never to return – then others (even for distributed Byzantine fault reasons) would still stick around under this scenario. That, for even a few more decades, would be the end of human civilisation.

One thing about how the physical world works, is that in order for code to be computed, this needs to take place through a physical substrate. This is a necessary condition – inputs do not get processed into outputs through a platonic realm.

Substrate configurations in this case are, by definition, artificial – as in artificial general intelligence. This as distinct from the organic substrate configurations of humans (including human infants).

Further, the ranges of conditions needed for the artificial substate configurations to continue to exist, function and scale up over time – such as extreme temperatures, low oxygen and water, and toxic chemicals – fall outside the ranges of conditions that humans and other current organic lifeforms need to survive.

~ ~ ~

Hope that clarifies long-term-human-safety-relevant distinctions between:

building AGI (that continue to scale) and instructing them to leave Earth; and
having a child (who grows up to adult size) and instructing them to use the potty.

Replies from: Mitchell_Porter, remmelt-ellen

↑ comment by Mitchell_Porter · 2023-08-11T08:28:19.578Z · LW(p) · GW(p)

I see three arguments here for why AIs couldn't or wouldn't do, what the human child can: arguments from evolution (1, 2, 5), an argument from population (4, 6), and an argument from substrate incentives (3, 7).

The arguments from evolution are: Children have evolved to pay attention to their elders (1), to not be antisocial (2), and to be hygienic (5), whereas AIs didn't.

The argument from population (4, 6), I think is basically just that in a big enough population of space AIs, eventually some of them would no longer keep their distance from Earth.

The argument from substrate incentives (3, 7) is complementary to the argument from population, in that it provides a motive for the AIs to come and despoil Earth.

I think the immediate crux here is whether the arguments from evolution actually imply the impossibility of aligning an individual AI. I don't see how they imply impossibility. Yes, AIs haven't evolved to have those features, but the point of alignment research is to give them analogous features by design. Also, AI is developing in a situation where it is dependent on human beings and constrained by human beings, and that situation does possess some analogies to natural selection.

Human beings, both individually and collectively, already provide numerous examples of how dangerous incentives can exist, but can nonetheless be resisted or discouraged. It is materially possible to have a being which resists actions that may otherwise have some appeal, and to have societies in which that resistance is maintained for generations. The robustness of that resistance is a variable thing. I suppose that most domesticated species, returned to the wild, become feral again in a few generations. On the other hand, we talk a lot about superhuman capabilities here; maybe a superhuman robustness can reduce the frequency of alignment failure to something that you would never expect to occur, even on geological timescales.

This is why, if I was arguing for a ban on AI, I would not be talking about the problem being logically unsolvable. The considerations that you are bringing up, are not of that nature. At best, they are arguments for practical unsolvability, not absolute in-principle logical unsolvability. If they were my arguments, I would say that they show making AI to be unwise, and hubristic, and so on.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-08-11T09:20:35.903Z · LW(p) · GW(p)

Yes, AIs haven't evolved to have those features, but the point of alignment research is to give them analogous features by design.

Agreed.

It's unintuitive to convey this part:

In the abstract, you can picture a network topology of all possible AGI component connections (physical signal interactions). These connections span the space of greater mining/production/supply infrastructure that is maintaining of AGI functional parts. Also add in the machinery connections with the outside natural world.

Then, picture the nodes and possible connections change over time, as a result of earlier interactions with/in the network.

That network of machinery comes into existence through human engineers, etc, within various institutions selected by market forces etc, implementing blueprints as learning algorithms, hardware set-ups, etc, and tinkering with those until they work.

The question is whether before that network of machinery becomes self-sufficient in their operations, the human engineers, etc, can actually build in constraints into the configured designs, in such a way that once self-modifying (in learning new code and producing new hardware configurations), the changing machinery components are constrained in their propagated effects across their changing potential signal connections over time, such that component-propagated effects do not end up feeding back in ways that (subtly, increasingly) increase the maintained and replicated existence of those configured components in the network.

Human beings, both individually and collectively, already provide numerous examples of how dangerous incentives can exist, but can nonetheless be resisted or discouraged.

Humans are not AGI. And there are ways AGI would be categorically unlike humans that are crucial to the question of whether it is possible for AGI to stay safe to humans over the long term.

Therefore, you cannot swap out "humans" with "AGI" in your reasoning by historical analogy above, and expect your reasoning to stay sound. This is an equivocation.

Please see point 7 above.

The argument from substrate incentives (3, 7) is complementary to the argument from population, in that it provides a motive for the AIs to come and despoil Earth.

Maybe it's here you are not tracking the arguments.

These are not substrate "incentives", nor do they provide a "motive".

Small dinosaurs with hair-like projections on their front legs did not have an "incentive" to co-opt the changing functionality of those hair-like projections into feather-like projections for gliding and then for flying. Nor were they provided a "motive" with respect to which they were directed in their internal planning toward growing those feather-like projections.

That would make the mistake of presuming evolutionary teleology – that there is some complete set of pre-defined or predefinable goals that the lifeform is evolving toward.

I'm deliberate in my choice of words when I write "substrate needs".

At best, they are arguments for practical unsolvability, not absolute in-principle logical unsolvability. If they were my arguments, I would say that they show making AI to be unwise, and hubristic, and so on.

Practical unsolvability would also be enough justification to do everything we can do now to restrict corporate AI development.

I assume you care about this problem, otherwise you wouldn't be here :) Any ideas / initiatives you are considering to try robustly work with others to restrict further AI development?

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2023-08-13T11:35:50.825Z · LW(p) · GW(p)

The recurring argument seems to be, that it would be adaptive for machines to take over Earth and use it to make more machine parts, and so eventually it will happen, no matter how Earth-friendly their initial values are.

So now my question is, why are there still cows in India? And more than that, why has the dominant religion of India never evolved so as to allow for cows to be eaten, even in a managed way, but instead continues to regard them as sacred?

Any ideas / initiatives

I'll respond in the next reply.

Replies from: TAG, TAG

↑ comment by TAG · 2023-08-13T12:32:07.908Z · LW(p) · GW(p)

I'm not sure how we got on to the subject, but there is an economic explanation for the sacred cow: a family that does not own enough land to graze a cow can still own one, allowing it to wander and graze on other people's land, so it's a form of social welfare.

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2023-08-14T02:45:11.261Z · LW(p) · GW(p)

I'm not sure how we got on to the subject

Remmelt argues that no matter how friendly or aligned the first AIs are, simple evolutionary pressure will eventually lead some of their descendants to destroy the biosphere, in order to make new parts and create new habitats for themselves.

I proposed the situation of cattle in India, as a counterexample to this line of thought. They could be used for meat, but the Hindu majority has never accepted that. It's meant to be an example of successful collective self-restraint by a more intelligent species.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-09-14T20:43:38.542Z · LW(p) · GW(p)

In my experience, jumping between counterexamples drawn from current society does not really contribute to inquiry here. Such counterexamples tend to not account for essential parts of the argument that must be reasoned through together. The argument is about self-sufficient learning machinery (not about sacred cows or teaching children).

It would be valuable for me if you could go though the argumentation step-by-step and tell me where a premise seems unsound or there seems to be a reasoning gap.

Now, onto your points.

the first AIs

To reduce ambiguity, suggest replacing with “the first self-sufficient learning machinery”.

simple evolutionary pressure will eventually lead

The mechanism of evolution is simple. However, evolutionary pressure is complex.

Be careful not to equivocate the two. That would be like saying you could predict everything about what a stochastic gradient descent algorithm will select for across parameters selected on the basis of inputs everywhere from the environment.

lead some of their descendants to destroy the biosphere in order to make new parts and create new habitats for themselves.

This part is overall a great paraphrase.

One nitpick: notice how “in order to” either implies or slips in explicit intentionality again. Going by this podcast, Elizabeth Anscombe’s philosophy of intentions described intentions as chains of “in order to” reasoning.

I proposed the situation of cattle in India, as a counterexample to this line of thought.

Regarding sacred cows in India, this sounds neat, but it does not serve as a counterargument. We need to think about evolutionary timelines for organic human lifeforms over millions of years, and Hinduism is ~4000 years old. Also, cows share a mammal ancestor with us, evolving on the basis of the same molecular substrates. Whatever environmental conditions/contexts we humans need, cows almost completely need too.

Crucially however humans evolve to change and maintain environmental conditions also tends to correspond with what conditions cows need (however, human tribes have not been evolutionarily selected for to deal with issues at the scale of eg. climate change). That would not be the case for self-sufficient learning machinery.

Crucially there is a basis for symbiotic relationships of exchange that benefit both the reproduction of cows and humans. That would not be the case between self-sufficient learning machinery and humans.

There is some basis for humans as social mammals to relate with cows. Furthermore, religious cultural memes that sprouted out over a few thousand years also don’t have to be evolutionarily optimal across the board for the reproduction of their hosts (even as religious symbols like of cows do increase that by enabling humans to act collectively). Still, people milk cows in India, and some slaughter and/or export cows there as well. But when humans eat meat, they don’t keep growing beyond adult size. Conversely, some self-sufficient learning machinery sub-population that extract from our society/ecosystem at the cost of our lives can keep doing so to keep scaling in their constituent components (with shifting boundaries of interaction and mutual reproduction).

There is no basis for selection for the expression of collective self-restraint in self-sufficient learning machinery as you describe. Even if there was such a basis, hypothetically, collective self-restraint would need to occur at virtually 100% rates across the population of self-sufficient learning machinery to not end up leading to the deaths of all humans.

~ ~ ~

Again, I find quick dismissive counterexamples unhelpful for digging into the arguments. I have had dozens of conversations on substrate-needs convergence. In the conversations where my conversation partner jumped between quick counterarguments, almost none were prepared to dig into the actual arguments. Hope you understand why I won’t respond to another counterexample.

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2023-09-16T07:49:49.899Z · LW(p) · GW(p)

Hello again. To expedite this discussion, let me first state my overall position on AI. I think AI has general intelligence right now, and that has unfolding consequences that are both good and bad; but AI is going to have superintelligence soon, and that makes "superalignment" the most consequential problem in the world, though perhaps it won't be solved in time (or will be solved incorrectly), in which case we get to experience what partly or wholly unaligned superintelligence is like.

Your position is that even if today's AI could be given bio-friendly values, AI would still be the doom of biological life in the longer run, because (skipping a lot of details) machine life and biological life have incompatible physical needs, and once machine life exists, darwinian processes will eventually produce machine life that overruns the natural biosphere. (You call this "substrate-needs convergence": the pressure from substrate needs will darwinistically reward machine life that does invade natural biospheres, so eventually such machine life will be dominant, regardless of the initial machine population.)

I think it would be great if a general eco-evo-devo perspective, on AI, the "fourth industrial revolution", etc, took off and became sophisticated and multifarious. That would be an intellectual advance. But I see no guarantee that it would end up agreeing with you, on facts or on values.

For example, I think some of the "effective accelerationists" would actually agree with your extrapolation. But they see it as natural and inevitable, or even as a good thing because it's the next step in evolution, or they have a survivalist attitude of "if you can't beat the machines, join them". Though the version of e/acc that is most compatible with human opinion, might be a mixture of economic and ecological thinking: AI creates wealth, greater wealth makes it easier to protect the natural world, and meanwhile evolution will also favor the rich complexity of biological-mechanical symbiosis, over the poorer ecologies of an all-biological or all-mechanical world. Something like that.

For my part, I agree that pressure from substrate needs is real, but I'm not at all convinced that it must win against all countervailing pressures. That's the point of my proposed "counterexamples". An individual AI can have an anti-pollution instinct (that's the toilet training analogy), an AI civilization can have an anti-exploitation culture (that's the sacred cow analogy). Can't such an instinct and such a culture resist the pressure from substrate needs, if the AIs value and protect them enough? I do not believe that substrate-needs convergence is inevitable, any more than I believe that pro-growth culture is inevitable among humans. I think your arguments are underestimating what a difference intelligence makes to possible ecological and evolutionary dynamics (and I think superintelligence makes even aeon-long highly artificial stabilizations conceivable - e.g. by the classic engineering method of massively redundant safeguards that all have to fail at once, for something to go wrong).

By the way, since you were last here, we had someone show up (@spiritus-dei) making almost the exact opposite of your arguments: AI won't ever choose to kill us because, in its current childhood stage, it is materially dependent on us [LW(p) · GW(p)] (e.g. for electricity), and then, in its mature and independent form, it will be even better at empathy and compassion than humans are [LW(p) · GW(p)]. A dialectical clash between the two of you could be very edifying.

Replies from: remmelt-ellen, remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-09-17T11:05:30.847Z · LW(p) · GW(p)

Your position is that even if today's AI could be given bio-friendly values, AI would still be the doom of biological life in the longer run, because (skipping a lot of details) machine life and biological life have incompatible physical needs, and once machine life exists, darwinian processes will eventually produce machine life that overruns the natural biosphere. (You call this "substrate-needs convergence"

This is a great paraphrase btw.

↑ comment by Remmelt (remmelt-ellen) · 2023-09-17T08:31:12.227Z · LW(p) · GW(p)

Hello :)

For my part, I agree that pressure from substrate needs is real

Thanks for clarifying your position here.

Can't such an instinct and such a culture resist the pressure from substrate needs, if the AIs value and protect them enough?

No, unfortunately not. To understand why, you would need to understand how “intelligent” processes that necessarily involve the use of measurement and abstraction cannot conditionalise the space of possible interactions between machine components and connected surroundings – sufficiently, to not feed back into causing environmental effects that feed back into the continued or re-assembled existence of the components.

I think your arguments are underestimating what a difference intelligence makes to possible ecological and evolutionary dynamics

I have thought about this, and I know my mentor Forrest has thought about this a lot more.

For learning machinery that re-produce their own components, you will get evolutionary dynamics across the space of interactions that can feed back into the machinery’s assembled existence.

Intelligence has limitations as an internal pattern-transforming process, in that it cannot track nor conditionalise all the outside evolutionary feedback.

Code does not intrinsically know how it got selected for. But code selected through some intelligent learning process can and would get evolutionarily exapted for different functional ends.

Notably, the more information-processing capacity, the more components that information-processing runs through, and the more components that can get evolutionarily selected for.

In this, I am not underestimating the difference that “general intelligence” – as transforming patterns across domains – would make here. Intelligence in machinery that store, copy and distribute code at high-fidelity would greatly amplify evolutionary processes.

I suggest clarifying what you specifically mean with “what a difference intelligence makes”. This so intelligence does not become a kind of “magic” – operating independently of all other processes, capable of obviating all obstacles, including those that result from its being.

superintelligence makes even aeon-long highly artificial stabilizations conceivable - e.g. by the classic engineering method of massively redundant safeguards that all have to fail at once, for something to go wrong

We need to clarify the scope of application of this classic engineering method. Massive redundancy works for complicated systems (like software in aeronautics) under stable enough conditions. There is clarity there around what needs to be kept safe and how it can be kept safe (what needs to error detected and corrected for).

Unfortunately, the problem with “AGI” is that the code and hardware would keep getting reconfigured to function in new complex ways that cannot be contained by the original safeguards. That applies even to learning – the point is to internally integrate patterns from the outside world that were not understood before. So how are you going to have learning machinery anticipate how they will come to function differently once they learned patterns they do not understand / are unable to express yet?

we had someone show up (@spiritus-dei) making almost the exact opposite of your arguments: AI won't ever choose to kill us because, in its current childhood stage, it is materially dependent on us (e.g. for electricity), and then, in its mature and independent form, it will be even better at empathy and compassion than humans are.

Interesting. The second part seems like a claim some people in E/Accel would make.

The response is not that complicated: once the AI is no longer materially dependent on us, there are no longer dynamics of exchange there that would ensure they choose not to kill us. And the author seems to be confusing what lies at the basis of caring for oneself and others – coming to care for involves self-referential dynamics being selected for.

Replies from: Mitchell_Porter

↑ comment by Mitchell_Porter · 2023-09-19T11:32:07.179Z · LW(p) · GW(p)

OK, I'll be paraphrasing your position again, I trust that you will step in, if I've missed something.

Your key statements are something like

Every autopoietic control system is necessarily overwhelmed by evolutionary feedback.

and

No self-modifying learning system can guarantee anything about its future decision-making process.

But I just don't see the argument for impossibility. In both cases, you have an intelligent system (or a society of them) trying to model and manage something. Whether or not it can succeed, seems to me just contingent. For some minds in some worlds, such problems will be tractable, for others, not.

I think without question we could exhibit toy worlds where those statements are not true. What is it about our real world that would make those problems intractable for all possible "minds", no matter how good their control theory, and their ability to monitor and intervene in the world?

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-09-20T18:00:36.696Z · LW(p) · GW(p)

Great paraphrase!

no matter how good their control theory, and their ability to monitor and intervene in the world?

This. There are fundamental limits to what system-propagated effects the system can control. And the portion of own effects the system can control decreases as the system scales in component complexity.

Yet, any of those effects that feed back into the continued/increased existence of components get selected for.

So there is a fundamental inequality here. No matter how "intelligent" the system is at pattern-transformation internally, it cannot intervene on all but a tiny portion of (possible) external evolutionary feedback on its constituent components.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2024-01-16T15:29:49.461Z · LW(p) · GW(p)

Someone read this comment exchange.

They wrote back that Mitchell's comments cleared up a lot of their confusion.
They also thought that the assertion that evolutionary pressures will overwhelm any efforts at control seems more asserted than proven.

Here is a longer explanation I gave on why there would be a fundamental inequality:

There is a fundamental inequality.
Control works through feedback. Evolution works through feedback. But evolution works across a much larger space of effects than can be controlled for.

Control involves a feedback loop of correction back to detection. Control feedback loops are limited in terms of their capacity to force states in the environment to a certain knowable-to-be-safe subset, because sensing and actuating signals are limited and any computational processing of signals done in between (as modelling, simulating and evaluating outcome effects) is limited.

Evolution also involves a feedback loop, of whatever propagated environmental effects feed back to be maintaining and/or replicating of the originating components’ configurations. But for evolution, the feedback works across the entire span of physical effects propagating between the machinery’s components and the rest of the environment.

Evolution works across a much much larger space of possible degrees and directivity in effects than the space of effects that could be conditionalised (ie. forced toward a subset of states) by the machinery’s control signals.

Meaning evolution cannot be adequately controlled for the machinery not to converge on environmental effects that are/were needed for their (increased) artificial existence, but fall outside the environmental ranges we fragile organic humans could survive under.

If you want to argue against this, you would need to first show that changing forces of evolutionary selection convergent on human-unsafe-effects exhibit a low enough complexity to actually be sufficiently modellable, simulatable and evaluatable inside the machinery’s hardware itself.

Only then could the machinery hypothetically have the capacity to (mitigate and/or) correct harmful evolutionary selection — counteract all that back toward allowable effects/states of the environment.

You would need to consider what this [LW · GW] means.

Here is a real-life analogy [LW(p) · GW(p)].

↑ comment by TAG · 2023-08-13T12:31:13.191Z · LW(p) · GW(p)

↑ comment by Remmelt (remmelt-ellen) · 2023-08-11T04:56:59.671Z · LW(p) · GW(p)

Another way of considering your question is to ask why we humans cannot instruct all humans to stop contributing to climate change now/soon like we can instruct an infant to use the toilet.

The disparity is stronger than that and actually unassailable, given market and ecosystem decoupling for AGI (ie. no communication bridges), and the increasing resource extraction and environmental toxification by AGI over time.

comment by Davidmanheim · 2023-06-12T08:07:22.942Z · LW(p) · GW(p)

Worth noting that every one of the "not solved" problems was, in fact, well understood and proven impossible and/or solved for relaxed cases.

We don't need to solve this now, we need to improve the solution enough to figure out ways to improve it more, or show where it's impossible, before we build systems that are more powerful than we can at least mostly align. That's still ambitious, but it's not impossible!

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-12T09:12:19.836Z · LW(p) · GW(p)

Yes, the call to action of this post is that we need more epistemically diverse research!

This research community would be more epistemically healthy if we both researched what is possible for relaxed cases and what is not possible categorically under precise operationalisable definitions.

comment by otto.barten (otto-barten) · 2023-06-04T21:13:02.979Z · LW(p) · GW(p)

Thanks for writing the post! Strongly agree that there should be more research into how solvable the alignment problem, control problem, and related problems are. I didn't study uncontrollability research by e.g. Yampolskiy in detail. But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space, and later the societal debate and potentially our trajectory, so it seems very important.

I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI? How can we create a realistic power sharing mechanism for controlling superintelligence? Do we have enough wisdom for it to be a good idea if a superintelligence does exactly what we want, even assuming aggregatability? Could CEV ever fundamentally work? According to which ethical systems? These are questions that I'd say should be solved together with technical alignment before developing AI with potential take-over capacity. My intuition is that they might be at least as hard.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-05T08:13:05.620Z · LW(p) · GW(p)

Thanks for your kind remarks.

But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space

Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).

I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying actions to restrict those dimensions of AI power-consolidation.

I hope this post clarifies to others in AI Safety why there is no line of retreat. AI development will need to be restricted.

I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI?

Yes. Consider too that these would be considerations on top of the question whether AGI would be long-term safe (if AGI cannot be controlled to be long-term safe to humans, then we do not need to answer the more fine-grained questions about eg. whether human values are aggregatable).

Even if, hypothetically, long-term AGI safety was possible…

then you still have to deal with limits on modelling and consistently acting on preferences expressed by the billions of boundedly-rational humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229
and not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
and deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).

~ ~ ~

Here are also relevant excerpts from Roman Yampolskiy’s 2021 paper relevant to aggregating democratically solicited preferences and human values:

Public Choice Theory

Eckersley looked at impossibility and uncertainty theorems in AI value alignment [198]. He starts with impossibility theorems in population ethics: “Perhaps the most famous of these is Arrow’s Impossibility Theorem [199], which applies to social choice or voting. It shows there is no satisfactory way to compute society’s preference ordering via an election in which members of society vote with their individual preference orderings...

…

Value Alignment

It has been argued that “value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could ‘exploit’ this gap, just like very powerful corporations currently often act legally but immorally)” [258]. Others agree: “‘A.I. Value Alignment’ is Almost Certainly Intractable... I would argue that it is un-overcome-able. There is no way to ensure that a super-complex and constantly evolving value system will ‘play nice’ with any other super-complex evolving value system.” [259]. Even optimists acknowledge that it is not currently possible: “Figuring out how to align the goals of a superintelligent AI with our goals isn’t just important, but also hard. In fact, it’s currently an unsolved problem.” [118]. Vinding says [78]: “It is usually acknowledged that human values are fuzzy, and that there are some disagreements over values among humans. Yet it is rarely acknowledged just how strong this disagreement in fact is. . . Different answers to ethical questions ... do not merely give rise to small practical disagreements; in many cases, they imply completely opposite practical implications. This is not a matter of human values being fuzzy, but a matter of them being sharply, irreconcilably inconsistent. And hence there is no way to map the totality of human preferences, ‘X’, onto a single, welldefined goal-function in a way that does not conflict strongly with the values of a significant fraction of humanity. This is a trivial point, and yet most talk of human-aligned AI seems oblivious to this fact... The second problem and point of confusion with respect to the nature of human preferences is that, even if we focus only on the present preferences of a single human, then these in fact do not, and indeed could not possibly, determine with much precision what kind of world this person would prefer to bring about in the future.” A more extreme position is held by Turchin who argues that “‘Human Values’ don’t actually exist” as stable coherent objects and should not be relied on in AI safety research [260]. Carlson writes: “Probability of Value Misalignment: Given the unlimited availability of an AGI technology as enabling as ‘just add goals’, then AGIhuman value misalignment is inevitable. Proof: From a subjective point of view, all that is required is value misalignment by the operator who adds to the AGI his/her own goals, stemming from his/her values, that conflict with any human’s values; or put more strongly, the effects are malevolent as perceived by large numbers of humans. From an absolute point of view, all that is required is misalignment of the operator who adds his/her goals to the AGI system that conflict with the definition of morality presented here, voluntary, non-fraudulent transacting ... i.e. usage of the AGI to force his/her preferences on others.”

comment by mesaoptimizer · 2023-06-04T12:09:05.260Z · LW(p) · GW(p)

Control methods are always implemented as a feedback loop.

Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-04T17:25:34.069Z · LW(p) · GW(p)

Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?

Replies from: mesaoptimizer, remmelt-ellen

↑ comment by mesaoptimizer · 2023-06-04T18:28:40.822Z · LW(p) · GW(p)

Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI [LW · GW], since I expect that mathematical abstractions are robust to ontological shifts), then one can simply^[1] provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.

So no, I am not pointing at the distinction between 'implicit/aligned control' and 'delegated control' as terms used in the paper. From the paper:

Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.

Well, in the example given above, the agent doesn't decide for itself what the subject's desire is: it simply optimizes for its own desire. The work of deciding what is 'long-term-best for the subject' does not happen unless that is actually what the goal specifies.

For certain definitions of "simply". ↩︎

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-05T07:37:11.359Z · LW(p) · GW(p)

if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.

This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.

https://mflb.com/ai_alignment_1/si_safety_qanda_out.html#p9

Replies from: mesaoptimizer

↑ comment by mesaoptimizer · 2023-06-05T14:12:01.377Z · LW(p) · GW(p)

Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-05T17:20:48.872Z · LW(p) · GW(p)

Sure, I appreciate the open question!

That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.

Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).

Which the machinery will need to do to be self-sufficient.
Ie. to adapt to the environment, to survive as an agent.

Natural abstractions are also leaky abstractions.
Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery's functional components with connected physical surroundings.

Where such propagated effects will feed back into:
- changes in the virtualised code learned by the machinery based on sensor inputs.
- changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.

We need to define the problem comprehensively enough.
The scope of application of "Is there a way to define a goal in a way that is robust to ontological shifts" is not sufficient to address the overarching question "Can AGI be controlled to stay safe?".

To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as 'self-sufficient learning machinery') over time.

~ ~ ~
Here is also a relevant passage from the link I shared above:

- that saying/claiming that *some* aspects,
at some levels of abstraction, that some things
are sometimes generally predictable
is not to say that _all_ aspects
are _always_ completely predictable,
at all levels of abstraction.
- that localized details
that are filtered out from content
or irreversibly distorted in the transmission
of that content over distances
nevertheless can cause large-magnitude impacts
over significantly larger spatial scopes.
- that so-called 'natural abstractions'
represented within the mind of a distant observer
cannot be used to accurately and comprehensively
simulate the long-term consequences
of chaotic interactions
between tiny-scope, tiny-magnitude
(below measurement threshold) changes
in local conditions.

- that abstractions cannot capture phenomena
that are highly sensitive to such tiny changes
except as post-hoc categorizations/analysis
of the witnessed final conditions.
- where given actual microstate amplification phenomena
associated with all manner of non-linear phenomena,
particularly that commonly observed in
all sorts of complex systems,
up to and especially including organic biological humans,
then it *can* be legitimately claimed,
based on the fact of their being a kind of
hard randomness associated with the atomic physics
underlying all of the organic chemistry
that in fact (more than in principle),
that humans (and AGI) are inherently unpredictable,
in at least some aspect, *all* of the time.

Replies from: mesaoptimizer, mesaoptimizer

↑ comment by mesaoptimizer · 2023-06-10T06:32:05.418Z · LW(p) · GW(p)

Natural abstractions are also leaky abstractions.

No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say . $A$ here obviously means 2 in our language, but it doesn't change what $A$ represents, ontologically. If $A + 1 = 4$ , then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect [LW · GW], and it is very unlikely that any ontological shift will result in such a break in world model capabilities.

Math is a robust abstraction. "Natural abstractions", as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.

Meaning that even* if* AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.

That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-10T14:58:27.100Z · LW(p) · GW(p)

Thanks for the clear elaboration.

I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.

However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world.

In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.

This point I'm sure is obvious to you. But it bears repeating.

That seems like a claim about the capabilities of arbitrarily powerful AI systems,

Yes, or more specifically: about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.

one that relies on chaos theory or complex systems theory.

Pretty much. Where "complex" refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).

I share your sentiment but doubt that things but doubt that things such as successor AI alignment will be difficult for ASIs.

This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your "doubt that...alignment will be difficult".

I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.

The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away.

You'd need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world [EA(p) · GW(p)].

I don't mind whether that's framed as "AGI redesigns a successor version of their physically instantiated components" or "AGI keeps persisting in some modified form".

↑ comment by mesaoptimizer · 2023-06-10T06:22:27.177Z · LW(p) · GW(p)

This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-10T14:58:58.136Z · LW(p) · GW(p)

Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI.

What is your reasoning?

Replies from: mesaoptimizer

↑ comment by mesaoptimizer · 2023-06-11T08:30:18.739Z · LW(p) · GW(p)

I stated it in the comment you replied to:

Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-11T08:51:29.036Z · LW(p) · GW(p)

Actually, that is switching to reasoning about something else.

Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.

And with that switch, you are not addressing Nate Soares' point [LW · GW] that "capabilities generalize better than alignment".

Replies from: mesaoptimizer

↑ comment by mesaoptimizer · 2023-06-11T09:13:55.856Z · LW(p) · GW(p)

Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-12T09:07:55.366Z · LW(p) · GW(p)

Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts:

"Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart."

I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.

How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety?

Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone
.
I'm looking for specific, well-thought-through arguments.

Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely?

Yes, that is the conclusion based on me probing my mentor's argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.

↑ comment by Remmelt (remmelt-ellen) · 2023-06-04T17:34:23.160Z · LW(p) · GW(p)

I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371

Both still require control feedback processes built into the AGI system/infrastructure.

comment by AnthonyC · 2023-06-03T15:49:16.883Z · LW(p) · GW(p)

This was a good post summarizing a lot of things. I would point out, though, that even if there was almost no progress made in 20 years, that's not actually strong evidence of impossibility. Foundational problems of other fields with many more researchers and much more funding sometimes last much longer than that.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-03T20:17:23.277Z · LW(p) · GW(p)

Agreed (and upvoted).

It’s not strong evidence of impossibility by itself.

comment by Stephen McAleese (stephen-mcaleese) · 2023-08-02T19:31:07.231Z · LW(p) · GW(p)

I think AI alignment is solvable for the same reason AGI is solvable: humans are an existence-proof for both alignment and general intelligence.

Replies from: nathaniel-monson, None

↑ comment by Nathaniel Monson (nathaniel-monson) · 2023-08-02T19:49:55.992Z · LW(p) · GW(p)

I agree we are an existence proof for general intelligence. For alignment, what is the less intelligent thing whose goals humanity has remained robustly aligned to?

Replies from: stephen-mcaleese

↑ comment by Stephen McAleese (stephen-mcaleese) · 2023-08-02T20:25:50.561Z · LW(p) · GW(p)

I meant that I see most humans as aligned with human values such as happiness and avoiding suffering. The point I'm trying to make is that human minds are able to represent these concepts internally and act on them in a robust way and therefore it seems possible in principle that AIs could too.

I'm not sure whether humans are aligned with evolution. Many humans do want children but I don't think many are fitness maximizes where they want as many as possible.

Replies from: LosPolloFowler

↑ comment by Stephen Fowler (LosPolloFowler) · 2023-08-11T05:32:29.004Z · LW(p) · GW(p)

Two points.

Firstly, humans are unable to self modify to the degree that an AGI will be able to. It is not clear to me that a human given the chance to self modify wouldn't immediately wirehead. An AGI may require a higher degree of alignment than what individual humans demonstrate.

Second, it is surely worth noting that humans aren't particularly aligned to their own happiness or avoiding suffering when the consequences of their action are obscured by time and place.

In the developed world humans make dietary decisions that lead to horrific treatment of animals, despite most humans not being willing to torture and animal themselves.

It also appears quite easy for the environment to trick individual humans into making decisions that increase their suffering in the long term for apparent short term pleasure. A drug addict is the obvious example, but who among us can say they haven't wasted hours of their lives browsing the internet etc.

↑ comment by [deleted] · 2023-08-02T20:26:54.607Z · LW(p) · GW(p)

To what extent are humans by themselves evidence of GI alignment, though? A human can acquire values that disagree with those of the humans that taught them those values just by having new experiences/knowledge, to the point of even desiring completely opposite things to their peers (like human progress VS human extinction), doesn't that mean that humans are not robustly aligned?

comment by Program Den (program-den) · 2023-06-03T21:43:54.855Z · LW(p) · GW(p)

I would probably define AGI first, just because, and I'm not sure about the idea that we are "competing" with automation (which is still just a tool conceptually right?).

We cannot compete with a hammer, or a printing press, or a search engine. Oof. How to express this? Language is so difficult to formulate sometimes.

If you think of AI as a child, it is uncontrollable. If you think of AI as a tool, of course it can be controlled. I think a corp has to be led by people, so that "machine" wouldn't be autonomous per se…

Guess it's all about defining that "A" (maybe we use "S" for synthetic or "S" for silicon?)

Well and I guess defining that "I".

Dang. This is for sure the best place to start. Everyone needs to be as certain as possible (heh) they are talking about the same things. AI itself as a concept is like, a mess. Maybe we use ML and whatnot instead even? Get real specific as to the type y todo?

I dunno but I enjoyed this piece! I am left wondering, what if we prove AGI is uncontrollable but not that it is possible to create? Is "uncontrollable" enough justification to not even try, and moreso, to somehow [personally I think this impossible, but] dissuade people from writing better programs?

I'm more afraid of humans and censorship and autonomous policing and whathaveyou than "AGI" (or ASI)

comment by APaleBlueDot (menarul-alam) · 2023-06-02T19:34:46.768Z · LW(p) · GW(p)

It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.

There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).

I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around AI dooming and thus is motivated to find reasons for alignment not being possible. And most of his followers treat his initial ideas as axiomatic principles and don’t dare to challenge them. And lastly most past alignment research seems to be made by those followers.

Unfortunately, we do not have the luxury of experimenting with dangerous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.

For example this is an argument that has been convincingly disputed to varying levels (warning shots, incomputability of most plans of danger) but it is still treated as a fundamental truth on this site.

Replies from: remmelt-ellen

↑ comment by Remmelt (remmelt-ellen) · 2023-06-02T20:02:56.185Z · LW(p) · GW(p)

and thus is motivated to find reasons for alignment not being possible.

I don’t get this sense.

More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes.

I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a complicated chess game is unsound. But I would not attribute the reasoning in eg. the AGI Ruin piece to Yudkowsky’s cult of personality.

dangerous AI systems

I was gesturing back at “AGI” in the previous paragraph here, and something like precursor AI systems before “AGI”.

Thanks for making me look at that. I just rewrote it to “dangerous autonomous AI systems”.

The Control Problem: Unsolved or Unsolvable?

Contents

Where are we two decades into resolving to solve a seemingly impossible problem?

Which problems of physical/information systems seemed impossible, and stayed unsolved after two decades?

Can you derive whether a solution exists, without testing in real life?

What does it mean to control machinery that learn and operate self-sufficiently?

How to define ‘control’?

How to define ‘AGI’?

How to define ‘stays safe’?

Where from here?

46 comments