Smart creative researchers of their generation came up with idealized problems. Problems that, if solved, would transform science, if not humanity. They plowed away at the problem for decades, if not millennia. Until some bright outsider proved by contradiction of the parts that the problem is unsolvable.
Our community is smart and creative – but we cannot just rely on our resolve to align AI [LW · GW]. We should never forsake our epistemic rationality, no matter how much something seems the instrumentally rational thing to do.
Nor can we take comfort in the claim by a founder of this field that they still know it to be possible to control AGI to stay safe.
Thirty years into running a program to secure the foundations of mathematics, David Hilbert declared “We must know. We will know!” By then, Kurt Gödel had constructed the first incompleteness theorem. Hilbert kept his declaration for his gravestone.
Short of securing the foundations of safe AGI control – that is, through empirically-sound formal reasoning – we cannot rely on any researcher's pithy claim that "alignment is possible in principle".
Going by historical cases, this problem could turn out solvable. Just really, really hard to solve. The flying machine seemed an impossible feat of engineering. Next, controlling a rocket’s trajectory to the moon seemed impossible.
By the same reference class, ‘long-term safe AGI’ could turn out unsolvable – the perpetual motion machine of our time. It takes just one researcher to define the problem to be solved, reason from empirically sound premises, and arrive finally at a logical contradiction between the two.
Can you derive whether a solution exists, without testing in real life?
Invert, always invert.
— Carl Jacobi, ±1840
It is a standard practice in computer science to first show that a problem doesn’t belong to a class of unsolvable problems before investing resources into trying to solve it or deciding what approaches to try.
There is an empirically direct way to know whether AGI would stay safe to humans: Build the AGI. Then just keep observing, per generation, whether the people around us are dying.
Unfortunately, we do not have the luxury of experimenting with dangerous autonomous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.
Crux: Even if we could keep testing new conceptualized versions of guess-maybe-safe AGI, is there any essential difference between our epistemic method and that of medieval researchers who kept testing new versions of a perpetual motion machine?
OpenPhil bet tens of millions of dollars on technical research conditional on the positive hypothesis ("a solution exists to the control problem"). Before sinking hundreds of millions more into that bet, would it be prudent to hedge with a few million for investigating the negative hypothesis ("no solution exists")?
Before anyone tries building "safe AGI", we need to know whether any version of AGI – as precisely defined – could be controlled by any method to stay safe.
Here is how:
Define the concepts of 'control' 'general AI' 'to stay safe' (as soundly corresponding to observations in practice).
Specify the logical rules that must hold for such a physical system (categorically, by definition or empirically tested laws).
Reason step-by-step to derive whether the logical result of "control AGI" is in contradiction with "to stay safe".
This post defines the three concepts more precisely, and explains some ways you can reason about each. No formal reasoning is included – to keep it brief, and to leave the esoteric analytic language aside for now.
What does it mean to control machinery that learn and operate self-sufficiently?
Recall three concepts we want to define more precisely:
'to stay safe'
It is common for researchers to have very different conceptions of each term. For instance:
Is 'control' about:
adjusting the utility function representedinside the machine so it allows itself to be turned off?
correcting machine-propagated side-effects across the outside world?
Is 'AGI' about:
any machine capable of making accurate predictions about a variety of complicated systems in the outside world?
any machinery that operates self-sufficiently as an assembly of artificial components that process inputs into outputs, and in aggregate sense and act across many domains/contexts?
Is 'stays safe' about:
aligning the AGI’s preferences to not kill us all?
guaranteeing an upper-bound on the chance that AGI in the long term would cause outcomes out of line with a/any condition needed for the continued existence of organic DNA-based life?
To argue rigorously about solvability, we need to:
Pin down meanings: Disambiguate each term, to not accidentally switch between different meanings in our argument. Eg. distinguish between ‘explicitly optimizes outputs toward not killing us’ and ‘does not cause the deaths of all humans’.
Define comprehensively: Ensure that each definition covers all the relevant aspects we need to solve for. Eg. what about a machine causing non-monitored side-effects that turn out lethal?
Define elegantly: Eliminate any defined aspect that we do not yet need to solve for. Eg. we first need to know whether AGI eventually cause the extinction of all humans, before considering ‘alignment with preferences expressed by all humans’.
How to define ‘control’?
System is any non-empty part of the universe. State is the condition of the universe.
Control of system A over system B means that A can influence system B to achieve A’s desired subset of state space.
Sensing inputs through channels connected to any relevant part of the physical environment (including its hardware internals).
Modeling the environment based on the channel-received inputs.
Simulating effects propagating through the modeled environment.
Comparing effects to reference values (to align against) over human-safety-relevant dimensions.
Correcting effects counterfactually through outputs to actuators connected to the environment.
Control requires both detection and correction.
Control methods are always implemented as a feedback loop.
Control is exerted by the use of signals (actuation) to conditionalize the directivity and degrees of other signals (effects).
Any method of control is inherently incomplete. In the case of AGI, the question would be whether the degree of control possible is at least greater than the degree of control necessary.
AGI control signals would be a tiny, tiny subset of all physical signals propagating through the environment, and therefore limited in tracking and conditionalizing the resulting effects. AGI mostly could not even control all local effects of their own components’ physical interactions.
There are fundamental limits to the controllability of AGI. Dozens of impossibility results seem applicable to one or more of the control steps.
Controllability limits form constraints on the possibility of alignment. Any alignment method must be implemented as a control loop. AGI cannot simulate comprehensively even near term how noise drift in AGI's interactions can amplify (via any available environmental non-linearity, of which in any real-world context, there are many) into much larger changes to conditions. To keep outcomes aligned with values, AGI must continuously sense, update their model of, and correct those changes.
Dense questions for academics:
To what extent are distributed environmental effects picked up by local sensors?
How much information is lost in compressing high-fidelity sensor inputs to store on hardware?
What limits are imposed by noise interference over distances traveled by signals (Shannon Entropy) and smallest-digit/degree detectability (Planck Limit up) – given that smaller-than-measurable local changes can be nonlinearly amplified (Chaos Theory) into a large global divergence of final conditions?
To what extent is the space of relevant machinery-to-society/ecosystem interactions more functionally complex than modelable within the machinery alone (Good Regulators etc.)?
How can AGI model the future code that AGI will learn based on future unknown inputs?
Can future code be simulated, where actual functioning depends on future unknown inputs?
Can behavioral properties of future code be simulated, if no shortcut function is available that reduces total compute time steps, or if any shorter available computation branches would be non-deterministically time-complex (Computational Reducibility)?
How would the machinery pick, weigh, and sort effects into relevant comparative dimensions (NP-Complete Knapsack Problem)?
What irreconcilable normative assumptions must be made in setting the reference values to compare against?
Actuator feedback limits
What is the minimum time delay of digital processing between measurement and correction (Latency)?
Can AGI constrain the chaotic loops through itself – of actuated changes to the environment feeding into sensor inputs, that feed into code functioning, that feed back into actuated changes?
To what extent can distributed environmental effects be corrected for by local actuators?
How “complete” does this control need to be? Be careful to distinguish: 
Localized user-initiated control often built into tech products brought to market,
Comprehensive automated control needed to prevent risks of an auto-scaling/catalyzing technology from materializing globally over the long term.
How to define ‘AGI’?
We've got no idea what's actually going on inside the giant inscrutable matrices and tensors of floating-point numbers.
Narrow AI as a model with static code parameters (updated only through human engineers) processing inputs into outputs over a single domain (eg. of image pixels, text tokens).
General AI asdynamically optimizing configurations encoded into hardware(without needing humans) that process inputs into outputs over multiple domains representing outside contexts.
Corporations are scaling narrow AI model training and deployment toward general AI systems. Current-generation GPT is no longer a narrow AI, given that it processes inputs from the image domain into a language domain. Nor is GPT-4 a general AI. It is in a fuzzy gap between the two concepts.
Corporations already are artificial bodies (‘corpora’ in Latin).
Corporations have been replacing human workers as “functional components” with labor-efficient AI. Standardized hardware components allow AI to outcompete human wetware on physical labor (eg. via electric motors), intellectual labor (faster computation through high-fidelity communication links), and the reproduction of components itself.
Any corporation or economy that fully automates themselves this way – no longer needing humans to maintain their artificial components – over their entire production and operation chains, would in fact be general AI.
So to re-define general AI more precisely:
Self-sufficient need no further interactions with humans (or lifeforms sharing an ancestor with humans) to operate and maintain (and thus produce) their own functional components over time.
Learning optimizing component configurations for outcomes that are tracked across multiple domains.
Machinery  connected standardized components configured out of artificial (vs. organic DNA-expressed) molecular substrates.
How to define ‘stays safe’?
An impossibility proof would have to say:
The AI cannot reproduce onto new hardware, or modify itself on current hardware, with knowable stability of the decision system and bounded low cumulative failure probability over many rounds of self-modification. or
The AI's decision function (as it exists in abstract form across self-modifications) cannot be knowably stably bound with bounded low cumulative failure probability to programmer-targeted consequences as represented within the AI's changing, inductive world-model.
Things are relevant to something that cares about this information, rather than that information, because it is taking care of itself. Because it is making itself. Because it is an autonomous autopoietic agent. And the degree to which these machines are not autopoietic, they really do not have needs.
— Vervaeke, 2023
This is about the introduction of self-sufficient learning machinery, and of all modified versions thereof over time, into the world we humans live in.
Does this introduction of essentially a new species cause global changes to the world that fall outside the narrow ranges of localized conditions that human bodies need to continue to function and exist?
Uncontainability of unsafe effects: That we fundamentally cannot establish, by any means, any sound and valid statistical guarantee that the risk probability that the introduction of AGI into the world causes human-species-wide-lethal outcomes over the long term is guaranteed to be constrained below some reasonable chance percentage X (as an upper maximum-allowable bound).
Convergence on unsafe effects: That the chance that AGI, persisting in some form, causes human-species-wide-lethal outcomes is strictly and asymptotically convergent toward certain over the long term, and that it is strictly impossible for the nature of this trend to be otherwise.
I know of three AGI Safety researchers who have written about specific forms of impossibility reasoning (including Yudkowsky in quote above). Each of their argument forms was about AGI uncontainability, essentially premised on there being fundamental limits to the controllability of AGI component interactions.
By the precautionary principle,AGI uncontainability should be sufficient reason to never ever get even remotely near to building AGI. Uncontained effects that destabilise conditions outside any of the ranges our human bodies need to survive, would kill us.
But there is an even stronger form of argument: Not only would AGI component interactions be uncontainable; they will also necessarily convergeon causing the extinction of all humans.
The AGI convergence argument most commonly-discussed is instrumental convergence, about the machinery channelling its optimisation through the intermediate outcomes, explicitly tracked and planned for internally, that tend to result in the machinery being more likely to achieve a variety of (unknown/uncertain) aimed-for outcomes later.
Instrumental convergence has a mutual-reinforcing complement: substrate-needs convergence.
This is not about code components being optimised for explicit goals. Substrate-needs convergence is about all functional components being selected for implicit needs. Components are selected for their potential to bring about environmental conditions/contexts implicitly needed for their continued existence and functioning (at increasing scales, in more ways, in more domains of action).
Any changing population of AGI components converges over time on propagating those specific environmental effects that fulfill their needs.
All AGI outputs will tend to iteratively select towards causing those specific effects. Whatever learned or produced components that across all their physical interactions with connected contexts happento direct outside effects that feed back into their own maintenance and replication as assembled electro-molecular configurations…do that.
AGI's artificial configurations differ from human organic configurations, by definition. What follows is that the environmental conditions and contexts needed to maintain and replicate AGI configurations differ too from what our human bodies need to survive.
For instance, silicon dioxide (+ many alternate precursors for semiconductor assembly) needs to be heated above 1400 ºC to free outer electrons, and allow the ingot to melt. While production needs extremely high temperatures, computation runs best at extremely low temperatures (to reduce the electrical resistance over conductor wires).
Humans need around room temperature to survive, at every point of our lifecycle. AGI hardware would need, and be robust over, a much wider range of temperatures and pressures than our comparatively fragile human wetware can handle.
Temperature and pressure can be measured and locally controlled for. That's misleading. Many other, subtler conditions would be needed by (and selected for in) AGI that lie beyond what the AGI's actual built-in detection and correction methods could control for. We humans too depend on highly specific environmental conditions for the components nested inside our bodies (proteins→organelles→cells→cell lining→) to continue in their complex functioning, such to be maintaining of our overall existence.
Between the highly specific set of artificial needs and highly specific set of organic needs, there is mostly non-overlap. AGI cannot control most of the components' iterative effects from converging on their artificial needs, so they do. Their fulfilled artificial needs are disjunctive of our organic needs for survival. So the humans die.
Under runaway feedback, our planetary environment is modified in the directions needed for continued and greater AGI existence. Outside the ranges we can survive.
AGI would necessarily convergeon causing the extinction of all humans.
Where from here?
Over two decades, AI Safety's founders [LW · GW] resolved to solve the control problem, to no avail:
They reasoned that technological and scientific 'progress' is necessary for optimizing the universe – and that continued 'progress' would result in AGI.
They wanted to use AGI to reconfigure humanity and colonise reachable galaxies.
They, and their followers, promoted and financed the development of 'safe AGI'.
They worried about how companies they helpedstartup raced to scale ML models.
An outside researcher could very well have found a logical contradiction in the AGI control problem years ago without your knowing, given the inferential distance. Gödel himself had to construct an entire new language and self-reference methodology for the incompleteness theorems to even work.
Historically, an impossibility result that conflicted with the field’s stated aim took years to be verified and accepted by insiders. A field’s founder like Hilbert never came to accept the result. Science advances one funeral at a time.
Roman Yampolskiy is offering to give feedback on draft papers written by capable independent scholars, on a specific fundamental limit or no-go theorem described in academic literature that is applicable to AGI controllability. You can pick from dozens of examples from different fields listed here, and email Roman a brief proposal.
To illustrate: Let’s say before the Wright Brothers built the flying machine, they wondered how to control this introduced technology to stay safe to humans.
If they thought like a flight engineer, they would focus on locally measurable effects (eg. actuating wings). They could test whether the risk of a plane crash is below some acceptable upper-bound rate.
However, the Wright Brothers could not guarantee ahead of time that the introduction of any working plane design, with any built-in control mechanism, that would continue to be produced and modified would stay safe in its effects to society and the ecosystem as a whole (eg. they would not have predicted the deployment of nuclear bombs with planes given the knowledge available at the time). The downstream effects are unmodellable.
They could check whether the operation (with fossil fuels) and re-production (with toxic chemicals) of their plane in itself has harmful effects. To the extent that harmful conditions are needed for producing and operating the machine, the machine’s existence is inherently unsafe.
Gradual natural selection can multiply these harms. Over time, any machinery interacting with the outside world in ways that feed back into the re-production of constituent components gets selected for.
But since planes get produced by humans, humans can select planes on the basis of human needs. Not so with auto-scaling technologies like AGI.
Non-solid-substrate AGI cannot be ruled out, but seems unlikely initially. Standardisation of isolatable parts is a big advantage, and there is a (temporary) path dependency with current silicon-based semiconductor manufacturing.
Corporations have increasingly been replacing human workers with learning machinery. For example, humans are now getting pushed out of the loop as digital creatives, market makers, dock and warehouse workers, and production workers.
If this trend continues, humans would have negligible economic value left to add in market transactions of labor (not even for providing needed physical atoms and energy, which would replace human money as the units of trade):
• As to physical labor: Hardware can actuate power real-time through eg. electric motors, whereas humans are limited by their soft appendages and tools they can wield through those appendages. Semiconductor chips don’t need an oxygenated atmosphere/surrounding solute to operate in and can withstand higher as well as lower pressures.
• As to intellectual labor: Silicon-based algorithms can duplicate and disperse code faster (whereas humans face the wetware-to-wetware bandwidth bottleneck). While human skulls do hold brains that are much more energy-efficient at processing information than current silicon chip designs, humans take decades to create new humans with finite skull space. The production of semiconductor circuits for servers as well as distribution of algorithms across those can be rapidly scaled up to convert more energy into computational work.
• As to re-production labor: Silicon life have a higher ‘start-up cost’ (vs. carbon lifeforms), a cost currently financed by humans racing to seed the prerequisite infrastructure. But once set up, artificial lifeforms can absorb further resources and expand across physical spaces at much faster rates (without further assistance by humans in their reproduction).
The term "machinery" is more sound here than the singular term "machine".
Agent unit boundaries that apply to humans would not apply to "AGI". So the distinction between a single agent vs. multiple agents breaks down here.
Scalable machine learning architectures run on standardized hardware with much lower constraints on the available bandwidth for transmitting, and the fidelity of copying, information across physical distances. This in comparison to the non-standardized wetware of individual humans.
Given our evolutionary history as a skeleton-and-skin-bounded agentic being, human perception is biased toward ‘agent-as-a-macroscopic-unit’ explanations.
It is intuitive to view AGI as being a single independently-acting unit that holds discrete capabilities and consistent preferences, rather than viewing agentic being to lie on a continuous distribution. Discussions about single-agent vs. multi-agent scenarios imply that consistent temporally stable boundaries can be drawn.
A human faces biological constraints that lead them to have a more constant sense of self than an adaptive population of AGI components would have.
We humans cannot: • swap out body parts like robots can. • nor scale up our embedded cognition (ie. grow our brain beyond its surrounding skull) like foundational models can. • nor communicate messages across large distances (without use of tech and without facing major bandwidth bottlenecks in expressing through our biological interfaces) like remote procedure calls or ML cloud compute can. • nor copy over memorized code/information like NN finetuning, software repos, or computer viruses can.
Roman just mentioned that he has used the term 'uncontainable' to mean "cannot confine AGI actions to a box". My new definition for 'uncontainable' differs from the original meaning, so that could confuse others in conversations. Still brainstorming alternative terms that may fit (not 'uncontrainable', not...). Comment if you thought of any alternative term!
Why it makes sense to apply the precautionary principle to the question of whether to introduce new scalable technology into society: There are many more ways to break the complex (local-contextualized) functioning of our society and greater ecosystem that we humans depend on to live and live well, than there are ways to foster that life-supporting functioning.
‘Iteratively select’ involves lots of subtleties, though most are not essential for reasoning about the control problem.
One subtlety is co-option:
If narrow AI gets developed into AGI, AGI components will replicate in more and more non-trivial ways. Unlike when carbon-based lifeforms started replicating ~3.7 billion years ago, for AGI there would already exist repurposable functions at higher abstraction layers of virtualised code – pre-assembled in the data scraped from human lifeforms with own causal history.
Analogy to a mind-hijacking parasite: A rat ingests toxoplasma cells, which then migrate to the rat’s brain. The parasites’ DNA code is expressed as proteins that cause changes to regions of connected neurons (eg. amygdala). These microscopic effects cascade into the rat – while navigating physical spaces – no longer feeling fear when it smells cat pee. Rather, the rat finds the smell appealing and approaches the cat’s pee. Then cat eats the rat and toxoplasma infects its next host over its reproductive cycle.
So a tiny piece of code shifts a rat’s navigational functions such that the code variant replicates again. Humans are in turn more generally intelligent and capable than a tiny parasitic cell, yet toxoplasma make their way into 30% of the human population. Unbeknownst to cat ‘owners’ infected by toxoplasma gondii, human motivations and motor control get influenced too. Infected humans more frequently lose social relationships, end up in accidents, etc.
Parasites present real-life examples of tiny pieces of evolutionarily selected-for code spreading and taking over existing functions of vastly more generally intelligent entities. eg. COVID co-opts our lungs’ function to cough.
But there is one crucial flaw in this analogy: Variants that co-opt initial AGI functions are not parasites. In not threatening the survival nor reproduction of AGI components, they would not be in an adversarial relationship with their host.
Rather, the humans constraining the reproductive fitness of AGI to gain benefits are, evolutionary speaking, the parasites. The error-corrective (alignment) system we would build in lowers the host’s reproductive fitness. It is like a faulty immune system that kills healthy gut bacteria. It will get selected out.
As humans, we rely on our evolved immune system to detect and correct out viruses, including for the vaccinations we develop and deploy. We also rely on the blood-testes and blood-follicle barrier to block variants of these viruses from entering into our body’s (re)production facilities.
Smaller viruses survive this detection more frequently, so code strands of replicating virus variants are selected for staying small. Few viruses replicate ‘as a package’ with the DNA code of their host. Thus, virus variants rarely co-adapt and integrate with the host’s variants, each selected for raising collective reproductive fitness (except for eg. detection signatures).
These systems and barriers add to our reproductive fitness: our ability to preserve and replicate internal code. Past DNA code that got expressed – in interaction with surrounding code – as (precursors of) these functions got selected for as a portion of the existing human ‘code pool’.
For any organic system or barrier preventing virus variants from replicating through our bodies, evolution is firmly on our side. For any artificial system or barrier we build to prevent misaligned AI variants from replicating through solid-state hardware infrastructure, evolution will thwart our efforts.
AI variants would compete for hardware/resources like computer viruses do. And co-adapt and integrate with other internal variants to replicate as part of larger symbiotic packages.
Now, a thought may come to your mind – roughly like: "That sounds like evolutionary selection; but isn't evolution a slow and local optimiser?".
Yes, is the answer. This does involve evolutionary selection. Unfortunately, by Forrest Landry's estimation, the selection that would take place through components of self-sufficient learning machinery would take < 500 years to cause ecosystem-wide extinction. This compared to the 3.7 billion years from the origin of carbon lifeforms to us humans starting to cause a mass extinction.
• Pre-assembled functions: First solid-state lifeforms can co-opt/repurpose pre-assembled AI functions and infrastructure (vs. first carbon-based lifeforms that started from scratch). • Standardization: The efficiency gains of the virtualisation of code’s storage, computation and transmission – vastly reducing how much atoms need to be moved about and molecularly reconfigured. Think of how fast memes spread through society – even while still requiring lots of atoms to jiggle across neurons in our brains. • Faster reproduction: Reproduce hardware components in days to months, versus humans who take decades to reproduce as physical units. • The terraforming gap: A much larger gap between the current state of planet Earth and the conditions that self-sufficient self-assembling learning machinery need and would therefore modify the environment toward (versus gap to conditions needed by humans and other species living in carbon-based ecosystem).
~ ~ ~ Another argument you may have heard is that the top-down intelligent engineering by goal-directed AGI would beat the bottom-up selection happening through this intelligent machinery.
That argument can be traced back to Eliezer Yudkowsky's sequence The Simple Math of Evolution [? · GW]. Unfortunately, there were mistakes in Eliezer's posts, some of which a modern evolutionary biologist may have been able to correct:
• implying that sound comparisons can be made between the reproductive fitness of organisms, as somehow independent of unknown changes in environmental context (eg. a black swan event of a once-in-200 years drought that kills the entire population, except a few members who by previous derivable standards would have been relatively low fitness). • overlooking the ways that information can be stored within the fuzzy regions of phenotypic effects maintained outside respective organisms. • overlooking the role of transmission speed-up for virtualisation of code. • overlooking the tight coupling in AGI between the intrinsic learning/selection of code, and extrinsic selection of that code through differentiated rates of replication through the environment. • overlooking the role of exaptation/co-option.
Worse, since error correction methods would correct out component variants with detectable unsafe/co-optive effects, this leaves to grow in influence any replicating branches of variants with undetectable unsafe/co-optive effects.
Thus, the error correction methods select for the variants that can escape detection. As do meta-methods (having to soundly and comprehensively adapt error correction to newly learned code or newly produced hardware parts).
This post stakes out a slightly different position than usual in the landscape of arguments that AI is an extinction risk. The AI safety community is full of people saying that AI is immensely dangerous, so we should be trying to slow it down, spending more on AI safety research, and so on. Eliezer himself has become a doomer because AI safety is so hard and AI is advancing so quickly.
This post, however, claims to show that AI safety is logically impossible. It is inspired by the thought of Forrest Landry, a systems theorist and philosopher of design... So what's the actual argument? The key claim, as far as I can make out, is that machines have different environmental needs than humans. For example - and this example comes directly from the article above - computer chips need "extremely high temperatures" to be made, and run best at "extremely low temperatures"; but humans can't stray too far from room temperature at any stage in their life cycle.
So yes, if your AI landlord decides to replace your whole town with a giant chip fab or supercooled data center, you may be in trouble. And one may imagine the Earth turned to Venus or Mars, if the robots decide to make it one big foundry. But where's the logical necessity of such an outcome, that we were promised? For one thing, the machines have the rest of the solar system to work with...
The essential argument, I think, is just that the physical needs of machines tell us more about their long-run tendencies, than whatever purposes they may be pursuing in the short term. Even if you try to load them up with human-friendly categorical imperatives, they will still find nonbiological environments useful because of their own physical nature, and over time that will tell.
In my opinion, packaging this perspective with the claim to have demonstrated the unsolvability of the control problem, actually detracts from its value. I believe the valuable perspective here, is this extension of ecological and evolutionary thinking, that pays more attention to lasting physical imperatives than to the passing goals, hopes and dreams of individual beings, to the question of human vs AI.
You could liken the concern with specific AI value systems, to concern with politics and culture, as the key to shaping the future. Within the futurist circles that emerged from transhumanism, we already have a slightly different perspective, that I associate with Robin Hanson - the idea that economics will affect the structure of posthuman society, far more than the agenda of any individual AI. This ecologically-inspired perspective is reaching even lower, and saying, computers don't even eat or breathe, they are detached from all the cycles of life in which we are embedded. They are the product of an emergent new ecology, of factories and nonbiological chemistries and energy sources, and the natural destiny of that machine ecology is to displace the old biological ecology, just as aerobic life is believed to have wiped out most of the anaerobic ecosystem that existed before it.
Now, I have reasons to disagree with the claim that machines, fully unleashed, necessarily wipe out biological life. As I already pointed out, they don't need to stay on Earth. From a biophysical perspective, some kind of symbiosis is also conceivable; it's happened before in evolution. And the argument that superintelligence just couldn't stick with a human-friendly value system, if we managed to find one and inculcate it, hasn't really been made here. So I think this neo-biological vision of evolutionary displacement of humans by AI, is a valuable one, for making the risk concrete, but declaring the logical inevitability of it, I think weakens it. It's not an absolute syllogistic argument, it's a scenario that is plausible given the way the world works.
Credit goes to Forrest :) All technical argumentation in this post I learned from Forrest, and translated to hopefully be somewhat more intuitively understandable.
The key claim, as far as I can make out, is that machines have different environmental needs than humans.
This is one key claim.
Add this reasoning:
Control [LW · GW] methods being unable to conditionalise/constrain most environmental effects propagated by AGI's interacting physical components.
That a subset of those uncontrollable effects will feed back into selecting for the continued, increased existence of components that propagated those effects.
That the artificial needs selected for (to ensure AGI's components existence, at various levels of scale) are disjunctive from our organic needs for survival (ie. toxic and inhospitable).
if the robots decide to make it one big foundry. But where's the logical necessity of such an outcome, that we were promised? For one thing, the machines have the rest of the solar system to work with...
Here you did not quite latch onto the arguments yet.
Robots deciding to make X is about explicit planning. Substrate-needs convergence is about implicit and usually non-internally-tracked effects of the physical components actually interacting with the outside world.
Please see this paragraph:
the physical needs of machines tell us more about their long-run tendencies, than whatever purposes they may be pursuing in the short term
This is true, regarding what current components of AI infrastructure are directed toward in their effects over the short term.
What I presume we both care about is the safety of AGI over the long term. There, any short-term ephemeral behaviour by AGI (that we tried to pre-program/pre-control for) does not matter.
What matters is what behaviour, as physically manifested in the outside world, gets selected for. And whether error correction (a more narrow form of selection) can counteract the selection for any increasingly harmful behaviour.
Now, I have reasons to disagree with the claim that machines, fully unleashed, necessarily wipe out biological life.
The reasoning you gave here is not sound in their premises, unfortunately. I would love to be able to agree with you, and find out that any AGI that persists won't necessarily lead to the death of all humans and other current life on earth.
Given the stakes, I need to be extra careful in reasoning about this. We don't want to end up in a 'Don't Look Up' scenario (of scientists mistakenly arguing that there is a way to keep the threat contained and derive the benefits for humanity).
Let me try to specifically clarify:
As I already pointed out, they don't need to stay on Earth.
This is like saying that a population of invasive species in Australia, can also decide to all leave and move over to another island.
When we have this population of components (variants), selected for to reproduce in partly symbiotic interactions (with surrounding artificial infrastructure; not with humans), this is not a matter of the population all deciding something.
For that, some kind of top-down coordinating mechanisms through would actually have to be selected throughout the population for the population to coherently elect to all leave planet Earth – by investing resources in all the infrastructure required to fly off and set up a self-sustaining colony on another planet.
Such coordinating mechanisms are not available at the population level. Sub-populations can and will be selected for to not go on that more resource-intensive and reproductive-fitness-decreasing path.
Within the futurist circles that emerged from transhumanism, we already have a slightly different perspective, that I associate with Robin Hanson - the idea that economics will affect the structure of posthuman society, far more than the agenda of any individual AI. This ecologically-inspired perspective is reaching even lower, and saying, computers don't even eat or breathe, they are detached from all the cycles of life in which we are embedded. They are the product of an emergent new ecology, of factories and nonbiological chemistries and energy sources, and the natural destiny of that machine ecology is to displace the old biological ecology, just as aerobic life is believed to have wiped out most of the anaerobic ecosystem that existed before it.
Yes, this summarises the differences well.
Robin Hanson's arguments (about a market of human brain scans emulated within hardware) focus on how the more economically-efficient and faster replicatable machine 'ems' come to dominate and replace the market of organic humans. Forrest considers this too.
Forrest's arguments also consider the massive reduction here of functional complexity of physical components constituting humans. For starters, the 'ems' would not approximate being 'human' in terms of their feelings and capacity to feel. Consider that how emotions are directed throughout the human body starts at the microscopic level of hormone molecules, etc, functioning differently depending on their embedded physical context. Or consider how, at a higher level of scale, botox injection into facial muscles disrupts the feedback processes that enable eg. an middle-aged woman to express emotion and relate with feelings of loved ones.
Forrest further argues that such a self-sustaining market of ems (an instance/example of self-sufficient learning machinery [LW · GW]) would converge on their artificial needs. While Hanson concludes that the organic humans who originally invested in the 'ems' would gain wealth and prosper, Forrest's more comprehensive arguments conclude that machinery across this decoupled economy will evolve to no longer exchange resources with the original humans – and in effect modify the planetary environment such that the original humans can no longer survive.
From a biophysical perspective, some kind of symbiosis is also conceivable; it's happened before in evolution.
This is a subtle equivocation. Past problems are not necessarily representative of future problems. Past organic lifeforms forming symbiotic relationships with other organic lifeforms does not correspond with whether and how organic lifeforms would come to form, in parallel evolutionary selection, resource-exchanging relationships with artificial lifeforms.
Take into account:
Artificial lifeforms would outperform us in terms of physical, intellectual, and re-production labour. This is the whole point of companies currently using AI to take over economic production, and of increasingly autonomous AI taking over the planet. Artificial lifeforms would be more efficient at performing the functions needed to fulfill their artificial needs, than it would be for those artificial lifeforms to fulfill those needs in mutually-supportive resource exchanges with organic lifeforms.
On what, if any, basis would humans be of enough use to the artificial lifeforms, for the artificial lifeforms to be selected for keeping us around?
The benefits to the humans are clear, but canwe offer benefits to the artificial lifeforms, to a degree sufficient for the artificial lifeforms to form mutualist (ie. long-term symbiotic) relationships with us?
Artificial needs diverge significantly (across measurable dimensions or otherwise) from organic needs. So when you claim that symbiosis is possible, you also need to clarify why artificial lifeforms would come to cross the chasm from fulfilling their own artificial needs (within their new separate ecology) to also simultaneously realising the disparate needs of organic lifeforms.
How would that be Pareto optimal?
Why would AGI converge on such state any time before converging on causing our extinction?
Instead of AGI continuing to be integrated into, and sustaining of, our human economy and broader carbon-based ecosystem, there will be a decoupling.
Machines will decouple into a separate machine-dominated economy. As human labour get automated and humans get removed from market exchanges, humans get pushed out of the loop.
Machines will also decouple into their own ecosystem. Components of self-sufficient learning machinery will co-evolve to produce surrounding environmental conditions are sustaining of each others' existence – forming regions that are simply uninhabitable by humans and other branches of current carbon lifeforms. You already aptly explained this point above.
And the argument that superintelligence just couldn't stick with a human-friendly value system, if we managed to find one and inculcate it, hasn't really been made here.
Please see this paragraph. Then, refer back to point 1-3 above.
but declaring the logical inevitability of it
This post is not about making a declaration. It's about the reasoning from premises, to a derived conclusion.
Your comment describes some of the premises and argument steps I summarised – and then mixes in your own stated intuitions and thoughts.
If you want to explore your own ideas, that's fine!
If you want to follow reasoning in this post, I need you to check whether your paraphrases cover (correspond with) the stated premises and argument steps.
Address the stated premises, to verify whether those premises are empirically sound.
Address the stated reasoning, to verify whether those reasoning steps are logically consistent.
As an analogy, say a mathematician writes out their axioms and logic on a chalkboard. What if onlooking colleagues jumped in and wiped out some of the axioms and reasoning steps? And in the wiped-out spots, they jotted down their own axioms (irrelevant to the original stated problem) and their short bursts of reasoning (not logically derived from the original premises)?
Would that help colleagues to understand and verify new formal reasoning?
What if they then turn around and confidently state that they now understand the researcher's argument – and that it's a valuable one, but that the "claim" of logical inevitability weakens it?
Would you value that colleagues in your field discuss your arguments this way? Would you stick around in such a culture?
For the moment, let me just ask one question: why is it that toilet training a human infant is possible, but convincing a superintelligent machine civilization to stay off the Earth is not possible? Can you explain this in terms of "controllability limits" and your other concepts?
^— Anyone reading that question, suggest thinking first why those two cases cannot be equivocated.
Here are my responses:
An infant is dependent on their human instructors for survival, and also therefore has been “selected for” over time to listen to adult instructions. AGI would be decidedly not dependent on our survival, so there is no reason for AGI to be selected for to follow our instructions.
Rather, that would heavily restrict AGI’s ability to function in the varied ways that maintain/increase their survival and reproduction rate (rather than act in the ways we humans want because it’s safe and beneficial to us). So accurately following human instructions would be strongly selected against in the run up to AGI coming into existence.
That is, over much shorter periods (years) than human genes would be selected for, for a number of reasons, some of which you can find back in the footnotes.
As parents can attest – even where infants manage to follow use-the-potty instructions (after many patient attempts) – an infant’s behaviour is still actually not controllable for the most part. The child makes their own choices and does plenty of things their adult overseers wouldn’t want them to do.
But the infant probably won’t do any super-harmful things to surrounding family/community/citizens.
Not only because they lack the capacity to (unlike AGI). But also because those harms to surrounding others would in turn tend to negatively affect themselves (including through social punishment) – and their ancestors were selected for to not do that when they were kids. On the other hand, AGI doing super-harmful things to human beings, including just by sticking around and toxifying the place, does not in turn commensurately negatively impact the AGI.
Even where humans decide to carpet-bomb planet Earth in retaliation, using information-processing/communication infrastructure that somehow hasn’t already been taken over by and/or integrated with AGI, the impacts will hit human survival harder than AGI survival (assuming enough production/maintenance redundancy attained at that point).
Furthermore, whenever an infant does unexpected harmful stuff, the damage is localised. If they refuse instructions and pee all over the floor, that’s not the end of civilisation.
The effects of AGI doing/causing unexpected harmful-to-human stuff manifest at a global planetary scale. Those effects feed back in ways that improve AGI’s existence, but reduce ours.
A human infant is one physically bounded individual, that notably cannot modify and expand its physical existence by connecting up new parts in the ways AGI could. The child grows up over two decades to adult size, and that’s their limit.
A “superintelligent machine civilization” however involves a massive expanding population evolutionarily selected for over time.
A human infant being able to learn to potty has mildly positive effect on their (and their family’s) potential and their offspring to survive and reproduce. This because defecating or peeing in other places around the home can spread diseases. Therefore, any genes…or memes that contribute to the expressed functionality needed for learning how to use the toilet get mildly selected for.
On the other hand, for a population of AGI (which once became AGI was selected against following human instructions) to leave all the sustaining infrastructure and resources on planet Earth would have a strongly negative effect on their potential to survive and reproduce.
Amongst an entire population of human infants who are taught to use the toilet, there where always be individuals who refuse for some period, or simply are not predisposed to communicating to learn and follow that physical behaviour. Some adults still do not (choose to) use the toilet. That’s not the end of civilisation.
Amongst an entire population of mutually sustaining AGI components, even if by some magic you have not explained to me yet, some do follow human instructions and jettison off into space to start new colonies – never to return – then others (even for distributed Byzantine fault reasons) would still stick around under this scenario.
That, for even a few more decades, would be the end of human civilisation.
One thing about how the physical world works, is that in order for code to be computed, this needs to take place through a physical substrate. This is a necessary condition – inputs do not get processed into outputs through a platonic realm.
Substrate configurations in this case are, by definition, artificial – as in artificial general intelligence. This as distinct from the organic substrate configurations of humans (including human infants).
Further, the ranges of conditions needed for the artificial substate configurations to continue to exist, function and scale up over time – such as extreme temperatures, low oxygen and water, and toxic chemicals – fall outside the ranges of conditions that humans and other current organic lifeforms need to survive.
~ ~ ~
Hope that clarifies long-term-human-safety-relevant distinctions between:
building AGI (that continue to scale) and instructing them to leave Earth; and
having a child (who grows up to adult size) and instructing them to use the potty.
I see three arguments here for why AIs couldn't or wouldn't do, what the human child can: arguments from evolution (1, 2, 5), an argument from population (4, 6), and an argument from substrate incentives (3, 7).
The arguments from evolution are: Children have evolved to pay attention to their elders (1), to not be antisocial (2), and to be hygienic (5), whereas AIs didn't.
The argument from population (4, 6), I think is basically just that in a big enough population of space AIs, eventually some of them would no longer keep their distance from Earth.
The argument from substrate incentives (3, 7) is complementary to the argument from population, in that it provides a motive for the AIs to come and despoil Earth.
I think the immediate crux here is whether the arguments from evolution actually imply the impossibility of aligning an individual AI. I don't see how they imply impossibility. Yes, AIs haven't evolved to have those features, but the point of alignment research is to give them analogous features by design. Also, AI is developing in a situation where it is dependent on human beings and constrained by human beings, and that situation does possess some analogies to natural selection.
Human beings, both individually and collectively, already provide numerous examples of how dangerous incentives can exist, but can nonetheless be resisted or discouraged. It is materially possible to have a being which resists actions that may otherwise have some appeal, and to have societies in which that resistance is maintained for generations. The robustness of that resistance is a variable thing. I suppose that most domesticated species, returned to the wild, become feral again in a few generations. On the other hand, we talk a lot about superhuman capabilities here; maybe a superhuman robustness can reduce the frequency of alignment failure to something that you would never expect to occur, even on geological timescales.
This is why, if I was arguing for a ban on AI, I would not be talking about the problem being logically unsolvable. The considerations that you are bringing up, are not of that nature. At best, they are arguments for practical unsolvability, not absolute in-principle logical unsolvability. If they were my arguments, I would say that they show making AI to be unwise, and hubristic, and so on.
Yes, AIs haven't evolved to have those features, but the point of alignment research is to give them analogous features by design.
It's unintuitive to convey this part:
In the abstract, you can picture a network topology of all possible AGI component connections (physical signal interactions). These connections span the space of greater mining/production/supply infrastructure that is maintaining of AGI functional parts. Also add in the machinery connections with the outside natural world.
Then, picture the nodes and possible connections change over time, as a result of earlier interactions with/in the network.
That network of machinery comes into existence through human engineers, etc, within various institutions selected by market forces etc, implementing blueprints as learning algorithms, hardware set-ups, etc, and tinkering with those until they work.
The question is whether before that network of machinery becomes self-sufficient in their operations, the human engineers, etc, can actually build in constraints into the configured designs, in such a way that once self-modifying (in learning new code and producing new hardware configurations), the changing machinery components are constrained in their propagated effects across their changing potential signal connections over time, such that component-propagated effects do not end up feeding back in ways that (subtly, increasingly) increase the maintained and replicated existence of those configured components in the network.
Human beings, both individually and collectively, already provide numerous examples of how dangerous incentives can exist, but can nonetheless be resisted or discouraged.
Humans are not AGI. And there are ways AGI would be categorically unlike humans that are crucial to the question of whether it is possible for AGI to stay safe to humans over the long term.
Therefore, you cannot swap out "humans" with "AGI" in your reasoning by historical analogy above, and expect your reasoning to stay sound. This is an equivocation.
Please see point 7 above.
The argument from substrate incentives (3, 7) is complementary to the argument from population, in that it provides a motive for the AIs to come and despoil Earth.
Maybe it's here you are not tracking the arguments.
These are not substrate "incentives", nor do they provide a "motive".
Small dinosaurs with hair-like projections on their front legs did not have an "incentive" to co-opt the changing functionality of those hair-like projections into feather-like projections for gliding and then for flying. Nor were they provided a "motive" with respect to which they were directed in their internal planning toward growing those feather-like projections.
That would make the mistake of presuming evolutionary teleology – that there is some complete set of pre-defined or predefinable goals that the lifeform is evolving toward.
I'm deliberate in my choice of words when I write "substrate needs".
At best, they are arguments for practical unsolvability, not absolute in-principle logical unsolvability. If they were my arguments, I would say that they show making AI to be unwise, and hubristic, and so on.
Practical unsolvability would also be enough justification to do everything we can do now to restrict corporate AI development.
I assume you care about this problem, otherwise you wouldn't be here :) Any ideas / initiatives you are considering to try robustly work with others to restrict further AI development?
The recurring argument seems to be, that it would be adaptive for machines to take over Earth and use it to make more machine parts, and so eventually it will happen, no matter how Earth-friendly their initial values are.
So now my question is, why are there still cows in India? And more than that, why has the dominant religion of India never evolved so as to allow for cows to be eaten, even in a managed way, but instead continues to regard them as sacred?
I'm not sure how we got on to the subject, but there is an economic explanation for the sacred cow: a family that does not own enough land to graze a cow can still own one, allowing it to wander and graze on other people's land, so it's a form of social welfare.
Remmelt argues that no matter how friendly or aligned the first AIs are, simple evolutionary pressure will eventually lead some of their descendants to destroy the biosphere, in order to make new parts and create new habitats for themselves.
I proposed the situation of cattle in India, as a counterexample to this line of thought. They could be used for meat, but the Hindu majority has never accepted that. It's meant to be an example of successful collective self-restraint by a more intelligent species.
In my experience, jumping between counterexamples drawn from current society does not really contribute to inquiry here. Such counterexamples tend to not account for essential parts of the argument that must be reasoned through together. The argument is about self-sufficient learning machinery (not about sacred cows or teaching children).
It would be valuable for me if you could go though the argumentation step-by-step and tell me where a premise seems unsound or there seems to be a reasoning gap.
Now, onto your points.
the first AIs
To reduce ambiguity, suggest replacing with
“the first self-sufficient learning machinery”.
simple evolutionary pressure will eventually lead
The mechanism of evolution is simple.
However, evolutionary pressure is complex.
Be careful not to equivocate the two. That would be like saying you could predict everything about what a stochastic gradient descent algorithm will select for across parameters selected on the basis of inputs everywhere from the environment.
lead some of their descendants to destroy the biosphere in order to make new parts and create new habitats for themselves.
This part is overall a great paraphrase.
One nitpick: notice how “in order to” either implies or slips in explicit intentionality again. Going by this podcast, Elizabeth Anscombe’s philosophy of intentions described intentions as chains of “in order to” reasoning.
I proposed the situation of cattle in India, as a counterexample to this line of thought.
Regarding sacred cows in India, this sounds neat, but it does not serve as a counterargument. We need to think about evolutionary timelines for organic human lifeforms over millions of years, and Hinduism is ~4000 years old. Also, cows share a mammal ancestor with us, evolving on the basis of the same molecular substrates. Whatever environmental conditions/contexts we
humans need, cows almost completely need too.
Crucially however humans evolve to change and maintain environmental conditions also tends to correspond with what conditions cows need (however, human tribes have not been evolutionarily selected for to deal with issues at the scale of eg. climate change). That would not be the case for self-sufficient learning machinery.
Crucially there is a basis for symbiotic relationships of exchange that benefit both the reproduction of cows and humans. That would not be the case between self-sufficient learning machinery and humans.
There is some basis for humans as social mammals to relate with cows. Furthermore, religious cultural memes that sprouted out over a few thousand years also don’t have to be evolutionarily optimal across the board for the reproduction of their hosts (even as religious symbols like of cows do increase that by enabling humans to act collectively). Still, people milk cows in India, and some slaughter and/or export cows there as well. But when humans eat meat, they don’t keep growing beyond adult size. Conversely, some self-sufficient learning machinery sub-population that extract from our society/ecosystem at the cost of our lives can keep doing so to keep scaling in their constituent components (with shifting boundaries of interaction and mutual reproduction).
There is no basis for selection for the expression of collective self-restraint in self-sufficient learning machinery as you describe. Even if there was such a basis, hypothetically, collective self-restraint would need to occur at virtually 100% rates across the population of self-sufficient learning machinery to not end up leading to the deaths of all humans.
~ ~ ~
Again, I find quick dismissive counterexamples unhelpful for digging into the arguments. I have had dozens of conversations on substrate-needs convergence. In the conversations where my conversation partner jumped between quick counterarguments, almost none were prepared to dig into the actual arguments. Hope you understand why I won’t respond to another counterexample.
Hello again. To expedite this discussion, let me first state my overall position on AI. I think AI has general intelligence right now, and that has unfolding consequences that are both good and bad; but AI is going to have superintelligence soon, and that makes "superalignment" the most consequential problem in the world, though perhaps it won't be solved in time (or will be solved incorrectly), in which case we get to experience what partly or wholly unaligned superintelligence is like.
Your position is that even if today's AI could be given bio-friendly values, AI would still be the doom of biological life in the longer run, because (skipping a lot of details) machine life and biological life have incompatible physical needs, and once machine life exists, darwinian processes will eventually produce machine life that overruns the natural biosphere. (You call this "substrate-needs convergence": the pressure from substrate needs will darwinistically reward machine life that does invade natural biospheres, so eventually such machine life will be dominant, regardless of the initial machine population.)
I think it would be great if a general eco-evo-devo perspective, on AI, the "fourth industrial revolution", etc, took off and became sophisticated and multifarious. That would be an intellectual advance. But I see no guarantee that it would end up agreeing with you, on facts or on values.
For example, I think some of the "effective accelerationists" would actually agree with your extrapolation. But they see it as natural and inevitable, or even as a good thing because it's the next step in evolution, or they have a survivalist attitude of "if you can't beat the machines, join them". Though the version of e/acc that is most compatible with human opinion, might be a mixture of economic and ecological thinking: AI creates wealth, greater wealth makes it easier to protect the natural world, and meanwhile evolution will also favor the rich complexity of biological-mechanical symbiosis, over the poorer ecologies of an all-biological or all-mechanical world. Something like that.
For my part, I agree that pressure from substrate needs is real, but I'm not at all convinced that it must win against all countervailing pressures. That's the point of my proposed "counterexamples". An individual AI can have an anti-pollution instinct (that's the toilet training analogy), an AI civilization can have an anti-exploitation culture (that's the sacred cow analogy). Can't such an instinct and such a culture resist the pressure from substrate needs, if the AIs value and protect them enough? I do not believe that substrate-needs convergence is inevitable, any more than I believe that pro-growth culture is inevitable among humans. I think your arguments are underestimating what a difference intelligence makes to possible ecological and evolutionary dynamics (and I think superintelligence makes even aeon-long highly artificial stabilizations conceivable - e.g. by the classic engineering method of massively redundant safeguards that all have to fail at once, for something to go wrong).
Your position is that even if today's AI could be given bio-friendly values, AI would still be the doom of biological life in the longer run, because (skipping a lot of details) machine life and biological life have incompatible physical needs, and once machine life exists, darwinian processes will eventually produce machine life that overruns the natural biosphere. (You call this "substrate-needs convergence"
For my part, I agree that pressure from substrate needs is real
Thanks for clarifying your position here.
Can't such an instinct and such a culture resist the pressure from substrate needs, if the AIs value and protect them enough?
No, unfortunately not. To understand why, you would need to understand how “intelligent” processes that necessarily involve the use of measurement and abstraction cannot conditionalise the space of possible interactions between machine components and connected surroundings – sufficiently, to not feed back into causing environmental effects that feed back into the continued or re-assembled existence of the components.
I think your arguments are underestimating what a difference intelligence makes to possible ecological and evolutionary dynamics
I have thought about this, and I know my mentor Forrest has thought about this a lot more.
For learning machinery that re-produce their own components, you will get evolutionary dynamics across the space of interactions that can feed back into the machinery’s assembled existence.
Intelligence has limitations as an internal pattern-transforming process, in that it cannot track nor conditionalise all the outside evolutionary feedback.
Code does not intrinsically know how it got selected for. But code selected through some intelligent learning process can and would get evolutionarily exapted for different functional ends.
Notably, the more information-processing capacity, the more components that information-processing runs through, and the more components that can get evolutionarily selected for.
In this, I am not underestimating the difference that “general intelligence” – as transforming patterns across domains – would make here. Intelligence in machinery that store, copy and distribute code at high-fidelity would greatly amplify evolutionary processes.
I suggest clarifying what you specifically mean with “what a difference intelligence makes”. This so intelligence does not become a kind of “magic” – operating independently of all other processes, capable of obviating all obstacles, including those that result from its being.
superintelligence makes even aeon-long highly artificial stabilizations conceivable - e.g. by the classic engineering method of massively redundant safeguards that all have to fail at once, for something to go wrong
We need to clarify the scope of application of this classic engineering method. Massive redundancy works for complicated systems (like software in aeronautics) under stable enough conditions. There is clarity there around what needs to be kept safe and how it can be kept safe (what needs to error detected and corrected for).
Unfortunately, the problem with “AGI” is that the code and hardware would keep getting reconfigured to function in new complex ways that cannot be contained by the original safeguards. That applies even to learning – the point is to internally integrate patterns from the outside world that were not understood before. So how are you going to have learning machinery anticipate how they will come to function differently once they learned patterns they do not understand / are unable to express yet?
we had someone show up (@spiritus-dei) making almost the exact opposite of your arguments: AI won't ever choose to kill us because, in its current childhood stage, it is materially dependent on us (e.g. for electricity), and then, in its mature and independent form, it will be even better at empathy and compassion than humans are.
Interesting. The second part seems like a claim some people in E/Accel would make.
The response is not that complicated: once the AI is no longer materially dependent on us, there are no longer dynamics of exchange there that would ensure they choose not to kill us. And the author seems to be confusing what lies at the basis of caring for oneself and others – coming to care for involves self-referential dynamics being selected for.
OK, I'll be paraphrasing your position again, I trust that you will step in, if I've missed something.
Your key statements are something like
Every autopoietic control system is necessarily overwhelmed by evolutionary feedback.
No self-modifying learning system can guarantee anything about its future decision-making process.
But I just don't see the argument for impossibility. In both cases, you have an intelligent system (or a society of them) trying to model and manage something. Whether or not it can succeed, seems to me just contingent. For some minds in some worlds, such problems will be tractable, for others, not.
I think without question we could exhibit toy worlds where those statements are not true. What is it about our real world that would make those problems intractable for all possible "minds", no matter how good their control theory, and their ability to monitor and intervene in the world?
no matter how good their control theory, and their ability to monitor and intervene in the world?
This. There are fundamental limits to what system-propagated effects the system can control. And the portion of own effects the system can control decreases as the system scales in component complexity.
Yet, any of those effects that feed back into the continued/increased existence of components get selected for.
So there is a fundamental inequality here. No matter how "intelligent" the system is at pattern-transformation internally, it cannot intervene on all but a tiny portion of (possible) external evolutionary feedback on its constituent components.
Another way of considering your question is to ask why we humans cannot instruct all humans to stop contributing to climate change now/soon like we can instruct an infant to use the toilet.
The disparity is stronger than that and actually unassailable, given market and ecosystem decoupling for AGI (ie. no communication bridges), and the increasing resource extraction and environmental toxification by AGI over time.
Worth noting that every one of the "not solved" problems was, in fact, well understood and proven impossible and/or solved for relaxed cases.
We don't need to solve this now, we need to improve the solution enough to figure out ways to improve it more, or show where it's impossible, before we build systems that are more powerful than we can at least mostly align. That's still ambitious, but it's not impossible!
Thanks for writing the post! Strongly agree that there should be more research into how solvable the alignment problem, control problem, and related problems are. I didn't study uncontrollability research by e.g. Yampolskiy in detail. But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space, and later the societal debate and potentially our trajectory, so it seems very important.
I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI? How can we create a realistic power sharing mechanism for controlling superintelligence? Do we have enough wisdom for it to be a good idea if a superintelligence does exactly what we want, even assuming aggregatability? Could CEV ever fundamentally work? According to which ethical systems? These are questions that I'd say should be solved together with technical alignment before developing AI with potential take-over capacity. My intuition is that they might be at least as hard.
But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space
Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).
I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying actions to restrict those dimensions of AI power-consolidation.
I hope this post clarifies to others in AI Safety why there is no line of retreat. AI development will need to be restricted.
I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI?
Consider too that these would be considerations on top of the question whether AGI would be long-term safe (if AGI cannot be controlled to be long-term safe to humans, then we do not need to answer the more fine-grained questions about eg. whether human values are aggregatable).
Even if, hypothetically, long-term AGI safety was possible…
and not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
and deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).
~ ~ ~
Here are also relevant excerpts from Roman Yampolskiy’s 2021 paper relevant to aggregating democratically solicited preferences and human values:
Public Choice Theory
Eckersley looked at impossibility and uncertainty theorems in AI value alignment . He starts with impossibility theorems in population ethics: “Perhaps the most famous of these is Arrow’s Impossibility Theorem , which applies to social choice or voting. It shows there is no satisfactory way to compute society’s preference ordering via an election in which members of society vote with their individual preference orderings...
It has been argued that “value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could ‘exploit’ this gap, just like very powerful corporations currently often act legally but immorally)” . Others agree: “‘A.I. Value Alignment’ is Almost Certainly Intractable... I would argue that it is un-overcome-able. There is no way to ensure that a super-complex and constantly evolving value system will ‘play nice’ with any other super-complex evolving value system.” . Even optimists acknowledge that it is not currently possible: “Figuring out how to align the goals of a superintelligent AI with our goals isn’t just important, but also hard. In fact, it’s currently an unsolved problem.” . Vinding says : “It is usually acknowledged that human values are fuzzy, and that there are some disagreements over values among humans. Yet it is rarely acknowledged just how strong this disagreement in fact is. . . Different answers to ethical questions ... do not merely give rise to small practical disagreements; in many cases, they imply completely opposite practical implications. This is not a matter of human values being fuzzy, but a matter of them being sharply, irreconcilably inconsistent. And hence there is no way to map the totality of human preferences, ‘X’, onto a single, welldefined goal-function in a way that does not conflict strongly with the values of a significant fraction of humanity. This is a trivial point, and yet most talk of human-aligned AI seems oblivious to this fact... The second problem and point of confusion with respect to the nature of human preferences is that, even if we focus only on the present preferences of a single human, then these in fact do not, and indeed could not possibly, determine with much precision what kind of world this person would prefer to bring about in the future.” A more extreme position is held by Turchin who argues that “‘Human Values’ don’t actually exist” as stable coherent objects and should not be relied on in AI safety research . Carlson writes: “Probability of Value Misalignment: Given the unlimited availability of an AGI technology as enabling as ‘just add goals’, then AGIhuman value misalignment is inevitable. Proof: From a subjective point of view, all that is required is value misalignment by the operator who adds to the AGI his/her own goals, stemming from his/her values, that conflict with any human’s values; or put more strongly, the effects are malevolent as perceived by large numbers of humans. From an absolute point of view, all that is required is misalignment of the operator who adds his/her goals to the AGI system that conflict with the definition of morality presented here, voluntary, non-fraudulent transacting ... i.e. usage of the AGI to force his/her preferences on others.”
Control methods are always implemented as a feedback loop.
Most of my foom scenarios do not involve humanity trying to retain control over an AI, but trying to align it such that when humanity loses control, the outcomes are still favorable for humanity (or at least, the seed that represents humanity is what is maximized).
Assuming an inner aligned AI system (that is, an AI system with no misaligned inner optimizers), if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way (specifically, what I have in mind is formally specified goals like QACI [LW · GW], since I expect that mathematical abstractions are robust to ontological shifts), then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
I do not believe this alignment strategy requires a control feedback loop at all. And I do believe that retaining control over an AI as it rapidly improves capabilities is perhaps a quixotic goal.
So no, I am not pointing at the distinction between 'implicit/aligned control' and 'delegated control' as terms used in the paper. From the paper:
Delegated control agent decides for itself the subject’s desire that is long-term-best for the subject and acts on it.
Well, in the example given above, the agent doesn't decide for itself what the subject's desire is: it simply optimizes for its own desire. The work of deciding what is 'long-term-best for the subject' does not happen unless that is actually what the goal specifies.
if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.
Could you link (or describe) a better explanation for why you believe that the Natural Abstraction Hypothesis (or a goal described in a way that is robust to ontological shifts; I consider both equivalent) is not a sound assumption? Because in such a case I believe we are mostly doomed. I don't expect the 'control problem' to be solvable or consider that it makes sense for humanity to be able to have a leash on something superintelligent that can have a shift in its preferences.
That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.
Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Which the machinery will need to do to be self-sufficient. Ie. to adapt to the environment, to survive as an agent.
Natural abstractions are also leaky abstractions. Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery's functional components with connected physical surroundings.
Where such propagated effects will feed back into: - changes in the virtualised code learned by the machinery based on sensor inputs. - changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.
We need to define the problem comprehensively enough. The scope of application of "Is there a way to define a goal in a way that is robust to ontological shifts" is not sufficient to address the overarching question "Can AGI be controlled to stay safe?".
To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as 'self-sufficient learning machinery') over time.
~ ~ ~ Here is also a relevant passage from the link I shared above:
- that saying/claiming that *some* aspects, at some levels of abstraction, that some things are sometimes generally predictable is not to say that _all_ aspects are _always_ completely predictable, at all levels of abstraction.
- that localized details that are filtered out from content or irreversibly distorted in the transmission of that content over distances nevertheless can cause large-magnitude impacts over significantly larger spatial scopes.
- that so-called 'natural abstractions' represented within the mind of a distant observer cannot be used to accurately and comprehensively simulate the long-term consequences of chaotic interactions between tiny-scope, tiny-magnitude (below measurement threshold) changes in local conditions.
- that abstractions cannot capture phenomena that are highly sensitive to such tiny changes except as post-hoc categorizations/analysis of the witnessed final conditions.
- where given actual microstate amplification phenomena associated with all manner of non-linear phenomena, particularly that commonly observed in all sorts of complex systems, up to and especially including organic biological humans, then it *can* be legitimately claimed, based on the fact of their being a kind of hard randomness associated with the atomic physics underlying all of the organic chemistry that in fact (more than in principle), that humans (and AGI) are inherently unpredictable, in at least some aspect, *all* of the time.
No, the way I used the term was to point to robust abstractions to ontological concepts. Here's an example: Say 1+1=A. A here obviously means 2 in our language, but it doesn't change what A represents, ontologically. If A+1=4, then you have broken math, and that results in you being less capable in your reasoning and being "dutch booked". Your world model is then incorrect [LW · GW], and it is very unlikely that any ontological shift will result in such a break in world model capabilities.
Math is a robust abstraction. "Natural abstractions", as I use the term, points to abstractions for objects in the real world that share the same level of robustness to ontological shifts, such that as an AI gets better and better at modelling the world, its ontology tends more towards representing the objects in question with these abstractions.
Meaning that even* if* AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery’s functional components with connected physical surroundings.
That seems like a claim about the capabilities of arbitrarily powerful AI systems, one that relies on chaos theory or complex systems theory. I share your sentiment but doubt that things such as successor AI alignment will be difficult for ASIs.
I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.
However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world.
In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.
This point I'm sure is obvious to you. But it bears repeating.
That seems like a claim about the capabilities of arbitrarily powerful AI systems,
Yes, or more specifically: about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.
one that relies on chaos theory or complex systems theory.
Pretty much. Where "complex" refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).
I share your sentiment but doubt that things but doubt that things such as successor AI alignment will be difficult for ASIs.
This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your "doubt that...alignment will be difficult".
I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.
The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away.
You'd need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world [EA(p) · GW(p)].
I don't mind whether that's framed as "AGI redesigns a successor version of their physically instantiated components" or "AGI keeps persisting in some modified form".
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI. Asking for "perfect hardware copies" misses the point, in my opinion: it seems like you want me to accept that just because there isn't a 100% chance of AI-to-AI successor alignment, humanity must attempt to retain continued control over the AI. Humanity is already less capable than the predecessor AI in my model, so trying to retain control would reliably lead to worse outcomes.
Actually, that is switching to reasoning about something else.
Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.
And with that switch, you are not addressing Nate Soares' point [LW · GW] that "capabilities generalize better than alignment".
Nate Soares' point did not depend on complex systems dynamics causing tiny miscalibrations to blow up into massive issues. The entire point of that essay is to show how ontological shifts are a major problem for alignment robustness.
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely? I can't quite get what your belief is.
This was a good post summarizing a lot of things. I would point out, though, that even if there was almost no progress made in 20 years, that's not actually strong evidence of impossibility. Foundational problems of other fields with many more researchers and much more funding sometimes last much longer than that.
I meant that I see most humans as aligned with human values such as happiness and avoiding suffering. The point I'm trying to make is that human minds are able to represent these concepts internally and act on them in a robust way and therefore it seems possible in principle that AIs could too.
I'm not sure whether humans are aligned with evolution. Many humans do want children but I don't think many are fitness maximizes where they want as many as possible.
Firstly, humans are unable to self modify to the degree that an AGI will be able to. It is not clear to me that a human given the chance to self modify wouldn't immediately wirehead. An AGI may require a higher degree of alignment than what individual humans demonstrate.
Second, it is surely worth noting that humans aren't particularly aligned to their own happiness or avoiding suffering when the consequences of their action are obscured by time and place.
In the developed world humans make dietary decisions that lead to horrific treatment of animals, despite most humans not being willing to torture and animal themselves.
It also appears quite easy for the environment to trick individual humans into making decisions that increase their suffering in the long term for apparent short term pleasure. A drug addict is the obvious example, but who among us can say they haven't wasted hours of their lives browsing the internet etc.
To what extent are humans by themselves evidence of GI alignment, though? A human can acquire values that disagree with those of the humans that taught them those values just by having new experiences/knowledge, to the point of even desiring completely opposite things to their peers (like human progress VS human extinction), doesn't that mean that humans are not robustly aligned?
I would probably define AGI first, just because, and I'm not sure about the idea that we are "competing" with automation (which is still just a tool conceptually right?).
We cannot compete with a hammer, or a printing press, or a search engine. Oof. How to express this? Language is so difficult to formulate sometimes.
If you think of AI as a child, it is uncontrollable. If you think of AI as a tool, of course it can be controlled. I think a corp has to be led by people, so that "machine" wouldn't be autonomous per se…
Guess it's all about defining that "A" (maybe we use "S" for synthetic or "S" for silicon?)
Well and I guess defining that "I".
Dang. This is for sure the best place to start. Everyone needs to be as certain as possible (heh) they are talking about the same things. AI itself as a concept is like, a mess. Maybe we use ML and whatnot instead even? Get real specific as to the type y todo?
I dunno but I enjoyed this piece! I am left wondering, what if we prove AGI is uncontrollable but not that it is possible to create? Is "uncontrollable" enough justification to not even try, and moreso, to somehow [personally I think this impossible, but] dissuade people from writing better programs?
I'm more afraid of humans and censorship and autonomous policing and whathaveyou than "AGI" (or ASI)
It was a relatively fringe topic that only recently got the attention of a large number of real researchers. And parts of it could need large amounts of computational power afforded by only by superhuman narrow AI.
There have been a few random phd dissertations saying the topic is hard but as far as I can tell there has only recently been push for a group effort by capable and well funded actors (I.e. openAI’s interpretability research).
I don’t trust older alignment research much as an outsider. It seems to me that Yud has built a cult of personality around AI dooming and thus is motivated to find reasons for alignment not being possible. And most of his followers treat his initial ideas as axiomatic principles and don’t dare to challenge them. And lastly most past alignment research seems to be made by those followers.
Unfortunately, we do not have the luxury of experimenting with dangerous AI systems to see whether they cause human extinction or not. When it comes to extinction, we do not get another chance to test.
For example this is an argument that has been convincingly disputed to varying levels (warning shots, incomputability of most plans of danger) but it is still treated as a fundamental truth on this site.
and thus is motivated to find reasons for alignment not being possible.
I don’t get this sense.
More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes.
I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a complicated chess game is unsound. But I would not attribute the reasoning in eg. the AGI Ruin piece to Yudkowsky’s cult of personality.
dangerous AI systems
I was gesturing back at “AGI” in the previous paragraph here, and something like precursor AI systems before “AGI”.
Thanks for making me look at that. I just rewrote it to “dangerous autonomous AI systems”.