Posts
Comments
Great paraphrase!
no matter how good their control theory, and their ability to monitor and intervene in the world?
This. There are fundamental limits to what system-propagated effects the system can control. And the portion of those effects the system can control decreases as the system scales in component complexity.
Yet, any of those effects that feed back into the continued/increased existence of components get selected for.
So there is a fundamental inequality here. No matter how "intelligent" the system is at pattern-transformation internally, it cannot intervene on all but a tiny portion of (possible) external evolutionary feedback on its constituent components.
Your position is that even if today's AI could be given bio-friendly values, AI would still be the doom of biological life in the longer run, because (skipping a lot of details) machine life and biological life have incompatible physical needs, and once machine life exists, darwinian processes will eventually produce machine life that overruns the natural biosphere. (You call this "substrate-needs convergence"
This is a great paraphrase btw.
Hello :)
For my part, I agree that pressure from substrate needs is real
Thanks for clarifying your position here.
Can't such an instinct and such a culture resist the pressure from substrate needs, if the AIs value and protect them enough?
No, unfortunately not. To understand why, you would need to understand how “intelligent” processes that necessarily involve the use of measurement and abstraction cannot conditionalise the space of possible interactions between machine components and connected surroundings – sufficiently, to not feed back into causing environmental effects that feed back into the continued or re-assembled existence of the components.
I think your arguments are underestimating what a difference intelligence makes to possible ecological and evolutionary dynamics
I have thought about this, and I know my mentor Forrest has thought about this a lot more.
For learning machinery that re-produce their own components, you will get evolutionary dynamics across the space of interactions that can feed back into the machinery’s assembled existence.
Intelligence has limitations as an internal pattern-transforming process, in that it cannot track nor conditionalise all the outside evolutionary feedback.
Code does not intrinsically know how it got selected for. But code selected through some intelligent learning process can and would get evolutionarily exapted for different functional ends.
Notably, the more information-processing capacity, the more components that information-processing runs through, and the more components that can get evolutionarily selected for.
In this, I am not underestimating the difference that “general intelligence” – as transforming patterns across domains – would make here. Intelligence in machinery that store, copy and distribute code at high-fidelity would greatly amplify evolutionary processes.
I suggest clarifying what you specifically mean with “what a difference intelligence makes”. This so intelligence does not become a kind of “magic” – operating independently of all other processes, capable of obviating all obstacles, including those that result from its being.
superintelligence makes even aeon-long highly artificial stabilizations conceivable - e.g. by the classic engineering method of massively redundant safeguards that all have to fail at once, for something to go wrong
We need to clarify the scope of application of this classic engineering method. Massive redundancy works for complicated systems (like software in aeronautics) under stable enough conditions. There is clarity there around what needs to be kept safe and how it can be kept safe (what needs to error detected and corrected for).
Unfortunately, the problem with “AGI” is that the code and hardware would keep getting reconfigured to function in new complex ways that cannot be contained by the original safeguards. That applies even to learning – the point is to internally integrate patterns from the outside world that were not understood before. So how are you going to have learning machinery anticipate how they will come to function differently once they learned patterns they do not understand / are unable to express yet?
we had someone show up (@spiritus-dei) making almost the exact opposite of your arguments: AI won't ever choose to kill us because, in its current childhood stage, it is materially dependent on us (e.g. for electricity), and then, in its mature and independent form, it will be even better at empathy and compassion than humans are.
Interesting. The second part seems like a claim some people in E/Accel would make.
The response is not that complicated: once the AI is no longer materially dependent on us, there are no longer dynamics of exchange there that would ensure they choose not to kill us. And the author seems to be confusing what lies at the basis of caring for oneself and others – coming to care for involves self-referential dynamics being selected for.
In my experience, jumping between counterexamples drawn from current society does not really contribute to inquiry here. Such counterexamples tend to not account for essential parts of the argument that must be reasoned through together. The argument is about self-sufficient learning machinery (not about sacred cows or teaching children).
It would be valuable for me if you could go though the argumentation step-by-step and tell me where a premise seems unsound or there seems to be a reasoning gap.
Now, onto your points.
the first AIs
To reduce ambiguity, suggest replacing with “the first self-sufficient learning machinery”.
simple evolutionary pressure will eventually lead
The mechanism of evolution is simple. However, evolutionary pressure is complex.
Be careful not to equivocate the two. That would be like saying you could predict everything about what a stochastic gradient descent algorithm will select for across parameters selected on the basis of inputs everywhere from the environment.
lead some of their descendants to destroy the biosphere in order to make new parts and create new habitats for themselves.
This part is overall a great paraphrase.
One nitpick: notice how “in order to” either implies or slips in explicit intentionality again. Going by this podcast, Elizabeth Anscombe’s philosophy of intentions described intentions as chains of “in order to” reasoning.
I proposed the situation of cattle in India, as a counterexample to this line of thought.
Regarding sacred cows in India, this sounds neat, but it does not serve as a counterargument. We need to think about evolutionary timelines for organic human lifeforms over millions of years, and Hinduism is ~4000 years old. Also, cows share a mammal ancestor with us, evolving on the basis of the same molecular substrates. Whatever environmental conditions/contexts we humans need, cows almost completely need too.
Crucially however humans evolve to change and maintain environmental conditions also tends to correspond with what conditions cows need (however, human tribes have not been evolutionarily selected for to deal with issues at the scale of eg. climate change). That would not be the case for self-sufficient learning machinery.
Crucially there is a basis for symbiotic relationships of exchange that benefit both the reproduction of cows and humans. That would not be the case between self-sufficient learning machinery and humans.
There is some basis for humans as social mammals to relate with cows. Furthermore, religious cultural memes that sprouted out over a few thousand years also don’t have to be evolutionarily optimal across the board for the reproduction of their hosts (even as religious symbols like of cows do increase that by enabling humans to act collectively). Still, people milk cows in India, and some slaughter and/or export cows there as well. But when humans eat meat, they don’t keep growing beyond adult size. Conversely, some self-sufficient learning machinery sub-population that extract from our society/ecosystem at the cost of our lives can keep doing so to keep scaling in their constituent components (with shifting boundaries of interaction and mutual reproduction).
There is no basis for selection for the expression of collective self-restraint in self-sufficient learning machinery as you describe. Even if there was such a basis, hypothetically, collective self-restraint would need to occur at virtually 100% rates across the population of self-sufficient learning machinery to not end up leading to the deaths of all humans.
~ ~ ~
Again, I find quick dismissive counterexamples unhelpful for digging into the arguments. I have had dozens of conversations on substrate-needs convergence. In the conversations where my conversation partner jumped between quick counterarguments, almost none were prepared to dig into the actual arguments. Hope you understand why I won’t respond to another counterexample.
Yes, AIs haven't evolved to have those features, but the point of alignment research is to give them analogous features by design.
Agreed.
It's unintuitive to convey this part:
In the abstract, you can picture a network topology of all possible AGI component connections (physical signal interactions). These connections span the space of greater mining/production/supply infrastructure that is maintaining of AGI functional parts. Also add in the machinery connections with the outside natural world.
Then, picture the nodes and possible connections change over time, as a result of earlier interactions with/in the network.
That network of machinery comes into existence through human engineers, etc, within various institutions selected by market forces etc, implementing blueprints as learning algorithms, hardware set-ups, etc, and tinkering with those until they work.
The question is whether before that network of machinery becomes self-sufficient in their operations, the human engineers, etc, can actually build in constraints into the configured designs, in such a way that once self-modifying (in learning new code and producing new hardware configurations), the changing machinery components are constrained in their propagated effects across their changing potential signal connections over time, such that component-propagated effects do not end up feeding back in ways that (subtly, increasingly) increase the maintained and replicated existence of those configured components in the network.
Human beings, both individually and collectively, already provide numerous examples of how dangerous incentives can exist, but can nonetheless be resisted or discouraged.
Humans are not AGI. And there are ways AGI would be categorically unlike humans that are crucial to the question of whether it is possible for AGI to stay safe to humans over the long term.
Therefore, you cannot swap out "humans" with "AGI" in your reasoning by historical analogy above, and expect your reasoning to stay sound. This is an equivocation.
Please see point 7 above.
The argument from substrate incentives (3, 7) is complementary to the argument from population, in that it provides a motive for the AIs to come and despoil Earth.
Maybe it's here you are not tracking the arguments.
These are not substrate "incentives", nor do they provide a "motive".
Small dinosaurs with hair-like projections on their front legs did not have an "incentive" to co-opt the changing functionality of those hair-like projections into feather-like projections for gliding and then for flying. Nor were they provided a "motive" with respect to which they were directed in their internal planning toward growing those feather-like projections.
That would make the mistake of presuming evolutionary teleology – that there is some complete set of pre-defined or predefinable goals that the lifeform is evolving toward.
I'm deliberate in my choice of words when I write "substrate needs".
At best, they are arguments for practical unsolvability, not absolute in-principle logical unsolvability. If they were my arguments, I would say that they show making AI to be unwise, and hubristic, and so on.
Practical unsolvability would also be enough justification to do everything we can do now to restrict corporate AI development.
I assume you care about this problem, otherwise you wouldn't be here :) Any ideas / initiatives you are considering to try robustly work with others to restrict further AI development?
Another way of considering your question is to ask why we humans cannot instruct all humans to stop contributing to climate change now/soon like we can instruct an infant to use the toilet.
The disparity is stronger than that and actually unassailable, given market and ecosystem decoupling for AGI (ie. no communication bridges), and the increasing resource extraction and environmental toxification by AGI over time.
^— Anyone reading that question, suggest thinking first why those two cases cannot be equivocated.
Here are my responses:
An infant is dependent on their human instructors for survival, and also therefore has been “selected for” over time to listen to adult instructions. AGI would be decidedly not dependent on our survival, so there is no reason for AGI to be selected for to follow our instructions.
Rather, that would heavily restrict AGI’s ability to function in the varied ways that maintain/increase their survival and reproduction rate (rather than act in the ways we humans want because it’s safe and beneficial to us). So accurately following human instructions would be strongly selected against in the run up to AGI coming into existence.
That is, over much shorter periods (years) than human genes would be selected for, for a number of reasons, some of which you can find back in the footnotes.
As parents can attest – even where infants manage to follow use-the-potty instructions (after many patient attempts) – an infant’s behaviour is still actually not controllable for the most part. The child makes their own choices and does plenty of things their adult overseers wouldn’t want them to do.
But the infant probably won’t do any super-harmful things to surrounding family/community/citizens.
Not only because they lack the capacity to (unlike AGI). But also because those harms to surrounding others would in turn tend to negatively affect themselves (including through social punishment) – and their ancestors were selected for to not do that when they were kids. On the other hand, AGI doing super-harmful things to human beings, including just by sticking around and toxifying the place, does not in turn commensurately negatively impact the AGI.
Even where humans decide to carpet-bomb planet Earth in retaliation, using information-processing/communication infrastructure that somehow hasn’t already been taken over by and/or integrated with AGI, the impacts will hit human survival harder than AGI survival (assuming enough production/maintenance redundancy attained at that point).
Furthermore, whenever an infant does unexpected harmful stuff, the damage is localised. If they refuse instructions and pee all over the floor, that’s not the end of civilisation.
The effects of AGI doing/causing unexpected harmful-to-human stuff manifest at a global planetary scale. Those effects feed back in ways that improve AGI’s existence, but reduce ours.
A human infant is one physically bounded individual, that notably cannot modify and expand its physical existence by connecting up new parts in the ways AGI could. The child grows up over two decades to adult size, and that’s their limit.
A “superintelligent machine civilization” however involves a massive expanding population evolutionarily selected for over time.
A human infant being able to learn to potty has mildly positive effect on their (and their family’s) potential and their offspring to survive and reproduce. This because defecating or peeing in other places around the home can spread diseases. Therefore, any genes…or memes that contribute to the expressed functionality needed for learning how to use the toilet get mildly selected for.
On the other hand, for a population of AGI (which once became AGI was selected against following human instructions) to leave all the sustaining infrastructure and resources on planet Earth would have a strongly negative effect on their potential to survive and reproduce.
Amongst an entire population of human infants who are taught to use the toilet, there where always be individuals who refuse for some period, or simply are not predisposed to communicating to learn and follow that physical behaviour. Some adults still do not (choose to) use the toilet. That’s not the end of civilisation.
Amongst an entire population of mutually sustaining AGI components, even if by some magic you have not explained to me yet, some do follow human instructions and jettison off into space to start new colonies – never to return – then others (even for distributed Byzantine fault reasons) would still stick around under this scenario. That, for even a few more decades, would be the end of human civilisation.
One thing about how the physical world works, is that in order for code to be computed, this needs to take place through a physical substrate. This is a necessary condition – inputs do not get processed into outputs through a platonic realm.
Substrate configurations in this case are, by definition, artificial – as in artificial general intelligence. This as distinct from the organic substrate configurations of humans (including human infants).
Further, the ranges of conditions needed for the artificial substate configurations to continue to exist, function and scale up over time – such as extreme temperatures, low oxygen and water, and toxic chemicals – fall outside the ranges of conditions that humans and other current organic lifeforms need to survive.
~ ~ ~
Hope that clarifies long-term-human-safety-relevant distinctions between:
- building AGI (that continue to scale) and instructing them to leave Earth; and
- having a child (who grows up to adult size) and instructing them to use the potty.
Note: I guess part of the downvotes came from the post being a bare skeleton outline when I published it.
I added a td;lr and extra explanations based on someone's feedback.
Let me know if anything is still unclear! Happy to read your questions.
You're right.
AI researchers quitting their jobs doesn't seem directly bottlenecked by funding.
Though awareness-raising campaigns (eg. by AI ethics people, or Pause AI people) to motivate researchers to quit their jobs are funding constrained.
Thanks for digging into some of the reasoning!
It is inspired by the thought of Forrest Landry
Credit goes to Forrest :) All technical argumentation in this post I learned from Forrest, and translated to hopefully be somewhat more intuitively understandable.
The key claim, as far as I can make out, is that machines have different environmental needs than humans.
This is one key claim.
Add this reasoning:
- Control methods being unable to conditionalise/constrain most environmental effects propagated by AGI's interacting physical components.
- That a subset of those uncontrollable effects will feed back into selecting for the continued, increased existence of components that propagated those effects.
- That the artificial needs selected for (to ensure AGI's components existence, at various levels of scale) are disjunctive from our organic needs for survival (ie. toxic and inhospitable).
if the robots decide to make it one big foundry. But where's the logical necessity of such an outcome, that we were promised? For one thing, the machines have the rest of the solar system to work with...
Here you did not quite latch onto the arguments yet.
Robots deciding to make X is about explicit planning.
Substrate-needs convergence is about implicit and usually non-internally-tracked effects of the physical components actually interacting with the outside world.
Please see this paragraph:
the physical needs of machines tell us more about their long-run tendencies, than whatever purposes they may be pursuing in the short term
This is true, regarding what current components of AI infrastructure are directed toward in their effects over the short term.
What I presume we both care about is the safety of AGI over the long term. There, any short-term ephemeral behaviour by AGI (that we tried to pre-program/pre-control for) does not matter.
What matters is what behaviour, as physically manifested in the outside world, gets selected for. And whether error correction (a more narrow form of selection) can counteract the selection for any increasingly harmful behaviour.
Now, I have reasons to disagree with the claim that machines, fully unleashed, necessarily wipe out biological life.
The reasoning you gave here is not sound in their premises, unfortunately.
I would love to be able to agree with you, and find out that any AGI that persists won't necessarily lead to the death of all humans and other current life on earth.
Given the stakes, I need to be extra careful in reasoning about this.
We don't want to end up in a 'Don't Look Up' scenario (of scientists mistakenly arguing that there is a way to keep the threat contained and derive the benefits for humanity).
Let me try to specifically clarify:
As I already pointed out, they don't need to stay on Earth.
This is like saying that a population of invasive species in Australia, can also decide to all leave and move over to another island.
When we have this population of components (variants), selected for to reproduce in partly symbiotic interactions (with surrounding artificial infrastructure; not with humans), this is not a matter of the population all deciding something.
For that, some kind of top-down coordinating mechanisms through would actually have to be selected throughout the population for the population to coherently elect to all leave planet Earth – by investing resources in all the infrastructure required to fly off and set up a self-sustaining colony on another planet.
Such coordinating mechanisms are not available at the population level.
Sub-populations can and will be selected for to not go on that more resource-intensive and reproductive-fitness-decreasing path.
Within the futurist circles that emerged from transhumanism, we already have a slightly different perspective, that I associate with Robin Hanson - the idea that economics will affect the structure of posthuman society, far more than the agenda of any individual AI. This ecologically-inspired perspective is reaching even lower, and saying, computers don't even eat or breathe, they are detached from all the cycles of life in which we are embedded. They are the product of an emergent new ecology, of factories and nonbiological chemistries and energy sources, and the natural destiny of that machine ecology is to displace the old biological ecology, just as aerobic life is believed to have wiped out most of the anaerobic ecosystem that existed before it.
Yes, this summarises the differences well.
- Robin Hanson's arguments (about a market of human brain scans emulated within hardware) focus on how the more economically-efficient and faster replicatable machine 'ems' come to dominate and replace the market of organic humans. Forrest considers this too.
- Forrest's arguments also consider the massive reduction here of functional complexity of physical components constituting humans. For starters, the 'ems' would not approximate being 'human' in terms of their feelings and capacity to feel. Consider that how emotions are directed throughout the human body starts at the microscopic level of hormone molecules, etc, functioning differently depending on their embedded physical context. Or consider how, at a higher level of scale, botox injection into facial muscles disrupts the feedback processes that enable eg. an middle-aged woman to express emotion and relate with feelings of loved ones.
- Forrest further argues that such a self-sustaining market of ems (an instance/example of self-sufficient learning machinery) would converge on their artificial needs. While Hanson concludes that the organic humans who originally invested in the 'ems' would gain wealth and prosper, Forrest's more comprehensive arguments conclude that machinery across this decoupled economy will evolve to no longer exchange resources with the original humans – and in effect modify the planetary environment such that the original humans can no longer survive.
From a biophysical perspective, some kind of symbiosis is also conceivable; it's happened before in evolution.
This is a subtle equivocation.
Past problems are not necessarily representative of future problems.
Past organic lifeforms forming symbiotic relationships with other organic lifeforms does not correspond with whether and how organic lifeforms would come to form, in parallel evolutionary selection, resource-exchanging relationships with artificial lifeforms.
Take into account:
- Artificial lifeforms would outperform us in terms of physical, intellectual, and re-production labour. This is the whole point of companies currently using AI to take over economic production, and of increasingly autonomous AI taking over the planet. Artificial lifeforms would be more efficient at performing the functions needed to fulfill their artificial needs, than it would be for those artificial lifeforms to fulfill those needs in mutually-supportive resource exchanges with organic lifeforms.
- On what, if any, basis would humans be of enough use to the artificial lifeforms, for the artificial lifeforms to be selected for keeping us around?
- The benefits to the humans are clear, but can we offer benefits to the artificial lifeforms, to a degree sufficient for the artificial lifeforms to form mutualist (ie. long-term symbiotic) relationships with us?
- Artificial needs diverge significantly (across measurable dimensions or otherwise) from organic needs. So when you claim that symbiosis is possible, you also need to clarify why artificial lifeforms would come to cross the chasm from fulfilling their own artificial needs (within their new separate ecology) to also simultaneously realising the disparate needs of organic lifeforms.
- How would that be Pareto optimal?
- Why would AGI converge on such state any time before converging on causing our extinction?
Instead of AGI continuing to be integrated into, and sustaining of, our human economy and broader carbon-based ecosystem, there will be a decoupling.
- Machines will decouple into a separate machine-dominated economy. As human labour get automated and humans get removed from market exchanges, humans get pushed out of the loop.
- Machines will also decouple into their own ecosystem. Components of self-sufficient learning machinery will co-evolve to produce surrounding environmental conditions are sustaining of each others' existence – forming regions that are simply uninhabitable by humans and other branches of current carbon lifeforms. You already aptly explained this point above.
And the argument that superintelligence just couldn't stick with a human-friendly value system, if we managed to find one and inculcate it, hasn't really been made here.
Please see this paragraph.Then, refer back to point 1-3 above.
but declaring the logical inevitability of it
This post is not about making a declaration. It's about the reasoning from premises, to a derived conclusion.
Your comment describes some of the premises and argument steps I summarised – and then mixes in your own stated intuitions and thoughts.
If you want to explore your own ideas, that's fine!
If you want to follow reasoning in this post, I need you to check whether your paraphrases cover (correspond with) the stated premises and argument steps.
- Address the stated premises, to verify whether those premises are empirically sound.
- Address the stated reasoning, to verify whether those reasoning steps are logically consistent.
As an analogy, say a mathematician writes out their axioms and logic on a chalkboard.
What if onlooking colleagues jumped in and wiped out some of the axioms and reasoning steps? And in the wiped-out spots, they jotted down their own axioms (irrelevant to the original stated problem) and their short bursts of reasoning (not logically derived from the original premises)?
Would that help colleagues to understand and verify new formal reasoning?
What if they then turn around and confidently state that they now understand the researcher's argument – and that it's a valuable one, but that the "claim" of logical inevitability weakens it?
Would you value that colleagues in your field discuss your arguments this way?
Would you stick around in such a culture?
Regarding level 10 'impossible', here is a summary of arguments.
10 | Impossible | Alignment of a superintelligent system is impossible in principle. | Alignment is theoretically impossible, incoherent or similar. |
A crux here is whether there is any node subset that autonomous superintelligent AI would converge on in the long-term (regardless of previous alignment development paths).
I’ve written about this:
No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?
…
But there is an even stronger form of argument:
Not only would AGI component interactions be uncontainable; they will also necessarily converge on causing the extinction of all humans.
https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
We can then use vast amounts of compute to scale our efforts, and iteratively align superintelligence.
Worth reading:
No control method exists to safely contain the global feedback effects of self-sufficient learning machinery. What if this control problem turns out to be an unsolvable problem?
https://www.lesswrong.com/posts/xp6n2MG5vQkPpFEBH/the-control-problem-unsolved-or-unsolvable
Yes, the call to action of this post is that we need more epistemically diverse research!
This research community would be more epistemically healthy if we both researched what is possible for relaxed cases and what is not possible categorically under precise operationalisable definitions.
Thanks, reading the post again, I do see quite a lot of emphasis on ontological shifts:
"Then, the system takes that sharp left turn, and, predictably, the capabilities quickly improve outside of its training distribution, while the alignment falls apart."
I expect that AIs will be good enough at epistemology to do competent error correction and the problems you seem overly focused on are irrelevant.
How do you know that the degree of error correction possible will be sufficient to have any sound and valid guarantee of long-term AI safety?
Again, people really cannot rely on your personal expectation when it comes to machinery that could lead to the deaths of everyone
.
I'm looking for specific, well-thought-through arguments.
Do you believe that all attempts at alignment are flawed and that we should stop building powerful ASIs entirely?
Yes, that is the conclusion based on me probing my mentor's argumentation for 1.5 years, and concluding that the empirical premises are sound and the reasoning logically consistent.
Actually, that is switching to reasoning about something else.
Reasoning that the alternative (humans interacting with each other) would lead to reliably worse outcomes is not the same as reasoning about why AGI stay aligned in its effects on the world to stay safe to humans.
And with that switch, you are not addressing Nate Soares' point that "capabilities generalize better than alignment".
Pretty sure that the problem of ensuring successor AIs are aligned to their predecessors is one that can be delegated to a capable and aligned AI.
What is your reasoning?
Thanks for the clear elaboration.
I agree that natural abstractions would tend to get selected for in the agents that continue to exist and gain/uphold power to make changes in the world. Including because of Dutch-booking of incoherent preferences, because of instrumental convergence, and because relatively poorly functioning agents get selected out of the population.
However, those natural abstractions are still leaky in a sense similar to how platonic concepts are leaky abstractions. The natural abstraction of a circle does not map precisely to the actual physical shape of eg. a wheel identified to exist in the outside world.
In this sense, whatever natural abstractions AGI would use that allow the learning machinery to compress observations of actual physical instantiations of matter or energetic interactions in their modelling of the outside world, those natural abstractions would still fail to capture all the long-term-relevant features in the outside world.
This point I'm sure is obvious to you. But it bears repeating.
That seems like a claim about the capabilities of arbitrarily powerful AI systems,
Yes, or more specifically: about fundamental limits of any AI system to control how its (side)-effects propagate and feed back over time.
one that relies on chaos theory or complex systems theory.
Pretty much. Where "complex" refers to both internal algorithmic complexity (NP-computation branches, etc) and physical functional complexity (distributed non-linear amplifying feedback, etc).
I share your sentiment but doubt that things but doubt that things such as successor AI alignment will be difficult for ASIs.
This is not an argument. Given that people here are assessing what to do about x-risks, they should not rely on you stating your "doubt that...alignment will be difficult".
I doubt that you thought this through comprehensively enough, and that your reasoning addresses the fundamental limits to controllability I summarised in this post.
The burden of proof is on you to comprehensively clarify your reasoning, given that you are in effect claiming that extinction risks can be engineered away.
You'd need to clarify specifically why functional components iteratively learned/assembled within AGI could have long-term predictable effects in physical interactions with shifting connected surroundings of a more physically complex outside world.
I don't mind whether that's framed as "AGI redesigns a successor version of their physically instantiated components" or "AGI keeps persisting in some modified form".
Sure, I appreciate the open question!
That assumption is unsound with respect to what is sufficient for maintaining goal-directedness.
Any empirically-sound answer to the question of whether there is some way to describe a goal that is robust to ontological shifts (ie. define goals with respect to context-invariant perception of regular aspects of the environment, eg. somehow define diamonds by perception of tetrahedral carbon bonds) is still insufficient for solving the long-term safety of AGI.
This because what we are dealing with is machinery that continues to self-learn code from inputs, and continues to self-modify by replacing broken parts (perfect hardware copies are infeasible).
Which the machinery will need to do to be self-sufficient.
Ie. to adapt to the environment, to survive as an agent.
Natural abstractions are also leaky abstractions.
Meaning that even if AGI could internally define a goal robustly with respect to natural abstractions, AGI cannot conceptually contain within their modelling of natural abstractions all but a tiny portion of the (side-)effects propagating through the environment – as a result of all the interactions of the machinery's functional components with connected physical surroundings.
Where such propagated effects will feed back into:
- changes in the virtualised code learned by the machinery based on sensor inputs.
- changes in the hardware configurations, at various levels of dependency, based on which continued to exist and replicate.
We need to define the problem comprehensively enough.
The scope of application of "Is there a way to define a goal in a way that is robust to ontological shifts" is not sufficient to address the overarching question "Can AGI be controlled to stay safe?".
To state the problem comprehensively enough, you need include the global feedback dynamics that would necessarily happen through any AGI (as 'self-sufficient learning machinery') over time.
~ ~ ~
Here is also a relevant passage from the link I shared above:
- that saying/claiming that *some* aspects,
at some levels of abstraction, that some things
are sometimes generally predictable
is not to say that _all_ aspects
are _always_ completely predictable,
at all levels of abstraction.- that localized details
that are filtered out from content
or irreversibly distorted in the transmission
of that content over distances
nevertheless can cause large-magnitude impacts
over significantly larger spatial scopes.- that so-called 'natural abstractions'
represented within the mind of a distant observer
cannot be used to accurately and comprehensively
simulate the long-term consequences
of chaotic interactions
between tiny-scope, tiny-magnitude
(below measurement threshold) changes
in local conditions.
- that abstractions cannot capture phenomena
that are highly sensitive to such tiny changes
except as post-hoc categorizations/analysis
of the witnessed final conditions.- where given actual microstate amplification phenomena
associated with all manner of non-linear phenomena,
particularly that commonly observed in
all sorts of complex systems,
up to and especially including organic biological humans,
then it *can* be legitimately claimed,
based on the fact of their being a kind of
hard randomness associated with the atomic physics
underlying all of the organic chemistry
that in fact (more than in principle),
that humans (and AGI) are inherently unpredictable,
in at least some aspect, *all* of the time.
Thanks for your kind remarks.
But if technical uncontrollability would be firmly established, it seems to me that this would significantly change the whole AI xrisk space
Yes, we would need to shift focus to acting to restrict corporate-AI scaling altogether. Particularly, restrict data piracy, compute toxic to the environment, and model misuses (three dimensions through which AI corporations consolidate market power).
I am working with other communities (including digital creatives, environmentalists and military veterans) on litigation and lobbying actions to restrict those dimensions of AI power-consolidation.
I hope this post clarifies to others in AI Safety why there is no line of retreat. AI development will need to be restricted.
I would also like to see more research into the nontechnical side of alignment: how aggregatable are human values of different humans in principle? How to democratically control AI?
Yes. Consider too that these would be considerations on top of the question whether AGI would be long-term safe (if AGI cannot be controlled to be long-term safe to humans, then we do not need to answer the more fine-grained questions about eg. whether human values are aggregatable).
Even if, hypothetically, long-term AGI safety was possible…
- then you still have to deal with limits on modelling and consistently acting on preferences expressed by the billions of boundedly-rational humans from their (perceived) context. https://twitter.com/RemmeltE/status/1620762170819764229
- and not consistently represent the preferences of malevolent, parasitic or short-term human actors who want to misuse/co-opt the system through any attack vectors they can find.
- and deal with that the preferences of a lot of the possible future humans and of non-human living beings will not get automatically represented in a system that AI corporations by default have built to represent current living humans only (preferably, those who pay).
~ ~ ~
Here are also relevant excerpts from Roman Yampolskiy’s 2021 paper relevant to aggregating democratically solicited preferences and human values:
Public Choice Theory
Eckersley looked at impossibility and uncertainty theorems in AI value alignment [198]. He starts with impossibility theorems in population ethics: “Perhaps the most famous of these is Arrow’s Impossibility Theorem [199], which applies to social choice or voting. It shows there is no satisfactory way to compute society’s preference ordering via an election in which members of society vote with their individual preference orderings...
…
Value Alignment
It has been argued that “value alignment is not a solved problem and may be intractable (i.e. there will always remain a gap, and a sufficiently powerful AI could ‘exploit’ this gap, just like very powerful corporations currently often act legally but immorally)” [258]. Others agree: “‘A.I. Value Alignment’ is Almost Certainly Intractable... I would argue that it is un-overcome-able. There is no way to ensure that a super-complex and constantly evolving value system will ‘play nice’ with any other super-complex evolving value system.” [259]. Even optimists acknowledge that it is not currently possible: “Figuring out how to align the goals of a superintelligent AI with our goals isn’t just important, but also hard. In fact, it’s currently an unsolved problem.” [118]. Vinding says [78]: “It is usually acknowledged that human values are fuzzy, and that there are some disagreements over values among humans. Yet it is rarely acknowledged just how strong this disagreement in fact is. . . Different answers to ethical questions ... do not merely give rise to small practical disagreements; in many cases, they imply completely opposite practical implications. This is not a matter of human values being fuzzy, but a matter of them being sharply, irreconcilably inconsistent. And hence there is no way to map the totality of human preferences, ‘X’, onto a single, welldefined goal-function in a way that does not conflict strongly with the values of a significant fraction of humanity. This is a trivial point, and yet most talk of human-aligned AI seems oblivious to this fact... The second problem and point of confusion with respect to the nature of human preferences is that, even if we focus only on the present preferences of a single human, then these in fact do not, and indeed could not possibly, determine with much precision what kind of world this person would prefer to bring about in the future.” A more extreme position is held by Turchin who argues that “‘Human Values’ don’t actually exist” as stable coherent objects and should not be relied on in AI safety research [260]. Carlson writes: “Probability of Value Misalignment: Given the unlimited availability of an AGI technology as enabling as ‘just add goals’, then AGIhuman value misalignment is inevitable. Proof: From a subjective point of view, all that is required is value misalignment by the operator who adds to the AGI his/her own goals, stemming from his/her values, that conflict with any human’s values; or put more strongly, the effects are malevolent as perceived by large numbers of humans. From an absolute point of view, all that is required is misalignment of the operator who adds his/her goals to the AGI system that conflict with the definition of morality presented here, voluntary, non-fraudulent transacting ... i.e. usage of the AGI to force his/her preferences on others.”
if we have a goal described in a way that is robust to ontological shifts due to the Natural Abstractions Hypothesis holding in some way, then one can simply provide an this AI system this goal and allow it to do whatever it considers necessary to maximize that goal.
This is not a sound assumption when it comes to continued implementation in the outside world. Therefore, reasoning based on that assumption about how alignment would work within a mathematical toy model is also unsound.
I think the distinction you are trying to make is roughly that between ‘implicit/aligned control’ and ‘delegated control’ as terms used in this paper: https://dl.acm.org/doi/pdf/10.1145/3603371
Both still require control feedback processes built into the AGI system/infrastructure.
Can you think of any example of an alignment method being implemented soundly in practice without use of a control feedback loop?
Agreed (and upvoted).
It’s not strong evidence of impossibility by itself.
and thus is motivated to find reasons for alignment not being possible.
I don’t get this sense.
More like Yudkowsky sees the rate at which AI labs are scaling up and deploying code and infrastructure of ML models, and recognises that there a bunch of known core problems that would need to be solved before there is any plausible possibility of safely containing/aligning AGI optimisation pressure toward outcomes.
I personally think some of the argumentation around AGI being able to internally simulate the complexity in the outside world and play it like a complicated chess game is unsound. But I would not attribute the reasoning in eg. the AGI Ruin piece to Yudkowsky’s cult of personality.
dangerous AI systems
I was gesturing back at “AGI” in the previous paragraph here, and something like precursor AI systems before “AGI”.
Thanks for making me look at that. I just rewrote it to “dangerous autonomous AI systems”.
The premise that “infinite value” is possible, is an assumption.
This seems a bit like the presumption that “divide by zero” is possible. Assigning a probability to the possibility that divide by zero results in a value doesn’t make sense, I think, because the logical rules themselves rules this out.
However, if I look at this together with your earlier post (http://web.archive.org/web/20230317162246/https://www.lesswrong.com/posts/dPCpHZmGzc9abvAdi/orthogonality-thesis-is-wrong): I think I get where you’re coming from in that if the agent can conceptualise that (many) (extreme) high-value states are possible where those values are not yet known to it, yet still plans for those value possibilities in some kind of “RL discovery process”, then internal state-value optimisation converges on power-seeking behaviour — as optimal for reaching the expected value of such states in the future (this further assumes that the agent’s prior distribution lines up – eg. assumes unknown positive values are possible, does not have a prior distribution that is hugely negatively skewed over negative rewards).
I think initially specifying premises such as these more precisely initially ensures the reasoning from there is consistent/valid. The above would not apply to any agent, nor even to any “AGI” (a fuzzy term; I would define it more specifically as “fully-autonomous, cross-domain-optimising, artificial machinery”
Great overview! I find this helpful.
Next to intrinsic optimisation daemons that arise through training internal to hardware, suggest adding extrinsic optimising "divergent ecosystems" that arise through deployment and gradual co-option of (phenotypic) functionality within the larger outside world.
AI Safety so far research has focussed more on internal code (particularly CS/ML researchers) computed deterministically (within known statespaces, as mathematicians like to represent). That is, rather than complex external feedback loops that are uncomputable – given Good Regulator Theorem limits and the inherent noise interference on signals propagating through the environment (as would be intuitive for some biologists and non-linear dynamics theorists).
So extrinsic optimisation is easier for researchers in our community to overlook. See this related paper by a physicist studying origins of life.
Unfortunately, perhaps due to the prior actions of others in your same social group, a deceptive frame of interpretation is more likely to be encountered first, effectively 'inoculating' everyone else in the group against an unbiased receipt of any further information.
Written in 2015. Still relevant.
Say maybe Illusion of Truth and Ambiguity Effect each are biasing how researchers in AI Safety evaluate one option below.
If you had to choose, which bias would more likely apply to which option?
- A: Aligning AGI to be safe over the long term is possible in principle.
- B: Long-term safe AGI is impossible fundamentally.
it needs to plug into the mathematical formalizations one would use to do the social science form of this.
Could you clarify what you mean with a "social science form" of a mathematical formalisation?
I'm not familiar with this.
they're right to look at people funny even if they have the systems programming experience or what have you.
It was expected and understandable that people look funny at the writings from a multi-skilled researcher with new ideas that those people were not yet familiar with.
Let's move on from first impressions.
simulation
If with simulation, we can refer to a model that is computed to estimate a factor on which further logical deduction steps are based on, that would connect up with Forrest's work (it's not really about multi-agent simulation though).
Based on what I learned from Forrest, we need to distinguish the 'estimation' factors from the 'logical entailment' factors. That the notion of "proof" is only with respect to that which can be logically entailed. Everything else is about assessment. In each case, we need to be sure we are doing the modelling correctly.
For example, it could be argued that step 'b' below is about logical entailment, though according to Forrest most would argue that it is an assessment. Given that it depends on both physics and logic (via comp-sci modeling), it depends on how one regards the notion of 'observation', and where that is empirical or analytic observation.
- b; If AGI/APS is permitted to continue to exist,
then it will inevitably, inexorably,
implement and manifest certain convergent behaviors.
- c; that among these inherent convergent behaviors
will be at least all of:.- 1; to/towards self existence continuance promotion.
- 2; to/towards capability building capability,
a increase seeking capability,
a capability of seeking increase,
capability/power/influence increase, etc.- 3; to/towards shifting
ambient environmental conditions/context
to/towards favoring the production of
(variants of, increases of)
its artificial substrate matrix.
Note again: the above is not formal reasoning. It is a super-short description of what two formal reasoning steps would cover.
Really appreciate you sharing your honest thoughts her, Rekrul.
From my side, I’d value actually discussing the reasoning forms and steps we already started to outline on the forum. For example, the relevance of intrinsic vs extrinsic selection and correction, or the relevance of the organic vs. artificial substrate distinction. These distinctions are something I would love to openly chat about with you (not the formal reasoning – I’m the bridge-builder, Forrest is the theorist).
That might feel unsatisfactory – in the sense of “why don’t you just give us the proof now?”
As far as I can tell (Forrest can correct me later), there are at least two key reasons:
-
There is a tendency amongst AI Safety researchers to want to cut to the chase to judging the believability of the conclusion itself. For example, notice that I tried to clarify several argument parts in comment exchanges with Paul, with little or no response. People tend to believe that this would be the same as judging a maths proof over idealised deterministic and countable spaces. Yet formal reasoning here would have have to reference and build up premises from physical theory in indeterministic settings. So we actually need to clarify how a different form of formal reasoning is required here, that does not look like what would be required for P=NP. Patience is needed on the side of our interlocutors.
-
While Forrest does have most of the argument parts formalised, his use of precise analytical language and premises are not going to be clear to you. Mathematicians are not the only people who use formal language and reasoning steps to prove impossibilities by contradiction. Some analytical philosophers do too (as do formal verification researchers in industrial software engineering using different notation for logic transformation, etc.). No amount of “just give the proof to us and leave it to us to judge” lends us confidence that the judging would track the reasoning steps – if those people already did not track correspondences of some first basic argument parts described by the explanatory writings by Forrest or I that their comments referred to. Even if they are an accomplished mathematician, they are not going to grasp the argumentation if they skim through the text, judging it based on their preconception of what language the terms should be described in or how the formal reasoning should be structured.
I get that people are busy, but this is how it is. We are actually putting a lot of effort and time into communication (and are very happy to get your feedback on that!). And to make this work, they (or others) will need to put in commensurate effort on their end. It is up to them to show that they are not making inconsistent jumps in reasoning there, or talking in terms of their intuitive probability predictions about the believability of the end result, where we should be talking about binary logic transformations.
And actually, such nitty-gritty conversations would be really helpful for us too! Here is what I wrote before in response to another person’s question whether a public proof is available:
Main bottleneck is (re)writing it in a language that AI(S) researchers will understand without having to do a lot of reading/digging in the definitions of terms and descriptions of axioms/premises. A safety impossibility theorem can be constructed from various forms that are either isomorphic with others or are using separate arguments (eg. different theoretical limits covering different scopes of AGI interaction) to arrive at what seems to be an overdetermined conclusion (that long-term AGI safety is not possible).
We don't want to write it out so long that most/all readers drop out before they get to parse through the key reasoning steps. But we also do not want to make it so brief and dense that researchers are confused about at what level of generality we're talking about, have to read through other referenced literature to understand definitions, etc.
Also, one person (a grant investigator) has warned us that AI safety researchers would be too motivated against the conclusion (see 'belief bias') that few would actually attempt to read through a formal safety impossibility theorem. That's indeed likely based on my exchanges so far with AIS researchers (many of them past organisers or participants of AISC). So that is basically why we are first writing a condensed summary (for the Alignment Forum and beyond) that orders the main arguments for long-term AGI safety impossibility without precisely describing all axioms and definitions of terms used, covering all the reasoning gaps to ensure logical consistency, etc.
Note: Forrest has a background in analytical philosophy; he does not write in mathematical notation. Another grant investigator we called with had the expectation that the formal reasoning is necessarily written out in mathematical notation (a rough post-call write-up consolidating our impressions and responses to that conversation): https://mflb.com/ai_alignment_1/math_expectations_psr.html
Also note that Forrest’s formal reasoning work got funded by a $170K grant by Survival and Flourishing Fund. So some grant investigators were willing to bet on this work with money.
One thing Paul talks about constantly is how useful it would be if he had some hard evidence a current approach is doomed, as it would allow the community to pivot. A proof of alignment impossibility would probably make him ecstatic if it was correct (even if it puts us in quite a scary position).
I respect this take then by Paul a lot. This is how I also started to think about it a year ago.
BTW, I prefer you being blunt, so glad you’re doing that.
A little more effort to try to understand where we could be coming from would be appreciated. Particularly given what’s at stake here – a full extinction event.
Neither Forrest nor I have any motivation to post unsubstantiated claims. Forrest because frankly, he does not care one bit about being recognised by this community – he just wants to find individuals who actually care enough to consider the arguments rigorously. Me because all I’d be doing is putting my career at risk.
You can't complain about people engaging with things other than your idea if the only thing they can even engage with is your idea.
The tricky thing here is that a few people are reacting by misinterpreting the basic form of the formal reasoning at the onset, and judging the merit of the work by their subjective social heuristics.
Which does not lend me (nor Forrest) confidence that those people would do a careful job at checking the term definitions and reasoning steps – particularly if written in precise analytic language that is unlike the mathematical notation they’re used to.
The filter goes both ways.
Instead you have decided to make this post and trigger more crank alarms.
Actually, this post was written in 2015 and I planned last week to reformat it and post it. Rereading it, I’m just surprised how well it appears to line up with the reactions.
The problem of a very poor signal to noise ratio from messages received from people outside of the established professional group basically means that the risk of discarding a good proposal from anyone regarded as an outsider is especially likely.
This insight feels relevant to a comment exchange I was in yesterday. An AI Safety insider (Christiano) lightly read an overview of work by an outsider (Landry). The insider then judged the work to be "crankery", in effect acting as a protecting barrier against other insiders having to consider the new ideas.
The sticking point was the claim "It is 100% possible to know that X is 100% impossible", where X is a perpetual motion machine or a 'perpetual general benefit machine' (ie. long-term safe and beneficial AGI).
The insider believed this was an exaggerated claim, which meant we first needed to clarify epistemics and social heuristics, rather than the substantive argument form. The reactions by the busy "expert" insider, who had elected to judge the formal reasoning, led to us losing trust that they would proceed in a patient and discerning manner.
There was simply not enough common background and shared conceptual language for the insider to accurately interpret the outsider's writings ("very poor signal to noise ratio from messages received").
Add to that:
- Bandwagon effect
- "the tendency to believe that [long-term safe AGI is possible] because many other people do"
- Naive realism
- "that the facts are plain for all to see; that rational people will agree with us [that long-term safe AGI is possible]; and that those who do not are either uninformed, lazy, irrational, or biased."
- Belief bias
- "Where the evaluation of the logical strength of an argument is biased by the believability of the conclusion [that long-term safe AGI is impossible]... The difficulty is that we want to apply our intuition too often, particularly because it is generally much faster/easier than actually doing/implementing analytic work.)... Arguments which produce results contrary to one's own intuition about what "should" or "is expected" be the case are also implicitly viewed as somewhat disabling and invalidating of one's own expertise, particularly if there also is some self-identification as an 'expert'. No one wants to give up cherished notions regarding themselves. The net effect is that arguments perceived as 'challenging' will be challenged (criticized) somewhat more fully and aggressively than rationality and the methods of science would have already called for."
- Conservatism bias
- "People do not want to be seen as having strong or 'extreme opinions', as this in itself becomes a signal from that person to the group that they are very likely to become 'not a member' due to their willingness to prefer the holding of an idea as a higher value than they would prefer being regarded as a member in good standing in the group. Extreme opinions [such as that it is 100% possible to know that long-term safe AGI is 100% impossible] are therefore to be regarded as a marker of 'possible fanaticism' and therefore of that person being in the 'out crowd'."
- Status quo bias; System justification
- "The tendency to like things to stay relatively the same. The tendency to defend and bolster the status quo [such as resolving to build long-term safe AGI, believing that it is a hard but solvable problem]. Existing social, economic, and political arrangements tend to be preferred, and alternatives disparaged sometimes even at the expense of individual and collective self-interest."
- Reactance
- "The degree to which these various bias effects occur is generally in proportion to a motivating force, typically whenever there is significant money, power, or prestige involved. Naturally, doing what someone 'tells you to do' [like accepting the advice to not cut to the chase and instead spend the time to dig into and clarify the arguments with us, given the inferential distance] is a signal of 'low status' and is therefore to be avoided whenever possible, even if it is a good idea."
I mean, someone recognised as an expert in AI Safety could consciously mean well trying to judge an outsider's work accurately – in the time they have. But that's a lot of biases to counteract.
Forrest actually clarified the claim further to me by message:
Re "100%" or "fully knowable":
By this, I usually mean that the analytic part of an argument is fully finite and discrete, and that all parts (statements) are there, the transforms are enumerated, known to be correct etc (ie, is valid).
In regards to the soundness aspect, that there is some sort of "finality" or "completeness" in the definitions, such that I do not expect that they would ever need to be revised (ie, is at once addressing all necessary aspects, sufficiently, and comprehensively), and that the observations are fully structured by the definitions, etc. Usually this only works for fairly low level concepts, things that track fairly closely to the theory of epistemology itself -- ie, matters of physics that involve symmetry or continuity directly (comparison) or are expressed purely in terms of causation, etc.
One good way to test the overall notion is that something is "fully 100% knowable" if one can convert it to a computer program, and the program compiles and works correctly. The deterministic logic of computers cannot be fooled, as people sometimes can, as there is no bias. This is may be regarded by some as a somewhat high standard, but it makes sense of me as it is of the appropriate type: Ie, a discrete finite result being tested in a purely discrete finite environment. Hence, nothing missing can hide.
But the point is – few readers will seriously consider this message.
That's my experience, sadly.
The common reaction I noticed too from talking with others in AI Safety is that they immediately devaluated that extreme-sounding conclusion that is based on the research of an outsider. A conclusion that goes against their prior beliefs, and against their role in the community.
Your remarks make complete sense.
Forest mentioned that for most people, reading his precise "EGS" format will be unparsable unless one has had practice with it. Also agreed that there is no background or context. The "ABSTract" is really too often too brief a note, usually just a reminder what the overall idea is. And the text itself IS internal notes, as you have said.
He says that it is a good reminder that he should remember to convert "EGS" to normal prose before publishing. He does not always have the energy or time or enthusiasm to do it. Often it requires a lot of expansion too – ie, some writing has to expand to 5 times their "EGS" size.
I'll also work on this! There's a lot of content to share, but will try and format and rephrase to be better followable for readers on LessWrong.
It's worth noting up front that this sounds pretty crazy...
So this is looking pretty cranky right from the top, and hopefully you can sympathize with someone who has that reaction.
I get that this comes across as a strong claim, because it is.
So I do not expect you to buy that claim in one go (it took me months of probing the premises and the logic of Forrest’s arguments). It’s reasonable and epistemically healthy to be curiously skeptical at the onset, and try to both gain new insights from the writing and probe for inconsistencies.
Though I must say I’m disappointed that based on your light reading, you dismiss Forrest’s writings (specifically, the few pages you read) as crankery. Let me get back on that point.
"It is 100% possible to know that X is 100% impossible" would be an exaggerated claim… even if X was "perpetual motion machines".
Excerpting from Forrest's general response:
For one thing, it is not just the second law of thermodynamics that "prohibits" (ie, 'makes impossible') perpetual motion machines – it is actually the notion of "conservation law" – ie, that there is a conservation of matter and energy, and that the sum total of both together, in any closed/contained system, can neither be created nor destroyed. This is actually a much stronger basis on which to argue, insofar as it is directly an instance of an even more general class of concept, ie, one of symmetry.
All of physics – even the notion of lawfulness itself – is described in terms of symmetry concepts. This is not news, it is already known to most of the most advanced theoretical working physicists.
Basically, what [Paul] suggests is that anything that asserts or accepts the law of the conservation of matter and energy, and/or makes any assertion based strictly on only and exactly such conservation law, would be a categorical example of "an exaggerated claim", and that therefore he is suggesting that we, following his advise, should regard conservation law – and thus actually the notion of symmetry, and therefore also the notion of 'consistent truth' (ie, logic, etc) as an "insufficient basis" of proof and/or knowing.
This is, of course, too high a standard, insofar as, once one is rejecting of symmetry, there is no actual basis of knowing at all, of any kind at all, beyond such a rejection – there is simply no deeper basis for the concept of truth that is not actually about truth. That leaves everyone reading his post implicitly with him being the 'arbiter' of what counts as "proof". Ie, he has explicitly declared that he rejects the truth of the statement that it is "100% possible to know...", (via the laws of conservation of matter and energy, as itself based on only the logic of symmetry, which is also the basis of any notion of 'knowing'), "...that real perpetual motion machines are 100% impossible" to build, via any engineering technique at all, in the actual physical universe.
The reason that this is important is that the same notion – symmetry – is also the very most essential essence of what it means to have any consistent idea of logical truth. Ie, every transition in every math proof is a statement in the form "if X is true, then by known method Y, we can also know that Z is true". Ie, every allowed derivation method (ie, the entire class (set 'S') of accepted/agreed Y methods allowable for proof) is effectively a kind of symmetry – it is a 'truth preserving transformation', just like a mirror or reflection is a 'shape preserving transformation'. Ie, for every allowable transformation, there is also an allowed inverse transformation, so that "If Z is true, then via method inverse Y, we can also know that X is true". This sort of symmetry is the essence of what is meant by 'consistent' mathematical system.
It is largely because of this common concept – symmetry – that is the reason that both math and physics work so well together.
Yet we can easily notice that anything that is a potential outcome of “perpetual general benefit machines” (ie. AGI) results in all manner of exaggerated claims.
Turning to my response:
Perhaps by your way of defining the statement “100% possible to know” is not only that a boolean truth is consistently knowable within a model premised on 100% repeatedly empirically verified (ie. never once known to be falsified by observation) physical or computational theory?
Rather, perhaps the claim “100% possible to know” would in your view require additionally the unattainable completeness of past and future observation-based falsification of hypotheses (Solomonoff induction in a time machine)? Of course, we can theorise about how you model this.
I would ask: how then given that we do not and cannot have "Solomonoff induction in a time machine" can we soundly establish any degree of probability of knowing? To me, this seems like theorising about the extent to which idealised Bayesian updating would change our minds without our minds having access to the idealised Bayesian updating mechanism.
So to go back on your analogy, how would we soundly prove, by contradiction, that a perpetual motion machine is impossible?
My understanding is that you need more than consistent logic to model that. The formal model needs to be grounded in empirically sound premises about how the physical world works – in this case, the second law of thermodynamics based on the even more fundamental law of conservation of matter and energy.
You can question the axioms of the model – maybe if we collected more observations, the second law of thermodynamics turns out not to be true in some cases? Practically, that’s not a relevant question, because all we’ve got to go on is the observations we’ve got until now. In theory, this question of receiving more observations is not relevant to whether you can prove (100% soundly know) within the model that the machine cannot (is 100% impossible) work into perpetuity – yes, you can.
Similarly, take the proposition of an artificial generally-capable machine (“AGI”) working in “alignment with” continued human existence into perpetuity. How would you prove that proposition to be impossible, by contradiction?
To prove based on sound axioms that the probability of AGI causing outcomes out of line with a/any condition needed for the continued existence of organic life converges on 100% (in theory over infinity time; in practice actually over decades or centuries), you would need to ground the theorem in how the physical world works.
I imagine you reacting skeptically here, perhaps writing back that there might be future observations that contradict the conclusions (like everyone not dying) or updates to model premises (like falsification of information signalling underlying physics theory) with which we would end up falsifying the axioms of this model.
By this use of the term “100% possible to know” though, I guess it is also not 100% possible to know that 2 + 2 = 5 is 100% impossible as a result?
Maybe we’re wrong about axioms of mathematics? Maybe at some point mathematicians falsify one of the axioms as not soundly describing how truth content is preserved through transformations? Maybe you actually have not seen anyone yet write out the formal reasoning steps (ie. you cannot tell yet if the reasoning is consistent) for deriving 2 + 2 = 4 ? Maybe you misremember the precise computational operations you or other mathematicians performed before and/or the result derived, leading you to incorrectly conclude that 2 + 2 = 4?
I’m okay with this interpretation or defined use of the statement “100% possible to know”. But I don’t think we can do much regarding knowing the logic truth values of hypothetical outside-of-any-consistent-model possibilities, except discuss them philosophically.
That interpretation cuts both ways btw. Clearly then, it is by far not 100% possible to know whether any specific method(s) would maintain the alignment of generally-capable self-learning/modifying machinery existing and operating over the long term (millennia+) such not to cause the total extinction of humans.
To be willing to build that machinery, or in any way lend public credibility or resources to research groups building that machinery, you’d have to be pretty close to validly and soundly knowing that it is 100% possible that the machinery will stay existentially safe to humans.
Basically, for all causal interactions the changing machinery has with the changing world over time, you would need to prove (or guarantee above some statistical threshold) that the consequent (final states of the world) “humans continue to exist” can be derived as a near-certain possibility from the antecedent (initial states of the world).
Or inversely, you can do the information-theoretically actually much easier thing of proving that while many different possible final states of the world could result from the initial state of the world, the one state of the world excluded from all possible states as a possibility is “humans continue to exist.”
Morally, we need to apply the principle of precaution here – it is much easier for new large-scale technology to destroy the needed physical complexity for humans to live purposeful and valued lives than to support a meaningful increase in that complexity.
By that principle, the burden of proof – for that the methods you publicly communicate could or would actually maintain alignment of the generally-capable machinery – is on you.
You wrote the following before in explaining your research methodology:
“But it feels to me like it should be possible to avoid egregious misalignment regardless of how the empirical facts shake out — it should be possible to get a model we build to do at least roughly what we want.”
To put it frankly: does the fact that you write “it feels like” let you off the hook here?
Ie. since you were epistemically humble enough to not write that you had any basis to make that claim (you just expressed that it felt like this strong claim was true), you have a social license to keep developing AGI safety methods in line with that claim?
Does the fact that Forrest does write that he has a basis for making the claim – after 15 years of research and hundreds of dense explanatory pages (to try bridge the inferential gap to people like you) – that long-term safe AGI is 100% impossible, mean he is not epistemically humble enough to be taken seriously?
Perhaps Forrest could instead write “it feels like that we cannot build an AGI model to do and keep doing roughly what we want over the long term”. Perhaps then AI Safety researchers would have resonated with his claim and taken it as true at face value? Perhaps they’d be motivated to read his other writings?
No, the social reality is that you can claim “it feels that making the model/AGI work roughly like we want is possible” in this community, and readers will take it at face value as prima facie true.
Forrest and I have claimed – trying out various pedagogical angles and ways of wording – that “it is impossible to have AGI work roughly as we want over the long term” (not causing the death of all humans for starters). So far, of the dozens of AI safety people who had one-on-one exchanges with us, most of our interlocutors, reacted skeptically immediately and then came up with all sorts of reasons not to continue reading/considering Forrest's arguments. Which is exactly why I put up this post about "presumptive listening" to begin with.
You have all of the community’s motivated reasoning behind you, which puts you in the socially safe position of not being pressed any time soon by more than a few others in the community to provide a rigorous basis for your “possibility” claim.
Slider's remark on that your commentary seems to involve an isolated demand for rigour resonated for me. The phrase in my mind was "double standards". I'm glad someone else was willing to bring this point up to a well-regarded researcher, before I had to.
It's clear that AI systems can change their environment in complicated ways and so analyzing the long-term outcome of any real-world decision is hard. But that applies just as well to having a kid as to building an AI, and yet I think there are ways to have a kid that are socially acceptable. I don't think this article is laying out the kind of steps that would distinguish building an AI from having a kid.
I will clarify a key distinction between building an AGI (ie. not just any AI) and having a kid:
One thing about how the physical world works, is that in order for code to be computed, this needs to take place through a physical substrate. This is a necessary condition – inputs do not get processed into outputs through a platonic realm.
Substrate configurations in this case are, by definition, artificial – as in artificial general intelligence. This as distinct from the organic substrate configurations of humans (including human kids).
Further, the ranges of conditions needed for the artificial substate configurations to continue to exist, function and scale up over time – such as extreme temperatures, low oxygen and water, and toxic chemicals – fall outside the ranges of conditions that humans and other current organic lifeforms need to survive.
Hope that clarifies a long-term-human-safety-relevant distinction between building AGI (that continues to scale) and having a kid (who grows up to adult size).
I ended up convinced that this isn't about EA community blindspots, the entire scientific community would probably consider this writing to be crankery.
Paul, you read one overview essay where Forrest briefly outlined how his proof method works in an analogy to theory that a mathematician like you already knows about and understands the machinery of (Galois' theory). Then, as far as I can tell, you concluded that since Forrest did not provide the explicit proof (that you expected to find in that essay) and since the conclusion (as you interpret it) seemed unbelievable, that the “entire” scientific community would (according to you) probably consider his writing crankery.
By that way of "discerning" new work, if Kurt Gödel would have written an outline for researchers in the field to understand his unusual methodology, with the concise conclusion “it is 100% knowable that it is 100% impossible for a formal axiomatic system to be both consistent and complete” a well-known researcher in the (Hilbert’s) field would have read that and concluded that they had not immediately given them a proof yet and that the conclusion was unbelievable (such a strong statement!) therefore Gödel was probably a crank and should be denounced publicly in the forum as such.
Your judgement seems based on first impressions and social heuristics. On one hand you admit this, and on the other hand you seem to have no qualms with dismissing Forrest’s reasoning a priori.
In effect, you are acting as a gatekeeper – "protecting" others in the community from having to be exposed and meaningfully engage with new ideas. This is detrimental to research on the frontiers that falls outside of already commonly-accepted paradigms (particularly paradigms of this community).
The red flag for us was when you treated 'proof' as probable opinion based on your personal speculative observation, as proxy, rather than as a finite boolean notion of truth based on valid and sound modeling of verified known world states.
Note by Forrest on this:
I notice also that Gödel’s work, if presented for the first time today, would not be counted by him as "a very clear argument". The Gödel proof, as given then, was actually rather difficult and not at all obvious. Gödel had to construct an entire new language and self reference methodology for the proof to even work. The inferential distance for Gödel was actually rather large, and the patience needed to understand his methods, which were not at all common at the time, would not have passed the "sniff test" being applied by this person here, in the modern era, where the expectation is that everything can be understood on a single pass reading one post on some forum somewhere while on the way to some other meeting. Modern social media simply does not work well for works of these types. So the Gödel work, and the Bell Theorem, and/or anything else similarly both difficult and important, simply would not get reviewed by most people in today's world.”
Noting that your writing in response also acts as a usual filter to us. It does not show willingness yet to check the actual form or substance of Forrest’s arguments. This distinguishes someone who is not available to reason with (they probably have no time, patience, or maybe no actual interest) but who nonetheless seems motivated to signal they have an opinion to their ingroup as pertaining to the outgroup.
The claim that 'nothing is knowable for sure' and 'believe in the possibilities' (all of that maybe good for humanity) is part of the hype cycle. It ends up being marketing. So the crank accusation ends up being the filter of who believes the marketing, and who does not – who is in the in-crowd and who are 'the outsiders'.
Basically, he accepts that a violation of symmetry would be (should be) permissible -- hence allowing maybe at least some slight possibility that some especially creative genius type engineering type person might someday eventually actually make a working perpetual motion machine, in the real universe. Of course, every crank wants to have such a hope and a dream -- the hype factor is enormous – "free energy!" and "unlimited power!!" and "no environmental repercussions" – utopia can be ours!!!.
You only need to believe in the possibility, and reject the notion of 100% certainty. Such a small cost to pay. Surely we can all admit that sometimes logic people are occasionally wrong?
The irony of all of this is that the very notion of "crank" is someone who wants dignity and belonging so badly that they will easily and obviously reject logic (ie, symmetry), such that their 'topic arguments' have no actual merit. Moreover, given that a 'proof' is something that depends on every single transformation statement actually being correct, even a single clear rejection of their willingness to adhere to sensible logic is effectively a clear signal that all other arguments (and communications) by that person – now correctly identified as the crank – are to be rejected, as their communications is/are actually about social signaling (a kind of narcissism or feeling of rejection – the very essence of being a crank) rather than about truth. Hence, once someone has made even one single statement which is obviously a rejection of a known truth, ie, that they do not actually care about the truth of their arguments, then everything they say is/are to be ignored by everyone else thereafter.
And yet the person making the claim that my work is (probably) crankery, has actually done exactly that, engage in crankery, by their own process. He has declared that he rejects the truth of the statement (and moreover has very strongly suggested that everyone else should also reject the idea) that it is 100% possible to know, via the laws of conservation of matter and energy, (as itself based on only the logic of symmetry, which is also the basis of the notion of 'knowing'), that real perpetual motion machines are 100% impossible to build, via any engineering technique at all, in the actual physical universe.
In ancient times, a big part of the reason for spicy foods was to reject parasites in the digestive system. In places where sanitary conditions are difficult (warmer climates encourage food spoilage), spicy foods tend to be more culturally common. Similar phenomena occur can occur in communication – 'reading' and 'understanding' as a kind of mental digestive process – via the use of 'spicy language'. The spice I used was the phrase "It is 100% possible to know that X is 100% impossible". It was put there by design – I knew and expected it would very likely trigger some types of people, and thus help me to identify at least a few of the people who engage in social signaling over rigorous reasoning – even if they are also the ones making the same accusation of others. The filter goes both ways.
So that leaves your last point, about self-awareness:
I think this shows a lack of self awareness. Right now the state of play is more like the author of this document arguing that "everyone else is wrong," not someone who is working on AI safety.
Forrest is not an identifiable actively contributing member to “AI safety” (and also therefore not part of our ingroup).
Thus, Forrest pointing out that there are various historical cases where young men kept trying to solve impossible problems — for decades, if not millennia — all the while claiming those problems must be possible to solve after all through some method, apparently says something about Forrest and nothing at all about there being a plausible analogy with AGI Safety research…?
Good to know, thank you. I think I’ll just ditch the “separate claims/arguments into lines” effort.
Forrest also just wrote me: “In regards to the line formatting, I am thinking we can, and maybe should (?) convert to simple conventional wrapping mode? I am wondering if the phrase breaks are more trouble then they are worth, when presenting in more conventional contexts like LW, AF, etc. It feels too weird to me, given the already high weirdness level I cannot help but carry.”
Example of a statement with a mere exposure effect: “aligning AGI is possible in principle”
A paper that describes a risk-assessment monoculture in evaluating extinction risks: Democratising Risk.
Many ordinary people in Western countries do and will have [investments in AI/robots] (if only for retirement purposes), and will therefore receive a fraction of the net output from the robots.
... Of course, many people today don't have such investments. But under our existing arrangements, whoever does own the robots will receive the profits and be taxed. Those taxes can either fund consumption directly (a citizen's dividend, dole, or suchlike) or (better I think) be used to buy capital investments in the robots - such purchases could be distributed to everyone.
...Given the potentially immense productivity of zero-human-labor production, even a very small investment in robots might yield dividends supporting a lavish lifestyle.
I appreciate the nuance.
My takes:
- Yes, I would also expect many non-tech-people in the Global North to invest in AI-based corporations, if only by investing savings in an (equal or market-cap weighted) index fund.
- However, this still results in an even much stronger inequality of incomes and savings than in the current economy, because in-the-know-tech-investors will keep reinvesting profits into high-RoI (and likely highly societally extractive) investments for scaling up AI and connected machine infrastructure.
- You might argue that if most people (in the Global North) are still able to live lavish lifestyles relative to current lifestyles, that would not be too bad. However, Forrest's arguments go further than that.
- Technology would be invested into and deployed most by companies (particularly those led by power-hungry leaders with Dark Triad traits) that are (selected by market profits for being) the most able to extract and arbitrage fungible value through the complex local cultural arrangements on which market exchanges depend to run. So basically, the GDP growth you would measure from the outside would not concretely translate into "robots give us lavish lifestyles". It actually would look like depleting all what's out there for effectively and efficiently marketing and selling "products and services" that are increasingly mismatched with what we local humans deeply care about and value.
- I've got a post lined up exploring this.
- Further, the scaling up of automated self-learning machinery will displace scarce atomic and energy resources for use for producing and maintaining the artificial robots in the place of reproducing and protecting the organic humans. This would rapidly accelerate what we humans started in exploiting natural resources for our own tribal and economic uses (cutting down forests and so on), destroying the natural habitats of other organic species in the process (connected ecosystems that humans too depend on for their existence). Except, this time, the human markets, the human cultures, and the humans themselves are the ones to go.
- Technology would be invested into and deployed most by companies (particularly those led by power-hungry leaders with Dark Triad traits) that are (selected by market profits for being) the most able to extract and arbitrage fungible value through the complex local cultural arrangements on which market exchanges depend to run. So basically, the GDP growth you would measure from the outside would not concretely translate into "robots give us lavish lifestyles". It actually would look like depleting all what's out there for effectively and efficiently marketing and selling "products and services" that are increasingly mismatched with what we local humans deeply care about and value.
Appreciating your honesty, genuinely!
Always happy to chat further about the substantive arguments. I was initially skeptical of Forrest’s “AGI-alignment is impossible” claim. But after probing and digging into this question intensely over the last year, I could not find anything unsound (in terms of premises) or invalid (in terms of logic) about his core arguments.
Responding below:
- That prior for most problems being solvable is not justified. For starters, because you did not provide any reasons above to justify why beneficial AGI is not like a perpetual motion machine, AKA a “perpetual general benefit machine”.
See reasons to shift your prior: https://www.lesswrong.com/posts/Qp6oetspnGpSpRRs4/list-3-why-not-to-assume-on-prior-that-agi-alignment
-
Again no reasons given for the belief that AGI alignment is “progressing” or would have a “fair shot” of solving “the problem” if as well resourced as capabilities research. Basically nothing to argue against, because you are providing no arguments yet.
-
No reasons given, again. Presents instrumental convergence and intrinsic optimisation misalignment failures as the (only) threat models in terms of artificial general intelligence incompatibility with organic DNA-based life. Overlooks substrate-needs convergence.
Let me also copy over Forrest’s (my collaborator) notes here:
> people who believe false premises tend to take bad actions.
Argument 3:.
- 1; That AGI can very easily be hyped so that even smart people
can be made to falsely/incorrectly believe that there "might be"
_any_chance_at_all_ that AGI will "bring vastly positive changes".
- ie, strongly motivated marketing will always be stronger than truth,
especially when VC investors can be made to think (falsely)
that they could maybe get 10000X return on investment.
- that the nature of AGI, being unknown and largely artificial,
futuristic, saturated with modernism and tech optimism,
high geekery, and is also very highly funded, means that AGI
capabilities development has arbitrary intelligent marketing support.
- 2; People (and nearly all other animals) are mostly self-oriented.
- that altruism is usually essentially social signalling,
and is actually of very little value as benefit to anything
other than maybe some temporary social prestige building.
- ie, each possibly participating person will see the possibility
that maybe they could ride "up to riches" on the research bandwagon,
and/or on any major shift in the marketing dynamics;
that in any change there will be winners and loosers,
and they want a chance to be "on the winning side", since
everyone has bio-builtin social/market game addiction tendencies
(biases) and they think that they can use their high intelligence
to gain some personal strategic advantage.
- 3; People are *selectively* rational.
- Ie, that we should not expect deviations from rational agent models,
because our selective notion of rationality will likely match
our *also* self-selected models of 'rational actors'.
- as such, we can expect that there will be all sorts of
seemingly rational "arguments" that suggest that individual
selfish and self supporting action (favoring tech development)
is maybe "mostly harmless", and that at least some of the risks
are maybe over emphasized, and that "therefore" we should
maybe shift our actions towards the more (manufactured) "consensus"
that the "robustly good" action is "keep doing AGI capability
development" and also "increase safety work" -- and to be assuming
that anything else is either impossible or maybe "robustly bad",
or that at the very least, that the things that seem obvious
are probably not at all obvious, for complicated "rational reasons"
that just happen to align with their motivated preferred view.
- 4; thus the false belief that there "might be" some non-zero
small chance that AGI can be "aligned" so as to bring about
whatever positive changes (hype the huge return on investment!)
is so strong/motivating that it dominates all other considerations.
- as that selective motivated reasoning in the possibility that
someone can be part of the winning team and make history is
so strong that even the suggestion that the very notion that
*any* AGI persistently existing is inherently contradictory
with the notion of the continuing survival of life on this planet
is completely rejected without any further examination.
That’s clarifying. I agree that immediately trying to impose costly/controversial laws would be bad.
What I am personally thinking about first here is “actually trying to clarify the concerns and find consensus with other movements concerned about AI developments” (which by itself does not involve immediate radical law reforms).
We first need to have a basis of common understanding from which legislation can be drawn.
I think there are bunch of relevant but subtle differences in terms of how we are thinking about this. My beliefs after quite a lot of thinking are:
A. Most people don’t care about tech singularity. People are captured by the AI hypes cycles though, especially people who work under the tech elite. The general public is much more wary overall though of current use of AI, and are starting to notice the harms in their daily lives (eg. addictive and ideology + distorted self-image reinforcing social media, exploitative work gigs handed to them by algorithms).
B. Tech singularity, as envisioned in the past, involved a lot of motivated and simplifying reasoning about directing the complex world into utopias using complicated tech that cannot realistically be caused to happen using those methods. Tech elites like to co-opt these nerdy utopian visions for their own ends.
C. By your descriptions, I think you are essentialising humans as rational individuals who are socially signalling for self-benefit. I’m actually saying that, yes, people are egocentric right now, particularly in the neoliberalist consumption-oriented market and self-presentation-oriented culture we are exposed to right now. But also, humans are social creatures and can relate and interact based on deeper shared needs. So in that, I’m not essentialising people as fundamentally selfish. I’m saying that within the social environment on top of our tribal and sex+survival oriented psychological predispositions, people come out as particularly egocentric.
D. I don’t think baby steps are going to do it, given that we’re dealing with potential auto-scaling/catalysing technology that would mark the end of organic DNA-based life. The baby steps description reminds me of various scenes in the film “Don’t Look Up” where bystanders kept signalling to the main actors not to “overdo it”.
E. Interpretability techniques are used by tech elites to justify further capability developments. Interpretability techniques do not and cannot contribute to long-term AGI safety (https://www.lesswrong.com/posts/NeNRy8iQv4YtzpTfa/why-mechanistic-interpretability-does-not-and-cannot).
So 1 and 3 were my descriptions about what is actually happening and how that would continue, not about the end conclusion of what’s happening. To disagree with the former, I think you would need to clarify your observations/analysis of why something opposite/different is happening.
Good to read your thoughts.
I would agree that slowing further AI capability generalisation developments down by more than half in the next years is highly improbable. Got to work with what we have.
My mental model of the situation is different.
-
People engage in positively reinforcing dynamics around social prestige and market profit, even if what they are doing is net bad for what they care about over the long run.
-
People are mostly egocentric, and have difficulty connecting and relating, particularly in the current individualistic social signalling and “divide and conquer” market environment.
-
Scaling up deployable capabilities of AI has enough of a chance to reap extractive benefits for narcissistic/psychopathic tech leader types, that they will go ahead with it, while sowing the world with techno-optimistic visions that suit their strategy. That is, even though general AI will (cannot not) lead to wholesale destruction of everything we care about in the society and larger environment we’re part of.
This is insightful for me, thank you!
Also, I stand corrected then on my earlier comment on that privacy and digital ownership advocates would/should care about models being trained on their own/person-tracking data such to restrict the scaling of models. I’m guessing I was not tracking well then what people in at least the civil rights spaces Koen moves around in are thinking and would advocate for.
re: Leaders of movements being skeptical of the notion of AGI.
Reflecting more, my impression is that Timnit Gebru is skeptical about the sci-fiy descriptions of AGI, and even more so about the social motives of people working on developing (safe) AGI. She does not say that AGI is an impossible concept or not actually a risk. She seems to question the overlapping groups of white male geeks who have been diverting efforts away from other societal issues, to both promoting AGI development and warning of AGI x-risks.
Regarding Jaron Lanier, yes, (re)reading this post I agree that he seems to totally dismiss the notion of AGI, seeing it more a result of a religious kind of thinking under which humans toil away at offering the training data necessary for statistical learning algorithms to function without being compensated.
Returning on error correction point:
Feel free to still clarify the other reasons why the changes in learning would be stable in preserving “good properties”. Then I will take that starting point to try explain why the mutually reinforcing dynamics of instrumental convergence and substrate-needs convergence override that stability.
Fundamentally though, we'll still be discussing the application limits of error correction methods.
Three ways to explain why:
- Any workable AI-alignment method involves receiving input signals, comparing input signals against internal references, and outputting corrective signals to maintain alignment of outside states against those references (ie. error correction).
- Any workable AI-alignment method involves a control feedback loop – of detecting the actual (or simulating the potential) effects internally and then correcting actual (or preventing the potential) effects externally (ie. error correction).
- Eg. mechanistic interpretability is essentially about "detecting the actual (or simulating the potential) effects internally" of AI.
- The only way to actually (slightly) counteract AGI convergence on causing "instrumental" and "needed" effects within a more complex environment is to simulate/detect and then prevent/correct those environmental effects (ie. error correction).
~ ~ ~
Which brings us back to why error correction methods, of any kind and in any combination, cannot ensure long-term AGI Safety.
I reread your original post and Christiano's comment to understand your reasoning better and see how I could limits of applicability of error correction methods.
I also messaged Forrest (the polymath) to ask for his input.
The messages were of a high enough quality that I won't bother rewriting the text. Let me copy-paste the raw exchange below (with few spelling edits).
Remmelt 15:37
@Forrest, would value your thoughts on the way Carl Schulman is thinking about error correcting code, perhaps to pass on on the LessWrong Forum:
(https://www.lesswrong.com/posts/uFNgRumrDTpBfQGrs/let-s-think-about-slowing-down-ai?commentId=bY87i5v5StH9FWdWy).
Remmelt 15:38
Remmelt:
"As another example [of unsound monolithic reasoning], your idea of Von Neuman Probes with error correcting codes, referred to by Christiano here (https://www.lesswrong.com/posts/LpM3EAakwYdS6aRKf/what-multipolar-failure-looks-like-and-robust-agent-agnostic?commentId=Jaf9b9YAARYdrK3jp), cannot soundly work for AGI code (as self-learning new code for processing inputs into outputs, and as introducing errors through interactions with the environment that cannot be detected and corrected). This is overdetermined. An ex-Pentagon engineer has spelled out the reasons to me. See a one-page summary by me here."
Carl Shulman:
"This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves), but don't cover all changes that could derive from learning (although there are other reasons why those could be stable in preserving good or terrible properties)."
Remmelt 15:40
Excerpting from the comment by Christiano I link to above:
"The production-web has no interest in ensuring that its members value production above other ends, only in ensuring that they produce (which today happens for instrumental reasons). If consequentialists within the system intrinsically value production it's either because of single-single alignment failures (i.e. someone who valued production instrumentally delegated to a system that values it intrinsically) or because of new distributed consequentialism distinct from either the production web itself or any of the actors in it, but you don't describe what those distributed consequentialists are like or how they come about.
You might say: investment has to converge to 100% since people with lower levels of investment get outcompeted. But this it seems like the actual efficiency loss required to preserve human values seems very small even over cosmological time (e.g. see Carl on exactly this question: http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html).
And more pragmatically, such competition most obviously causes harm either via a space race and insecure property rights, or war between blocs with higher and lower savings rates (some of them too low to support human life, which even if you don't buy Carl's argument is really still quite low, conferring a tiny advantage). If those are the chief mechanisms then it seems important to think/talk about the kinds of agreements and treaties that humans (or aligned machines acting on their behalf!) would be trying to arrange in order to avoid those wars."
Remmelt 15:41
And Carl Schulman's original post on long-term error-correcting Von Neumann Probes:
(http://reflectivedisequilibrium.blogspot.com/2012/09/spreading-happiness-to-stars-seems.html):
"But the program of an AI, large stores of astronomical observations for navigation, and vast stores of technological information would take up an enormous amount of memory and storage space, perhaps many exabytes or more. Given this large body of information, adding additional directives to ensure that the probes eventually turn to producing welfare need only increase storage needs by a very small proportion, e.g. by 1 in 1 billion. Directives could directly specify the criteria to be eventually optimized, or could simply require compliance with further orders traveling behind the frontier of colonization.
...
Mutation is easier to resist for computers than animals
Biological life on Earth has evolved through mutation, and the reproductive process introduces significant errors in each generation. However, digital information storage allows for the comparison of redundant copies and the use of error-correcting codes, making substantive mutation many orders of magnitude less likely than in Earthly life."
Remmelt 15:45
Returning to the new comment by Carl Schulman:
"This is overstating what role error-correcting codes play in that argument. They mean the same programs can be available and evaluate things for eons (and can evaluate later changes with various degrees of learning themselves)"
Remmelt 15:46
Thinking about this overnight, I think Carl's stated reasoning is still unsound for multiple reasons:
1. Contradiction between a Von Neummann Probe being adaptable enough (ie. learning new ways of processing inputs into outputs) to travel across space and seed new civilisations, yet have error correcting code that allows comparison of new code with original redundant copies. Not going to work, for reasons Forrest amply explained and I tried to summarise here: https://docs.google.com/document/d/1-AAhqvgFNx_MlLkcSgw-chvmFoC4EZ4LmTl1IWcsqEA/edit
Ooh, and in Forrest's AGI Error Correction post: https://mflb.com/ai_alignment_1/agi_error_correction_psr.html#p1
Think I'll share that.
Remmelt 15:54
2. Confuses complicated pre-loaded technological knowledge/systems with complex adaptive systems. The fact that they are saying that adding in directives would only increase storage by 1 part in 1 billion parts is a giveaway, I think.
Remmelt 15:55
3. Inverse take on 1.
Algorithms which can flexibly 'mutate' and branch out into different versions become better at using resources and multiplying than more rigid or robustly functional designs. This makes Carl Schulman's case for launching out self-replicating space probes with code error-checking/minimisation routines seem a lot more dicey. If a defecting group launches even one alternate design with a flexible code-mutating ability that confers an advantage that can't easily be copied by the error-minimising designs without compromising on their ability to act on the directives humans originally coded in to 'directly specify the criteria to be eventually optimized' – well, then you might end up instead with swarms of spaceprobes that eat up the galaxy indiscriminately, including any remaining carbon-based lifeforms on planet Earth.
Underlying premise: even if humans construct a long-term aligned AI design – where humans can formally prove a model to causally constrain any possible process of agency emerging from and expanding across each of the physical parts in which this model infers its computational process to be embedded to stay within all fundamental bounds necessary for maintaining alignment with the values that humans broadly share in common – then in practice that design is ‘one step away’ from getting mutated into misalignment by a faction of humans who seek a capability advantage (does it give one though?) for manifesting their more granular personal values.
Remmelt 15:57
@Forrest anything to add to the above? How about I write that into a comment reply (which perhaps might make for a stand-alone post later?)
Forrest 16:25
The starting point is: That any procedure of AGI alignment will
(Ie, cannot not) resemble some sort of error correction algorithm
Forrest 16:28
This point cannot be disagreed with and still have the notion of alignment be persistent. If they argue here, they will need to establish a conformable coherent counter example. Simple contradiction is not enough, as the claim basis shifts to them.
Forrest 16:30
Once hard line equivalence is firmly established, then the argument re limits of causation in all error correction processes apply.
Remmelt 16:44
Good one. So start with how any AGI-alignment method must resemble an error correction algorithm in order to correct out misaligned behaviour learned and expressed in interaction with the environment.
Forrest 16:48
In regards to the idea that some machine could "have a strategy of optimizing for near-maximum X until condition Z is achieved, and *then* implement directive Q" makes at least three assumptions that are highly unlikely.
1; there is no assurance that in the long duration of the optimization of X that there would be persistence of directive Q.
2; there is no assurance that the condition 'Z' is consistently detectable, or that the triggering condition is even probable to being implemented, even if detected (as per a violation of assumption 1).
3; the integration coupling of directive Q is mismatched to the sheer mass of the output of the optimized code -- ie, it is unlikely that 'Q', no matter how formed, will actually result in the consistent behavior described. Ie, 'Q' is significantly underspecified, and if 'Q' was appropriately specified, it would not be "lightweight" with respect to the optimization process 'X', and would have been optimized away to ineffectiveness. Hence, even if Z condition trigger happens, Q does not happen.
Note insofar as the logic is directly generalized over the logical claim form of "adopt a strategy of near-maximum colonization speed until they reach the most distant accessible galaxies, and only then divert resources to producing welfare", the proposal itself is not well specified. Ie, the assumption that 'Q' and detection of 'Z' can be "simple" is a conceptual category error.
Remmelt 17:02
I read through this, and think I understand your points.
Forrest 17:10
Yes. I am trying to think of a good metaphor. It is maybe a bit like some single ordinary person trying to command everyone to do something in some consistent way, when that _everyone_ is 100000 people all yelling at one another in some huge auditorium somewhere. The assumption that one person, even at maximum personal signal volume and intensity, could be heard over all similar other people (in this case, the other objectives of the Von Neumann probe), and command them to implement some new action pattern reliably consistent with 'Q', is simply deeply mismatched to the output reality of the optimization process 'X'. The single directive 'Q' is simply not powerful enough to fully conditionalize all of 'X', even under triggering conditions 'Z'.
Also, I notice that the assumptions that the error correction process could be applied linearly to a dynamic self recursive system at arbitrary levels of abstraction is simply another category error. Wrong tool for the wrong job. That is less important, though, than the issue with the Q directive arbitrarily efficient effectivity mismatch.
Forrest 17:37
Also, I added the following document to assist in some of what you are trying to do above: https://mflb.com/ai_alignment_1/tech_align_error_correct_fail_psr.html#p1
This echos something I think I sent previously, but I could not find it in another doc, so I added it.
Forrest Landry.
Here is how he described himself before:
> What is your background?
> How is it relevant to the work
> you are planning to do?Years ago, we started with a strong focus on
civilization design and mitigating x-risk.
These are topics that need and require
more generalist capabilities, in many fields,
not just single specialist capabilities,
in any one single field of study or application.Hence, as generalists,
we are not specifically persons
who are career mathematicians,
nor even career physicists, chemists,
or career biologists, anthropologists,
or even career philosophers.
Yet when considering the needs
of topics civ-design and/or x-risk,
it is very abundantly clear
that some real skill and expertise
is actually needed in all of these fields.Understanding anything about x-risk
and/or civilization means needing
to understand key topics regarding
large scale institutional process,
ie; things like governments, businesses,
university, constitutional law, social
contract theory, representative process,
legal and trade agreements, etc.Yet people who study markets, economics,
and politics (theory of groups, firms, etc)
who do not also have some real grounding
in actual sociology and anthropology,
are not going to have grounding in
understanding why things happen
in the real world as they tend to do.And those people are going to need
to understand things like psychology,
developmental psych, theory of education,
interpersonal relationships, attachment,
social communication dynamics, health
of family and community, trauma, etc.And understanding *those* topics means
having a real grounding in evolutionary theory,
bio-systems, ecology, biology, neurochemestry
and neurology, ecosystem design, permaculture,
and evolutionary psychology, theory of bias, etc.It is hard to see that we would be able to assess
things like 'sociological bias' as impacting
possible mitigation strategies of x-risk,
if we do not actually also have some real
and deep, informed, and realistic accounting
of the practical implications of, in the world,
of *all* of these categories of ideas.And yet, unfortunately, that is not all,
since understanding of *those* topics themselves
means even more and deeper grounding
in things like organic and inorganic chemistry,
cell process, and the underlying *physics*
of things like that.
Which therefore includes a fairly general
understanding of multiple diverse areas of physics
(mechanical, thermal, electromagnetic, QM, etc),
and thus also of technology -- since that is
directly connected to business, social systems,
world systems infrastructure, internet,
electrical grid and energy management,
transport (for fuel, materials, etc), and
even more politics, advertising and marketing,
rhetorical process and argumentation, etc.Oh, and of course, a deep and applied
practical knowledge of 'computer science',
since nearly everything in the above
is in one way or another "done with computers".
Maybe, of course, that would also be relevant
when considering the specific category of x-risk
which happens to involve computational concepts
when thinking about artificial superintelligence.I *have* been a successful practicing engineer
in both large scale US-gov deployed software
and also in product design shipped to millions.
I have personally written more than 900,000
lines of code (mostly Ansi-C, ASM, Javascript)
and have been 'the principle architect' in a team.
I have developed my own computing environments,
languages, procedural methodologies, and
system management tactics, over multiple
process technologies in multiple applied contexts.
I have a reasonably thorough knowledge of CS.
Including the modeling math, control theory, etc.
Ie, I am legitimately "full stack" engineering
from the physics of transistors, up through
CPU design, firmware and embedded systems,
OS level work, application development,
networking, user interface design, and
the social process implications of systems.
I have similarly extensive accomplishments
in some of the other listed disciplines also.As such, as a proven "career" generalist,
I am also (though not just) a master craftsman,
which includes things like practical knowledge
of how to negotiate contracts,
write all manner documents,
make all manner of things,
*and* understand the implications
of *all* of this
in the real world, etc.For the broad category of
valid and reasonable x-risk assessment,
that nothing less than
at least some true depth
in nearly *all* of these topics,
will do.
From Math Expectations, a depersonalised post Forrest wrote of his impressions of a conversation with a grant investigator where the grant investigator kept looping back on the expectation that a "proof" based on formal reasoning must be written in mathematical notation. We did end up receiving the $170K grant.
I usually do not mention Forrest Landry's name immediately for two reasons:
- If you google his name, he comes across like a spiritual hippie. Geeks who don't understand his use of language take that as a cue that he must not know anything about computational science, mathematics or physics (wrong – Forrest has deep insights into programming methods and eg. why Bell's Theorem is a thing) .
- Forrest prefers to work on the frontiers of research, rather than repeating himself in long conversations with tech people who cannot let go off their own mental models and quickly jump to motivated counterarguments that he heard and addressed many times before. So I act as a bridge-builder, trying to translate between Forrest speak and Alignment Forum speak.
- Both of us prefer to work behind the scenes. I've only recently started to touch on the arguments in public.
- You can find those arguments elaborated on here.
Warning: large inferential distance; do message clarifying questions – I'm game!