Has Eliezer publicly and satisfactorily responded to attempted rebuttals of the analogy to evolution?
post by kaler · 2024-07-28T12:23:40.671Z · LW · GW · 5 commentsThis is a question post.
Contents
Answers 89 Jeremy Gillen 7 Donald Hobson 4 lukehmiles 3 Noah Birnbaum None 5 comments
I refer to these posts:
https://optimists.ai/2023/11/28/ai-is-easy-to-control/
https://www.lesswrong.com/posts/hvz9qjWyv8cLX9JJR/evolution-provides-no-evidence-for-the-sharp-left-turn [LW · GW]
https://www.lesswrong.com/posts/CoZhXrhpQxpy9xw9y/where-i-agree-and-disagree-with-eliezer [LW · GW]
My (poor, maybe mis-) understanding is that the argument is that as SGD optimizes for "predicting the next token" and we select for systems with very low loss by modifying every single parameter in the neural network (which basically defines the network itself), it seems quite unlikely that we'll have a "sharp left turn" in the near term, which happened because evolution was too weak an outer optimizer to fully "control" humans' thinking in the direction that most improved inclusive genetic fitness, as it is too weak to directly tinker every neuron connection in our brain.
Given SGD's vastly stronger ability at outer optimisation of every parameter, isn't it possible, if not likely, that any sharp left turn occurs only at a vastly superhuman level, if the inner optimizer becomes vastly stronger than SGD?
The above arguments have persuaded me that we might be able to thread the needle for survival if humanity is able to use the not-yet-actively-deceptive outputs of moderately-superhuman models (because they are still just predicting the next token to the best of their capability), to help us solve the potential sharp left turn and if humanity doesn't do anything else stupid with other training methods/misuse and manages to solve the other problems. Of course, in an ideal world we wouldn't be in this situation.
I have read some rebuttals by others on LessWrong but did not find anything that convincingly debunked this idea (maybe I missed something).
Did Eliezer, or anyone else, ever tell us why this is wrong (if it is)? I have been searching for the past week but have only found this: https://x.com/ESYudkowsky/status/1726329895121514565 which seemed to be switching to more of a post-training discussion.
Answers
Some general advice:
- Taboo "sharp left turn". If possible, replace it with a specific example from e.g. here.
- Don't worry about what Eliezer in particular has to say, just look for good arguments.
- There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
Object level response:
"sharp left turn" in the near term, which happened because evolution was too weak an outer optimizer to fully "control" humans' thinking in the direction that most improved inclusive genetic fitness, as it is too weak to directly tinker every neuron connection in our brain.
I think this is a misunderstanding. Evolution failed to align humans in the sense that, when humans are in an extremely OOD environment, they don't act as if they had evolved in that OOD environment. In other words, alignment is about generalization, it's not about how much fine-grained control the outer optimizer had.
Eliezer does touch on this in a podcast [LW · GW], search for "information bottleneck". In the podcast, Eliezer seems to be saying that SGD might have less simplicity bias than evolution, which may imply even worse goal generalization. But I don't think I buy that, because there are plenty of other sources of simplicity bias in neural network training, so a priori I don't see a strong reason to believe one would generalize better than the other.
There are theoretical questions about how generalization works, how neural networks in particular generalize, how we can apply these theories to reason about the "goals" of a trained AI, and how those goals relate to the training distribution. These are much more relevant to alignment than the information bottleneck of SGD. My impression of Quintin and Nora is that they don't model advanced AIs as pursuing "goals" in the same way that I do, and I think that's one of the main sources of disagreement. They have a much more context-dependent idea of what a goal is, and they seem to think this model of goals is sufficient for very intelligent behavior.
humanity is able to use the not-yet-actively-deceptive outputs of moderately-superhuman models (because they are still just predicting the next token to the best of their capability), to help us solve the potential sharp left turn
LLMs are already moderately-superhuman at the task of predicting next tokens. This isn't sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering. And having that level of capability implies (arguably, see here [LW · GW] for some arguments):
- the capacity to learn and develop new skills (basic self-modification),
- the capacity to creatively solve way-out-of-training-distribution problems using way-out-of-training-distribution solutions,
- maybe something analogous to the capability of moral philosophy, i.e. analyzing its own motivations and trying to work out what it "should" want.
These capabilities should (imo) produce sufficiently large context changes (distribution shifts) that they will probably be sufficient to reveal goal-misgeneralization. (some specific example mechanisms for goal misgeneralization are listed here [LW · GW]).
So my answer, in short, is that "help us solve" implies "high capability" implies "OOD goal misgeneralization" (which is what I think you basically meant by "sharp left turn"). So it seems unlikely to me that we can use future AI to solve alignment in the way you describe, because misalignment should already be creating large problems by the time the AI is capable of helping.
↑ comment by localdeity · 2024-07-30T01:20:29.911Z · LW(p) · GW(p)
There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
The most obvious way of addressing this, "just feel more comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs", has its own failure mode, where you end up attacking a strawman that you think is a better argument than what they made, defeating it, and thinking you've solved the issue when you haven't. People have complained about this failure mode of steelmanning a couple [LW · GW] of times [LW · GW]. At a fixed level of knowledge and thought about the subject, it seems one can only trade off one danger against the other.
However, if you're planning to increase your knowledge and time-spent-thinking about the subject, then during that time it's better to focus on the ideas than on who-said-or-meant-what; the latter is instrumentally useful as a source of ideas.
↑ comment by Davidmanheim · 2024-08-15T19:22:34.826Z · LW(p) · GW(p)
LLMs are already moderately-superhuman at the task of predicting next tokens. This isn't sufficient to help solve alignment problems. We would need them to meet the much much higher bar of being moderately-superhuman at the general task of science/engineering.
We also need the assumption - which is definitely not obvious - that significant intelligence increases are relatively close to achievable. Superhumanly strong math skills presumably don't let AI solve NP problems in P time, and it's similarly plausible - though far from certain - that really good engineering skill tops out somewhere only moderately above human ability due to instrinsic difficulty, and really good deception skills top out somewhere not enough to subvert the best systems that we could build to do oversight and detect misalignment. (On the other hand, even with these objections being correct, it would only show that control is possible, not that it is likely to occur.)
↑ comment by dxu · 2024-07-31T01:57:04.404Z · LW(p) · GW(p)
There's also a failure mode of focusing on "which arguments are the best" instead of "what is actually true". I don't understand this failure mode very well, except that I've seen myself and others fall into it. Falling into it looks like focusing a lot on specific arguments, and spending a lot of time working out what was meant by the words, rather than feeling comfortable adjusting arguments to fit better into your own ontology and to fit better with your own beliefs.
My sense is that this is because different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
Eliezer in particular strikes me as having an intuitive prior for AI alignment outcomes that looks very similar to priors for tasks like e.g. writing bug-free software on the first try, assessing the likelihood that a given plan will play out as envisioned, correctly compensating for optimism bias, etc. which is what gives rise to posts concerning concepts like security mindset [LW · GW].
Other people don't share this intuitive prior, and so have to be argued into it. To such people, the reliability of the arguments in question is actually critical, because if those arguments turn out to have holes, that reverts the downstream updates and restores the original intuitive prior, whatever it looked like—kind of like a souped up version of the burden of proof concept, where the initial placement of that burden is determined entirely via the intuitive judgement of the individual.
This also seems related to why different people seem to naturally gravitate towards either conjunctive or disjunctive [LW · GW] models of catastrophic outcomes from AI misalignment: the conjunctive impulse stems from an intuition that AI catastrophe is a priori unlikely, and so a bunch of different claims have to hold simultaneously in order to force a large enough update, whereas the disjunctive impulse stems from the notion that any given low-level claim need not be on particularly firm ground, because the high-level thesis of AI catastrophe robustly manifests via different but converging lines of reasoning.
See also: the focus on coherence [LW · GW], where some people place great importance on the question of whether VNM or other coherence theorems show what Eliezer et al. purport they show about superintelligent agents, versus the competing model wherein none of these individual theorems are important in their particulars, so much as the direction they seem to point, hinting at the concept of what idealized behavior with respect to non-gerrymandered physical resources ought to look like [LW · GW].
I think the real question, then, is where these differences in intuition come from, and unfortunately the answer might have to do a lot with people's backgrounds, and the habits and heuristics they picked up from said backgrounds—something quite difficult to get at via specific, concrete argumentation.
Replies from: andrei-alexandru-parfeni↑ comment by sunwillrise (andrei-alexandru-parfeni) · 2024-07-31T02:47:39.145Z · LW(p) · GW(p)
different people have different intuitive priors, and process arguments (mostly) as a kind of Bayesian evidence that updates those priors, rather than modifying the priors (i.e. intuitions) directly.
I'm not sure I understand this distinction as-written. How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
Replies from: dxu↑ comment by dxu · 2024-07-31T03:04:31.221Z · LW(p) · GW(p)
How is a Bayesian agent supposed to modify priors except by updating on the basis of evidence?
They're not! But humans aren't ideal Bayesians, and it's entirely possible for them to update in a way that does change their priors (encoded by intuitions) moving forward. In particular, the difference between having updated one's intuitive prior, and keeping the intuitive prior around but also keeping track of a different, consciously held posterior, is that the former is vastly less likely to "de-update", because the evidence that went into the update isn't kept around in a form that subjects it to (potential) refutation.
(IIRC, E.T. Jaynes talks about this distinction in Chapter 18 of Probability Theory: The Logic of Science, and he models it by introducing something he calls an A_p distribution. His exposition of this idea is uncharacteristically unclear, and his A_p distribution looks basically like a beta distribution with specific values for α and β, but it does seem to capture the distinction I see between "intuitive" updating versus "conscious" updating.)
I haven't seen an answer by Eliezer. But I can go through the first post, and highlight what I think is wrong. (And would be unsurprised if Eliezer agreed with much of it)
AIs are white boxes
We can see literally every neuron, but have little clue what they are doing.
Black box methods are sufficient for human alignment
Humans are aligned to human values because humans have human genes. Also individual humans can't replicate themselves, which makes taking over the world much harder.
most people do assimilate the values of their culture pretty well, and most people are reasonably pro-social.
Humans have specific genes for absorbing cultural values, at least within a range of human cultures. There are various alien values that humans won't absorb.
Gradient descent is very powerful because, unlike a black box method, it’s almost impossible to trick.
Hmm. I don't think the case for that is convincing.
If the AI is secretly planning to kill you, gradient descent will notice this and make it less likely to do that in the future, because the neural circuitry needed to make the secret murder plot can be dismantled and reconfigured into circuits that directly improve performance.
Current AI techniques involve giving the AI loads of neurons, so having a few neurons that aren't being used isn't a problem.
Also, it's possible that the same neurons that sometimes plot to kill you are also sometimes used to predict plots in murder mystery books.
In general, gradient descent has a strong tendency to favor the simplest solution which performs well, and secret murder plots aren’t actively useful for improving performance on the tasks humans will actually optimize AIs to perform.
If you give the AI lots of tasks, it's possible that the simplest solution is some kind of internal general optimizer.
Either you have an AI that is smart and general and can try new things that are substantially different from anything it's done before. (In which case the new things can include murder plots) Or you have an AI that's dumb and is only repeating small variations on it's training data.
We can run large numbers of experiments to find the most effective interventions
Current techniques are based on experiments/gradient descent. This works so long as the AI's can't break out of the sandbox or realize they are being experimented on and plot to trick the experimenters. You can't keep an ASI in a little sandbox and run gradient descent on it.
Our reward circuitry reliably imprints a set of motivational invariants into the psychology of every human: we have empathy for friends and acquaintances, we have parental instincts, we want revenge when others harm us, etc.
Sure. And we use contraception. Which kind of shows that evolution failed somewhere a bit.
Also, evolution got a long time testing and refining with humans that didn't have the tools to mess with evolution or even understand it.
Even in the pessimistic scenario where AIs stop obeying our every command, they will still protect us and improve our welfare, because they will have learned an ethical code very early in training.
No one is claiming the ASI won't understand human values, they are saying it won't care.
The moral judgements of current LLMs already align with common sense to a high degree,
Is that evidence that LLM's actually care about morality. Not really. It's evidence that they are good at predicting humans. Get them predicting an ethics professor and they will answer morality questions. Get them predicting Hitler and they will say less moral things.
And of course, there is a big difference between an AI that says "be nice to people" and an AI that is nice to people. The former can be trivially achieved by hard coding a list of platitudes for the AI to parrot back. The second requires the AI to make decisions like "are unborn babies people?".
Imagine some robot running around. You have an LLM that says nice-sounding things when posed ethical dilemmas. You need some system that turns the raw camera input into a text description, and the nice sounding output into actual actions.
It would be interesting if someone discovered something like "junk DNA that just copies itself" within the weights during the backprop+SGD process. Would be some evidence that backprop's thumb is not so heavy a worm can't wiggle out. Right now I would bet against that happening within a normal neural net training on a dataset.
Note that RL exists and gives the neural net much more uh "creative room" to uh "decide how to exist". Because you just have to get enough score over time to survive, but any strategy is accepted. In other words, it is much less convergent.
Also in RL, glitching/hacking of the physics sim / game engine is what you expect to happen! Then you have to patch your sim and retrain.
Also, most of the ML systems we use every day involve multiple neural nets with different goals (eg the image generator and the NSFW detector), so something odd might happen in that interaction.
All this to say: The question "if I train one NN on a fixed dataset with backprop+SGD, could something unexpected pop out?" is quite interesting and still open in my opinion. But even if that always goes exactly as expected, it is certainly clear that RL, active learning, multi-NN ML systems, hyperparameter optimization (which is often an evolutionary algorithm), etc produces weird things with weird goals and strategies very often.
I think debate surrounds the 1-NN-1-dataset question because it is an interesting and natural and important question, the type of question a good scientist would ask. Probably only a small part of the bigger challenge to control the whole trained machine.
I think Eliezer briefly responds to this in his podcast with Dwarkesh Patel — satisfactorily is pretty subjective. https://youtu.be/41SUp-TRVlg?si=hE3gcWxjDtl1-j14
At about 24:40.
5 comments
Comments sorted by top scores.
comment by DaemonicSigil · 2024-07-28T18:48:32.397Z · LW(p) · GW(p)
IMO, a very good response, which Eliezer doesn't seem to be interested in making as far as I can tell, is that we should not be making the analogy natural selection <--> gradient descent
, but rather, human brain learning algorithm <--> gradient descent ; natural selection <--> us trying to build AI
.
So here, the striking thing is that evolution failed to solve the alignment problem for humans. I.e. we have a prior example of strongish general intelligence being created, but no prior examples of strongish general intelligence being aligned. Evolution was strong enough to do one but not the other. It's not hopeless, because we should generally consider ourselves smarter than evolution, but on the other hand, evolution has had a very long time to work and it does frequently manage things that we humans have not been able to replicate. And also, it provides a small amount of evidence against "the problem will be solved with minor tweaks to existing algorithms" since generally minor tweaks are easier for evolution to find than ideas that require many changes at once.
Replies from: lcmgcd↑ comment by lemonhope (lcmgcd) · 2024-07-29T05:20:25.312Z · LW(p) · GW(p)
Yeah you can kind of stop at "we are already doing natural selection." The devs give us random variation. The conferences and the market give us selection. The population is large, the mutation rate is high, the competition is fierce, and replicating costs $0.25 + 10 minutes.
comment by Seth Herd · 2024-07-29T03:47:22.917Z · LW(p) · GW(p)
I don't see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.
Focusing on this argument seems like missing the forest for the trees.
Replies from: andrei-alexandru-parfeni↑ comment by sunwillrise (andrei-alexandru-parfeni) · 2024-11-20T02:41:30.175Z · LW(p) · GW(p)
[Coming at this a few months late, sorry. This [LW(p) · GW(p)] comment by @Steven Byrnes [LW · GW] sparked my interest in this topic once again]
I don't see why we care if evolution is a good analogy for alignment risk. The arguments for misgeneralization/mis-specification stand on their own. They do not show that alignment is impossible, but they do strongly suggest that it is not trivial.
Focusing on this argument seems like missing the forest for the trees.
Ngl, I find everything you're written here a bit... baffling, Seth. Your writing in particular and your exposition of your thoughts on AI risk generally does not use evolutionary analogies, but this only means that posts and comments criticizing analogies with evolution (sample: 1 [LW · GW], 2 [LW · GW], 3 [LW · GW], 4 [LW(p) · GW(p)], 5 [LW · GW], etc) are just not aimed at you and your reasoning. I greatly enjoy [LW(p) · GW(p)] reading your writing and pondering the insights you bring up, but you are simply not even close to the most publicly-salient proponent of "somewhat high " among the AI alignment community. It makes perfect sense from the perspective of those who disagree with you (or other, more hardcore "doomers" [LW · GW]) on the bottom-line question of AI to focus their public discourse primarily on responding to the arguments brought up by the subset of "doomers" who are most salient and also most extreme in their views, namely the MIRI-cluster [LW(p) · GW(p)] centered around Eliezer, Nate Soares, and Rob Bensinger.
And when you turn to MIRI and the views that its members have espoused on these topics, I am very surprised to hear that "The arguments for misgeneralization/mis-specification stand on their own" and are not ultimately based on analogies with evolution.
But anyway, to hopefully settle this once and for all, let's go through all the examples that pop up in my head immediately when I think of this, shall we?
From the section on inner & outer alignment of "AGI Ruin: A List of Lethalities [LW · GW]", by Yudkowsky (I have removed the original emphasis and added my own):
15. Fast capability gains seem likely, and may break lots of previous alignment-required invariants simultaneously. Given otherwise insufficient foresight by the operators, I'd expect a lot of those problems to appear approximately simultaneously after a sharp capability gain. See, again, the case of human intelligence. We didn't break alignment with the 'inclusive reproductive fitness' outer loss function, immediately after the introduction of farming - something like 40,000 years into a 50,000 year Cro-Magnon takeoff, as was itself running very quickly relative to the outer optimization loop of natural selection. Instead, we got a lot of technology more advanced than was in the ancestral environment, including contraception, in one very fast burst relative to the speed of the outer optimization loop, late in the general intelligence game. We started reflecting on ourselves a lot more, started being programmed a lot more by cultural evolution, and lots and lots of assumptions underlying our alignment in the ancestral training environment broke simultaneously. (People will perhaps rationalize reasons why this abstract description doesn't carry over to gradient descent; eg, “gradient descent has less of an information bottleneck”. My model of this variety of reader has an inside view, which they will label an outside view, that assigns great relevance to some other data points that are not observed cases of an outer optimization loop producing an inner general intelligence, and assigns little importance to our one data point actually featuring the phenomenon in question. When an outer optimization loop actually produced general intelligence, it broke alignment after it turned general, and did so relatively late in the game of that general intelligence accumulating capability and knowledge, almost immediately before it turned 'lethally' dangerous relative to the outer optimization loop of natural selection. Consider skepticism, if someone is ignoring this one warning, especially if they are not presenting equally lethal and dangerous things that they say will go wrong instead.)
16. Even if you train really hard on an exact loss function, that doesn't thereby create an explicit internal representation of the loss function inside an AI that then continues to pursue that exact loss function in distribution-shifted environments. Humans don't explicitly pursue inclusive genetic fitness; outer optimization even on a very exact, very simple loss function doesn't produce inner optimization in that direction. This happens in practice in real life, it is what happened in the only case we know about, and it seems to me that there are deep theoretical reasons to expect it to happen again: the first semi-outer-aligned solutions found, in the search ordering of a real-world bounded optimization process, are not inner-aligned solutions. This is sufficient on its own, even ignoring many other items on this list, to trash entire categories of naive alignment proposals which assume that if you optimize a bunch on a loss function calculated using some simple concept, you get perfect inner alignment on that concept.
[...]
21. There's something like a single answer, or a single bucket of answers, for questions like 'What's the environment really like?' and 'How do I figure out the environment?' and 'Which of my possible outputs interact with reality in a way that causes reality to have certain properties?', where a simple outer optimization loop will straightforwardly shove optimizees into this bucket. When you have a wrong belief, reality hits back at your wrong predictions. When you have a broken belief-updater, reality hits back at your broken predictive mechanism via predictive losses, and a gradient descent update fixes the problem in a simple way that can easily cohere with all the other predictive stuff. In contrast, when it comes to a choice of utility function, there are unbounded degrees of freedom and multiple reflectively coherent fixpoints. Reality doesn't 'hit back' against things that are locally aligned with the loss function on a particular range of test cases, but globally misaligned on a wider range of test cases. This is the very abstract story about why hominids, once they finally started to generalize, generalized their capabilities to Moon landings, but their inner optimization no longer adhered very well to the outer-optimization goal of 'relative inclusive reproductive fitness' - even though they were in their ancestral environment optimized very strictly around this one thing and nothing else. This abstract dynamic is something you'd expect to be true about outer optimization loops on the order of both 'natural selection' and 'gradient descent'. The central result: Capabilities generalize further than alignment once capabilities start to generalize far.
From "A central AI alignment problem: capabilities generalization, and the sharp left turn" [LW · GW], by Nate Soares, which, by the way, quite literally uses the exact phrase "The central analogy"; as before, emphasis is mine:
My guess for how AI progress goes is that at some point, some team gets an AI that starts generalizing sufficiently well, sufficiently far outside of its training distribution, that it can gain mastery of fields like physics, bioengineering, and psychology, to a high enough degree that it more-or-less singlehandedly threatens the entire world. Probably without needing explicit training for its most skilled feats, any more than humans needed many generations of killing off the least-successful rocket engineers to refine our brains towards rocket-engineering before humanity managed to achieve a moon landing.
And in the same stroke that its capabilities leap forward, its alignment properties are revealed to be shallow, and to fail to generalize. The central analogy here is that optimizing apes for inclusive genetic fitness (IGF) doesn't make the resulting humans optimize mentally for IGF. Like, sure, the apes are eating because they have a hunger instinct and having sex because it feels good—but it's not like they could be eating/fornicating due to explicit reasoning about how those activities lead to more IGF. They can't yet perform the sort of abstract reasoning that would correctly justify those actions in terms of IGF. And then, when they start to generalize well in the way of humans, they predictably don't suddenly start eating/fornicating because of abstract reasoning about IGF, even though they now could. Instead, they invent condoms, and fight you if you try to remove their enjoyment of good food (telling them to just calculate IGF manually). The alignment properties you lauded before the capabilities started to generalize, predictably fail to generalize with the capabilities.
From "The basic reasons I expect AGI ruin" [LW · GW], by Rob Bensinger:
When I say "general intelligence", I'm usually thinking about "whatever it is that lets human brains do astrophysics, category theory, etc. even though our brains evolved under literally zero selection pressure to solve astrophysics or category theory problems".
[...]
Human brains aren't perfectly general, and not all narrow AI systems or animals are equally narrow. (E.g., AlphaZero is more general than AlphaGo.) But it sure is interesting that humans evolved cognitive abilities that unlock all of these sciences at once, with zero evolutionary fine-tuning of the brain aimed at equipping us for any of those sciences. Evolution just stumbled into a solution to other problems, that happened to generalize to millions of wildly novel tasks.
[...]
Human brains underwent no direct optimization for STEM ability in our ancestral environment, beyond traits like "I can distinguish four objects in my visual field from five objects".[5] [LW(p) · GW(p)]
[5] More generally, the sciences (and many other aspects of human life, like written language) are a very recent development on evolutionary timescales. So evolution has had very little time to refine and improve on our reasoning ability in many of the ways that matter
From "Niceness is unnatural" [LW · GW], by Nate Soares:
I think this view is wrong, and I don't see much hope here. Here's a variety of propositions I believe that I think sharply contradict this view:
- There are lots of ways to do the work that niceness/kindness/compassion did in our ancestral environment, without being nice/kind/compassionate.
- The specific way that the niceness/kindness/compassion cluster shook out in us is highly detailed, and very contingent on the specifics of our ancestral environment (as factored through its effect on our genome) and our cognitive framework (calorie-constrained massively-parallel slow-firing neurons built according to DNA), and filling out those details differently likely results in something that is not relevantly "nice".
- Relatedly, but more specifically: empathy (and other critical parts of the human variant of niceness) seem(s) critically dependent on quirks in the human architecture. More generally, there are lots of different ways for the AI's mind to work differently from how you hope it works.
- The desirable properties likely get shredded under reflection. Once the AI is in the business of noticing and resolving conflicts and inefficiencies within itself (as is liable to happen when its goals are ad-hoc internalized correlates of some training objective), the way that its objectives ultimately shake out is quite sensitive to the specifics of its resolution strategies.
From "Superintelligent AI is necessary for an amazing future, but far from sufficient [LW · GW]", by Nate Soares:
These are the sorts of features of human evolutionary history that resulted in us caring (at least upon reflection) about a much more diverse range of minds than “my family”, “my coalitional allies”, or even “minds I could potentially trade with” or “minds that share roughly the same values and faculties as me”.
Humans today don’t treat a family member the same as a stranger, or a sufficiently-early-development human the same as a cephalopod; but our circle of concern is certainly vastly wider than it could have been, and it has widened further as we’ve grown in power and knowledge.
From the Eliezer-edited summary of "Ngo and Yudkowsky on alignment difficulty" [LW · GW], by... Ngo and Yudkowsky:
Eliezer, summarized by Richard (continued): "In biological organisms, evolution is
one sourcethe ultimate source of consequentialism. Asecondsecondary outcome of evolution is reinforcement learning. For an animal like a cat, upon catching a mouse (or failing to do so) many parts of its brain get slightly updated, in a loop that makes it more likely to catch the mouse next time. (Note, however, that this process isn’t powerful enough to make the cat a pure consequentialist - rather, it has many individual traits that, when we view them from this lens, point in the same direction.)A third thing that makes humans in particular consequentialist is planning,Another outcome of evolution, which helps make humans in particular more consequentialist, is planning - especially when we’re aware of concepts like utility functions."
From "Comments on Carlsmith's “Is power-seeking AI an existential risk?"" [LW · GW], by Nate Soares:
Perhaps Joe thinks that alignment is so easy that it can be solved in a short time window?
My main guess, though, is that Joe is coming at things from a different angle altogether, and one that seems foreign to me.
Attempts to generate such angles along with my corresponding responses:
- Claim: perhaps it's just not that hard to train an AI system to be "good" in the human sense? Like, maybe it wouldn't have been that hard for natural selection to train humans to be fitness maximizers, if it had been watching for goal-divergence and constructing clever training environments?
- Counter: Maybe? But I expect these sorts of things to take time, and at least some mastery of the system's internals, and if you want them to be done so well that they actually work in practice even across the great Change-Of-Distribution to operating in the real world then you've got to do a whole lot of clever and probably time-intensive work.
- Claim: perhaps there's just a handful of relevant insights, and new ways of thinking about things, that render the problem easy?
- Counter: Seems like wishful thinking to me, though perhaps I could go point-by-point through hopeful-to-Joe-seeming candidates?
From "Soares, Tallinn, and Yudkowsky discuss AGI cognition [LW · GW]", by... well, you get the point:
Eliezer: Something like, "Evolution constructed a jet engine by accident because it wasn't particularly trying for high-speed flying and ran across a sophisticated organism that could be repurposed to a jet engine with a few alterations; a human industry would be gaining economic benefits from speed, so it would build unsophisticated propeller planes before sophisticated jet engines." It probably sounds more convincing if you start out with a very high prior against rapid scaling / discontinuity, such that any explanation of how that could be true based on an unseen feature of the cognitive landscape which would have been unobserved one way or the other during human evolution, sounds more like it's explaining something that ought to be true.
And why didn't evolution build propeller planes? Well, there'd be economic benefit from them to human manufacturers, but no fitness benefit from them to organisms, I suppose? Or no intermediate path leading to there, only an intermediate path leading to the actual jet engines observed.
I actually buy a weak version of the propeller-plane thesis based on my inside-view cognitive guesses (without particular faith in them as sure things), eg, GPT-3 is a paper airplane right there, and it's clear enough why biology could not have accessed GPT-3. But even conditional on this being true, I do not have the further particular faith that you can use propeller planes to double world GDP in 4 years, on a planet already containing jet engines, whose economy is mainly bottlenecked by the likes of the FDA rather than by vaccine invention times, before the propeller airplanes get scaled to jet airplanes.
The part where the whole line of reasoning gets to end with "And so we get huge, institution-reshaping amounts of economic progress before AGI is allowed to kill us!" is one that doesn't feel particular attractored to me, and so I'm not constantly checking my reasoning at every point to make sure it ends up there, and so it doesn't end up there.
From "Humans aren't fitness maximizers [LW · GW]", by Soares:
One claim that is hopefully uncontroversial (but that I'll expand upon below anyway) is:
- Humans are not literally optimizing for IGF, and regularly trade other values off against IGF.
Separately, we have a stronger and more controversial claim:
- If an AI's objectives included goodness in the same way that our values include IGF, then the future would not be particularly good.
I think there's more room for argument here, and will provide some arguments.
A semi-related third claim that seems to come up when I have discussed this in person is:
- Niceness is not particularly canonical; AIs will not by default give humanity any significant fraction of the universe in the spirit of cooperation.
I endorse that point as well. It takes us somewhat further afield, and I don't plan to argue it here, but I might argue it later.
From "Shah and Yudkowsky on alignment failures [LW · GW]", by the usual suspects:
Yudkowsky: and lest anyone start thinking that was an exhaustive list of fundamental problems, note the absence of, for example, "applying lots of optimization using an outer loss function doesn't necessarily get you something with a faithful internal cognitive representation of that loss function" aka "natural selection applied a ton of optimization power to humans using a very strict very simple criterion of 'inclusive genetic fitness' and got out things with no explicit representation of or desire towards 'inclusive genetic fitness' because that's what happens when you hill-climb and take wins in the order a simple search process through cognitive engines encounters those wins"
From the comments [LW(p) · GW(p)] on "Late 2021 MIRI Conversations: AMA / Discussion [LW · GW]", by Yudkowsky:
Yudkowsky: I would "destroy the world" from the perspective of natural selection in the sense that I would transform it in many ways, none of which were making lots of copies of my DNA, or the information in it, or even having tons of kids half resembling my old biological self.
From the perspective of my highly similar fellow humans with whom I evolved in context, they'd get nice stuff, because "my fellow humans get nice stuff" happens to be the weird unpredictable desire that I ended up with at the equilibrium of reflection on the weird unpredictable godshatter that ended up inside me, as the result of my being strictly outer-optimized over millions of generations for inclusive genetic fitness, which I now don't care about at all.
Paperclip-numbers do well out of paperclip-number maximization. The hapless outer creators of the thing that weirdly ends up a paperclip maximizer, not so much.
From Yudkowsky's appearance on the Bankless podcast (full transcript here [LW · GW]):
Ice cream didn't exist in the natural environment, the ancestral environment, the environment of evolutionary adeptedness. There was nothing with that much sugar, salt, fat combined together as ice cream. We are not built to want ice cream. We were built to want strawberries, honey, a gazelle that you killed and cooked [...] We evolved to want those things, but then ice cream comes along, and it fits those taste buds better than anything that existed in the environment that we were optimized over.
[...]
Leaving that aside for a second, the reason why this metaphor breaks down is that although the humans are smarter than the chickens, we're not smarter than evolution, natural selection, cumulative optimization power over the last billion years and change. (You know, there's evolution before that but it's pretty slow, just, like, single-cell stuff.)
There are things that cows can do for us, that we cannot do for ourselves. In particular, make meat by eating grass. We’re smarter than the cows, but there's a thing that designed the cows; and we're faster than that thing, but we've been around for much less time. So we have not yet gotten to the point of redesigning the entire cow from scratch. And because of that, there's a purpose to keeping the cow around alive.
And humans, furthermore, being the kind of funny little creatures that we are — some people care about cows, some people care about chickens. They're trying to fight for the cows and chickens having a better life, given that they have to exist at all. And there's a long complicated story behind that. It's not simple, the way that humans ended up in that [??]. It has to do with the particular details of our evolutionary history, and unfortunately it's not just going to pop up out of nowhere.
But I'm drifting off topic here. The basic answer to the question "where does that analogy break down?" is that I expect the superintelligences to be able to do better than natural selection, not just better than the humans.
At this point, I'm tired, so I'm logging off. But I would bet a lot of money that I can find at least 3x the number of these examples if I had the energy to. As Alex Turner put it [LW(p) · GW(p)], it seems clear to me that, for a very high portion of "classic" alignment arguments about inner & outer alignment problems, at least in the form espoused by MIRI, the argumentative bedrock is ultimately based on little more than analogies with evolution.
comment by lemonhope (lcmgcd) · 2024-07-31T08:11:03.923Z · LW(p) · GW(p)
I think the tooling/scale is at a point where we can begin the search for "life" (eg viruses, boundaries that repair, etc) in weights during training. We should certainly expect to see such things if the NN is found via evolutionary algorithm. So we can look for similar structures in similar places with backprop+SGD. I expect this to go much like the search for life on mars. A negative result would still be good information IMO.