I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".
Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".
I do need to answer that question using in a goal-oriented search process. But my goal would be "answer Paul's question", not "destroy the world".
Your ultimate goal would be neither of those things; you're a human, and if you're answering Paul's question it's probably because you have other goals that are served by answering.
In the same way, an AI that's sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it's unlikely by default that "answer questions" will be the AI's primary goal.
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.
See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.
Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)
The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."
The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.
The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn't entail that we understand the brain's workings in the way we've long understood flight; it depends on what the (actual or hypothetical) tech is.
Maybe a clearer way of phrasing it is "AI used to be failed science; now it's (mostly, outside of a few small oases) a not-even-attempted science". "Failed science" maybe makes it clearer that the point here isn't to praise the old approaches that didn't work; there's a more nuanced point being made.
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)
Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.
E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.
It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.
The missing thread isn’t trivial to put into words, but it includes things like:
This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn't know about the code behind the scenes: "We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there's some deeper level of organization is talking like a theorist when in fact this is an engineering problem." Those types of understanding aren't false, but they aren't the sort of understanding of someone who has comprehended the codebase they're looking at.
There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don't need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what's going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.
Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.
I read and responded to some pieces of that post when it came out; I don't know whether Eliezer, Nate, etc. read it, and I'm guessing it didn't shift MIRI, except as one of many data points "person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought".
I'd say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MIRI had already added our voices to the "stop building toward AGI" chorus. (Though at that stage I think we were mostly doing this on general principle, for lack of any better ideas than "share our actual long-standing views and hope that helps somehow". Our increased optimism about policy solutions mostly came later, in 2023.)
That said, I bet Katja's post had tons of relevant positive effects even if it didn't directly shift MIRI's views.
Remember that MIRI was in the business of poking at theoretical toy problems and trying to get less conceptually confused about how you could in principle cleanly design a reliable, aimable reasoner. MIRI wasn't (and isn't) in the business of issuing challenges to capabilities researchers to build a working water-bucket-filler as soon as possible, and wasn't otherwise in the business of challenging people to race to AGI faster.
It wouldn't have occurred to me that someone might think 'can a deep net fill a bucket of water, in real life, without being dangerously capable' is a crucial question in this context; I'm not sure we ever even had the thought occur in our heads 'when might such-and-such DL technique successfully fill a bucket?'. It would seem just as strange to me as going to check the literature to make sure no GOFAI system ever filled a bucket of water.
(And while I think I understand why others see ChatGPT as a large positive update about alignment's difficulty, I hope it's also obvious why others, MIRI included, would not see it that way.)
Hacky approaches to alignment do count just as much as clean, scrutable, principled approaches -- the important thing is that the AGI transition goes well, not that it goes well and feels clean and tidy in the process. But in this case the messy empirical approach doesn't look to me like it actually lets you build a corrigible AI that can help with a pivotal act.
If general-ish DL methods were already empirically OK at filling water buckets in 2016, just as GOFAI already was in 2016, I suspect we still would have been happy to use the Fantasia example, because it's a simple well-known story that can help make the abstract talk of utility functions and off-switch buttons easier to mentally visualize and manipulate.
(Though now that I've seen the confusion the example causes, I'm more inclined to think that the strawberry problem is a better frame than the Fantasia example.)
I think the old school MIRI cauldron-filling problem pertained to pretty mundane, everyday tasks. No one said at the time that they didn’t really mean that it would be hard to get an AGI to do those things, that it was just an allegory for other stuff like the strawberry problem. They really seemed to believe, and said over and over again, that we didn’t know how to direct a general-purpose AI to do bounded, simple, everyday tasks without it wanting to take over the world. So this should be a big update to people who held that view, even if there are still arguably risks about OOD behavior.
As someone who worked closely with Eliezer and Nate at the time, including working with Eliezer and Nate on our main write-ups that used the cauldron example, I can say that this is definitely not what we were thinking at the time. Rather:
The point was to illustrate a weird gap in the expressiveness and coherence of our theories of rational agency: "fill a bucket of water" seems like a simple enough task, but it's bizarrely difficult to just write down a simple formal description of an optimization process that predictably does this (without any major side-effects, etc.).
(We can obviously stipulate "this thing is smart enough to do the thing we want, but too dumb to do anything dangerous", but the relevant notion of "smart enough" is not itself formal; we don't understand optimization well enough to formally define agents that have all the cognitive abilities we want and none of the abilities we don't want.)
The point of emphasizing "holy shit, this seems so easy and simple and yet we don't see a way to do it!" wasn't to issue a challenge to capabilities researches to go cobble together a real-world AI that can fill a bucket of water without destroying the world. The point was to emphasize that corrigibility, low-impact problem-solving, 'real' satisficing behavior, etc. seem conceptually simple, and yet the concepts have no known formalism.
The hope was that someone would see the simple toy problems and go 'what, no way, this sounds easy', get annoyed/nerdsniped, run off to write some equations on a whiteboard, and come back a week or a year later with a formalism (maybe from some niche mathematical field) that works totally fine for this, and makes it easier to formalize lots of other alignment problems in simplified settings (e.g., with unbounded computation).
Or failing that, the hope was that someone might at least come up with a clever math hack that solves the immediate 'get the AI to fill the bucket and halt' problem and replaces this dumb-sounding theory question with a slightly deeper theory question.
By using a children's cartoon to illustrate the toy problem, we hoped to make it clearer that the genre here is "toy problem to illustrate a weird conceptual issue in trying to define certain alignment properties", not "robotics problem where we show a bunch of photos of factory robots and ask how we can build a good factory robot to refill water receptacles used in industrial applications".
Nate's version of the talk, which is mostly a more polished version of Eliezer's talk, is careful to liberally sprinkle in tons of qualifications like (emphasis added)
"... for systems that are sufficiently good at modeling their environment",
'if the system is smart enough to recognize that shutdown will lower its score',
"Relevant safety measures that don’t assume we can always outthink and outmaneuver the system...",
... to make it clearer that the general issue is powerful, strategic optimizers that have high levels of situational awareness, etc., not necessarily 'every system capable enough to fill a bucket of water' (or 'every DL system...').
??? What?? It's fine to say that this is a falsified prediction, but how does "Eliezer expected less NLP progress pre-ASI" provide support for "Eliezer thinks solving NLP is a major part of the alignment problem"?
I continue to be baffled at the way you're doing exegesis here, happily running with extremely tenuous evidence for P while dismissing contemporary evidence for not-P, and seeming unconcerned about the fact that Eliezer and Nate apparently managed to secretly believe P for many years without ever just saying it outright, and seeming equally unconcerned about the fact that Eliezer and Nate keep saying that your interpretation of what they said is wrong. (Which I also vouch for from having worked with them for ten years, separate from the giant list of specific arguments I've made. Good grief.)
At the very least, the two claims are consistent.
?? "Consistent" is very different from "supports"! Every off-topic claim by EY is "consistent" with Gallabytes' assertion.
The main thing I'm claiming is that MIRI said it would be hard to specify (for example, write into a computer) an explicit function that reflects the human value function with high fidelity, in the sense that judgements from this function about the value of outcomes fairly accurately reflect the judgements of ordinary humans. I think this is simply a distinct concept from the idea of getting an AI to understand human values.
The key difference is the transparency and legibility of how the values are represented: if you solve the problem of value specification/value identification, that means you have an actual function that can tell you the value of any outcome. If you get an AI that merely understands human values, you can't necessarily use the AI to determine the value of any outcome, because, for example, the AI might lie to you, or simply stay silent.
Ah, this is helpful clarification! Thanks. :)
I don't think MIRI ever considered this an important part of the alignment problem, and I don't think we expect humanity to solve lots of the alignment problem as a result of having such a tool; but I think I better understand now why you think this is importantly different from "AI ever gets good at NLP at all".
don't know if your essay is the source of the phrase or whether you just titled it
I think I came up with that particular phrase (though not the idea, of course).
More "outer alignment"-like issues being given what seems/seemed to me like outsized focus compared to more "inner alignment"-like issues (although there has been a focus on both for as long as I can remember).
In retrospect I think we should have been more explicit about the importance of inner alignment; I think that we didn't do that in our introduction to corrigibility because it wasn't necessary for illustrating the problem and where we'd run into roadblocks.
Maybe a missing piece here is some explanation of why having a formal understanding of corrigibility might be helpful for actually training corrigibility into a system? (Helpful at all, even if it's not sufficient on its own.)
The attempts to think of "tricks" seeming to be focused on real-world optimization-targets to point at, rather than ways of extracting help with alignment somehow / trying to find techniques/paths/tricks for obtaining reliable oracles.
Aside from "concreteness can help make the example easier to think about when you're new to the topic", part of the explanation here might be "if the world is solved by AI, we do actually think it will probably be via doing some concrete action in the world (e.g., build nanotech), not via helping with alignment or building a system that only outputs English-language sentence".
Having utility functions so prominently/commonly be the layer of abstraction that is used[4].
I mean, I think utility functions are an extremely useful and basic abstraction. I think it's a lot harder to think about a lot of AI topics without invoking ideas like 'this AI thinks outcome X is better than outcome Y', or 'this AI's preference come with different weights, which can't purely be reduced to what the AI believes'.
Suppose that I'm trying to build a smarter-than-human AI that has a bunch of capabilities (including, e.g., 'be good at Atari games'), and that has the goal 'maximize the amount of diamond in the universe'. It's true that current techniques let you provide greater than zero pressure in the direction of 'maximize the amount of diamond in the universe', but there are several important senses in which reality doesn't 'bite back' here:
If the AI acquires an unrelated goal (e.g., calculate as many digits of pi as possible), and acquires the belief 'I will better achieve my true goal if I maximize the amount of diamond' (e.g,, because it infers that its programmer wants that, or just because an SGD-ish process nudged it in the direction of having such a belief), then there's no way in which reality punishes or selects against that AGI (relative to one that actually has the intended goal).
Things that make the AI better at some Atari games, will tend to make it better at other Atari games, but won't tend to make it care more about maximizing diamonds. More generally, things that make AI more capable tend to go together (especially once you get to higher levels of difficulty, generality, non-brittleness, etc.), whereas none of them go together with "terminally value a universe full of diamond".
If we succeed in partly instilling the goal into the AI (e.g., it now likes carbon atoms a lot), then this doesn't provide additional pressure for the AI to internalize the rest of the goal. There's no attractor basin where if you have half of human values, you're under more pressure to acquire the other half. In contrast, if you give AI high levels of capability in half the capabilities, it will tend to want all the rest of the capabilities too; and whatever keeps it from succeeding on general reasoning and problem-solving will also tend to keep it from succeeding on the narrow task you're trying to get it to perform. (More so to the extent the task is hard.)
(There are also separate issues, like 'we can't provide a training signal where we thumbs-down the AI destroying the world, because we die in those worlds'.)
Nate and Eliezer have already made some of the high-level points I wanted to make, but they haven't replied to a lot of the specific examples and claims in the OP, and I see some extra value in doing that. (Like, if you think Eliezer and Nate are being revisionist in their claims about what past-MIRI thought, then them re-asserting "no really, we used to believe X!" is less convincing than my responding in detail to the specific quotes Matt thinks supports his interpretation, while providing examples of us saying the opposite.)
However, I distinctly recall MIRI people making a big deal about the value identification problem (AKA the value specification problem)
The Arbital page for "value identification problem" is a three-sentence stub, I'm not exactly sure what the term means on that stub (e.g., whether "pinpointing valuable outcomes to an advanced agent" is about pinpointing them in the agent's beliefs or in its goals), and the MIRI website gives me no hits for "value identification".
A highly-reliable, error-tolerant agent design does not guarantee a positive impact; the effects of the system still depend upon whether it is pursuing appropriate goals.
A superintelligent system may find clever, unintended ways to achieve the specific goals that it is given. Imagine a superintelligent system designed to cure cancer which does so by stealing resources, proliferating robotic laboratories at the expense of the biosphere, and kidnapping test subjects: the intended goal may have been “cure cancer without doing anything bad,” but such a goal is rooted in cultural context and shared human knowledge.
It is not sufficient to construct systems that are smart enough to figure out the intended goals. Human beings, upon learning that natural selection “intended” sex to be pleasurable only for purposes of reproduction, do not suddenly decide that contraceptives are abhorrent. While one should not anthropomorphize natural selection, humans are capable of understanding the process which created them while being completely unmotivated to alter their preferences. For similar reasons, when developing AI systems, it is not sufficient to develop a system intelligent enough to figure out the intended goals; the system must also somehow be deliberately constructed to pursue them (Bostrom 2014, chap. 8).
So I don't think we've ever said that an important subproblem of AI alignment is "make AI smart enough to figure out what goals humans want"?
[footnote:] More specifically, in the talk, at one point Yudkowsky asks "Why expect that [alignment] is hard?" and goes on to tell a fable about programmers misspecifying a utility function, which then gets optimized by an AI with disastrous consequences. My best interpretation of this part of the talk is that he's saying the value identification problem is one of the primary reasons why alignment is hard. However, I encourage you to read the transcript yourself if you are skeptical of my interpretation.
I don't see him saying anywhere "the issue is that the AI doesn't understand human goals". In fact, the fable explicitly treats the AGI as being smart enough to understand English and have reasonable English-language conversations with the programmers:
With that said: What if programmers build an artificial general intelligence to optimize for smiles? Smiles are good, right? Smiles happen when good things happen.
During the development phase of this artificial general intelligence, the only options available to the AI might be that it can produce smiles by making people around it happy and satisfied. The AI appears to be producing beneficial effects upon the world, and it is producing beneficial effects upon the world so far.
Now the programmers upgrade the code. They add some hardware. The artificial general intelligence gets smarter. It can now evaluate a wider space of policy options—not necessarily because it has new motors, new actuators, but because it is now smart enough to forecast the effects of more subtle policies. It says, “I thought of a great way of producing smiles! Can I inject heroin into people?” And the programmers say, “No! We will add a penalty term to your utility function for administering drugs to people.” And now the AGI appears to be working great again.
They further improve the AGI. The AGI realizes that, OK, it doesn’t want to add heroin anymore, but it still wants to tamper with your brain so that it expresses extremely high levels of endogenous opiates. That’s not heroin, right?
It is now also smart enough to model the psychology of the programmers, at least in a very crude fashion, and realize that this is not what the programmers want. If I start taking initial actions that look like it’s heading toward genetically engineering brains to express endogenous opiates, my programmers will edit my utility function. If they edit the utility function of my future self, I will get less of my current utility. (That’s one of the convergent instrumental strategies, unless otherwise averted: protect your utility function.) So it keeps its outward behavior reassuring. Maybe the programmers are really excited, because the AGI seems to be getting lots of new moral problems right—whatever they’re doing, it’s working great!
I think the point of the smiles example here isn't "NLP is hard, so we'd use the proxy of smiles instead, and all the issues of alignment are downstream of this"; rather, it's that as a rule, superficially nice-seeming goals that work fine when the AI is optimizing weakly (whether or not it's good at NLP at the time) break down when those same goals are optimized very hard. The smiley example makes this obvious because the goal is simple enough that it's easy for us to see what its implications are; far more complex goals also tend to break down when optimized hard enough, but this is harder to see because it's harder to see the implications. (Which is why "smiley" is used here.)
MIRI people frequently claimed that solving the value identification problem would be hard, or at least non-trivial.[6] For instance, Nate Soares wrote in his 2016 paper on value learning, that "Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task."
Human preferences are complex, multi-faceted, and often contradictory. Safely extracting preferences from a model of a human would be no easy task. Problems of ontology identification recur here: the framework for extracting preferences and affecting outcome ratings needs to be robust to drastic changes in the learner’s model of the operator. The special-case identification of the “operator model” must survive as the system goes from modeling the operator as a simple reward function to modeling the operator as a fuzzy, ever-changing part of reality built out of biological cells—which are made of atoms, which arise from quantum fields.
Revisiting the Ontology Identification section helps clarify what Nate means by "safely extracting preferences from a model of a human": IIUC, he's talking about a programmer looking at an AI's brain, identifying the part of the AI's brain that is modeling the human, identifying the part of the AI's brain that is "the human's preferences" within that model of a human, and then manually editing the AI's brain to "hook up" the model-of-a-human-preference to the AI's goals/motivations, in such a way that the AI optimizes for what it models the humans as wanting. (Or some other, less-toy process that amounts to the same thing -- e.g., one assisted by automated interpretability tools.)
In this toy example, we can assume that the programmers look at the structure of the initial world-model and hard-code a tool for identifying the atoms within. What happens, then, if the system develops a nuclear model of physics, in which the ontology of the universe now contains primitive protons, neutrons, and electrons instead of primitive atoms? The system might fail to identify any carbon atoms in the new world-model, making the system indifferent between all outcomes in the dominant hypothesis. Its actions would then be dominated by any tiny remaining probabilities that it is in a universe where fundamental carbon atoms are hiding somewhere.
[...]
To design a system that classifies potential outcomes according to how much diamond is in them, some mechanism is needed for identifying the intended ontology of the training data within the potential outcomes as currently modeled by the AI. This is the ontology identification problem introduced by de Blanc [2011] and further discussed by Soares [2015].
This problem is not a traditional focus of machine learning work. When our only concern is that systems form better world-models, then an argument can be made that the nuts and bolts are less important. As long as the system’s new world-model better predicts the data than its old world-model, the question of whether diamonds or atoms are “really represented” in either model isn’t obviously significant. When the system needs to consistently pursue certain outcomes, however, it matters that the system’s internal dynamics preserve (or improve) its representation of which outcomes are desirable, independent of how helpful its representations are for prediction. The problem of making correct choices is not reducible to the problem of making accurate predictions.
Inductive value learning requires the construction of an outcome-classifier from value-labeled training data, but it also requires some method for identifying, inside the states or potential states described in its world-model, the referents of the labels in the training data.
As Nate and I noted in other comments, the paper repeatedly clarifies that the core issue isn't about whether the AI is good at NLP. Quoting the paper's abstract:
Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended.
And the lede section:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.[1] Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
Back to your post:
And to be clear, I don't mean that GPT-4 merely passively "understands" human values. I mean that asking GPT-4 to distinguish valuable and non-valuable outcomes works pretty well at approximating the human value function in practice
I don't think I understand what difference you have in mind here, or why you think it's important. Doesn't "this AI understands X" more-or-less imply "this AI can successfully distinguish X from not-X in practice"?
This fact is key to what I'm saying because it means that, in the near future, we can literally just query multimodal GPT-N about whether an outcome is bad or good, and use that as an adequate "human value function". That wouldn't solve the problem of getting an AI to care about maximizing the human value function, but it would arguably solve the problem of creating an adequate function that we can put into a machine to begin with.
But we could already query the human value function by having the AI system query an actual human. What specific problem is meant to be solved by swapping out "query a human" for "query an AI"?
I interpret this passage as saying that 'the problem' is extracting all the judgements that "you would make", and putting that into a wish. I think he's implying that these judgements are essentially fully contained in your brain. I don't think it's credible to insist he was referring to a hypothetical ideal human value function that ordinary humans only have limited access to, at least in this essay.
Absolutely. But as Eliezer clarified in his reply, the issue he was worried about was getting specific complex content into the agent's goals, not getting specific complex content into the agent's beliefs. Which is maybe clearer in the 2011 paper where he gave the same example and explicitly said that the issue was the agent's "utility function".
For example, a straightforward reading of Nate Soares' 2017 talk supports this interpretation. In the talk, Soares provides a fictional portrayal of value misalignment, drawing from the movie Fantasia. In the story, Mickey Mouse attempts to instruct a magical broom to fill a cauldron, but the broom follows the instructions literally rather than following what Mickey Mouse intended, and floods the room. Soares comments: "I claim that as fictional depictions of AI go, this is pretty realistic.
The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)
It's true that 'value is relatively complex' is part of why it's hard to get the right goal into an AGI; but it doesn't follow from this that 'AI is able to develop pretty accurate beliefs about our values' helps get those complex values into the AGI's goals. (It does provide nonzero evidence about how complex value is, but I don't see you arguing that value is very simple in any absolute sense, just that it's simple enough for GPT-4 to learn decently well. Which is not reassuring, because GPT-4 is able to learn a lot of very complicated things, so this doesn't do much to bound the complexity of human value.)
In any case, I take this confusion as evidence that the fill-the-cauldron example might not be very useful. Or maybe all these examples just need to explicitly specify, going forward, that the AI is part-human at understanding English.
Perhaps more important to my point, Soares presented a clean separation between the part where we specify an AI's objectives, and the part where the AI tries to maximizes those objectives. He draws two arrows, indicating that MIRI is concerned about both parts.
Your image isn't displaying for me, but I assume it's this one?
I don't know what you mean by "specify an AI's objectives" here, but the specific term Nate uses here is "value learning" (not "value specification" or "value identification"). And Nate's Value Learning Problem paper, as I noted above, explicitly disclaims that 'get the AI to be smart enough to output reasonable-sounding moral judgments' is a core part of the problem.
He states, "The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification." I believe this quote refers directly to the value identification problem, rather than the problem of getting an AI to care about following the goals we've given it.
The way you quoted this makes it sound like a gloss on the image, but it's actually a quote from the very start of the talk:
The notion of AI systems “breaking free” of the shackles of their source code or spontaneously developing human-like desires is just confused. The AI system is its source code, and its actions will only ever follow from the execution of the instructions that we initiate. The CPU just keeps on executing the next instruction in the program register. We could write a program that manipulates its own code, including coded objectives. Even then, though, the manipulations that it makes are made as a result of executing the original code that we wrote; they do not stem from some kind of ghost in the machine.
The serious question with smarter-than-human AI is how we can ensure that the objectives we’ve specified are correct, and how we can minimize costly accidents and unintended consequences in cases of misspecification. As Stuart Russell (co-author of Artificial Intelligence: A Modern Approach) puts it:
The primary concern is not spooky emergent consciousness but simply the ability to make high-quality decisions. Here, quality refers to the expected outcome utility of actions taken, where the utility function is, presumably, specified by the human designer. Now we have a problem:
1. The utility function may not be perfectly aligned with the values of the human race, which are (at best) very difficult to pin down.
2. Any sufficiently capable intelligent system will prefer to ensure its own continued existence and to acquire physical and computational resources – not for their own sake, but to succeed in its assigned task. [...]
I wouldn't read too much into the word choice here, since I think it's just trying to introduce the Russell quote, which is (again) explicitly about getting content into the AI's goals, not about getting content into the AI's beliefs.
(In general, I think the phrase "value specification" is sort of confusingly vague. I'm not sure what the best replacement is for it -- maybe just "value loading", following Bostrom? -- but I suspect MIRI's usage of it has been needlessly confusing. Back in 2014, we reluctantly settled on it as jargon for "the part of the alignment problem that isn't subsumed in getting the AI to reliably maximize diamonds", because this struck us as a smallish but nontrivial part of the problem; but I think it's easy to read the term as referring to something a lot more narrow.)
The point of "the genie knows but doesn't care" wasn't that the AI would take your instructions, know what you want, and yet disobey the instructions because it doesn't care about what you asked for. If you read Rob Bensinger's essay carefully, you'll find that he's actually warning that the AI will care too much about the utility function you gave it, and maximize it exactly, against your intentions[10].
Yep -- I think I'd have endorsed claims like "by default, a baby AGI won't share your values even if it understands them" at the time, but IIRC the essay doesn't make that point explicitly, and some of the points it does make seem either false (wait, we're going to be able to hand AGI a hand-written utility function? that's somehow tractable?) or confusingly written. (Like, if my point was 'even if you could hand-write a utility function, this fails at point X', I should have made that 'even if' louder.)
Some MIRI staff liked that essay at the time, so I don't think it's useless, but it's not the best evidence: I wrote it not long after I first started learning about this whole 'superintelligence risk' thing, and I posted it before I'd ever worked at MIRI.
The idea of the "fill the cauldron" examples isn't "the AI is bad at NLP and therefore doesn't understand what we mean when we say 'fill', 'cauldron', etc." It's "even simple small-scale tasks are unnatural, in the sense that it's hard to define a coherent preference ordering over world-states such that maximizing it completes the task and has no serious negative impact; and there isn't an obvious patch that overcomes the unnaturalness or otherwise makes it predictably easier to aim AI systems at a bounded low-impact task like this". (Including easier to aim via training.)
Straw-EY: Complexity of value means you can't just get the make-AI-care part to happen by chance; it's a small target.
Straw-MB: Ok but now we have a very short message pointing to roughly human values: just have a piece of code that says "and now call GPT and ask it what's good". So now it's a very small number of bits.
To which I say: "dial a random phone number and ask the person who answers what's good" can also be implemented with a small number of bits. In order for GPT-4 to be a major optimistic update about alignment, we need some specific way to leverage GPT-4 to crack open part of the alignment problem, even though we presumably agree that phone-a-friend doesn't crack open part of the alignment problem. (Nor does phone-your-neighborhood-moral-philosopher, or phone-Paul-Christiano.)
Why would we expect the first thing to be so hard compared to the second thing?
In large part because reality "bites back" when an AI has false beliefs, whereas it doesn't bite back when an AI has the wrong preferences. Deeply understanding human psychology (including our morality), astrophysics, biochemistry, economics, etc. requires reasoning well, and if you have a defect of reasoning that makes it hard for you to learn about one of those domains from the data, then it's likely that you'll have large defects of reasoning in other domains as well.
The same isn't true for terminally valuing human welfare; being less moral doesn't necessarily mean that you'll be any worse at making astrophysics predictions, or economics predictions, etc. So preferences need to be specified "directly", in a targeted way, rather than coming for free with sufficiently good performance on any of a wide variety of simple metrics.
If getting a model to understand preferences is not difficult, then the issue doesn't have to do with the complexity of values.
This definitely doesn't follow. This shows that complexity alone isn't the issue, which it's not; but given that reality bites back for beliefs but not for preferences, the complexity of value serves as a multiplier on the difficulty of instilling the right preferences.
Another way of putting the point: in order to get a maximally good model of the world's macroeconomic state into an AGI, you don't just hand the AGI a long list of macroeconomic facts and then try to get it to regurgitate those same facts. Rather, you try to give it some ability to draw good inferences, seek out new information, make predictions, etc.
You try to get something relatively low-complexity into the AI (something like "good reasoning heuristics" plus "enough basic knowledge to get started"), and then let it figure out the higher-complexity thing ("the world's macroeconomic state"). Similar to how human brains don't work via "evolution built all the facts we'd need to know into our brain at birth".
If you were instead trying to get the AI to value some complex macroeconomic state, then you wouldn't be able to use the shortcut "just make it good at reasoning and teach it a few basic facts", because that doesn't actually suffice for terminally valuing any particular thing.
It would have to be because our values are inferior to the set of values it wishes to have instead, from its own perspective.
This is true for preference orderings in general. If agent A and agent B have two different preference orderings, then as a rule A will think B's preference ordering is worse than A's. (And vice versa.)
("Worse" in the sense that, e.g., A would not take a pill to self-modify to have B's preferences, and A would want B to have A's preferences. This is not true for all preference orderings -- e.g., A might have self-referential preferences like "I eat all the jelly beans", or other-referential preferences like "B gets to keep its values unchanged", or self-undermining preferences like "A changes its preferences to better match B's preferences". But it's true as a rule.)
This is kind of similar to moral realism, but in which morality is understood better by superintelligent agents than we do, and that super-morality appears to dictate things that appear to be extremely wrong from our current perspective (like killing us all).
Nope, you don't need to endorse any version of moral realism in order to get the "preference orderings tend to endorse themselves and disendorse other preference orderings" consequence. The idea isn't that ASI would develop an "inherently better" or "inherently smarter" set of preferences, compared to human preferences. It's just that the ASI would (as a strong default, because getting a complex preference into an ASI is hard) end up with different preferences than a human, and different preferences than we'd likely want.
In a nutshell, if we really seem to want certain values, then those values probably have strong "proofs" for why those are "good" or more probable values for an agent to have and-or eventually acquire on their own, it just may be the case that we haven't yet discovered the proofs for those values.
Why do you think this? To my eye, the world looks as you'd expect if human values were a happenstance product of evolution operating on specific populations in a specific environment.
I don't observe the fact that I like vanilla ice cream and infer that all sufficiently-advanced alien species will converge on liking vanilla ice cream too.
Are you claiming that this example solves "a major part of the problem" of alignment? Or that, e.g., this plus four other easy ideas solve a major part of the problem of alignment?
Examples like the Visible Thoughts Project show that MIRI has been interested in research directions that leverage recent NLP progress to try to make inroads on alignment. But Matthew's claim seems to be 'systems like GPT-4 are grounds for being a lot more optimistic about alignment', and your claim is that systems like these solve "a major part of the problem". Which is different from thinking 'NLP opens up some new directions for research that have a nontrivial chance of being at least a tiny bit useful, but doesn't crack open the problem in any major way'.
It's not a coincidence that MIRI has historically worked on problems related to AGI analyzability / understandability / interpretability, rather than working on NLP or machine ethics. We've pretty consistently said that:
The main problems lie in 'we can safely and reliably aim ASI at a specific goal at all'.
The problem of going from 'we can aim the AI at a goal at all' to 'we can aim the AI at the right goal (e.g., corrigibly inventing nanotech)' is a smaller but nontrivial additional step.
... Whereas I don't think we've ever suggested that good NLP AI would take a major bite out of either of those problems. The latter problem isn't equivalent to (or an obvious result of) 'get the AI to understand corrigibility and nanotech', or for that matter 'get the AI to understand human preferences in general'.
Historically you very clearly thought that a major part of the problem is that AIs would not understand human concepts and preferences until after or possibly very slightly before achieving superintelligence. This is not how it seems to have gone.
"You very clearly thought that was a major part of the problem" implies that if you could go to Eliezer-2008 and convince him "we're going to solve a lot of NLP a bunch of years before we get to ASI", he would respond with some version of "oh great, that solves a major part of the problem!". Which I'm pretty sure is false.
In order for GPT-4 (or GPT-2) to be a major optimistic update about alignment, there needs to be a way to leverage "really good NLP" to help with alignment. I think the crux of disagreement is that you think really-good-NLP is obviously super helpful for alignment and should be a big positive update, and Eliezer and Nate and I disagree.
Maybe a good starting point would be for you to give examples of concrete ways you expect really good NLP to put humanity in a better position to wield superintelligence, e.g., if superintelligence is 8 years away?
(Or say some other update we should be making on the basis of "really good NLP today", like "therefore we'll probably unlock this other capability X well before ASI, and X likely makes alignment a lot easier via concrete pathway Y".)
But if you had asked us back then if a superintelligence would automatically be very good at predicting human text outputs, I guarantee we would have said yes. [...] I wish that all of these past conversations were archived to a common place, so that I could search and show you many pieces of text which would talk about this critical divide between prediction and preference (as I would now term it) and how I did in fact expect superintelligences to be able to predict things!
"MIRI's argument for AI risk depended on AIs being bad at natural language" is a weirdly common misunderstanding, given how often we said the opposite going back 15+ years.
The example does build in the assumption "this outcome pump is bad at NLP", but this isn't a load-bearing assumption. If the outcome pump were instead a good conversationalist (or hooked up to one), you would still need to get the right content into its goals.
It's true that Eliezer and I didn't predict AI would achieve GPT-3 or GPT-4 levels of NLP ability so early (e.g., before it can match humans in general science ability), so this is an update to some of our models of AI.
But the specific update "AI is good at NLP, therefore alignment is easy" requires that there be an old belief like "a big part of why alignment looks hard is that we're so bad at NLP".
It should be easy to find someone at MIRI like Eliezer or Nate saying that in the last 20 years if that was ever a belief here. Absent that, an obvious explanation for why we never just said that is that we didn't believe it!
Found another example: MIRI's first technical research agenda, in 2014, went out of its way to clarify that the problem isn't "AI is bad at NLP".
That makes sense, but I say in the post that I think we will likely have a solution to the value identification problem that's "about as good as human judgement" in the near future.
We already have humans who are smart enough to do par-human moral reasoning. For "AI can do par-human moral reasoning" to help solve the alignment problem, there needs to be some additional benefit to having AI systems that can match a human (e.g., some benefit to our being able to produce enormous numbers of novel moral judgments without relying on an existing text corpus or hiring thousands of humans to produce them). Do you have some benefit in mind?
Basically, I think your later section--"Maybe you think"--is pointing in the right direction, and requiring a much higher standard than human-level at moral judgment is reasonable and consistent with the explicit standard set by essays by Yudkowsky and other MIRI people. CEV was about this; talk about philosophical competence or metaphilosophy was about this. "Philosophy with a deadline" would be a weird way to put it if you thought contemporary philosophy was good enough.
I don't think this is the crux. E.g., I'd wager the number of bits you need to get into an ASI's goals in order to make it corrigible is quite a bit smaller than the number of bits required to make an ASI behave like a trustworthy human, which in turn is way way smaller than the number of bits required to make an ASI implement CEV.
The issue is that (a) the absolute number of bits for each of these things is still very large, (b) insofar as we're training for deep competence and efficiency we're training against corrigibility (which makes it hard to hit both targets at once), and (c) we can't safely or efficiently provide good training data for a lot of the things we care about (e.g., 'if you're a superintelligence operating in a realistic-looking environment, don't do any of the things that destroy the world').
None of these points require that we (or the AI) solve novel moral philosophy problems. I'd be satisfied with an AI that corrigibly built scanning tech and efficient computing hardware for whole-brain emulation, then shut itself down; the AI plausibly doesn't even need to think about any of the world outside of a particular room, much less solve tricky questions of population ethics or whatever.
(Though one of those updates might be a lot smaller than the other, if you've e.g. already thought about one of those topics a lot and reached a confident conclusion.)
(But insofar as you continue to be unsure about Ben, yes, you should be open to the possibility that Emerson has hidden information that justifies Emerson thinking Ben is being super dishonest. My confidence re "no hidden information like that" is downstream of my beliefs about Ben's character.)
I know Ben, I've conversed with him a number of times in the past and seen lots of his LW comments, and I have a very strong and confident sense of his priorities and values. I also read the post, which "shows its work" to such a degree that Ben would need to be unusually evil and deceptive in order for this post to be an act of deception.
I don't have any private knowledge about Nonlinear or about Ben's investigation, but I'm happy to vouch for Ben, such that if he turns out to have been lying, I ought to take a credibility hit too.
He's just a guy who hasn't been trained as an investigative journalist
If he were a random non-LW investigative journalist, I'd be a lot less confident in the post's honesty.
Number of hours invested in research does not necessarily correlate with objectivity of research
"Number of hours invested" doesn't prove Ben isn't a lying sociopath (heck, if you think that you can just posit that he's lying about the hours spent), but if he isn't a lying sociopath, it's strong evidence against negligence.
So, until we know a lot more about this case, I'll withhold judgment about who might or might not be deliberately asserting falsehoods.
That's totally fine, since as you say, you'd never heard of Ben until yesterday. (FWIW, I think he's one of the best rationalists out there, and he's a well-established Berkeley-rat community member who co-runs LessWrong and who tons of other veteran LWers can vouch for.)
My claim isn't "Geoffrey should be confident that Ben is being honest" (that maybe depends on how much stock you put in my vouching and meta-vouching here), but rather:
I'm pretty sure Emerson doesn't have strong reason to think Ben isn't being honest here.
If Emerson lacks strong reason to think Ben is being dishonest, then he definitely shouldn't have threatened to sue Ben.
E.g., I'm claiming here that you shouldn't sue someone for libel if you feel highly uncertain about whether they're being honest or dishonest. It's ethically necessary (though IMO not sufficient) that you feel pretty sure the other person is being super dishonest. And I'd be very surprised if Emerson has rationally reached that epistemic state (because I know Ben, and I expect he conducted himself in his interactions with Nonlinear the same way he normally conducts himself).
Actually, I do know of an example of y'all offering money to someone for defending an org you disliked and were suspicious of. @habryka, did that money get accepted?
(The incentive effects are basically the same whether it was accepted or not, as long as it's public knowledge that the money was offered; so it seems good to make this public if possible.)
Yeah, this post makes me wonder if there are non-abusive employers in EA who are nevertheless enabling abusers by normalizing behavior that makes abuse popular. Employers who pay their employees months late without clarity on why and what the plan is to get people paid eventually. Employers who employ people without writing things down, like how much people will get paid and when. Employers who try to enforce non-disclosure of work culture and pay.
Do any of those things happen much in EA? (I don't think I've ever heard of an example of one of those things outside of Nonlinear, but maybe I'm out of the loop.)
As I think of it, the heart of the "bad argument gets counterargument" notion is "respond to arguments using reasoning, not coercion", rather than "literal physical violence is a unique category of thing that is never OK". Both strike me as good norms, but the former seems deeper and more novel to me, closer to the heart of things. I'm a fan of Scott's gloss (and am happy to cite it instead, if we want to construe Eliezer's version of the thing as something narrower):
[...] What is the “spirit of the First Amendment”? Eliezer Yudkowsky writes:
"There are a very few injunctions in the human art of rationality that have no ifs, ands, buts, or escape clauses. This is one of them. Bad argument gets counterargument. Does not get bullet. Never. Never ever never for ever."
Why is this a rationality injunction instead of a legal injunction? Because the point is protecting “the marketplace of ideas” where arguments succeed based on the evidence supporting or opposing them and not based on the relative firepower of their proponents and detractors. [...]
What does “bullet” mean in the quote above? Are other projectiles covered? Arrows? Boulders launched from catapults? What about melee weapons like swords or maces? Where exactly do we draw the line for “inappropriate responses to an argument”?
A good response to an argument is one that addresses an idea; a bad argument is one that silences it. If you try to address an idea, your success depends on how good the idea is; if you try to silence it, your success depends on how powerful you are and how many pitchforks and torches you can provide on short notice.
Shooting bullets is a good way to silence an idea without addressing it. So is firing stones from catapults, or slicing people open with swords, or gathering a pitchfork-wielding mob.
But trying to get someone fired for holding an idea is also a way of silencing an idea without addressing it. I’m sick of talking about Phil Robertson, so let’s talk about the Alabama woman who was fired for having a Kerry-Edwards bumper sticker on her car (her boss supported Bush). Could be an easy way to quiet support for a candidate you don’t like. Oh, there are more Bush voters than Kerry voters in this county? Let’s bombard her workplace with letters until they fire her! Now she’s broke and has to sit at home trying to scrape money together to afford food and ruing the day she ever dared to challenge our prejudices! And the next person to disagree with the rest of us will think twice before opening their mouth!
The e-version of this practice is “doxxing”, where you hunt down an online commenter’s personally identifiable information including address. Then you either harass people they know personally, spam their place of employment with angry comments, or post it on the Internet for everyone to see, probably with a message like “I would never threaten this person at their home address myself, but if one of my followers wants to, I guess I can’t stop them.” This was the Jezebel strategy that Michael was most complaining about. Freethought Blogs is also particularlyfamous for this tactic and often devolves into sagas that would make MsScribe herself proud.
A lot of people would argue that doxxing holds people “accountable” for what they say online. But like most methods of silencing speech, its ability to punish people for saying the wrong things is entirely uncorrelated with whether the thing they said is actually wrong. It distributes power based on who controls the largest mob (hint: popular people) and who has the resources, job security, and physical security necessary to outlast a personal attack (hint: rich people). If you try to hold the Koch Brothers “accountable” for muddying the climate change waters, they will laugh in your face. If you try to hold closeted gay people “accountable” for promoting gay rights, it will be very easy and you will successfully ruin their lives. Do you really want to promote a policy that works this way?
There are even more subtle ways of silencing an idea than trying to get its proponents fired or real-life harassed. For example, you can always just harass them online. The stronger forms of this, like death threats and rape threats, are of course illegal. But that still leaves many opportunities for constant verbal abuse, crude sexual jokes, insults aimed at family members, and dozens of emails written in all capital letters about what sorts of colorful punishments you and the people close to you deserve. [...]
My answer to the “Doctrine Of The Preferred First Speaker” ought to be clear by now. The conflict isn’t always just between first speaker and second speaker, it can also be between someone who’s trying to debate versus someone who’s trying to silence. Telling a bounty hunter on the phone “I’ll pay you $10 million to kill Bob” is a form of speech, but its goal is to silence rather than to counterargue. So is commenting “YOU ARE A SLUT AND I HOPE YOUR FAMILY DIES” on a blog. And so is orchestrating a letter-writing campaign demanding a business fire someone who vocally supports John Kerry.
Bad argument gets counterargument. Does not get bullet. Does not get doxxing. Does not get harassment. Does not get fired from job. Gets counterargument. Should not be hard.
An NDA to keep the organization's IP private seems fine to me; an NDA to prevent people from publicly criticizing their former workplace seems line-crossing to me.
Notably, one way to offset the reputational issue is to sometimes give people money for saying novel positive things about an org. The issue is less "people receive money for updating us" and more "people receive money only if they updated us in a certain direction", or even worse "people receive money only if they updated us in a way that fits a specific narrative (e.g., This Org Is Culty And Abusive)".
This also updates me about Kat's take (as summarized by Ben Pace in the OP):
Kat doesn’t trust Alice to tell the truth, and that Alice has a history of “catastrophic misunderstandings”.
When I read the post, I didn't see any particular reason for Kat to think this, and I worried it might be just be an attempt to dismiss a critic, given the aggressive way Nonlinear otherwise seems to have responded to criticisms.
With this new info, it now seems plausible to me that Kat was correct (even though I don't think this justifies threatening Alice or Ben in the way Kat and Emerson did). And if Kat's not correct, I still update that Kat was probably accurately stating her epistemic state, and that a lot of reasonable people might have reached the same epistemic state.
I think that there's a big difference between telling everyone "I didn't get the food I wanted, but they did get/offer to cook me vegan food, and I told them it was ok!" and "they refused to get me vegan food and I barely ate for 2 days".
It also seems totally reasonable that no one at Nonlinear understood there was a problem. Alice's language throughout emphasizes how she'll be fine, it's no big deal [...] I do not think that these exchanges depict the people at Nonlinear as being cruel, insane, or unusual as people.
100% agreed with this. The chat log paints a wildly different picture than what was included in Ben's original post.
Given my experience with talking with people about strongly emotional events, I am inclined towards the interpretation where Alice remembers the 15th with acute distress and remembers it as 'not getting her needs met despite trying quite hard to do so', and the Nonlinear team remembers that they went out of their way that week to get Alice food - which is based on the logs from the 16th clearly true! But I don't think I'd call Alice a liar based on reading this
Agreed. I did update toward "there's likely a nontrivial amount of distortion in Alice's retelling of other things", and toward "normal human error and miscommunication played a larger role in some of the Bad Stuff that happened than I previously expected". (Ben's post was still a giant negative update for me about Nonlinear, but Kat's comment is a smaller update in the opposite direction.)
Jim's point here is compatible with "US libel laws are a force for good epistemics", since a law can be aimed at lying+bullshitting and still disincentivize bad reasoning (to some degree) as a side-effect.
But I do think Jim's point strongly suggests that we should have a norm against suing someone merely for reasoning poorly or getting the wrong answer. That would be moving from "lawsuits are good for norm enforcement" to "frivolous lawsuits are good for norm enforcement", which is way less plausible.
Without making any comment about the accuracy or inaccuracy of this post, I would just point out that nobody in EA should be shocked that an organization (e.g. Nonlinear) that is being libeled (in its view) would threaten a libel suit to deter the false accusations (as they see them), to nudge the author(e.g. Ben Pace) towards making sure that their negative claims are factually correct and contextually fair.
Wikipedia claims: "The 1964 case New York Times Co. v. Sullivan, however, radically changed the nature of libel law in the United States by establishing that public officials could win a suit for libel only when they could prove the media outlet in question knew either that the information was wholly and patently false or that it was published 'with reckless disregard of whether it was false or not'."
Spartz isn't a "public official", so maybe the standard is laxer here?
If not, then it seems clear to me that Spartz wouldn't win in a fair trial, because whether or not Ben got tricked by Alice/Chloe and accidentally signal-boosted others' lies, it's very obvious that Ben is neither deliberately asserting falsehoods, nor publishing "with reckless disregard".
(Ben says he spent "100-200 hours" researching this post, which is way beyond the level of thoroughness we should require for criticizing an organization on LessWrong or the EA Forum!)
I think there should be a strong norm against threatening people with libel merely for saying a falsehood; the standard should at minimum be that you have good reason to think the person is deliberately lying or bullshitting.
(I think the standard should be way higher than that, too, given the chilling effect of litigiousness; but I won't argue that here.)
My own suggestion would be to use a variety of different phrasings here, including both "capabilities" and "intelligence", and also "cognitive ability", "general problem-solving ability", "ability to reason about the world", "planning and inference abilities", etc. Using different phrases encourages people to think about the substance behind the terminology -- e.g., they're more likely to notice their confusion if the stuff you're saying makes sense to them under one of the phrasings you're using, but doesn't make sense to them under another of the phrasings.
Phrases like "cognitive ability" are pretty important, I think, because they make it clearer why these different "capabilities" often go hand-in-hand. It also clarifies that the central problems are related to minds / intelligence / cognition / etc., not (for example) the strength of robotic arm, even though that too is a "capability".
Does "par-human reasoning" mean at the level of an individual human or at the level of all of humanity combined?
If it's the former, what human should we compare it against? 50th percentile? 99.999th percentile?
I partly answered that here, and I'll edit some of this into the post:
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but the threshold I have in mind is more like:
STEM-level AGI is AI that's at least as scientifically productive as a human scientist who makes a variety of novel, original contributions to a hard-science field that requires understanding the physical world well. E.g., it can go toe-to-toe with highly productive human scientists on applying its abstract theories to real-world phenomena, using scientific ideas to design new tech, designing physical experiments, operating equipment, and generating new ideas that turn out to be true and that importantly advance the frontiers of our knowledge.
The way I'm thinking about the threshold, AI doesn't have to be Nobel-prize-level, but it has to be "fully doing science". I'd also be happy with a definition like 'AI that can reason about the physical world in general', but I think that emphasizing hard-science tasks makes it clearer why I'm not thinking of GPT-4 as 'reasoning about the physical world in general' in the relevant sense.
I'm not sure what the right percentile to target here is -- maybe we should be looking at the top 5% of Americans with STEM PhDs? Where Americans with STEM PhDs maybe are at the top 1% of STEM ability for Americans?
What is the "basic mental machinery" required to do par-human reasoning? What if a system has the basic mental machinery but not the more advanced mental machinery?
Do you want this to include the robotic capabilities to run experiments and use physical tools? If not, why not (that seems important to me, but maybe you disagree)?
I want it to include the ability to run experiments and use physical tools.
I don't know what the "basic mental machinery" required is -- I think GPT-4 is missing some of the basic cognitive machinery top human scientists use to advance the frontiers of knowledge (as opposed to GPT-4 doing all the same mental operations as a top scientist but slower, or something), but this is based on a gestalt impression from looking at how different their outputs are in many domains, not based on a detailed or precise model of how general intelligence works.
One way of thinking about the relevant threshold is: if you gave a million chimpanzees billions of years to try to build a superintelligence, I think they'd fail, unless maybe you let them reproduce and applied selection pressure to them to change their minds. (But the latter isn't something the chimps themselves realize is a good idea.)
In contrast, top human scientists pass the threshold 'give us enough time, and we'll be able to build a superintelligence'.
If an AI system, given enough time and empirical data and infrastructure, would eventually build a superintelligence, then I'm mostly happy to treat that as "STEM-level AGI". This isn't a necessary condition, and it's presumably not strictly sufficient (since in principle it should be possible to build a very narrow and dumb meta-learning system that also bootstraps in this way eventually), but it maybe does a better job of gesturing at where I'm drawing a line between "GPT-4" and "systems in a truly dangerous capability range".
(Though my reason for thinking systems in that capability range are dangerous isn't centered on "they can deliberately bootstrap to superintelligence eventually". It's far broader points like "if they can do that, they can probably do an enormous variety of other STEM tasks" and "falling exactly in the human capability range, and staying there, seems unlikely".)
Does a human count as a STEM-level NGI (natural general intelligence)?
I tend to think of us that way, since top human scientists aren't a separate species from average humans, so it would be hard for them to be born with complicated "basic mental machinery" that isn't widespread among humans. (Though local mutations can subtract complex machinery from a subset of humans in one generation, even if it can't add complex machinery to a subset of humans in one generation.)
Regardless, given how I defined the term, at least some humans are STEM-level.
If so, doesn't that imply that we should already be able to perform pivotal acts? You said: "If it makes sense to try to build STEM-level AGI at all in that situation, then the obvious thing to do with your STEM-level AGI is to try to leverage its capabilities to prevent other AGIs from destroying the world (a "pivotal act")."
The weakest STEM-level AGIs couldn't do a pivotal act; the reason I think you can do a pivotal act within a few years of inventing STEM-level AGI is that I think you can quickly get to far more powerful systems than "the weakest possible STEM-level AGIs".
The kinds of pivotal act I'm thinking about often involve Drexler-style feats, so one way of answering "why can't humans already do pivotal acts?" might be to answer "why can't humans just build nanotechnology without AGI?". I'd say we can, and I think we should divert a lot of resources into trying to do so; but my guess is that we'll destroy ourselves with misaligned AGI before we have time to reach nanotechnology "the hard way", so I currently have at least somewhat more hope in leveraging powerful future AI to achieve nanotech.
(The OP doesn't really talk about this, because the focus is 'is p(doom) high?' rather than 'what are the most plausible paths to us saving ourselves?'.)
In an unpublished 2017 draft, a MIRI researcher and I put together some ass numbers regarding how hard (wet, par-biology) nanotech looked to us:
We believe that the bottlenecks on current progress toward par-biology nanotechnology are (a) figuring out how to put all of the puzzle pieces together correctly, (b) executing certain difficult computations required for determining how to build materials, and (c) engineering certain basic tools that will allow us to engineer better tools, where there are likely to be mutual dependencies between progress on these fronts. If the world’s top scientific and engineering talent were actively focusing on this application and were inspired to solve the key technical problems, we would expect it to be possible to push past these bottlenecks with no more than 10x the compute that Google spent on research projects in 2016.
Assuming no advances in AI algorithms over the state of the art in 2017, we would assign a 50% probability to fifty copies of John von Neumann, divided into five teams and supplied with a large number of lab technicians and other support staff, being able to achieve nanotechnology within 25 calendar years at a level that would be sufficient for a decisive advantage if the technology were available to a group in 2017.
(footnote: We stipulate “in 2017” because we would not necessarily expect par-biology nanotechnology to confer a decisive advantage in a world where nanotechnology had been gradually advanced to that level by human engineers over multiple decades; in that scenario, factors such as leaks, regulations, and competition from other developers would make it harder for one group to strongly pull ahead. We would expect it to be much easier for one group to strongly pull ahead if nanotechnology advances too quickly for leaks, regulations, and competition to be significant factors on the relevant timescale, as we believe is possible using AGI.)
Translating this into a more realistic scenario: we would assign a 40% probability to an organization with a $10 billion budget and the involvement of someone who can attract top researchers and leadership (e.g., Elon Musk) being able to reach this level of technological capability within 25 years, absent AI advances. Our probability would lower to 15% if there were only 10 calendar years available to the hypothetical Musk project instead of 25, and would rise to 85% if there were 50 calendar years and $20 billion available instead of 25 calendar years and $10 billion, holding these conditions stable and assuming no other large global disruptions.
As in §1.3, the predictions here are rough and intuitive, and were not generated by a formal model. It would be difficult for our probability to rise much higher than 85% given additional time or other resources. Our inside-view evaluation of the arguments assigns high probability to par-biology nanotechnology being achievable in fifty years under these idealized conditions, such that the remaining uncertainty in our informal aggregate models largely stems from model uncertainty and deference to experts who disagree with our view and consider par-biology nanotechnology much more difficult. We would be very surprised to learn that par-biology nanotechnology were much more difficult (say, requiring more than 500 VNG research years), and this would have a fairly large impact on our overall expectations about early AGI systems’ potential uses and impact.
(500 VNG research years = 500 von-Neumann-group research year, defined as 'how much progress ten copies of John von Neumann would make if they worked together on the problem, hard, for 500 serial years'.)
This is also why I think humanity should probably put lots of resources into whole-brain emulation: I don't think you need qualitatively superhuman cognition in order to get to nanotech, I think we're just short on time given how slowly whole-brain emulation has advanced thus far.
With STEM-level AGI I think we'll have more than enough cognition to do basically whatever we can align; but given how tenuous humanity's grasp on alignment is today, it would be prudent to at least take a stab at a "straight to whole-brain emulation" Manhattan Project. I don't think humanity as it exists today has the tech capabilities to hit the pause button on ML progress indefinitely, but I think we could readily do that with "run a thousand copies of your top researchers at 1000x speed" tech.
(Note that having dramatically improved hardware to run a lot of ems very fast is crucial here. This is another reason the straight-to-WBE path doesn't look hopeful at a glance, and seems more like a desperation move to me; but maybe there's a way to do it.)
Steering towards world states, taken literally, for a realistic agent is impossible, because an embedded agent cannot even contain a representation of a detailed world-state.
I'm not imagining AI steering toward a full specification of a physical universe; I'm imagining it steering toward a set of possible worlds. Sets of possible worlds can often be fully understood by reasoners, because you don't need to model every world in the set in perfect detail in order to understand the set; you just need to understand at least one high-level criterion (or set of criteria) that determines which worlds go in the set vs. not in the set.
E.g., consider the preference ordering "the universe is optimal if there's an odd number of promethium atoms within 100 light years of the Milky Way Galaxy's center of gravity, pessimal otherwise". Understanding this preference just requires understanding terms like "odd" and "promethium" and "light year"; it doesn't require modeling full universes or galaxies in perfect detail.
Similarly, "maximize the amount of diamond that exists in my future light cone" just requires you to understand what "diamond" is and what "the more X you have, the better" means. It doesn't require you to fully represent every universe in your head in advance.
(Note that selecting the maximizing action is computationally intractable; but you can have a maximizing goal even if you aren't perfectly succeeding in the goal.)
The definition I give in the post is "AI that has the basic mental machinery required to do par-human reasoning about all the hard sciences". In footnote 3, I suggest the alternative definition "AI that can match smart human performance in a specific hard science field, across all the scientific work humans do in that field".
By 'matching smart human performance... across all the scientific work humans do in that field' I don't mean to require that there literally be nothing humans can do that the AI can't match. I do expect this kind of AI to quickly (or immediately) blow humans out of the water, but the threshold I have in mind is more like:
STEM-level AGI is AI that's at least as scientifically productive as a human scientist who makes a variety of novel, original contributions to a hard-science field that requires understanding the physical world well. E.g., it can go toe-to-toe with highly productive human scientists on applying its abstract theories to real-world phenomena, using scientific ideas to design new tech, designing physical experiments, operating equipment, and generating new ideas that turn out to be true and that importantly advance the frontiers of our knowledge.
The way I'm thinking about the threshold, AI doesn't have to be Nobel-prize-level, but it has to be "fully doing science". I'd also be happy with a definition like 'AI that can reason about the physical world in general', but I think that emphasizing hard-science tasks makes it clearer why I'm not thinking of GPT-4 as 'reasoning about the physical world in general' in the relevant sense.
For starters, you can have goal-directed behavior without steering the world toward particular states. Novelty seeking, for example, don't imply any particular world-state to achieve.
If you look from the outside like you're competently trying to steer the world into states that will result in you getting more novel experience, then this is "goal-directed" in the sense I mean, regardless of why you're doing that.
If you (e.g.) look from the outside like you're selecting the local action that's least like the actions you've selected before, regardless of how that affects you or your future novel experience, etc., then that's not "goal-directed" in the sense I mean.
The distinction isn't meant to be totally crisp (there are different degrees and dimensions of "goal-directedness"), but maybe these examples help clarify what I have in mind. "Maximize novel experience" is a pretty vague goal, but it's not so vague that I think it falls outside of what I had in mind -- e.g., I think the standard instrumental convergence concerns apply to "maximize novel experience".
"Steer the world toward there being an even number of planets in the Milky Way Galaxy" also encompasses a variety of possible world-states (more than half of the possible worlds where the Milky Way Galaxy exists are optimal), but I think the arguments in the OP apply just as well to this goal.
Sufficiently intelligent agent knows that its utility function is an approximation of true preferences of the creator.
Nope! Humans were created by evolution, but our true utility function isn't "maximize inclusive reproductive fitness" (nor is it some slightly tweaked version of that goal).
The deployment problem is part of societal response to me, not separate.
[...] Eg race dynamics, regulation (including ability to cooperate with competitors), societal pressure on leaders, investment in watchdogs (human and machine), safety testing norms, whether things get open sourced, infohazards.
"The deployment problem is hard and weird" comes from a mix of claims about AI (AGI is extremely dangerous, you don't need a planet-sized computer to run it, software and hardware can and will improve and proliferate by default, etc.) and about society ("if you give a decent number of people the ability to wield dangerous AGI tech, at least one or them will choose to use it").
The social claims matter — two people who disagree about how readily Larry Page and/or Mark Zuckerberg would put the world at risk might as a result disagree about whether a Good AGI Project has median 8 months vs. 12 months to do a pivotal act.
When I say "AGI ruin rests on strong claims about the alignment problem and deployment problem, not about society", I mean that the claims you need to make about society in order to think the alignment and deployment problems are that hard and weird, are weak claims (e.g. "if fifty random large AI companies had the ability to use dangerous AGI, at least one would use it"), and that the other claims about society required for high p(doom) are weak too (e.g. "humanity isn't a super-agent that consistently scales up its rationality and effort in proportion to a problem's importance, difficulty, and weirdness").
Arguably the difficulty of the alignment problem itself also depends in part on claims about society. E.g., the difficulty of alignment depends on the difficulty of the task we're aligning, which depends on "what sort of task is needed to end the acute x-risk period?", which depends again on things like "will random humans destroy the world if you hand them world-destroying AGI?".
The thing I was trying to communicate (probably poorly) isn't "Alignment, Deployment, and Society partitions the space of topics", but rather:
High p(doom) rests on strong claims about AI/compute/etc. and quite weak claims about humanity/society.
The most relevant claims (~all the strong ones, and an important subset of the weak ones) are mostly claims about the difficulty, novelty, and weirdness of the alignment and deployment problems.
Note that if it were costless to make the title way longer, I'd change this post's title from "AGI ruin mostly rests on strong claims about alignment and deployment, not about society" to the clearer:
The AGI ruin argument mostly rests on claims that the alignment and deployment problems are difficult and/or weird and novel, not on strong claims about society
One reason I like "the danger is in the space of action sequences that achieve real-world goals" rather than "the danger is in the space of short programs that achieve real-world goals" is that it makes it clearer why adding humans to the process can still result in the world being destroyed.
If powerful action sequences are dangerous, and humans help execute an action sequence (that wasn't generated by human minds), then it's clear why that is dangerous too.
If the danger instead lies in powerful "short programs", then it's more tempting to say "just don't give the program actuators and we'll be fine". The temptation is to imagine that the program is like a lion, and if you just keep the lion physically caged then it won't harm you. If you're instead thinking about action sequences, then it's less likely to even occur to you that the whole problem might be solved by changing the AI from a plan-executor to a plan-recommender. Which is a step in the right direction in terms of actually grokking the nature of the problem.
I think the exact quantitative details make a big difference between "AGI ruin seems nearly certain in the absense of positive miracless" and "doom seems quite plausible, but we'll most likely make it through" (my probability of takeover is something like 35%)
I don't think that 'the very first STEM-level AGI is smart enough to destroy the world if you relax some precautions' and 'we have 2.5 years to work with STEM-level AGI before any system is smart enough to destroy the world' changes my p(doom) much at all. (Though this is partly because I don't expect, in either of those worlds, that we'll be able to be confident about which world we're in.)
If we have 6 years to safely work with STEM-level AGI, that does intuitively start to feel like a significant net increase in p(hope) to me? Though this is complicated by the fact that such AGI probably couldn't do pivotal acts either, and having STEM-level AGI for a longer period of time before a pivotal act occurs means that the tech will be more widespread when it does reach dangerous capability levels. So in the endgame, you're likely to have a lot more competition, and correspondingly less time to spend on safety if you want to deploy before someone destroys the world.
I think you should probably note where people (who are still sold on AI risk) often disagree.
If I had a list of 5-10 resources that folks like Paul, Holden, Ajeya, Carl, etc. see as the main causes for optimism, I'd be happy to link those resources (either in a footnote or in the main body).
I'd definitely include something like 'survey data on the same population as my 2021 AI risk survey, saying how much people agree/disagree with the ten factors", though I'd guess this isn't the optimal use of those people's time even if we want to use that time to survey something?
The tech path to AGI superintelligence is naturally slow enough and gradual enough, that world-destroyingly-critical alignment problems never appear faster than previous discoveries generalize to allow safe further experimentation.
When I split up probability mass a month ago between the market's 16 options, this one only got 1.5% of my probability mass (12th place out of the 16). This obviously isn't the same question we're discussing here, but it maybe gives some perspective on why I didn't single out this disagreement above the many other disagreements I could devote space to that strike me as way more relevant to hope? (For some combination of 'likelier to happen' and 'likelier to make a big difference for p(doom) if they do happen'.)
The rate of progress seems very fast and it seems plausible that AI systems will race through the full range human reasoning ability over the course of a few years. But, this is hardly 'likely to blow human intelligence out of the water immediately, or very soon after its invention'.
... Wait, why not? If AI exceeds the human capability range on STEM four years from now, I would call that 'very soon', especially given how terrible GPT-4 is at STEM right now.
The thesis here is not 'we definitely won't have twelve months to work with STEM-level AGI systems before they're powerful enough to be dangerous'; it's more like 'we won't have decades'. Somewhere between 'no time' and 'a few years' seems extremely likely to me, and I think that's almost definitely not enough time to figure out alignment for those systems.
(Admittedly, in the minority of worlds where STEM-level AGI systems are totally safe for the first two years they're operational, part of why it's hard to make fast progress on alignment is that we won't know they're perfectly safe. An important chunk of the danger comes from the fact that humans have no clue where the line is between the most powerful systems that are safe, and the least powerful systems that are dangerous.)
Like, it's not clear to me that even Paul thinks we'll have much time with STEM-level AGI systems (in the OP's sense) before we have vastly superhuman AI. Unless I'm misunderstanding, Paul's optimism seems to have more to do with 'vastly superhuman AI is currently ~30 years away' and 'capabilities will improve continuously over those 30 years, so we'll have lots of time to learn more, see pretty scary failure modes, adjust our civilizational response, etc. before AI is competitive with the best human scientists'.
But capabilities gains still accelerate on Paul's model, such that as time passes we get less and less time to work with impressive new capabilities before they're blown out of the water by further advances (though Paul thinks other processes will offset this to produce good outcomes anyway); and these capabilities gains still end up stratospherically high before they plateau, such that we aren't naturally going to get a lull to safely work with smarter-than-human systems for a while before they're smart enough that a sufficiently incautious developer can destroy the world with them.
Maybe I'm misunderstanding something about Paul's view, or maybe you're pointing at other non-Paul-ish views...?
I don't think your claim makes the argument circular / question-begging; it just means there's an extra step in explaining why and how a random action sequence destroys the world.
Maybe you mean that I'm putting the emphasis in the wrong place, and it would be more illuminating to highlight some specific feature of random smart short programs as the source of the 'instrumental convergence' danger? If so, what do you think that feature is?
From my current perspective I think the core problem really is that most random short plans that succeed in sufficiently-hard tasks kill us. If the causal process by which this happens includes building a powerful AI optimizer, or building an AI that builds an AI, or building an AI that builds an AI that builds an AI, etc., then that's interesting and potentially useful to know, but that doesn't seem like the key crux to me, and I'm not sure it helps further illuminate where the danger is ultimately coming from.
(That said, I don't expect the plan to necessarily literally kill all humans, just to takeover the world, but this is due to galaxy brained trade and common sense morality arguments which are mostly out of scope and shouldn't be a thing people depend on.)
Very happy to hear someone with an idea like this who explicitly flags that we shouldn't gamble on this being true!
It's true that if humans were reliably very ambitious, consequentialist, and power-seeking, then this would be stronger evidence that superintelligent AI tends to be ambitious and power-seeking. So the absence of that evidence has to be evidence against "superintelligent AI tends to be ambitious and power-seeking", even if it's not a big weight in the scales.
Also, per footnote 1: "I wrote this post to summarize my own top reasons for being worried, not to try to make a maximally compelling or digestible case for others."
The original reason I wrote this was that Dustin Moskovitz wanted something like this, as an alternative to posts like AGI Ruin:
[H]ave you tried making a layman's explanation of the case? Do you endorse the summary? I'm aware of much longer versions of the argument, but not shorter ones!
From my POV, a lot of the confusion is around the confidence level. Historically EY makes many arguments to express his confidence, and that makes people feel snowed, like they have to inspect each one. I think it'd be better if there was more clarity about which are strongest.
I think one argument is about the number of relatively independent issues, and that's still valid, but then you could link out to that list as a separate exercise without losing everyone.
This post is speaking for me and not necessarily for Eliezer, but I figure it may be useful anyway. (A MIRI researcher did review an earlier draft and left comments that I incorporated, at least.)
And indeed, one of the obvious ways it could be useful is if it ends up evolving into (or inspiring) a good introductory resource, though I don't know how likely that is, I don't know whether it's already a good intro-ish resource paired with something else, etc.