Posts
Comments
After reading the first section and skimming the rest, my impression is that the document is a good overview, but does not present any detailed argument for why godlike AI would lead to human extinction. (Except for the "smarter species" analogy, which I would say doesn't qualify.) So if I put on my sceptic hat, I can imagine reading the whole document in detail and somewhat-justifiably going away with "yeah, well, that sounds like a nice story, but I am not updating based on this".
That seems fine to me, given that (as far as I am concerned) no detailed convincing arguments for AI X-risk exist. But at the moment, the summary of the document gave me the impression that maybe some such argument will appear. So I suggest updating the summary (or some other part of the doc) to make it explicit that no detailed arugment for AI X-risk will be given.
Some suggestions for improving the doc (I noticed the link to the editable version too late, apologies):
What is AI? Who is building it? Why? And is it going to be a future we want?
Something weird with the last sentence here (substituting "AI" for "it" makes the sentence un-grammatical).
Machines of hateful competition need not have such hindrances.
"Hateful" seems likely to put off some readers here, and I also think it is not warranted -- indifference is both more likely and also sufficient for extinction. So "Machines of indifferent competition" might work better.
There is no one is coming to save us.
Typo, extra "is".
The only thing necessary for the triumph of evil is for good people to do nothing. If you do nothing, evil triumphs, and that’s it.
Perhaps rewrite this for less antagonistic language? I know it is a quote and all, but still. (This can be interpreted as "the people building AI are evil and trying to cause harm on purpose". That seems false. And including this in the writing is likely to give the reader the impression that you don't understand the situation with AI, and stop reading.)
Perhaps (1) make it apparent that the first thing is a quote and (2) change the second sentence to "If you do nothing, our story gets a bad ending, and that's it.". Or just rewrite the whole thing.
I agree that "we can't test it right now" is more appropriate. And I was looking for examples of things that "you can't test right now even if you try really hard".
Good point. Also, for the purpose of the analogy with AI X-risk, I think we should be willing to grant that the people arrive at the alternative hypothesis through theorising. (Similarly to how we came up with the notion of AI X-risk before having any powerful AIs.) So that does break my example somewhat. (Although in that particular scenario, I imagine that sceptic of Newtonian gravity would came up with alternative explanations for the observation. Not that this seems very relevant.)
I agree with all of this. (And good point about the high confidence aspect.)
The only thing that I would frame slightly differently is that:
[X is unfalsifiable] indeed doesn't imply [X is false] in the logical sense. On reflection, I think a better phrasing of the original question would have been something like: 'When is "unfalsifiability of X is evidence against X" incorrect?'. And this amended version often makes sense as a heuristic --- as a defense against motivated reasoning, conspiracy theories, etc. (Unfortunately, many scientists seem to take this too far, and view "unfalsifiable" as a reason to stop paying attention, even though they would grant the general claim that [unfalsifiable] doesn't logically imply [false].)
I don't think it's foolish to look for analogous examples here, but I guess it'd make more sense to make the case directly.
That was my main plan. I was just hoping to accompany that direct case by a class of examples that build intuition and bring the point home to the audience.
Some partial examples I have so far:
Phenomenon: For virtually any goal specification, if you pursue it sufficiently hard, you are guaranteed to get human extinction.[1]
Situation where it seems false and unfalsifiable: The present world.
Problems with the example: (i) We don't know whether it is true. (ii) Not obvious enough that it is unfalsifiable.
Phenomenon: Physics and chemistry can give rise to complex life.
Situation where it seems false and unfalsifiable: If Earth didn't exist.
Problems with the example: (i) if Earth didn't exist, there wouldn't be anybody to ask the question, so the scenario is a bit too weird. (ii) The example would be much better if it was the case that if you wait long enough, any planet will produce life.
Phenomenon: Gravity -- all things with mass attract each other. (As opposed to "things just fall in this one particular direction".)
Situation where it seems false and unfalsifiable: If you lived in a bunker your whole life, with no knowledge of the outside world.[2]
Problems with the example: The example would be even better if we somehow had some formal model that: (a) describes how physics works, (b) where we would be confident that the model is correct, (c) and that by analysing that model, we will be able to determine whether the theory is correct or false, (d) but the model would be too complex to actually analyse. (Similarly to how chemistry-level simulations are too complex for studying evolution.)
Phenomenon: Eating too much sweet stuff is unhealthy.
Situation where it seems false and unfalsifiable: If you can't get lots of sugar yet, and only rely on fruit etc.
Problems with the example: The scenario is a bit too artificial. You would have to pretend that you can't just go and harvest sugar from sugar cane and have somebody eat lots of it.
Nitpick on the framing: I feel that thinking about "misaligned decision-makers" as an "irrational" reason for war could contribute to (mildly) misunderstanding or underestimating the issue.
To elaborate: The "rational vs irrational reasons" distinction talks about the reasons using the framing where states are viewed as monolithic agents who act in "rational" or "irrational" ways. I agree that for the purpose of classifying the risks, this is an ok way to go about things.
I wanted to offer an alternative framing of this, though: For any state, we can consider the abstraction where all people in that state act in harmony to pursue the interests of the state. And then there is the more accurate abstraction where the state is made of individual people with imperfectly aligned interests, who each act optimally to pursue those interests, given their situation. And then there is the model where the individual humans are misaligned and make mistakes. And then you can classify the reasons based on which abstraction you need to explain them.
[I am confused about your response. I fully endorse your paragraph on "the AI with superior ontology would be able to predict how humans would react to things". But then the follow-up, on when this would be scary, seems mostly irrelevant / wrong to me --- meaning that I am missing some implicit assumptions, misunderstanding how you view this, etc. I will try react in a hopefully-helpful way, but I might be completely missing the mark here, in which case I apologise :).]
I think the problem is that there is a difference between:
(1) AI which can predict how things score in human ontology; and
(2) AI which has "select things that score high in human ontology" as part of its goal[1].
And then, in the worlds where natural abstraction hypothesis is false: Most AIs achieve (1) as a by-product of the instrumental sub-goal of having low prediction error / being selected by our training processes / being able to manipulate humans. But us successfully achieving (2) for a powerful AI would require the natural abstraction hypothesis[2].
And this leaves us two options. First, maybe we just have no write access to the AI's utility function at all. (EG, my neighbour would be very happy if I gave him $10k, but he doesn't have any way of making me (intrinsincally) desire doing that.) Second, we might have a write access to the AI's utility function, but not in a way that will lead to predictable changes in goals or behaviour. (EG, if you give me full access to weights of an LLM, it's not like I know how to use that to turn that LLM into an actually-helpful assistant.)
(And both of these seem scary to me, because of the argument that "not-fully-aligned goal + extremely powerful optimisation ==> extinction". Which I didn't argue for here.)
- ^
IE, not just instrumentally because it is pretending to be aligned while becoming more powerful, etc.
- ^
More precisely: Damn, we need a better terminology here. The way I understand things, "natural abstraction hypothesis" is the claim that most AIs will converge to an ontology that is similar to ours. The negation of that is that a non-trivial portion of AIs will use an ontology that is different from ours. What I subscribe to is that "almost no powerful AIs will use an ontology that is similar to ours". Let's call that "strong negation" of the natural abstraction hypothesis. So achieving (2) would be a counterexample to this strong negation.
Ironically, I believe the strong negation hypothesis because I expect that very powerful AIs will arrive at similar ways of modelling the world --- and those are all different from how we model the world.
Nitpicky edit request: your comment contains some typos that make it a bit hard to parse ("be other", "we it"). (So apologies if my reaction misunderstands your point.)
[Assuming that the opposite of the natural abstraction hypothesis is true --- ie, not just that "not all powerful AIs share ontology with us", but actually "most powerful AIs don't share ontology with us":]
I also expect that an AI with superior ontology would be able to answer your questions about its ontology, in a way that would make you feel like[1] you understand what is happening. But that isn't the same as being able to control the AI's actions, or being able to affect its goal specification in a predictable way (to you). You totally wouldn't be able to do that.
([Vague intuition, needs work] I suspect that if you had a method for predictably-to-you translating from your ontology to the AI's ontology, then this could be used to prove that you can easily find a powerful AI that shares an ontology with us. Because that AI could be basically thought of as using our ontology.)
- ^
Though note that unless you switched to some better ontology, you wouldn't actually understand what is going on, because your ontology is so bogus that it doesn't even make sense to talk about "you understanding [stuff]". This might not be true for all kinds of [stuff], though. EG, perhaps our understanding of set theory is fine while our understanding of agency, goals, physics, and whatever else, isn't.
As a quick reaction, let me just note that I agree that (all else being equal) this (ie, "the AI understanding us & having superior ontology") seems desirable. And also that my comment above did not present any argument about why we should be pessimistic about AI X-risk if we believe that the natural abstraction hypothesis is false. (I was just trying to explain why/how "the AI has a different ontology" is compatible with "the AI understands our ontology".)
As a longer reaction: I think my primary reason for pessimism, if natural abstraction hypothetis is false, is that a bunch of existing proposals might work if the hypothesis were true, but don't work if the hypothesis is false. (EG, if the hypothesis is true, I can imagine that "do a lot of RLHF, and then ramp up the AIs intelligence" could just work. Similarly for "just train the AI to not be deceptive".)
If I had to gesture at an underlying principle, then perhaps it could be something like: Suppose we successfully code up an AI which is pretty good at optimising, or create a process which gives rise to such an AI. [Inference step missing here.] Then the goals and planning of this AI will be happening in some ontology which allows for low prediction error. But this will be completely alien to our ontology. [Inference step missing here.] And, therefore, things that score very highly with respect to these ("alien") goals will have roughly no value[1] according to our preferences.
(I am not quite clear on this, but I think that if this paragraph was false, then you could come up with a way of falsifying my earlier description of how it looks like when the natural abstraction hypothesis is false.)
- ^
IE, no positive value, but also no negative value. So no S-risk.
Simplifying somewhat: I think that my biggest delta with John is that I don't think the natural abstraction hypothesis holds. (EG, if I believed it holds, I would become more optimistic about single-agent alignment, to the point of viewing Moloch as higher priority.) At the same time, I believe that powerful AIs will be able to understand humans just fine. My vague attempt at reconciling these two is something like this:
Humans have some ontology, in which they think about the world. This corresponds to a world model. This world model has a certain amount of prediction errors.
The powerful AI wants to have much lower prediction error than that. When I say "natural abstraction hypothesis is false", I imagine something like: If you want to have a much lower prediction error than that, you have to use a different ontology / world-model than humans use. And in fact if you want sufficiently low error, then all ontologies that can achieve that are very different from our ontology --- either (reasonably) simple and different, or very complex (and, I guess, therefore also different).
So when the AI "understands humans perfectly well", that means something like: The AI can visualise the flawed (ie, high prediction error) model that we use to think about the world. And it does this accurately. But it also sees how the model is completely wrong, and how the things, that we say we want, only make sense in that model that has very little to do with the actual world.
(An example would be how a four-year old might think about the world in terms of Good people and Evil people. The government sometimes does Bad things because there are many Evil people in it. And then the solution is to replace all the Evil people by Good people. And that might internally make sense, and maybe an adult can understand this way of thinking, while also being like "this has nothing to do with how the world actually works; if you want to be serious about anything, just throw this model out".)
An illustrative example, describing a scenario that is similar to our world, but where "Extinction-level Goodhart's law" would be false & falsifiable (hat tip Vincent Conitzer):
Suppose that we somehow only start working on AGI many years from now, after we have already discovered a way to colonize the universe at the close to the speed of light. And some of the colonies are already unreachable, outside of our future lightcone. But suppose we still understand "humanity" as the collection of all humans, including those in the unreachable colonies. Then any AI that we build, no matter how smart, would be unable to harm these portions of humanity. And thus full-blown human extinction, from AI we build here on Earth, would be impossible. And you could "prove" this using a simple, yet quite rigorous, physics argument.[1]
(To be clear, I am not saying that "AI X-risk's unfalsifiability is justifiable ==> we should update in favour of AI X-risk compared to our priors". I am just saying that the justifiability means we should not update against it compared to our priors. Though I guess that in practice, it means that some people should undo some of their updates against AI X-risk... )
- ^
And sure, maybe some weird magic is actually possible, and the AI could actually beat speed of light. But whatever, I am ignoring this, and an argument like this would count as falsification as far as I am concerned.
FWIW, I acknowledge that my presentation of the argument isn't ironclad, but I hope that it makes my position a bit clearer. If anybody has ideas for how to present it better, or has some nice illustrative examples, I would be extremely grateful.
tl;dr: "lack of rigorous arguments for P is evidence against P" is typically valid, but not in case of P = AI X-risk.
A high-level reaction to your point about unfalsifiability:
There seems to be a general sentiment that "AI X-risk arguments are unfalsifiable ==> the arguments are incorrect" and "AI X-risk arguments are unfalsifiable ==> AI X-risk is low".[1] I am very sympathetic to this sentiment --- but I also think that in the particular case of AI X-risk, it is not justified.[2] For quite non-obvious reasons.
Why I believe this?
Take this simplified argument for AI X-risk:
- Some important future AIs will be goal-oriented, or will behave in a goal-oriented way in sometimes[3]. (Read: If you think of them as trying to maximise some goal, you will make pretty good predictions.[4])
- The "AI-progress tech-tree" is such that discontinous jumps in impact are possible. In particular, we will one day go from "an AI that is trying to maximise some goal, but not doing a very good job of it" to "an AI that is able to treat humans and other existing AIs as 'environment', and is going to do a very good job at maximising some goal".
- For virtually any[5] goal specification, doing a sufficiently[6] good job at maximising that goal specification leads to an outcome where every human is dead.
FWIW, I think that having a strong opinion on (1) and (2), in either direction, is not justified.[7] But in this comment, I only want to focus on (3) --- so let's please pretend, for the sake of this discussion, that we find (1) and (2) at least plausible. What I claim is that even if we lived in a universe where (3) is true, we should still expect even the best arguments for (3) (that we might realistically identify) to be unfalsifiable --- at least given realistic constraints on falsification effort and assumming that we use rigorous standards for what counts as a solid evidence, like people do in mathematics, physics, or CS.
What is my argument for "even best arguments for (3) will be unfalsifiable"?
Suppose you have an environment that contains a Cartesian agent (a thing that takes actions in the environment and -- let's assume for simplicity -- has perfect information about the environment, but whose decison-making computation happens outside of the environment). And suppose that this agent acts in a way that maximises[8] some goal specification[9] over . Now, might or might not contain humans, or representations of humans. We can now ask the following question: Is it true that, unless we spend an extremely high amont of effort (eg, >5 civilisation-years), any (non-degenerate[10]) goal-specification we come up with will result in human extinction[11] in E when maximised by the agent. I refer to this as "Extinction-level Goodhart's Law".
I claim that:
(A) Extinction-level Goodhart's Law plausibly holds in the real world. (At least the thought expertiments I know, eg here or here, of suggest it does.)
(B) Even if Extinction-level Goodhart's Law was true in the real world, it would still be false in environments where we could verify it experimentally (today, or soon) or mathematically (by proofs, given realistic amounts of effort).
==> And (B) implies that if we want "solid arguments", rather than just thought expertiments, we might be kinda screwed when it comes to Extinction-level Goodhart's Law.
And why do I believe (B)? The long story is that I try to gesture at this in my sequence on "Formalising Catastrophic Goodhart". The short story is that there are many strategies for finding "safe to optimise" goal specifications that work in simpler environments, but not in the real-world (examples below). So to even start gaining evidence on whether the law holds in our world, we need to investigate envrionments where those simpler strategies don't work --- and it seems to me that those are always too complex for us to analyse mathematically or run an AI there which could "do a sufficiently good job a trying to maximise the goal specification".
Some examples of the above-mentioned strategies for finding safe-to-optimise goal specifications: (i) The environment contains no (representations of) humans, or those "humans" can't "die", so it doesn't matter. EG, most gridworlds. (ii) The environment doesn't have any resources or similar things that would give rise to convergent instrumental goals, so it doesn't matter. EG, most gridworlds. (iii) The environment allows for a simple formula that checks whether "humans" are "extinct", so just add a huge penalty if that formula holds. (EG, most gridworlds where you added "humans".) (iv) There is a limited set of actions that result in "killing" "humans", so just add a huge penalty to those. (v) There is a simple formula for expressing a criterion that limits the agent's impact. (EG, "don't go past these coordinates" in a gridworld.)
All together, this should explain why the "unfalsifiability" counter-argument does not hold as much weight, in the case of AI X-risk, as one might intuitively expect.
- ^
If I understand you correctly, you would endorse something like this? Quite possibly with some disclaimers, ofc. (Certainly I feel that many other people endorse something like this.)
- ^
I acknowledge that the general heuristic "argument for X is unfalsifiable ==> the argument is wrong" holds in most cases. And I am aware we should be sceptical whenever somebody goes "but my case is an exception!". Despite this, I still believe that AI X-risk genuinely is different from invisible dragons in your garage and conspiracy theories.
That said, I feel there should be a bunch of other examples where the heuristic doesn't apply. If you have some that are good, please share! - ^
An example of this would be if GPT-4 acted like a chatbot most of the time, but tried to take over the world if you prompt it with "act as a paperclipper".
- ^
And this way of thinking about them is easier -- description length, etc -- than other options. EG, no "water bottles maximising being a water battle".
- ^
By "virtual any" goal specification (leading to extinction when maximised), I mean that finding a goal specification for which extinction does not happen (when maximised) is extremely difficult. One example of operationalising "extremely difficult" would be "if our civilisation spent all its efforts on trying to find some goal specification, for 5 years from today, we would still fail". In particular, the claim (3) is meant to imply that if you do anything like "do RLHF for a year, then optimise the result extremely hard", then everybody dies.
- ^
For the purposes of this simplified AI X-risk argument, the AIs from (2), which are "very good at maximising a goal", are meant to qualify for the "sufficiently good job at maximising a goal" from (3). In practice, this is of course more complicated --- see e.g. my post on Weak vs Quantitative Extinction-level Goodhart's Law.
- ^
Or at least there are no publicly available writings, known to me, which could justifiy claims like "It's >=80% likely that (1) (or 2) holds (or doesn't hold)". Of course, (1) and (2) are too vague for this to even make sense, but imagine replacing (1) and (2) by more serious attempts at operationalising the ideas that they gesture at.
- ^
(or does a sufficiently good job of maximising)
- ^
Most reasonable ways of defining what "goal specification" means should work for the argument. As a simple example, we can think of having a reward function R : states --> R and maximising the sum of R(s) over any long time horizon.
- ^
To be clear, there are some trivial ways of avoiding Extinction-level Goodhart's Law. One is to consider a constant utility function, which means that the agent might as well take random actions. Another would be to use reward functions in the spirit of "shut down now, or get a huge penalty". And there might be other weird edge cases.
I acknowledge that this part should be better developed. But in the meantime, hopefully it is clear -- at least somewhat -- what I am trying to gesture at. - ^
Most environments won't contain actual humans. So by "human extinction", I mean the "metaphorical humans being metaphorically dead". EG, if your environment was pacman, then the natural thing would be to view the pacman as representing a "human", and being eaten by the ghosts as representing "extinction". (Not that this would be a good model for studying X-risk.)
Assumption 2 is, barring rather exotic regimes far into the future, basically always correct, and for irreversible computation, this always happens, since there's a minimum cost to increase the features IRL, and it isn't 0.
Increasing utility IRL is not free.
I think this is a misunderstanding of what I meant. (And the misunderstanding probably only makes sense to try clarifying it if you read the paper and disagree with my interpretation of it, rather than if your reaction is only based on my summary. Not sure which of the two is the case.)
What I was trying to say is that the most natural interpretation of the paper's model does not allow for things like: In state 1, the world is exactly as it is now, except that you decided to sleep on the floor every day instead of in your bed (for no particular reason), and you are tired and miserable all day. State 2 is exactly the same as state 1, except you decided that it would be smarter to sleep in your bed. And now, state 2 is just strictly better than state 1 (at least in all respects that you would care to name).
Essentially, the paper's model requires, by assumption, that it is impossible to get any efficiency gains (like "don't sleep on the floor" or "use this more efficient design instead) or mutually-beneficial deals (like helping two sides negotiate and avoid a war).
Yes, I agree that you can interpret the model in ways that avoid this. EG, maybe by sleeping on the floor, your bed will last longer. And sure, any action at all requires computation. I am just saying that these are perhaps not the interpretations that people initially imagine when reading the paper,. So unless you are using an interpretation like that, it is important to notice those strong assumptions.
I do agree that debate could be used in all of these ways. But at the same time, I think generality often leads to ambiguity and to papers not describing any such application in detail. And that in turn makes it difficult to critique debate-based approaches. (Both because it is unclear what one is critiquing and because it makes it too easy to accidentally dimiss the critiques using the motte-and-bailey fallacy.)
I was previously unaware of Section 4.2 of the Scalable AI Safety via Doubly-Efficient Debate paper and, hurray, it does give an answer to (2) in Section 4.2. (Thanks for mentioning, @niplav!) That still leaves (1) unanswered, or at least not answered clearly enough, imo. Also I am curious about the extent that other people, who find debate promising, consider this paper's answer to (2) as the answer to (2).
For what it's worth, none of the other results that I know about were helpful for me for understanding (1) and (2). (The things I know about are the original AI Safety via Debate paper, follow-up reports by OpenAI, the single- and two-step debate papers, the Anthropic 2023 post, the Khan et al. (2024) paper. Some more LW posts, including mine.) I can of course make some guesses regarding plausible answers to (1) and (2). But most of these papers are primarily concerned with exploring the properties of debates, but not explaining where debate fits in the process of producing an AI (and what problem it aims to address).
The original people kind-of did, but new people started, and Geoffrey Irving continued/got-back-to working on it.
Further disclaimer: Feel free to answer even if you don't find debate promising, but note that I am primarily interested in hearing from people who do actively work on it, or find it promising --- or at least from people who have a very good model of specific such people.
Motivation behind the question: People often mention Debate as a promising alignment technique. For example, the AI Safety Fundamentals curriculum features it quite prominently. But I think there is a lack of consensus on "as far as the proposal is concerned, how is Debate actually meant to be used"? (For example, do we apply it during deployment, as a way of checking the safety of solutions proposed by other systems? Or do we use it during deployment, to generate solutions? Or do we use it to generate training data?) And as far as I know, of all the existing work, only the Nov 2023 paper addresses my questions, and it only answers (Q2). But I am not sure to what extent is the answer given there canonical. So I am interested in knowing the opinions of people who currently endorse Debate.
Illustrating what I mean by the questions: If I were to answer the questions 1-3 for RLHF, I could for example say that:
(1) RLFH is meant for turning a neural network trained for next-token prediction into, for example, an agent that acts as a chatbot and gives helpful, honest, and lawsuit-less answers.
(2) RLHF is used for generating training (or fine-tuning) data (or signal).
(3) Seems pretty good for this purpose, for roughly <=human-level AIs.
I believe that a promising safety strategy for the larger asteroids is to put them in a secure box prior to them landing on earth. That way, the asteroid is -- provably -- guaranteed to have no negative impact on earth.
Proof:
| | | | | | | |
v v v v v v v v
__________ CC
| ___ | CCCC
| / O O \ | :-) CCC :-)
| | o C o | | _|_ || o _|_
| \ o _ / | | ||/ |
|_________ | /\ || /\
--------------------------------------------------------
□
Agreed.
It seems relevant, to the progression, that a lot of human problem solving -- though not all -- is done by the informal method of "getting exposed to examples and then, somehow, generalising". (And I likewise failed to appreciate this, not sure until when.) This suggests that if we want to build AI that solves things in similar ways that humans solve them, "magic"-involving "deepware" is a natural step. (Whether building AI in the image of humans is desirable, that's a different topic.)
tl;dr: It seems noteworthy that "deepware" has strong connotations with "it involves magic", while the same is not true for AI in general.
I would like to point out one thing regarding the software vs AI distinction that is confusing me a bit. (I view this as complementing, rather than contradicting, your post.)
As we go along the progression "Tools > Machines > Electric > Electronic > Digital", most[1] of the examples can be viewed as automating a reasonably-well-understood process, on a progressively higher level of abstraction.[2]
[For example: A hammer does basically no automation. > A machine like a lawn-mower automates a rigidly-designed rotation of the blades. > An electric kettle does-its-thingy. > An electronic calculator automates calculating algorithms that we understand, but can do it for much larger inputs than we could handle. > An algorithm like Monte Carlo tree search automates an abstract reasoning process that we understand, but can apply it to a wide range of domains.]
But then it seems that this progression does not neatly continue to the AI paradigm. Or rather, some things that we call AI can be viewed as a continuation of this progression, while others can't (or would constitute a discontinuous jump).
[For example, approaches like "solving problems using HCH" (minus the part where you use unknown magic to obtain a black box that imitates the human) can be viewed as automating a reasonably-well-understood process (of solving tasks by decomposing & delegating them). But there are also other things that we call AI that are not well described as a continuation of this progression --- or perhaps they constitute a rather extreme jump. On the other hand, deep learning automates the not-well-understood process of "stare at many things, then use magic to generalise". And the other example is abstract optimisation, which automates the not-well-understood process of "search through many potential solutions and pick the one that scores the best according to an objective function". And there are examples that lie somewhere inbetween --- for example, AlphaZero is mostly a quite well-understood process, but it does involve some opaque deep learning.]
I suppose we could refer to the distinction as "does it involve magic?". It then seems noteworthy that "deepware" has strong connotations with magic, while the same isn't true for all types of AI.[2]
- ^
Or perhaps just "many"? I am not quite sure, this would require going through more examples, and I was intending for this to be a quick comment.
- ^
To be clear, I am not super-confident that this progression is a legitimate phenomenon. But for the sake of argument, let's say it is.
- ^
An interesting open question is how large hit to competitiveness would we suffer if we restricted ourselves to systems that only involve a small amount of magic.
I want to flag that the overall tone of the post is in tension with the dislacimer that you are "not putting forward a positive argument for alignment being easy".
To hint at what I mean, consider this claim:
Undo the update from the “counting argument”, however, and the probability of scheming plummets substantially.
I think this claim is only valid if you are in a situation such as "your probability of scheming was >95%, and this was based basically only on this particular version of the 'counting argument' ". That is, if you somehow thought that we had a very detailed argument for scheming (AI X-risk, etc), and this was it --- then yes, you should strongly update.
But in contrast, my take is more like: This whole AI stuff is a huge mess, and the best we have is intuitions. And sometimes people try to formalise these intuitions, and those attempts generally all suck. (Which doesn't mean our intuitions cannot be more or less detailed. It's just that even the detailed ones are not anywhere close to being rigorous.) EG, for me personally, the vague intuition that "scheming is instrumental for a large class of goals" makes a huge contribution to my beliefs (of "something between 10% and 99% on alignment being hard"), while the particular version of the 'counting argument' that you describe makes basically no contribution. (And vague intuitions about simplicity priors contributing non-trivially.) So undoing that particular update does ~nothing.
I do acknowledge that this view suggests that the AI-risk debate should basically be debating the question: "So, we don't have any rigorous arguments about AI risk being real or not, and we won't have them for quite a while yet. Should we be super-careful about it, just in case?". But I do think that is appropriate.
I feel a bit confused about your comment: I agree with each individual claim, but I feel like perhaps you meant to imply something beyond just the individual claims. (Which I either don't understand or perhaps disagree with.)
Are you saying something like: "Yeah, I think that while this plan would work in theory, I expect it to be hopeless in practice (or unneccessary because the homework wasn't hard in the first place)."?
If yes, then I agree --- but I feel that of the two questions, "would the plan work in theory" is the much less interesting one. (For example, suppose that OpenAI could in theory use AI to solve alignment in 2 years. Then this won't really matter unless they can refrain from using that same AI to build misaligned superintelligence in 1.5 years. Or suppose the world could solve AI alignment if the US government instituted a 2-year moratorium on AI research --- then this won't really matter unless the US government actually does that.)
However, note that if you think we would fail to sufficiently check human AI safety work given substantial time, we would also fail to solve various issues given a substantial pause
This does not seem automatic to me (at least in the hypothetical scenario where "pause" takes a couple of decades). The reasoning being that there is difference between [automate a current form of an institution, and speed-run 50 years of it in a month] and [an institutions, as it develops over 50 years].
For example, my crux[1] is that current institutions do not subscribe to the security mindset with respect to AI. But perhaps hypothetical institutions in 50 years might.
- ^
For being in favour of slowing things down; if that were possible in a reasonable way, which it might not be.
Assumming that there is an "alignment homework" to be done, I am tempted to answer something like: AI can do our homework for us, but only if we are already in a position where we could solve that homework even without AI.
An important disclaimer is that perhaps there is no "alignment homework" that needs to get done ("alignment by default", "AGI being impossible", etc). So some people might be optimistic about Superalignment, but for reasons that seem orthogonal to this question - namely, because they think that the homework to be done isn't particularly difficult in the first place.
For example, suppose OpenAI can use AI to automate many research tasks that they already know how to do. Or they can use it to scale up the amount of research they produce. Etc. But this is likely to only give them the kinds of results that they could come up with themselves (except possibly much faster, which I acknowledge matters).
However, suppose that the solution to making AI go well lies outside of the ML paradigm. Then OpenAI's "superalignment" approach would need to naturally generate solutions outside of this new paradigm. Or it would need to cause the org to pivot to a new paradigm. Or it would need to convince OpenAI that way more research is needed, and they need to stop AI progress until that happens.
And my point here is not to argue that this won't happen. Rather, I am suggesting that whether this would happen seems strongly connected to whether OpenAI would be able to do these things even prior to all the automation. (IE, this depends on things like: Will people think to look into a particular problem? Will people be able to evaluate the quality of alignment proposals? Is the organisational structure set up such that warning signs will be taken seriously?)
To put it in a different way:
- We can use AI to automate an existing process, or a process that we can describe in enough detail.
(EG, suppose we want to "automate science". Then an example of a thing that we might be able to do would be to: Set up a system where many LLMs are tasked to write papers. Other LLMs then score those papers using the same system as human researchers use for conference reviewes. And perhaps the most successful papers then get added to the training corpus of future LLMs. And then we repeat the whole thing. However, we do not know how to "magically make science better".) - We can also have AI generate solution proposals, but this will only be helpful to the extent that we know how to evaluate the quality of those proposals.[1]
(EG, we can use AI to factorise numbers into their prime factors, since we know how to check whether is equal to the original number. However, suppose we use an AI to generate a plan for how to improve an urban design of a particular city. Then it's not really clear how to evaluate that plan. And the same issue arises when we ask for plans regarding the problem of "making AI go well".)
Finally, suppose you think that the problem with "making AI go well" is the relative speeds of progress in AI capabilities vs AI alignment. Then you need to additionally explain why the AI will do our alignment homework for us while simultaneously refraining from helping with the capabilities homework.[2]
- ^
A relevant intuition pump: The usefulness of forecasting questions on prediction markets seems limited by your ability to specify the resolution criteria.
- ^
The resonable default assumption might be that AI will speed up capabilities and alignment equally. In contrast, arguing for disproportional speedup of alignment sounds like corporate b...cheap talk. However, there might be reasons to believe that AI will disproportionally speed up capabilities - for example, because we know how to evaluate capabilities research, while the field of "make AI go well" is much less mature.
Quick reaction:
- I didn't want to use the ">1 billion people" formulation, because that is compatible with scenarios where a catastrophe or an accident happens, but we still end up controling the future in the end.
- I didn't want to use "existential risk", because that includes scenarios where humanity survives but has net-negative effects (say, bad versions of Age of Em or humanity spreading factory farming across the stars).
- And for the purpose of this sequence, I wanted to look at the narrower class of scenarios where a single misaligned AI/optimiser/whatever takes over and does its thing. Which probably includes getting rid of literally everyone, modulo some important (but probably not decision-relevant?) questions about anthropics and negotiating with aliens.
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
What would you suggest instead? Something like [50% chance the AI kills > 99% of people]?
(My current take is that for a majority reader, sticking to "literal extinction" is the better tradeoff between avoiding confusion/verbosity and accuracy. But perhaps it deserves at least a footnote or some other qualification.)
I think literal extinction from AI is a somewhat odd outcome to study as it heavily depends on difficult to reason about properties of the world (e.g. the probability that Aliens would trade substantial sums of resources for emulated human minds and the way acausal trade works in practice).
That seems fair. For what it's worth, I think the ideas described in the sequence are not sensitive to what you choose here. The point isn't as much to figure out whether the particular arguments go through or not, but to ask which properties must your model have, if you want to be able to evaluate those arguments rigorously.
A key claim here is that if you actually are able to explain a high fraction of loss in a human understandable way, you must have done something actually pretty impressive at least on non-algorithmic tasks. So, even if you haven't solved everything, you must have made a bunch of progress.
Right, I agree. I didn't realise the bolded statement was a poor/misleading summary of the non-bolded text below. I guess it would be more accurate to say something like "[% of loss explained] is a good metric for tracking intellectual progress in interpretability. However, it is somewhat misleading in that 100% loss explained does not mean you understand what is going on inside the system."
I rephrased that now. Would be curious to hear whether you still have objections to the updated phrasing.
[% of loss explained] isn't a good interpretability metric [edit: isn't enough to get guarantees].
In interpretability, people use [% of loss explained] as a measure of the quality of an explanation. However, unless you replace the system-being-explained by its explanation, this measure has a fatal flaw.
Suppose you have misaligned superintelligence X pretending to be a helpful assistant A --- that is, acting as A in all situations except those where it could take over the world. Then the explanation "X is behaving as A" will explain 100% of loss, but actually using X will still kill you.
For [% of loss explained] to be a useful metric [edit: robust for detecting misalignment], it would need to explain most of the loss on inputs that actually matter. And since we fundamentally can't tell which ones those are, the metric will only be useful (for detecting misaligned superintelligences) if we can explain 100% of loss on all possible inputs.
I think the relative difficulty of hacking AI(x-1) and AI(x-2) will be sensitive to how much emphasis you put on the "distribute AI(x-1) quickly" part. IE, if you rush it, you might make it worse, even if AI(x-1) has the potential to be more secure. (Also, there is the "single point of failure" effect, though it seems unclear how large.)
To clarify: The question about improving Steps 1-2 was meant specifically for [improving things that resemble Steps 1-2], rather than [improving alignment stuff in general]. And the things you mention seem only tangentially related to that, to me.
But that complaint aside: sure, all else being equal, all of the points you mention seem better having than not having.
Might be obvious, but perhaps seems worth noting anyway: Ensuring that our boundaries are respected is, at least with a straightforward understanding of "boundaries", not sufficient for being safe.
For example:
- If I take away all food from your local supermarkets (etc etc), you will die of starvation --- but I haven't done anything with your boundaries.
- On a higher level, you can wipe out humanity without messing with our boundaries, by blocking out the sun.
An aspect that I would not take into account is the expected impact of your children.
Most importantly, it just seems wrong to make personal-happiness decisions subservient to impact.
But even if you did want to optimise impact through others, then betting on your children seems riskier and less effective than, for example, engaging with interested students. (And even if you wanted to optimise impact at all costs, then the key factors might not be your impact through others. But instead (i) your opportunity costs, (ii) second order effects, where having kids makes you more or less happy, and this changes the impact of your work, and (iii) negative second order effects that "sacrificing personal happiness because of impact" has on the perception of the community.)
In fact it's hard to find probable worlds where having kids is a really bad idea, IMO.
One scenario where you might want to have kids in general, but not if timelines are short, is if you feel positive about having kids, but you view the first few years of having kids as a chore (ie, it costs you time, sleep, and money). So if you view kids as an investment of the form "take a hit to your happiness now, get more happiness back later", then not having kids now seems justifiable. But I think that this sort of reasoning requires pretty short timelines (which I have), with high confidence (which I don't have), and high confidence that the first few years of having kids is net-negative happiness for you (which I don't have).
(But overall I endorse the claim that, mostly, if you would have otherwise wanted kids, you should still have them.)
(For context: My initial reaction to the post was that this is misrepresenting the MIRI-position-as-I-understood-it. And I am one of the people who strongly endorse the view that "it was never about getting the AI to predict human preferences". So when I later saw Yudkowsky's comment and your reaction, it seemed perhaps useful to share my view.)
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
My reaction to this is that: Actually, current LLMs do care about our preferences, and about their guardrails. It was never about getting some AI to care about our preferences. It is about getting powerful AIs to robustly care about our preferences. Where by "robustly" includes things like (i) not caring about other things as well (e.g., prediction accuracy), (ii) generalising correctly (e.g., not just maximising human approval), and (iii) not breaking down when we increase the amount of optimisation pressure a lot (e.g., will it still work once we hook it into future-AutoGPT-that-actually-works and have it run for a long time?).
Some examples of what would cause me to update are: If we could make LLMs not jailbreakable without relying on additional filters on input or output.
Nitpicky comment / edit request: The circle inversion figure was quite confusing to me. Perhaps add a note to it saying that solid green maps onto solid blue, red maps onto itself, and dotted green maps onto dotted blue. (Rather than colours mapping to each other, which is what I intuitively expected.)
Fun example: The evolution of offensive words seems relevant here. IE, we frown upon using currently-offensive words, so we end up expressing ourselves using some other words. And over time, we realise that those other words are (primarily used as) Doppelgangers, and mark them as offensive as well.
Related questions:
- What is the expected sign of the value of marking posts like this? (One might wonder whether explicitly putting up "DON'T LOOK HERE!" won't backfire.)
[I expect some AI companies might respect these signs, so this seems genuinely unclear.] - Is there a way of putting things on the internet in a way that more robustly prevents AIs from seeing them?
[I am guessing not, but who knows...]
E.g. Living in large groups such that it’s hard for a predator to focus on any particular individual; a zebra’s stripes.
Off-topic, but: Does anybody have a reference for this, or a better example? This is the first time I have heard this theory about zebras.
Two points that seem relevant here:
- To what extent are "things like LLMs" and "things like AutoGPT" very different creatures, with the latter sometimes behaving like a unitary agent?
- Assuming that the distinction in (1) matters, how often do we expect to see AutoGPT-like things?
(At the moment, both of these questions seem open.)
This made me think of "lawyer-speak", and other jargons.
More generally, this seems to be a function of learning speed and the number of interactions on the one hand, and the frequency with which you interact with other groups on the other. (In this case, the question would be how often do you need to be understandable to humans, or to systems that need to be understandable to humans, etc.)
I would distinguish between "feeling alien" (as in, most of the time, the system doesn't feel too weird or non-human to interact with, at least if you don't look too closely) and "being alien" (a in, "having the potential to sometimes behave in a way that a human never would").
My argument is that the current LLMs might not feel alien (at least to some people), but they definitely are. For example, any human that is smart enough to write a good essay will also be able to count the number of words in a sentence --- yet LLMs can do one, but not the other. Similarly, humans have moods and emotions and other stuff going in their heads, such that when they say "I am sorry" or "I promise to do X", it is a somewhat costly signal of their future behaviour --- yet this doesn't have to be true at all for AI.
(Also, you are right that people believe that ChatGPT's isn't conscious. But this seems quite unrelated to the overall point? As in, I expect some people would also believe ChatGPT if it started saying that it is conscious. And if ChatGPT was conscious and claimed that it isn't, many people would still believe that it isn't.)
I agree that we shouldn't be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)
To elaborate: My view of the ML paradigm is that the machinery under the hood is very alien, and susceptible to things like jailbreaks, adversarial examples, and non-robustness out of distribution. Most of the time, this makes no difference to the user's experience. However, the exceptions might be disproportionally important. And for that reason, it seems important to advertise the possibility of those cases.
For example, it might be possible to steal other people's private information by jailbreaking their LLM-based AI assistants --- and this is why it is good that more people are aware of jailbreaks. Similarly, it seems easy to create virtual agents that maintain a specific persona to build trust, and then abuse that trust in a way that would be extremely unlikely for a human.[1] But perhaps that, and some other failure modes, are not yet sufficiently widely appreciated?
Overall, it seems good to take some action towards making people/society/the internet less vulnerable to these kinds of exploits. (The example I gave in the post were some ideas towards this goal. But I am less married to those than to the general point.) One fair objection against the particular action of advertising the vulnerabilities is that doing so brings them to the attention of malicious actors. I do worry about this somewhat, but primarily I expect people (and in particular nation-states) to notice these vulnerabilities anyway. Perhaps more importantlly, I expect potential misaligned AIs to notice the vulnerabilities anyway --- so patching them up seems useful for (marginally) decreasing the world's take-over-ability.
- ^
For example, because a human wouldn't be patient enough to maintain the deception for the given payoff. Or because a human that would be smart enough to pull this off would have better ways to spend their time. Or because only a psychopathic human would do this, and there is only so many of those.
I would like to point out one aspects of the "Vulnerable ML systems" scenario that the post doesn't discuss much: the effect on adversarial vulnerability on widespread-automation worlds.
Using existing words, some ways of pointing towards what I mean are: (1) Adversarial robustness solved after TAI (your case 2), (2) vulnerable ML systems + comprehensive AI systems, (3) vulnerable ML systems + slow takeoff, (4) fast takeoff happening in the middle of (3).
But ultimately, I think none of these fits perfectly. So a longer, self-contained description is something like:
- Consider the world where we automate more and more things using AI systems that have vulnerable components. Perhaps those vulnerabilities primarily come from narrow-purpose neural networks and foundation models. But some might also come from insecure software design, software bugs, and humans in the loop.
- And suppose some parts of the economy/society will be designed more securely (some banks, intelligence services, planes, hopefully nukes)...while others just have glaring security holes.
- A naive expectation would be that a security hole gets fixed if and only if there is somebody who would be able to exploit it. This is overly optimistic, but note that even this implies the existence of many vulnerabilities that would require stronger-than-existing level of capability to exploit. More realistically, the actual bar for fixing security holes will be "there might be many people who can exploit this, but it is not worth their opportunity cost". And then we will also not-fix all the holes that we are unaware of, or where the exploitation goes undetected.
These potential vulnerabilities leave a lot of space for actual exploitation when the stakes get higher, or we get a sudden jump in some area of capabilities, or when many coordinated exploits become more profitable than what a naive extrapolation would suggest.
There are several potential threats that have particularly interesting interactions with this setting:
- (A) Alignment scheme failure: An alignment scheme that would otherwise work fails due to vulnerabilities in the AI company training it. This seems the closest to what this post describes?
- (B) Easier AI takeover: Somebody builds a misaligned AI that would normally be sub-catastrophic, but all of these vulnerabilities allow it to take over.
- (C) Capitalism gone wrong: The vulnerabilities regularly get exploited, in ways that either go undetected or cause negative externalities that nobody relevant has incentives to fix. And this destroys a large portion of the total value.
- (D) Malicious actors: Bad actors use the vulnerabilities to cause damage. (And this makes B and C worse.)
- (E) Great-power war: The vulnerabilities get exploited during a great-power war. (And this makes B and C worse.)
Connection to Cases 1-3: All of this seems very related to how you distinguish between adversarial robustness gets solved before tranformative AI/after TAI/never. However, I would argue that TAI is not necessarily the relevant cutoff point here. Indeed, for Alignment failure (A) and Easier takeover (B), the relevant moment is "the first time we get an AI capable of forming a singleton". This might happen tomorrow, by the time we have automated 25% of economically-relevant tasks, half a year into having automated 100% of tasks, or possibly never. And for the remaining threat models (C,D,E), perhaps there are no single cutoff points, and instead the stakes and implications change gradually?
Implications: Personally, I am the most concerned about misaligned AI (A and B) and Capitalism gone wrong (C). However, perhaps risks from malicious actors and nation-state adversaries (D, E) are more salient and less controversial, while pointing towards the same issues? So perhaps advancing the agenda outlined in the post can be best done through focusing on these? [I would be curious to know your thoughts.]
An idea for increasing the impact of this research: Mitigating the "goalpost moving" effect for "but surely a bit more progress on capabilities will solve this".
I suspect that many people who are sceptical of this issue will, by default, never sit down and properly think about this. If they did, they might make some falsifiable predictions and change their minds --- but many of them might never do that. Or perhaps many people will, but it will all happen very gradually, and we will never get a good enough "coordination point" that would allow us to take needle-shifting actions.
I also suspect there are ways of making this go better. I am not quite sure what they are, but here are some ideas: Making and publishing surveys. Operationalizing all of this better, in particular with respect to the "how much does this actually matter?" aspect. Formulating some memorable "hypothesis" that makes it easier to refer to this in conversations and papers (cf "orthogonality thesis"). Perhaps making some proponents of "the opposing view" make some testable predictions, ideally some that can be tested with systems whose failures won't be catastrophic yet?
Ok, got it. Though, not sure if I have a good answer. With trans issues, I don't know how to decouple the "concepts and terminology" part of the problem from the "political" issues. So perhaps the solution with AI terminology is to establish the precise terminology? And perhaps to establish it before this becomes an issue where some actors benefit from ambiguity (and will therefore resist disambiguation)? [I don't know, low confidence on all of this.]
Do you have a more realistic (and perhaps more specific, and ideally apolitical) example than "cooperation is a fuzzy concept, so you have no way to deny that I am cooperating"? (All instances of this that I managed to imagine were either actually complicated, about something else, or something that I could resolve by replying "I don't care about your language games" and treating you as non-cooperative.)
For the purpose of this section, we will consider adversarial robustness to be solved if systems cannot be practically exploited to cause catastrophic outcomes.
Regarding the predictions, I want to make the following quibble: According to the definition above, one way of "solving" adversarial robustness is to make sure that nobody tries to catastrophically exploit the system in the first place. (In particular, exploitable AI that takes over the world is no longer exploitable.)
So, a lot with this definition rests on how do you distinguish between "cannot be exploited" and "will not be exploited".
And on reflection, I think that for some people, this is close to being a crux regarding the importance of all this research.