Posts
Comments
I didn't cross-post it, but I've poked EY about the title!
I feel pretty frustrated at how rarely people actually bet or make quantitative predictions about existential risk from AI. EG my recent attempt to operationalize a bet with Nate went nowhere. Paul trying to get Eliezer to bet during the MIRI dialogues also went nowhere, or barely anywhere—I think they ended up making some random bet about how long an IMO challenge would take to be solved by AI. (feels pretty weak and unrelated to me. lame. but huge props to Paul for being so ready to bet, that made me take him a lot more seriously.)
This paragraph doesn't seem like an honest summary to me. Eliezer's position in the dialogue, as I understood it, was:
- The journey is a lot harder to predict than the destination. Cf. "it's easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be". Eliezer isn't claiming to have secret insights about the detailed year-to-year or month-to-month changes in the field; if he thought that, he'd have been making those near-term tech predictions already back in 2010, 2015, or 2020 to show that he has this skill.
- From Eliezer's perspective, Paul is claiming to know a lot about the future trajectory of AI, and not just about the endpoints: Paul thinks progress will be relatively smooth and continuous, and thinks it will get increasingly smooth and continuous as time passes and more resources flow into the field. Eliezer, by contrast, expects the field to get choppier as time passes and we get closer to ASI.
- A way to bet on this, which Eliezer repeatedly proposed but wasn't able to get Paul to do very much, would be for Paul to list out a bunch of concrete predictions that Paul sees as "yep, this is what smooth and continuous progress looks like". Then, even though Eliezer doesn't necessarily have a concrete "nope, the future will go like X instead of Y" prediction, he'd be willing to bet against a portfolio of Paul-predictions: when you expect the future to be more unpredictable, you're willing to at least weakly bet against any sufficiently ambitious pool of concrete predictions.
- (Also, if Paul generated a ton of predictions like that, an occasional prediction might indeed make Eliezer go "oh wait, I do have a strong prediction on that question in particular; I didn't realize this was one of our points of disagreement". I don't think this is where most of the action is, but it's at least a nice side-effect of the person-who-thinks-this-tech-is-way-more-predictable spelling out predictions.)
Eliezer was also more interested in trying to reach mutual understanding of the views on offer, as opposed to bet let's bet on things immediately never mind the world-views. But insofar as Paul really wanted to have the bets conversation instead, Eliezer sunk an awful lot of time into trying to find operationalizations Paul and he could bet on, over many hours of conversation.
If your end-point take-away from that (even after actual bets were in fact made, and tons of different high-level predictions were sketched out) is "wow how dare Eliezer be so unwilling to make bets on anything", then I feel a lot less hope that world-models like Eliezer's ("long-term outcome is more predictable than the detailed year-by-year tech pathway") are going to be given a remotely fair hearing.
(Also, in fairness to Paul, I'd say that he spent a bunch of time working with Eliezer to try to understand the basic methodologies and foundations for their perspectives on the world. I think both Eliezer and Paul did an admirable job going back and forth between the thing Paul wanted to focus on and the thing Eliezer wanted to focus on, letting us look at a bunch of different parts of the elephant. And I don't think it was unhelpful for Paul to try to identify operationalizations and bets, as part of the larger discussion; I just disagree with TurnTrout's summary of what happened.)
If I was misreading the blog post at the time, how come it seems like almost no one ever explicitly predicted at the time that these particular problems were trivial for systems below or at human-level intelligence?!?
Quoting the abstract of MIRI's "The Value Learning Problem" paper (emphasis added):
Autonomous AI systems’ programmed goals can easily fall short of programmers’ intentions. Even a machine intelligent enough to understand its designers’ intentions would not necessarily act as intended. We discuss early ideas on how one might design smarter-than-human AI systems that can inductively learn what to value from labeled training data, and highlight questions about the construction of systems that model and act upon their operators’ preferences.
And quoting from the first page of that paper:
The novelty here is not that programs can exhibit incorrect or counter-intuitive behavior, but that software agents smart enough to understand natural language may still base their decisions on misrepresentations of their programmers’ intent. The idea of superintelligent agents monomaniacally pursuing “dumb”-seeming goals may sound odd, but it follows from the observation of Bostrom and Yudkowsky [2014, chap. 7] that AI capabilities and goals are logically independent.1 Humans can fully comprehend that their “designer” (evolution) had a particular “goal” (reproduction) in mind for sex, without thereby feeling compelled to forsake contraception. Instilling one’s tastes or moral values into an heir isn’t impossible, but it also doesn’t happen automatically.
I won't weigh in on how many LessWrong posts at the time were confused about where the core of the problem lies. But "The Value Learning Problem" was one of the seven core papers in which MIRI laid out our first research agenda, so I don't think "we're centrally worried about things that are capable enough to understand what we want, but that don't have the right goals" was in any way hidden or treated as minor back in 2014-2015.
I also wouldn't say "MIRI predicted that NLP will largely fall years before AI can match e.g. the best human mathematicians, or the best scientists", and if we saw a way to leverage that surprise to take a big bite out of the central problem, that would be a big positive update.
I'd say:
- MIRI mostly just didn't make predictions about the exact path ML would take to get to superintelligence, and we've said we didn't expect this to be very predictable because "the journey is harder to predict than the destination". (Cf. "it's easier to use physics arguments to predict that humans will one day send a probe to the Moon, than it is to predict when this will happen or what the specific capabilities of rockets five years from now will be".)
- Back in 2016-2017, I think various people at MIRI updated to median timelines in the 2030-2040 range (after having had longer timelines before that), and our timelines haven't jumped around a ton since then (though they've gotten a little bit longer or shorter here and there).
- So in some sense, qualitatively eyeballing the field, we don't feel surprised by "the total amount of progress the field is exhibiting", because it looked in 2017 like the field was just getting started, there was likely an enormous amount more you could do with 2017-style techniques (and variants on them) than had already been done, and there was likely to be a lot more money and talent flowing into the field in the coming years.
- But "the total amount of progress over the last 7 years doesn't seem that shocking" is very different from "we predicted what that progress would look like". AFAIK we mostly didn't have strong guesses about that, though I think it's totally fine to say that the GPT series is more surprising to the circa-2017 MIRI than a lot of other paths would have been.
- (Then again, we'd have expected something surprising to happen here, because it would be weird if our low-confidence visualizations of the mainline future just happened to line up with what happened. You can expect to be surprised a bunch without being able to guess where the surprises will come from; and in that situation, there's obviously less to be gained from putting out a bunch of predictions you don't particularly believe in.)
- Pre-deep-learning-revolution, we made early predictions like "just throwing more compute at the problem without gaining deep new insights into intelligence is less likely to be the key thing that gets us there", which was falsified. But that was a relatively high-level prediction; post-deep-learning-revolution we haven't claimed to know much about how advances are going to be sequenced.
- We have been quite interested in hearing from others about their advance prediction record: it's a lot easier to say "I personally have no idea what the qualitative capabilities of GPT-2, GPT-3, etc. will be" than to say "... and no one else knows either", and if someone has an amazing track record at guessing a lot of those qualitative capabilities, I'd be interested to hear about their further predictions. We're generally pessimistic that "which of these specific systems will first unlock a specific qualitative capability?" is particularly predictable, but this claim can be tested via people actually making those predictions.
But the benefit of a Pause is that you use the extra time to do something in particular. Why wouldn't you want to fiscally sponsor research on problems that you think need to be solved for the future of Earth-originating intelligent life to go well?
MIRI still sponsors some alignment research, and I expect we'll sponsor more alignment research directions in the future. I'd say MIRI leadership didn't have enough aggregate hope in Agent Foundations in particular to want to keep supporting it ourselves (though I consider its existence net-positive).
My model of MIRI is that our main focus these days is "find ways to make it likelier that a halt occurs" and "improve the world's general understanding of the situation in case this helps someone come up with a better idea", but that we're also pretty open to taking on projects in all four of these quadrants, if we find something that's promising and that seems like a good fit at MIRI (or something promising that seems unlikely to occur if it's not housed at MIRI):
AI alignment work | Non-alignment work | |
---|---|---|
High-EV absent a pause | ||
High-EV given a pause |
I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).
Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.
It's also important to keep in mind that on Leopold's model (and my own), these problems need to be solved under a ton of time pressure. To maintain a lead, the USG in Leopold's scenario will often need to figure out some of these "under what circumstances can we trust this highly novel system and believe its alignment answers?" issues in a matter of weeks or perhaps months, so that the overall alignment project can complete in a very short window of time. This is not a situation where we're imagining having a ton of time to develop mastery and deep understanding of these new models. (Or mastery of the alignment problem sufficient to verify when a new idea is on the right track or not.)
one positive feature it does have, it proposes to rely on a multitude of "limited weakly-superhuman artificial alignment researchers" and makes a reasonable case that those can be obtained in a form factor which is alignable and controllable.
I don't find this convincing. I think the target "dumb enough to be safe, honest, trustworthy, relatively non-agentic, etc., but smart enough to be super helpful for alignment" is narrow (or just nonexistent, using the methods we're likely to have on hand).
Even if this exists, verification seems extraordinarily difficult: how do we know that the system is being honest? Separately, how do we verify that its solutions are correct? Checking answers is sometimes easier than generating them, but only to a limited degree, and alignment seems like a case where checking is particularly difficult.
You and Leopold seem to share the assumption that huge GPU farms or equivalently strong compute are necessary for superintelligence.
Nope! I don't assume that.
I do think that it's likely the first world-endangering AI is trained using more compute than was used to train GPT-4; but I'm certainly not confident of that prediction, and I don't think it's possible to make reasonable predictions (given our current knowledge state) about how much more compute might be needed.
("Needed" for the first world-endangeringly powerful AI humans actually build, that is. I feel confident that you can in principle build world-endangeringly powerful AI with far less compute than was used to train GPT-4; but the first lethally powerful AI systems humans actually build will presumably be far from the limits of what's physically possible!)
But what would happen if one effectively closes that path? There will be huge selection pressure to look for alternative routes, to invest more heavily in those algorithmic breakthroughs which can work with modest GPU power or even with CPUs.
Agreed. This is why I support humanity working on things like human enhancement and (plausibly) AI alignment, in parallel with working on an international AI development pause. I don't think that a pause on its own is a permanent solution, though if we're lucky and the laws are well-designed I imagine it could buy humanity quite a few decades.
I hope people will step back from solely focusing on advocating for policy-level prescriptions (as none of the existing policy-level prescriptions look particularly promising at the moment) and invest some of their time in continuing object-level discussions of AI existential safety without predefined political ends.
FWIW, MIRI does already think of "generally spreading reasonable discussion of the problem, and trying to increase the probability that someone comes up with some new promising idea for addressing x-risk" as a top organizational priority.
The usual internal framing is some version of "we have our own current best guess at how to save the world, but our idea is a massive longshot, and not the sort of basket humanity should put all its eggs in". I think "AI pause + some form of cognitive enhancement" should be a top priority, but I also consider it a top priority for humanity to try to find other potential paths to a good future.
As a start, you can prohibit sufficiently large training runs. This isn't a necessary-and-sufficient condition, and doesn't necessarily solve the problem on its own, and there's room for debate about how risk changes as a function of training resources. But it's a place to start, when the field is mostly flying blind about where the risks arise; and choosing a relatively conservative threshold makes obvious sense when failing to leave enough safety buffer means human extinction. (And when algorithmic progress is likely to reduce the minimum dangerous training size over time, whatever it is today -- also a reason the cap is likely to need to lower over time to some extent, until we're out of the lethally dangerous situation we currently find ourselves in.)
Alternatively, they either don't buy the perils or believes there's a chance the other chance may not?
If they "don't buy the perils", and the perils are real, then Leopold's scenario is falsified and we shouldn't be pushing for the USG to build ASI.
If there are no perils at all, then sure, Leopold's scenario and mine are both false. I didn't mean to imply that our two views are the only options.
Separately, Leopold's model of "what are the dangers?" is different from mine. But I don't think the dangers Leopold is worried about are dramatically easier to understand than the dangers I'm worried about (in the respective worlds where our worries are correct). Just the opposite: the level of understanding you need to literally solve alignment for superintelligences vastly exceeds the level you need to just be spooked by ASI and not want it to be built. Which is the point I was making; not "ASI is axiomatically dangerous", but "this doesn't count as a strike against my plan relative to Leopold's, and in fact Leopold is making a far bigger ask of government than I am on this front".
Nuclear war essentially has a localized p(doom) of 1
I don't know what this means. If you're saying "nuclear weapons kill the people they hit", I don't see the relevance; guns also kill the people they hit, hut that doesn't make a gun strategically similar to a smarter-than-human AI system.
Yep, I had in mind AI Forecasting: One Year In.
Why? 95% risk of doom isn't certainty, but seems obviously more than sufficient.
For that matter, why would the USG want to build AGI if they considered it a coinflip whether this will kill everyone or not? The USG could choose the coinflip, or it could choose to try to prevent China from putting the world at risk without creating that risk itself. "Sit back and watch other countries build doomsday weapons" and "build doomsday weapons yourself" are not the only two options.
Leopold's scenario requires that the USG come to deeply understand all the perils and details of AGI and ASI (since they otherwise don't have a hope of building and aligning a superintelligence), but then needs to choose to gamble its hegemony, its very existence, and the lives of all its citizens on a half-baked mad science initiative, when it could simply work with its allies to block the tech's development and maintain the status quo at minimal risk.
Success in this scenario requires a weird combination of USG prescience with self-destructiveness: enough foresight to see what's coming, but paired with a weird compulsion to race to build the very thing that puts its existence at risk, when it would potentially be vastly easier to spearhead an international alliance to prohibit this technology.
Responding to Matt Reardon's point on the EA Forum:
Leopold's implicit response as I see it:
- Convincing all stakeholders of high p(doom) such that they take decisive, coordinated action is wildly improbable ("step 1: get everyone to agree with me" is the foundation of many terrible plans and almost no good ones)
- Still improbable, but less wildly, is the idea that we can steer institutions towards sensitivity to risk on the margin and that those institutions can position themselves to solve the technical and other challenges ahead
Maybe the key insight is that both strategies walk on a knife's edge. While Moore's law, algorithmic improvement, and chip design hum along at some level, even a little breakdown in international willpower to enforce a pause/stop can rapidly convert to catastrophe. Spending a lot of effort to get that consensus also has high opportunity cost in terms of steering institutions in the world where the effort fails (and it is very likely to fail). [...]
Three high-level reasons I think Leopold's plan looks a lot less workable:
- It requires major scientific breakthroughs to occur on a very short time horizon, including unknown breakthroughs that will manifest to solve problems we don't understand or know about today.
- These breakthroughs need to come in a field that has not been particularly productive or fast in the past. (Indeed, forecasters have been surprised by how slowly safety/robustness/etc. have progressed in recent years, and simultaneously surprised by the breakneck speed of capabilities.)
- It requires extremely precise and correct behavior by a giant government bureaucracy that includes many staff who won't be the best and brightest in the field — inevitably, many technical and nontechnical people in the bureaucracy will have wrong beliefs about AGI and about alignment.
The "extremely precise and correct behavior" part means that we're effectively hoping to be handed an excellent bureaucracy that will rapidly and competently solve a thirty-digit combination lock requiring the invention of multiple new fields and the solving of a variety of thorny and poorly-understood technical problems — in many cases, on Leopold's view, in a space of months or weeks. This seems... not like how the real world works.
It also separately requires that various guesses about the background empirical facts all pan out. Leopold can do literally everything right and get the USG fully on board and get the USG doing literally everything correctly by his lights — and then the plan ends up destroying the world rather than saving it because it just happened to turn out that ASI was a lot more compute-efficient to train than he expected, resulting in the USG being unable to monopolize the tech and unable to achieve a sufficiently long lead time.
My proposal doesn't require qualitatively that kind of success. It requires governments to coordinate on banning things. Plausibly, it requires governments to overreact to a weird, scary, and publicly controversial new tech to some degree, since it's unlikely that governments will exactly hit the target we want. This is not a particularly weird ask; governments ban things (and coordinate or copy-paste each other's laws) all the time, in far less dangerous and fraught areas than AGI. This is "trying to get the international order to lean hard in a particular direction on a yes-or-no question where there's already a lot of energy behind choosing 'no'", not "solving a long list of hard science and engineering problems in a matter of months and getting a bureaucracy to output the correct long string of digits to nail down all the correct technical solutions and all the correct processes to find those solutions".
The CCP's current appetite for AGI seems remarkably small, and I expect them to be more worried that an AGI race would leave them in the dust (and/or put their regime at risk, and/or put their lives at risk), than excited about the opportunity such a race provides. Governments around the world currently, to the best of my knowledge, are nowhere near advancing any frontiers in ML.
From my perspective, Leopold is imagining a future problem into being ("all of this changes") and then trying to find a galaxy-brained incredibly complex and assumption-laden way to wriggle out of this imagined future dilemma, when the far easier and less risky path would be to not have the world powers race in the first place, have them recognize that this technology is lethally dangerous (something the USG chain of command, at least, would need to fully internalize on Leopold's plan too), and have them block private labs from sending us over the precipice (again, something Leopold assumes will happen) while not choosing to take on the risk of destroying themselves (nor permitting other world powers to unilaterally impose that risk).
(Though he also has an incentive to not die.)
As is typical for Twitter, we also signal-boosted a lot of other people's takes. Some non-MIRI people whose social media takes I've recently liked include Wei Dai, Daniel Kokotajlo, Jeffrey Ladish, Patrick McKenzie, Zvi Mowshowitz, Kelsey Piper, and Liron Shapira.
The stuff I've been tweeting doesn't constitute an official MIRI statement — e.g., I don't usually run these tweets by other MIRI folks, and I'm not assuming everyone at MIRI agrees with me or would phrase things the same way. That said, some recent comments and questions from me and Eliezer:
- May 17: Early thoughts on the news about OpenAI's crazy NDAs.
- May 24: Eliezer flags that GPT-4o can now pass one of Eliezer's personal ways of testing whether models are still bad at math.
- May 29: My initial reaction to hearing Helen's comments on the TED AI podcast. Includes some follow-on discussion of the ChatGPT example, etc.
- May 30: A conversation between me and Emmett Shear about the version of events he'd tweeted in November. (Plus a comment from Eliezer.)
- May 30: Eliezer signal-boosting a correction from Paul Graham.
- June 4: Eliezer objects to Aschenbrenner's characterization of his timelines argument as open-and-shut "believing in straight lines on a graph".
Every protest I've witnessed seemed to be designed to annoy and alienate its witnesses, making it as clear as possible that there was no way to talk to these people, that their minds were on rails. I think most people recognize that as cult shit and are alienated by that.
In the last year, I've seen a Twitter video of an AI risk protest (I think possibly in continental Europe?) that struck me as extremely good: calm, thoughtful, accessible, punchy, and sensible-sounding statements and interview answers. If I find the link again, I'll add it here as a model of what I think a robustly good protest can look like!
A leftist friend once argued that protest is not really a means, but a reward, a sort of party for those who contributed to local movementbuilding. I liked that view.
I wouldn't recommend making protests purely this. A lot of these protests are getting news coverage and have a real chance of either intriguing/persuading or alienating potential allies; I think it's worth putting thought into how to hit the "intriguing/persuading" target, regardless of whether this is "normal" for protests.
But I like the idea of "protest as reward" as an element of protests, or as a focus for some protests. :)
Could we talk about a specific expert you have in mind, who thinks this is a bad strategy in this particular case?
AI risk is a pretty weird case, in a number of ways: it's highly counter-intuitive, not particularly politically polarized / entrenched, seems to require unprecedentedly fast and aggressive action by multiple countries, is almost maximally high-stakes, etc. "Be careful what you say, try to look normal, and slowly accumulate political capital and connections in the hope of swaying policymakers long-term" isn't an unconditionally good strategy, it's a strategy adapted to a particular range of situations and goals. I'd be interested in actually hearing arguments for why this strategy is the best option here, given MIRI's world-model.
(Or, separately, you could argue against the world-model, if you disagree with us about how things are.)
?
Two things:
- For myself, I would not feel comfortable using language as confident-sounding as "on the default trajectory, AI is going to kill everyone" if I assigned (e.g.) 10% probability to "humanity [gets] a small future on a spare asteroid-turned-computer or an alien zoo or maybe even star". I just think that scenario's way, way less likely than that.
- I'd be surprised if Nate assigns 10+% probability to scenarios like that, but he can speak for himself. 🤷♂️
- I think some people at MIRI have significantly lower p(doom)? And I don't expect those people to use language like "on the default trajectory, AI is going to kill everyone".
- I agree with you that there's something weird about making lots of human-extinction-focused arguments when the thing we care more about is "does the cosmic endowment get turned into paperclips"? I do care about both of those things, an enormous amount; and I plan to talk about both of those things to some degree in public communications, rather than treating it as some kind of poorly-kept secret that MIRI folks care about whether flourishing interstellar civilizations get a chance to exist down the line. But I have this whole topic mentally flagged as a thing to be thoughtful and careful about, because it at least seems like an area that contains risk factors for future deceptive comms. E.g., if we update later to expecting the cosmic endowment to be wasted but all humans not dying, I would want us to adjust our messaging even if that means sacrificing some punchiness in our policy outreach.
- Currently, however, I think the particular scenario "AI keeps a few flourishing humans around forever" is incredibly unlikely, and I don't think Eliezer, Nate, etc. would say things like "this has a double-digit probability of happening in real life"? And, to be honest, the idea of myself and my family and friends and every other human being all dying in the near future really fucks me up and does not seem in any sense OK, even if (with my philosopher-hat on) I think this isn't as big of a deal as "the cosmic endowment gets wasted".
- So I don't currently feel bad about emphasizing a true prediction ("extremely likely that literally all humans literally nonconsensually die by violent means"), even though the philosophy-hat version of me thinks that the separate true prediction "extremely likely 99+% of the potential value of the long-term future is lost" is more morally important than that. Though I do feel obliged to semi-regularly mention the whole "cosmic endowment" thing in my public communication too, even if it doesn't make it into various versions of my general-audience 60-second AI risk elevator pitch.
Note that "everyone will be killed (or worse)" is a different claim from "everyone will be killed"! (And see Oliver's point that Ryan isn't talking about mistreated brain scans.)
Some of the other things you suggest, like future systems keeping humans physically alive, do not seem plausible to me.
I agree with Gretta here, and I think this is a crux. If MIRI folks thought it were likely that AI will leave a few humans biologically alive (as opposed to information-theoretically revivable), I don't think we'd be comfortable saying "AI is going to kill everyone". (I encourage other MIRI folks to chime in if they disagree with me about the counterfactual.)
I also personally have maybe half my probability mass on "the AI just doesn't store any human brain-states long-term", and I have less than 1% probability on "conditional on the AI storing human brain-states for future trade, the AI does in fact encounter aliens that want to trade and this trade results in a flourishing human civilization".
FWIW I do think "don't trust this guy" is warranted; I don't know that he's malicious, but I think he's just exceptionally incompetent relative to the average tech reporter you're likely to see stories from.
Like, in 2018 Metz wrote a full-length article on smarter-than-human AI that included the following frankly incredible sentence:
During a recent Tesla earnings call, Mr. Musk, who has struggled with questions about his company’s financial losses and concerns about the quality of its vehicles, chastised the news media for not focusing on the deaths that autonomous technology could prevent — a remarkable stance from someone who has repeatedly warned the world that A.I. is a danger to humanity.
FWIW, Cade Metz was reaching out to MIRI and some other folks in the x-risk space back in January 2020, and I went to read some of his articles and came to the conclusion in January that he's one of the least competent journalists -- like, most likely to misunderstand his beat and emit obvious howlers -- that I'd ever encountered. I told folks as much at the time, and advised against talking to him just on the basis that a lot of his journalism is comically bad and you'll risk looking foolish if you tap him.
This was six months before Metz caused SSC to shut down and more than a year before his hit piece on Scott came out, so it wasn't in any way based on 'Metz has been mean to my friends' or anything like that. (At the time he wasn't even asking around about SSC or Scott, AFAIK.)
(I don't think this is an idiosyncratic opinion of mine, either; I've seen other non-rationalists I take seriously flag Metz as someone unusually out of his depth and error-prone for a NYT reporter, for reporting unrelated to SSC stuff.)
Sounds like a lot of political alliances! (And "these two political actors are aligned" is arguably an even weaker condition than "these two political actors are allies".)
At the end of the day, of course, all of these analogies are going to be flawed. AI is genuinely a different beast.
It's pretty sad to call all of these end states you describe alignment as alignment is an extremely natural word for "actually terminally has good intentions".
Aren't there a lot of clearer words for this? "Well-intentioned", "nice", "benevolent", etc.
(And a lot of terms, like "value loading" and "value learning", that are pointing at the research project of getting good intentions into the AI.)
To my ear, "aligned person" sounds less like "this person wishes the best for me", and more like "this person will behave in the right ways".
If I hear that Russia and China are "aligned", I do assume that their intentions play a big role in that, but I also assume that their circumstances, capabilities, etc. matter too. Alignment in geopolitics can be temporary or situational, and it almost never means that Russia cares about China as much as China cares about itself, or vice versa.
And if we step back from the human realm, an engineered system can be "aligned" in contexts that have nothing to do with goal-oriented behavior, but are just about ensuring components are in the right place relative to each other.
Cf. the history of the term "AI alignment". From my perspective, a big part of why MIRI coordinated with Stuart Russell to introduce the term "AI alignment" was that we wanted to switch away from "Friendly AI" to a term that sounded more neutral. "Friendly AI research" had always been intended to subsume the full technical problem of making powerful AI systems safe and aimable; but emphasizing "Friendliness" made it sound like the problem was purely about value loading, so a more generic and low-content word seemed desirable.
But Stuart Russell (and later Paul Christiano) had a different vision in mind for what they wanted "alignment" to be, and MIRI apparently failed to communicate and coordinate with Russell to avoid a namespace collision. So we ended up with a messy patchwork of different definitions.
I've basically given up on trying to achieve uniformity on what "AI alignment" is; the best we can do, I think, is clarify whether we're talking about "intent alignment" vs. "outcome alignment" when the distinction matters.
But I do want to push back against those who think outcome alignment is just an unhelpful concept — on the contrary, if we didn't have a word for this idea I think it would be very important to invent one.
IMO it matters more that we keep our eye on the ball (i.e., think about the actual outcomes we want and keep researchers' focus on how to achieve those outcomes) than that we define an extremely crisp, easily-packaged technical concept (that is at best a loose proxy for what we actually want). Especially now that ASI seems nearer at hand (so the need for this "keep our eye on the ball" skill is becoming less and less theoretical), and especially now that ASI disaster concerns have hit the mainstream (so the need to "sell" AI risk has diminished somewhat, and the need to direct research talent at the most important problems has increased).
And I also want to push back against the idea that a priori, before we got stuck with the current terminology mess, it should have been obvious that "alignment" is about AI systems' goals and/or intentions, rather than about their behavior or overall designs. I think intent alignment took off because Stuart Russell and Paul Christiano advocated for that usage and encouraged others to use the term that way, not because this was the only option available to us.
"Should" in order to achieve a certain end? To meet some criterion? To boost a term in your utility function?
In the OP: "Should" in order to have more accurate beliefs/expectations. E.g., I should anticipate (with high probability) that the Sun will rise tomorrow in my part of the world, rather than it remaining night.
Why would the laws of physics conspire to vindicate a random human intuition that arose for unrelated reasons?
We do agree that the intuition arose for unrelated reasons, right? There's nothing in our evolutionary history, and no empirical observation, that causally connects the mechanism you're positing and the widespread human hunch "you can't copy me".
If the intuition is right, we agree that it's only right by coincidence. So why are we desperately searching for ways to try to make the intuition right?
It also doesn't force us to believe that a bunch of water pipes or gears functioning as a classical computer can ever have our own first person experience.
Why is this an advantage of a theory? Are you under the misapprehension that "hypothesis H allows humans to hold on to assumption A" is a Bayesian update in favor of H even when we already know that humans had no reason to believe A? This is another case where your theory seems to require that we only be coincidentally correct about A ("sufficiently complex arrangements of water pipes can't ever be conscious"), if we're correct about A at all.
One way to rescue this argument is by adding in an anthropic claim, like: "If water pipes could be conscious, then nearly all conscious minds would be instantiated in random dust clouds and the like, not in biological brains. So given that we're not Boltzmann brains briefly coalescing from space dust, we should update that giant clouds of space dust can't be conscious."
But is this argument actually correct? There's an awful lot of complex machinery in a human brain. (And the same anthropic argument seems to suggest that some of the human-specific machinery is essential, else we'd expect to be some far-more-numerous observer, like an insect.) Is it actually that common for a random brew of space dust to coalesce into exactly the right shape, even briefly?
Yeah, at some point we'll need a proper theory of consciousness regardless, since many humans will want to radically self-improve and it's important to know which cognitive enhancements preserve consciousness.
You can easily clear this confusion if you rephrase it as "You should anticipate having any of these experiences". Then it's immediately clear that we are talking about two separate screens.
This introduces some other ambiguities. E.g., "you should anticipate having any of these experiences" may make it sound like you have a choice as to which experience to rationally expect.
And it's also clear that our curriocity isn't actually satisfied. That the question "which one of these two will actually be the case" is still very much on the table.
... And the answer is "both of these will actually be the case (but not in a split-screen sort of way)".
Your rephrase hasn't shown that there was a question left unanswered in the original post; it's just shown that there isn't a super short way to crisply express what happens in English, you do actually have to add the clarification.
Still as soon as we got Rob-y and Rob-z they are not "metaphysically the same person". When Rob-y says "I" he is reffering to Rob-y, not Rob-z and vice versa. More specifically Rob-y is refering to some causal curve through time ans Rob-z is refering to another causal curve through time. These two curves are the same to some point, but then they are not.
Yep, I think this is a perfectly fine way to think about the thing.
My first issue with your post is that this initial ontological assumption is neither mentioned explicitly nor motivated. Nothing in your post can be used as proof of this initial assumption.
There are always going to be many different ways someone could object to a view. If you were a Christian, you'd perhaps be objecting that the existence of incorporeal God-given Souls is the real crux of the matter, and if I were intellectually honest I'd be devoting the first half of the post to arguing against the Christian Soul.
Rather than trying to anticipate these objections, I'd rather just hear them stated out loud by their proponents and then hash them out in the comments. This also makes the post less boring for the sorts of people who are most likely to be on LW: physicalists and their ilk.
Now, what would be the experience of getting copied, seen from a first-person, "internal", perspective? I am pretty sure it would be something like: you walk into the room, you sit there, you hear say the scanner working for some time, it stops, you walk out. From my agnostic perspective, if I were the one to be scanned it seems like nothing special would have happened to me in this procedure. I didnt feel anything weird, I didnt feel my "consciousness split into two" or something.
Why do you assume that you wouldn't experience the copy's version of events?
The un-copied version of you experiences walking into the room, sitting there, hearing the scanner working, and hearing it stop; then that version of you experiences walking out. It seems like nothing special happened in this procedure; this version of you doesn't feel anything weird, and doesn't feel like their "consciousness split into two" or anything.
The copied version of you experiences walking into the room, sitting here, hearing the scanner working, and then an instantaneous experience of (let's say) feeling like you've been teleported into another room -- you're now inside the simulation. Assuming the simulation feels like a normal room, it could well seem like nothing special happened in this procedure -- it may feel like blinking and seeing the room suddenly change during the blink, while you yourself remain unchanged. This version of you doesn't necessarily feel anything weird either, and they don't feel like their "consciousness split into two" or anything.
It's a bit weird that there are two futures, here, but only one past -- that the first part of the story is the same for both versions of you. But so it goes; that just comes with the territory of copying people.
If you disagree with anything I've said above, what do you disagree with? And, again, what do you mean by saying you're "pretty sure" that you would experience the future of the non-copied version?
Namely, if I consider this procedure as an empirical experiment, from my first person perspective I dont get any new / unexpected observation compared to say just sitting in an ordinary room. Even if I were to go and find my copy, my experience would again be like meeting a different person which just happens to look like me and which claims to have similar memories up to the point when I entered the copying room. There would be no way to verify or to view things from their first person perspective.
Sure. But is any of this Bayesian evidence against the view I've outlined above? What would it feel like, if the copy were another version of yourself? Would you expect that you could telepathically communicate with your copy and see things from both perspectives at once, if your copies were equally "you"? If so, why?
On the contrary, I would be wary to, say, kill myself or to be destroyed after the copying procedure, since no change will have occured to my first person perspective, and it would thus seem less likely that my "experience" would somehow survive because of my copy.
Shall we make a million copies and then take a vote? :)
I agree that "I made a non-destructive software copy of myself and then experienced the future of my physical self rather than the future of my digital copy" is nonzero Bayesian evidence that physical brains have a Cartesian Soul that is responsible for the brain's phenomenal consciousness; the Cartesian Soul hypothesis does predict that data. But the prior probability of Cartesian Souls is low enough that I don't think it should matter.
You need some prior reason to believe in this Soul in the first place; the same as if you flipped a coin, it came up heads, and you said "aha, this is perfectly predicted by the existence of an invisible leprechaun who wanted that coin to come up heads!". Losing a coinflip isn't a surprising enough outcome to overcome the prior against invisible leprechauns.
and it would also force me to accept that even a copy where the "circuit" is made of water pipes and pumps, or gears and levers also have an actual, first person experience as "me", as long as the appropriate computations are being carried out.
Why wouldn't it? What do you have against water pipes?
Wouldn't it follow that in the same way you anticipate the future experiences of the brain that you "find yourself in" (i.e. the person reading this) you should anticipate all experiences, i.e. that all brain states occur with the same kind of me-ness/vivid immediacy?
What's the empirical or physical content of this belief?
I worry that this may be another case of the Cartesian Ghost rearing its ugly head. We notice that there's no physical thingie that makes the Ghost more connected to one experience or the other; so rather than exorcising the Ghost entirely, we imagine that the Ghost is connected to every experience simultaneously.
But in fact there is no Ghost. There's just a bunch of experience-moments implemented in brain-moments.
Some of those brain-moments resemble other brain-moments, either by coincidence or because of some (direct or indirect) causal link between the brain-moments. When we talk about Brain-1 "anticipating" or "becoming" a future brain-state Brain-2, we normally mean things like:
- There's a lawful physical connection between Brain-1 and Brain-2, such that the choices and experiences of Brain-1 influence the state of Brain-2 in a bunch of specific ways.
- Brain-2 retains ~all of the memories, personality traits, goals, etc. of Brain-1.
- If Brain-2 is a direct successor to Brain-1, then typically Brain-2 can remember a bunch of things about the experience Brain-1 was undergoing.
These are all fuzzy, high-level properties, which admit of edge cases. But I'm not seeing what's gained by therefore concluding "I should anticipate every experience, even ones that have no causal connection to mine and no shared memories and no shared personality traits". Tables are a fuzzy and high-level concept, but that doesn't mean that every object in existence is a table. It doesn't even mean that every object is slightly table-ish. A photon isn't "slightly table-ish", it's just plain not a table.
Which just means, all brain states exist in the same vivid, for-me way, since there is nothing further to distinguish between them that makes them this vivid, i.e. they all exist HERE-NOW.
But they don't have the anticipation-related properties I listed above; so what hypotheses are we distinguishing by updating from "these experiences aren't mine" to "these experiences are mine"?
Maybe the update that's happening is something like: "Previously it felt to me like other people's experiences weren't fully real. I was unduly selfish and self-centered, because my experiences seemed to me like they were the center of the universe; I abstractly and theoretically knew that other people have their own point of view, but that fact didn't really hit home for me. Then something happened, and I had a sudden realization that no, it's all real."
If so, then that seems totally fine to me. But I worry that the view in question might instead be something tacitly Cartesian, insofar as it's trying to say "all experiences are for me" -- something that doesn't make a lot of sense to say if there are two brain states on opposite sides of the universe with nothing in common and nothing connecting them, but that does make sense if there's a Ghost the experiences are all "for".
As a test, I asked a non-philosopher friend of mine what their view is. Here's a transcript of our short conversation: https://docs.google.com/document/d/1s1HOhrWrcYQ5S187vmpfzZcBfolYFIbeTYgqeebNIA0/edit
I was a bit annoyingly repetitive with trying to confirm and re-confirm what their view is, but I think it's clear from the exchange that my interpretation is correct at least for this person.
Is there even anybody claiming there is an experiential difference?
Yep! Ask someone with this view whether the current stream of consciousness continues from their pre-uploaded self to their post-uploaded self, like it continues when they pass through a doorway. The typical claim is some version of "this stream of consciousness will end, what comes next is only oblivion", not "oh sure, the stream of consciousness is going to continue in the same way it always does, but I prefer not to use the English word 'me' to refer to the later parts of that stream of consciousness".
This is why the disagreement here has policy implications: people with different views of personal identity have different beliefs about the desirability of mind uploading. They aren't just disagreeing about how to use words, and if they were, you'd be forced into the equally "uncharitable" perspective that someone here is very confused about how relevant word choice is to the desirability of uploading.
The alternative to this is that there is a disagreement about the appropriate semantic interpretation/analysis of the question. E.g. about what we mean when we say "I will (not) experience such and such". That seems more charitable than hypothesizing beliefs in "ghosts" or "magic".
I didn't say that the relevant people endorse a belief in ghosts or magic. (Some may do so, but many explicitly don't!)
It's a bit darkly funny that you've reached for a clearly false and super-uncharitable interpretation of what I said, in the same sentence you're chastising me for being uncharitable! But also, "charity" is a bad approach to trying to understand other people, and bad epistemology can get in the way of a lot of stuff.
The problem was that you first seemed to belittle questions about word meanings ("self") as being "just" about "definitions" that are "purely verbal".
I did no such thing!
Luckily now you concede that the question about the meaning of "I" isn't just about (arbitrary) "definitions"
Read the blog post at the top of this page! It's my attempt to answer the question of when a mind is "me", and you'll notice it's not talking about definitions.
But we already know all the empirical facts: Someone goes into the teleporter, a bit later someone comes out at the other end and sees something. So the issue can only be about the semantic interpretation of that question, about what we mean with expressions like "I will see x".
Nope!
There are two perspectives here:
- "I don't want to upload myself, because I wouldn't get to experience that uploads' experiences. When I die, this stream of consciousness will end, rather than continuing in another body. Physically dying and then being being copied elsewhere is not phenomenologically indistinguishable from stepping through a doorway."
- "I do want to upload myself, because I would get to experience that uploads' experiences. Physically dying and then being copied myself is phenomenologically indistinguishable from stepping through a doorway."
The disagreement between these two perspectives isn't about word definitions at all; a fear that "when my body dies, there will be nothing but oblivion" is a very real fear about anticipated experiences (and anticipated absences of experience), not a verbal quibble about how we ought to define a specific word.
But it's also a bit confusing to call the disagreement between these two perspectives "empirical", because "empirical" here is conflating "third-person empirical" with "first-person empirical".
The disagreement here is about whether a stream of consciousness can "continue" across temporal and spatial gaps, in the same way that it continues when there are no obvious gaps. It's about whether there's a subjective, experiential difference between stepping through a doorway and using a teleporter.
The thing I'm arguing in the OP is that there can't be an experiential difference here, because there's no physical difference that could be underlying the supposed experiential difference. So the disagreement about the first-person facts, I claim, stems from a cognitive error, which I characterize as "making predictions as though you believed yourself to be a Cartesian Ghost (even if you don't on-reflection endorse the claim that Cartesian Ghosts exist)". This is, again, a very different error from "defining a word in a nonstandard way".
You're also free to define "I" however you want in your values.
Sort of!
- It's true that no law of nature will stop you from using "I" in a nonstandard way; your head will not explode if you redefine "table" to mean "penguin".
- And it's true that there are possible minds in abstract mindspace that have all sorts of values, including strict preferences about whether they want their brain to be made of silicon vs. carbon.
- But it's not true that humans alive today have full and complete control over their own preferences.
- And it's not true that humans can never be mistaken in their beliefs about their own preferences.
In the case of teleportation, I think teleportation-phobic people are mostly making an implicit error of the form "mistakenly modeling situations as though you are a Cartesian Ghost who is observing experiences from outside the universe", not making a mistake about what their preferences are per se. (Though once you realize that you're not a Cartesian Ghost, that will have some implications for what experiences you expect to see next in some cases, and implications for what physical world-states you prefer relative to other world-states.)
FWIW, I typically use "alignment research" to mean "AI research aimed at making it possible to safely do ambitious things with sufficiently-capable AI" (with an emphasis on "safely"). So I'd include things like Chris Olah's interpretability research, even if the proximate impact of this is just "we understand what's going on better, so we may be more able to predict and finely control future systems" and the proximate impact is not "the AI is now less inclined to kill you".
Some examples: I wouldn't necessarily think of "figure out how we want to airgap the AI" as "alignment research", since it's less about designing the AI, shaping its mind, predicting and controlling it, etc., and more about designing the environment around the AI.
But I would think of things like "figure out how to make this AI system too socially-dumb to come up with ideas like 'maybe I should deceive my operators', while keeping it superhumanly smart at nanotech research" as central examples of "alignment research", even though it's about controlling capabilities ('make the AI dumb in this particular way') rather than about instilling a particular goal into the AI.
And I'd also think of "we know this AI is trying to kill us; let's figure out how to constrain its capabilities so that it keeps wanting that, but is too dumb to find a way to succeed in killing us, thereby forcing it to work with us rather than against us in order to achieve more of what it wants" as a pretty central example of alignment research, albeit not the sort of alignment research I feel optimistic about. The way I think about the field, you don't have to specifically attack the "is it trying to kill you?" part of the system in order to be doing alignment research; there are other paths, and alignment researchers should consider all of them and focus on results rather than marrying a specific methodology.
But that isn't an experience. It's two experiences. You will not have an experience of having two experiences. Two experiences will experience having been one person.
Sure; from my perspective, you're saying the same thing as me.
Are you going to care about 1000 different copies equally?
How am I supposed to choose between them?
Why? If "I" is arbitrary definition, then “When I step through this doorway, will I have another experience?" depends on this arbitrary definition and so is also arbitrary.
Which things count as "I" isn't an arbitrary definition; it's just a fuzzy natural-language concept.
(I guess you can call that "arbitrary" if you want, but then all the other words in the sentence, like "doorway" and "step", are also "arbitrary".)
Analogy: When you're writing in your personal diary, you're free to define "table" however you want. But in ordinary English-language discourse, if you call all penguins "tables" you'll just be wrong. And this fact isn't changed at all by the fact that "table" lacks a perfectly formal physics-level definition.
The same holds for "Will Rob Bensinger's next experience be of sitting in his bedroom writing a LessWrong comment, or will it be of him grabbing some tomatoes in a supermarket in Beijing?"
Terms like 'Rob Bensinger' and 'I' aren't perfectly physically crisp — there may be cases where the answer is "ehh, maybe?" rather than a clear yes or no. And if we live in a Big Universe and we allow that there can be many Beijings out there in space, then we'll have to give a more nuanced quantitative answer, like "a lot more of Rob's immediate futures are in his bedroom than in Beijing".
But if we restrict our attention to this Beijing, then all that complexity goes away and we can pretty much rule out that anyone in Beijing will happen to momentarily exhibit exactly the right brain state to look like "Rob Bensinger plus one time step".
The nuances and wrinkles don't bleed over and make it a totally meaningless or arbitrary question; and indeed, if I thought I were likely to spontaneously teleport to Beijing in the next minute, I'd rightly be making very different life-choices! "Will I experience myself spontaneously teleporting to Beijing in the next second?" is a substantive (and easy) question, not a deep philosophical riddle.
So you always anticipate all possible experiences, because of multiverse?
Not all possible experiences; just all experiences of brains that have the same kinds of structural similarities to your current brain as, e.g., "me after I step through a doorway" has to "me before I stepped through the doorway".
The problem is another way to phrase this is a superintelligent weapon system - "ending a risk period" by "reliably, and efficiently doing a small number of specific concrete tasks" means using physical force to impose your will on others.
The pivotal acts I usually think about actually don't route through physically messing with anyone else. I'm usually thinking about using aligned AGI to bootstrap to fast human whole-brain emulation, then using the ems to bootstrap to fully aligned CEV AI.
If someone pushes a "destroy the world" button then the ems or CEV AI would need to stop the world from being destroyed, but that won't necessarily happen if the developers have enough of a lead, if they get the job done quickly enough, and if CEV AI is able to persuade the world to step back from the precipice voluntarily (using superhumanly good persuasion that isn't mind-control-y, deceptive, or otherwise consent-violating). It's a big ask, but not as big as CEV itself, I expect.
From my current perspective this is all somewhat of a moot point, however, because I don't think alignment is tractable enough that humanity should be trying to use aligned AI to prevent human extinction. I think we should instead hit the brakes on AI and shift efforts toward human enhancement, until some future generation is in a better position to handle the alignment problem.
If and only if that fails it may be appropriate to consider less consensual options.
It's not clear to me that we disagree in any action-relevant way, since I also don't think AI-enabled pivotal acts are the best path forward anymore. I think the path forward is via international agreements banning dangerous tech, and technical research to improve humanity's ability to wield such tech someday.
That said, it's not clear to me how your "if that fails, then try X instead" works in practice. How do you know when it's failed? Isn't it likely to be too late by the time we're sure that we've failed on that front? Indeed, it's plausibly already too late for humanity to seriously pivot to 'aligned AGI'. If I thought humanity's last best scrap of hope for survival lay in an AI-empowered pivotal act, I'd certainly want more details on when it's OK to start trying to figure out have humanity not die via this last desperate path.
To pick out a couple of specific examples from your list, Wei Dai:
14. Human-controlled AIs causing ethical disasters (e.g., large scale suffering that can't be "balanced out" later) prior to reaching moral/philosophical maturity
This is a serious long-term concern if we don't kill ourselves first, but it's not something I see as a premise for "the priority is for governments around the world to form an international agreement to halt AI progress". If AI were easy to use for concrete tasks like "build nanotechnology" but hard to use for things like CEV, I'd instead see the priority as "use AI to prevent anyone else from destroying the world with AI", and I wouldn't want to trade off probability of that plan working in exchange for (e.g.) more probability of the US and the EU agreeing in advance to centralize and monitor large computing clusters.
After someone has done a pivotal act like that, you might then want to move more slowly insofar as you're worried about subtle moral errors creeping in to precursors to CEV.
30. AI systems end up controlled by a group of humans representing a small range of human values (ie. an ideological or religious group that imposes values on everyone else)
I currently assign very low probability to humans being able to control the first ASI systems, and redirecting governments' attention away from "rogue AI" and toward "rogue humans using AI" seems very risky to me, insofar as it causes governments to misunderstand the situation, and to specifically misunderstand it in a way that encourages racing.
If you think rogue actors can use ASI to achieve their ends, then you should probably also think that you could use ASI to achieve your own ends; misuse risk tends to go hand-in-hand with "we're the Good Guys, let's try to outrace the Bad Guys so AI ends up in the right hands". This could maybe be justified if it were true, but when it's not even true it strikes me as an especially bad argument to make.
Yep, before I saw orthonormal's response I had a draft-reply written that says almost literally the same thing:
we just call 'em like we see 'em
[...]
insofar as we make bad predictions, we should get penalized for it. and insofar as we think alignment difficulty is the crux for 'why we need to shut it all down', we'd rather directly argue against illusory alignment progress (and directly acknowledge real major alignment progress as a real reason to be less confident of shutdown as a strategy) rather than redirect to something less cruxy
I'll also add: Nate (unlike Eliezer, AFAIK?) hasn't flatly said 'alignment is extremely difficult'. Quoting from Nate's "sharp left turn" post:
Many people wrongly believe that I'm pessimistic because I think the alignment problem is extraordinarily difficult on a purely technical level. That's flatly false, and is pretty high up there on my list of least favorite misconceptions of my views.
I think the problem is a normal problem of mastering some scientific field, as humanity has done many times before. Maybe it's somewhat trickier, on account of (e.g.) intelligence being more complicated than, say, physics; maybe it's somewhat easier on account of how we have more introspective access to a working mind than we have to the low-level physical fields; but on the whole, I doubt it's all that qualitatively different than the sorts of summits humanity has surmounted before.
It's made trickier by the fact that we probably have to attain mastery of general intelligence before we spend a bunch of time working with general intelligences (on account of how we seem likely to kill ourselves by accident within a few years, once we have AGIs on hand, if no pivotal act occurs), but that alone is not enough to undermine my hope.
What undermines my hope is that nobody seems to be working on the hard bits, and I don't currently expect most people to become convinced that they need to solve those hard bits until it's too late.
So it may be that Nate's models would be less surprised by alignment breakthroughs than Eliezer's models. And some other MIRI folks are much more optimistic than Nate, FWIW.
My own view is that I don't feel nervous leaning on "we won't crack open alignment in time" as a premise, and absent that premise I'd indeed be much less gung-ho about government intervention.
why put all your argumentative eggs in the "alignment is hard" basket? (If you're right, then policymakers can't tell that you're right.)
The short answer is "we don't put all our eggs in the basket" (e.g., Eliezer's TED talk and TIME article emphasize that alignment is an open problem, but they emphasize other things too, and they don't go into detail on exactly how hard Eliezer thinks the problem is), plus "we very much want at least some eggs in that basket because it's true, it's honest, it's cruxy for us, etc." And it's easier for policymakers to acquire strong Bayesian evidence for "the problem is currently unsolved" and "there's no consensus about how to solve it" and "most leaders in the field seem to think there's a serious chance we won't solve it in time" than to acquire strong Bayesian evidence for "we're very likely generations away from solving alignment", so the difficulty of communicating the latter isn't a strong reason to de-emphasize all the former points.
The longer answer is a lot more complicated. We're still figuring out how best to communicate our views to different audiences, and "it's hard for policymakers to evaluate all the local arguments or know whether Yann LeCun is making more sense than Yoshua Bengio" is a serious constraint. If there's a specific argument (or e.g. a specific three arguments) you think we should be emphasizing alongside "alignment is unsolved and looks hard", I'd be interested to hear your suggestion and your reasoning. https://www.lesswrong.com/posts/WXvt8bxYnwBYpy9oT/the-main-sources-of-ai-risk is a very long list and isn't optimized for policymakers, so I'm not sure what specific changes you have in mind here.
I expect it makes it easier, but I don't think it's solved.
Suppose you want to synthesize a lot of diamonds. Instead of giving an AI some lofty goal "maximize diamonds in an aligned way", why not a bunch of small grounded ones.
- "Plan the factory layout of the diamond synthesis plant with these requirements".
- "Order the equipment needed, here's the payment credentials".
- "Supervise construction this workday comparing to original plans"
- "Given this step of the plan, do it"
- (Once the factory is built) "remove the output from diamond synthesis machine A53 and clean it".
That is how MIRI imagines a sane developer using just-barely-aligned AI to save the world. You don't build an open-ended maximizer and unleash it on the world to maximize some quantity that sounds good to you; that sounds insanely difficult. You carve out as many tasks as you can into concrete, verifiable chunks, and you build the weakest and most limited possible AI you can to complete each chunk, to minimize risk. (Though per faul_sname, you're likely to be pretty limited in how much you can carve up the task, given time will be a major constraint and there may be parts of the task you don't fully understand at the outset.)
Cf. The Rocket Alignment Problem. The point of solving the diamond maximizer problem isn't to go build the thing; it's that solving it is an indication that we've become less conceptually confused about real-world optimization and about aimable cognitive work. Being less conceptually confused about very basic aspects of problem-solving and goal-oriented reasoning means that you might be able to build some of your powerful AI systems out of building blocks that are relatively easy to analyze, test, design, predict, separate out into discrete modules, measure and limit the capabilities of, etc., etc.
That seems acceptable, industrial equipment causes accidents all the time, the main thing is to limit the damage. Fences to limit the robots operating area, timers that shut down control after a timeout, etc.
If everyone in the world chooses to permanently use very weak systems because they're scared of AI killing them, then yes, the impact of any given system failing will stay low. But that's not what's going to actually happen; many people will use more powerful systems, once they can, because they misunderstand the risks or have galaxy-brained their way into not caring about them (e.g. 'maybe humans don't deserve to live', 'if I don't do it someone else will anyway', 'if it's that easy to destroy the world then we're fucked anyway so I should just do the Modest thing of assuming nothing I do is that important'...).
The world needs some solution to the problem "if AI keeps advancing and more-powerful AI keeps proliferating, eventually someone will destroy the world with it". I don't know of a way to leverage AI to solve that problem without the AI being pretty dangerously powerful, so I don't think AI is going to get us out of this mess unless we make a shocking amount of progress on figuring out how to align more powerful systems. (Where "aligning" includes things like being able to predict in advance how pragmatically powerful your system is, and being able to carefully limit the ways in which it's powerful.)
To be clear: The diamond maximizer problem is about getting specific intended content into the AI's goals ("diamonds" as opposed to some random physical structure it's maximizing), not just about building a stable maximizer.
From briefly talking to Eliezer about this the other day, I think the story from MIRI's perspective is more like:
- Back in 2001, we defined "Friendly AI" as "The field of study concerned with the production of human-benefiting, non-human-harming actions in Artificial Intelligence systems that have advanced to the point of making real-world plans in pursuit of goals."
We could have defined the goal more narrowly or generically than that, but that just seemed like an invitation to take your eye off the ball: if we aren't going to think about the question of how to get good long-run outcomes from powerful AI systems, who will?
And many of the technical and philosophical problems seemed particular to CEV, which seemed like an obvious sort of solution to shoot for: just find some way to leverage the AI's intelligence to solve the problem of extrapolating everyone's preferences in a reasonable way, and of aggregating those preferences fairly.
- Come 2014, Stuart Russell and MIRI were both looking for a new term to replace "the Friendly AI problem", now that the field was starting to become a Real Thing. Both parties disliked Bostrom's "the control problem". In conversation, Russell proposed "the alignment problem", and MIRI liked it, so Russell and MIRI both started using the term in public.
Unfortunately, it gradually came to light that Russell and MIRI had understood "Friendly AI" to mean two moderately different things, and this disconnect now turned into a split between how MIRI used "(AI) alignment" and how Russell used "(value) alignment". (Which I think also influenced the split between Paul Christiano's "(intent) alignment" and MIRI's "(outcome) alignment".)
Russell's version of "friendliness/alignment" was about making the AI have good, human-deferential goals. But Creating Friendly AI 1.0 had been very explicit that "friendliness" was about good behavior, regardless of how that's achieved. MIRI's conception of "the alignment problem" (like Bostrom's "control problem") included tools like capability constraint and boxing, because the thing we wanted researchers to focus on was the goal of leveraging AI capabilities to get actually-good outcomes, whatever technical work that requires, not some proxy goal that might turn out to be surprisingly irrelevant.
Again, we wanted a field of people keeping their eye on the ball and looking for clever technical ways to get the job done, rather than a field that neglects some actually-useful technique because it doesn't fit their narrow definition of "alignment".
- Meanwhile, developments like the rise of deep learning had updated MIRI that CEV was not going to be a realistic thing to shoot for with your first AI. We were still thinking of some version of CEV as the ultimate goal, but it now seemed clear that capabilities were progressing too quickly for humanity to have time to nail down all the details of CEV, and it was also clear that the approaches to AI that were winning out would be far harder to analyze, predict, and "aim" than 2001-Eliezer had expected. It seemed clear that if AI was going to help make the future go well, the first order of business would be to do the minimal thing to prevent other AIs from destroying the world six months later, with other parts of alignment/friendliness deferred to later.
I think considerations like this eventually trickled in to how MIRI used the term "alignment". Our first public writing reflecting the switch from "Friendly AI" to "alignment", our Dec. 2014 agent foundations research agenda, said:
We call a smarter-than-human system that reliably pursues beneficial goals “aligned with human interests” or simply “aligned.”
Whereas by July 2016, when we released a new research agenda that was more ML-focused, "aligned" was shorthand for "aligned with the interests of the operators".
In practice, we started using "aligned" to mean something more like "aimable" (where aimability includes things like corrigibility, limiting side-effects, monitoring and limiting capabilities, etc., not just "getting the AI to predictably tile the universe with smiley faces rather than paperclips"). Focusing on CEV-ish systems mostly seemed like a distraction, and an invitation to get caught up in moral philosophy and pie-in-the-sky abstractions, when "do a pivotal act" is legitimately a hugely more philosophically shallow topic than "implement CEV". Instead, we went out of our way to frame the challenge of alignment in a way that seemed almost comically simple and "un-philosophical", but that successfully captured all of the key obstacles: 'explain how to use an AI to cause there two exist two strawberries that are identical at the cellular level, without causing anything weird or disruptive to happen in the process'.
Since realistic pivotal acts still seemed pretty outside the Overton window (and since we were mostly focused on our own research at the time), we wrote up our basic thoughts about the topic on Arbital but didn't try to super-popularize the topic among rationalists or EAs at the time. (Which unfortunately, I think, exacerbated a situation where the larger communities had very fuzzy models of the strategic situation, and fuzzy models of what the point even was of this "alignment research" thing; alignment research just become a thing-that-was-good-because-it-was-a-good, not a concrete part of a plan backchained from concrete real-world goals.)
I don't think MIRI wants to stop using "aligned" in the context of pivotal acts, and I also don't think MIRI wants to totally divorce the term from the original long-term goal of friendliness/alignment.
Turning "alignment" purely into a matter of "get the AI to do what a particular stakeholder wants" is good in some ways -- e.g., it clarifies that the level of alignment needed for pivotal acts could also be used to do bad things.
But from Eliezer's perspective, this move would also be sending a message to all the young Eliezers "Alignment Research is what you do if you're a serious sober person who thinks it's naive to care about Doing The Right Thing and is instead just trying to make AI Useful To Powerful People; if you want to aim for the obvious desideratum of making AI friendly and beneficial to the world, go join e/acc or something". Which does not seem ideal.
So I think my proposed solution would be to just acknowledge that 'the alignment problem' is ambiguous between three different (overlapping) efforts to figure out how to get good and/or intended outcomes from powerful AI systems:
- intent alignment, which is about getting AIs to try to do what the AI thinks the user wants, and in practice seems to be most interested in 'how do we get AIs to be generically trying-to-be-helpful'.
- "strawberry problem" alignment, which is about getting AIs to safely, reliably, and efficiently do a small number of specific concrete tasks that are very difficult, for the sake of ending the acute existential risk period.
- CEV-style alignment, which is about getting AIs to fully figure out how to make the future good.
Plausibly it would help to have better names for the latter two things. The distinction is similar to "narrow value learning vs. ambitious value learning", but both problems (as MIRI thinks about them) are a lot more general than just "value learning", and there's a lot more content to the strawberry problem than to "narrow alignment", and more content to CEV than to "ambitious value learning" (e.g., CEV cares about aggregation across people, not just about extrapolation).
(Note: Take the above summary of MIRI's history with a grain of salt; I had Nate Soares look at this comment and he said "on a skim, it doesn't seem to quite line up with my recollections nor cut things along the joints I would currently cut them along, but maybe it's better than nothing".)
In the context of a conversation with Balaji Srinivasan about my AI views snapshot, I asked Nate Soares what sorts of alignment results would impress him, and he said:
example thing that would be relatively impressive to me: specific, comprehensive understanding of models (with the caveat that that knowledge may lend itself more (and sooner) to capabilities before alignment). demonstrated e.g. by the ability to precisely predict the capabilities and quirks of the next generation (before running it)
i'd also still be impressed by simple theories of aimable cognition (i mostly don't expect that sort of thing to have time to play out any more, but if someone was able to come up with one after staring at LLMs for a while, i would at least be impressed)
fwiw i don't myself really know how to answer the question "technical research is more useful than policy research"; like that question sounds to me like it's generated from a place of "enough of either of these will save you" whereas my model is more like "you need both"
tho i'm more like "to get the requisite technical research, aim for uploads" at this juncture
if this was gonna be blasted outwards, i'd maybe also caveat that, while a bunch of this is a type of interpretability work, i also expect a bunch of interpretability work to strike me as fake, shallow, or far short of the bar i consider impressive/hopeful
(which is not itself supposed to be any kind of sideswipe; i applaud interpretability efforts even while thinking it's moving too slowly etc.)
I can come up with plans for destroying the world without wanting to do it, and other cognitive systems probably can too.
You're changing the topic to "can you do X without wanting Y?", when the original question was "can you do X without wanting anything at all?".
Nate's answer to nearly all questions of the form "can you do X without wanting Y?" is "yes", hence his second claim in the OP: "the wanting-like behavior required to pursue a particular training target X, does not need to involve the AI wanting X in particular".
I do need to answer that question using in a goal-oriented search process. But my goal would be "answer Paul's question", not "destroy the world".
Your ultimate goal would be neither of those things; you're a human, and if you're answering Paul's question it's probably because you have other goals that are served by answering.
In the same way, an AI that's sufficiently good at answering sufficiently hard and varied questions would probably also have goals, and it's unlikely by default that "answer questions" will be the AI's primary goal.
The idea that an area of study is less scientific because the subject is inelegant is a blinkered view of what science is.
See my reply to Bogdan here. The issue isn't "inelegance"; we also lack an inelegant ability to predict or explain how particular ML systems do what they do.
Modern ML is less like modern chemistry, and more like ancient culinary arts and medicine. (Or "ancient culinary arts and medicine shortly after a cultural reboot", such that we have a relatively small number of recently-developed shallow heuristics and facts to draw on, rather than centuries of hard-earned experience.)
The opening sounds a lot like saying "aerodynamics used to be a science until people started building planes."
The reason this analogy doesn't land for me is that I don't think our epistemic position regarding LLMs is similar to, e.g., the Wright brothers' epistemic position regarding heavier-than-air flight.
The point Nate was trying to make with "ML is no longer a science" wasn't "boo current ML that actually works, yay GOFAI that didn't work". The point was exactly to draw a contrast between, e.g., our understanding of heavier-than-air flight and our understanding of how the human brain works. The invention of useful tech that interfaces with the brain doesn't entail that we understand the brain's workings in the way we've long understood flight; it depends on what the (actual or hypothetical) tech is.
Maybe a clearer way of phrasing it is "AI used to be failed science; now it's (mostly, outside of a few small oases) a not-even-attempted science". "Failed science" maybe makes it clearer that the point here isn't to praise the old approaches that didn't work; there's a more nuanced point being made.
Some of Nate’s quick thoughts (paraphrased), after chatting with him:
Nate isn’t trying to say that we have literally zero understanding of deep nets. What he’s trying to do is qualitatively point to the kind of high-level situation we’re in, in part because he thinks there is real interpretability progress, and when you’re working in the interpretability mines and seeing real advances it can be easy to miss the forest for the trees and forget how far we are from understanding what LLMs are doing. (Compared to, e.g., how well we can predict or post-facto-mechanistically-explain a typical system humans have engineered.)
Nobody's been able to call the specific capabilities of systems in advance. Nobody's been able to call the specific exploits in advance. Nobody's been able to build better cognitive algorithms by hand after understanding how the AI does things we can't yet code by hand. There is clearly some other level of understanding that is possible that we lack, and that we once sought, and that only the interpretability folks continue to seek.
E.g., think of that time Neel Nanda figured out how a small transformer does modular arithmetic (AXRP episode). If nobody had ever thought of that algorithm for an adder, we would have thereby learned a new algorithm for an adder. There are things that these AI systems are doing that aren’t just lots of stuff we know; there are levels of organization of understanding that give you the ability to predict how things work outside of the bands where we’ve observed them.
It seems trendy to declare that they never existed in the first place and that that’s all white tower stuff, but Nate thinks this point of view is missing a pretty important and central thread.
The missing thread isn’t trivial to put into words, but it includes things like:
- This sounds like the same sort of thing some people would say if they were staring at computer binary for the first time and didn't know about the code behind the scenes: "We have plenty of understanding beyond just how the CPU handles instructions; we understand how memory caching works and we have recognized patterns like the stack and the heap; talking as if there's some deeper level of organization is talking like a theorist when in fact this is an engineering problem." Those types of understanding aren't false, but they aren't the sort of understanding of someone who has comprehended the codebase they're looking at.
- There are, predictably, things to learn here; the messiness and complexity of the real world doesn’t mean we already know the relevant principles. You don't need to understand everything about how a bird works in order to build an airplane; there are compressible principles behind how birds fly; if you understand what's going on you can build flying devices that have significantly more carrying capacity than a bird, and this holds true even if the practical engineering of an airplane requires a bunch of trial and error and messy engineering work.
- A mind’s causal structure is allowed to be complicated; we can see the weights, but we don’t thereby have a mastery of the high-level patterns. In the case of humans, neuroscience hasn’t actually worked to give us a mastery of the high-level patterns the human brain is implementing.
- Mystery is in the map, not in the territory; reductionism works. Not all sciences that can exist, already exist today.
Possibly the above pointers are only useful if you already grok the point we’re trying to make, and isn’t so useful for communicating a new idea; but perhaps not.
I read and responded to some pieces of that post when it came out; I don't know whether Eliezer, Nate, etc. read it, and I'm guessing it didn't shift MIRI, except as one of many data points "person X is now loudly in favor of a pause (and other people seem receptive), so maybe this is more politically tractable than we thought".
I'd say that Kerry Vaughan was the main person who started smashing this Overton window, and this started in April/May/June of 2022. By late December my recollection is that this public conversation was already fully in swing and MIRI had already added our voices to the "stop building toward AGI" chorus. (Though at that stage I think we were mostly doing this on general principle, for lack of any better ideas than "share our actual long-standing views and hope that helps somehow". Our increased optimism about policy solutions mostly came later, in 2023.)
That said, I bet Katja's post had tons of relevant positive effects even if it didn't directly shift MIRI's views.