My Overview of the AI Alignment Landscape: Threat Models

post by Neel Nanda (neel-nanda-1) · 2021-12-25T23:07:10.846Z · LW · GW · 3 comments

Contents

  Introduction
  Power-Seeking AI
    The case
      Other criticisms:
    The work
  Sub-Threat model: Inner Alignment
    The Case
    The work
      Understanding:
      Solving:
  You get what you measure
    The case
    The work
  AI caused coordination failures
    The Case
    The Work
None
4 comments

This is the second post in a sequence mapping out the AI Alignment research landscape. The sequence will likely never be completed, but you can read a draft here.

Disclaimer: I recently started as an interpretability researcher at Anthropic, but I wrote this post before starting, and it entirely represents my personal views not those of my employer

Intended audience: People who understand why you might think that AI Alignment is important, but want to understand what AI researchers actually do and why.

Pedagogy note: I link to many papers and blog posts to read more about each area. I think technical writing is often harder to digest without a big picture in mind, so where possible I link to Alignment Newsletter summaries for a piece. There are a lot of links, so I recommend reading the summaries for anything interesting, but being selective about which full-length works you read.

Terminology note: There is a lot of disagreement about what “intelligence”, “human-level”, “transformative” or AGI even means. For simplicity, I will use AGI as a catch-all term for ‘the kind of powerful AI that we care about’. If you find this unsatisfyingly vague, OpenPhil’s definition of Transformative AI is my favourite precise definition.

Introduction

A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catastrophe. And then to identify specific problems in current or future AI systems that make these failure modes more likely to happen, and try to solve them now.

It is obviously really hard to reason about the future in a specific way without being wildly off! But I am pretty excited about approaches like this. I think it's easy for research (or anything, really) to be meandering, undirected and not very useful, especially for vague and ungrounded problems such as AI Alignment, which is essentially trying to fix problems in a technology that doesn’t exist yet. And having a specific story to guide what you do can be a valuable source of direction, even if ultimately you know it will be flawed in many ways. Nate Soares makes the case for having a specific but flawed story in general well.

Note that I think there is very much a spectrum between this category and robustly good approaches (a forthcoming post in this sequence). Most robustly good ways to help also address specific threat models, and many ways to address specific threat models feel useful even if that specific threat model is wrong. But I find this a helpful distinction to keep in mind.

Pedagogy notes:

Power-Seeking AI

This is the classic case outlined by earlier proponents of AI Alignment, especially Nick Bostrom and Eliezer Yudkowsky. It is outlined most clearly in Superintelligence. Joseph Carlsmith recently wrote a more up-to-date report [AF · GW] examining a similar case, and distilling it down to a simpler set of assumptions.

The case

We produce AGI. We believe this will be a goal-directed agent, trying to maximise a goal. Our current techniques cannot shape the goals of AIs very precisely and, worse, human values are highly complex and nuanced and vary between people, making them extremely hard to specify precisely. This will plausibly still be true by the time we produce AGI, so we will probably not be able to give it precisely the right goal. Further, maximising most large-scale goals means the AGI will have many instrumentally convergent goals - it will want to gain power, influence, resources and avoid being turned off, because these are instrumentally helpful for a wide range of tasks.

As goal specification is so hard, the AGI will inevitably want different things from us. It will have superhuman planning capabilities, meaning it will be better at coming up with ways to get what it wants than we will. And so it will likely come up with creative plans that we cannot predict and successfully guard against, because it is very hard to outwit something significantly smarter than you. A specific way this could go wrong is by creating an incentive to deceive us, to act perfectly aligned and to pass all tests we give it, until it can gain enough influence to decisively take power: a treacherous turn. This is not necessarily how things would actually go down, the key point is that if a system is better at planning than us, has different goals, and can influence the world, this can go wrong in many catastrophic ways.

Personally, I overall find this case fairly persuasive, and I expect there are significant grains of truth in this. It is by far the oldest and most established of the threat models I discuss, and has seen far more rigorous treatment than the others, but could still do with significantly more study. In particular, simplistic discussions of this model often bake in significant implicit assumptions, and it has often faced criticism.

When first encountering this case, it’s easy to assume it must apply to future powerful AIs, which I don’t think is obvious. I find it helpful to distill out the implicit assumptions. One set of maybe sufficient assumptions (mostly borrowed from Rohin Shah’s summary of Joseph Carlsmith’s report [AF · GW]):

Other criticisms:

The work

Sub-Threat model: Inner Alignment

A particularly concerning special case of the power-seeking concern is inner misalignment. This was an idea that had been floating around MIRI for a while, but was first properly clarified by Evan Hubinger in Risks from Learned Optimization [? · GW].

I think this is extremely important but notoriously hard to get your head around. Accessible overviews: Rob Miles, Rafael Harth [LW · GW]. Sources to learn more: Evan’s interviews on FLI and AXRP [? · GW], the Risks from Learned Optimization paper [? · GW].

The Case

We first begin with the analogy of humans and evolution: From a certain point of view, evolution is an optimization process that searches over the space of possible organisms and finds those that are good at reproducing. Evolution eventually produced humans, who are themselves optimizers, and we care about a range of goals, such as status, pleasure, art, knowledge, writing posts for the Alignment forum, etc. And in the ancestral environment, pursuing these goals resulted in significant reproductive success. But in the modern world we continue to optimize our goals, yet totally fail to maximise reproductive success, eg by using birth control. Thus, from the perspective of evolution, humans are misaligned.

The key feature of the setup here, is that we had a base optimizer, evolution, an optimization process searching over possible systems according to how well they performed on a base objective, reproductive success. And this base optimizer eventually found a system, humans, that was itself optimizing. Humans are an example of a mesa-optimizer, an optimizing system found by a base optimizer, and humans are pursuing mesa objective(s).


The core problem is that the base objective (reproductive success) and the mesa objective (status, pleasure, etc) are not the same. This happened because evolution only cares about the performance of a system in the ancestral environment, rather than what the system’s mesa-objective truly is. And there are many possible mesa-objectives that will lead to reproductive success in the ancestral environment, but may lead to totally different outcomes in other environments - as happened with humans.

This setup is similar to modern deep learning: we search over possible neural networks weights with stochastic gradient descent (SGD), the base optimizer, according to our loss function, the base objective. And, further, SGD only pays attention to a network’s performance on the loss function on the training data. And pays no attention to how a network actually works. So the concern is that deep learning may result in neural networks that are optimizing systems pursuing mesa-objectives, but have no way of ensuring these objectives are the same as the base objective.

This concern introduces significant further complexity into the alignment problem. Optimization is scary, and a highly capable system pursuing an objective misaligned with ours will likely lead to bad outcomes. But with mesa-optimizers, we have two objectives: the base objective, and the mesa-objective. So we need to both solve the outer alignment problem, ensuring the base objective is aligned with human values, and the inner alignment problem, ensuring that the mesa-objective is the same as the base objective.

A key feature of the inner alignment problem is that the base objective underdetermines the mesa-objective. Our main tool for reasoning about the outcome of training a neural network is evaluating which parameters lead to good performance on the training data. This tool breaks down here, as there are likely many mesa-objectives that perform well on the base objective on the training data, some of which will be aligned (as in, they generalise safely to new environments), some of which will not be. So the key question is which mesa-objective we will end up with.

In practice, if we end up with mesa-optimizers, they will have performed well on the base objective on the training data. There are many ways this could happen, here are three of the most important:

I think there is the seed of an important idea here, but a lot of the discussion seems divided and confused, especially regarding what terms like optimizer actually mean. (See eg Evan Hubinger’s Clarifying Inner Alignment terminology [AF · GW]). And while humans fail inner alignment, humans do not seem like an expected utility maximiser. Personally, I’m not convinced that we will ever produce neural networks that act like expected utility maximisers.

Another framing that side-steps the question of defining optimization is the 2D model of robustness [AF · GW]. When we successfully train a model to act in an environment, it will take purposeful actions to achieve the intended objective. But when we shift the model to a different environment, there are three things that can happen. It may fail to take any purposeful actions at all, it may take purposeful actions but not towards the intended objective (its capabilities have generalised but its objective has not) or it may take purposeful actions towards the intended objective (its capabilities and objective have generalised). This breaks the question of ‘does the model successfully generalise?’ into the questions of ‘does the model’s capabilities generalise?’ and ‘does the model’s objective generalise?’. This is a helpful distinction, because causing an existential catastrophe is really hard, and so is much more likely to occur from an agent taking purposeful actions and capable of planning.

Personally, I find the inner misalignment threat model to be incredibly compelling, and it was a major factor in my decision to work on interpretability. But I’m not necessarily convinced by any of the more specific framings, eg specific narratives of what a mesa-optimizer might look like. My best attempt to distill out the core argument is as follows:

The work

This is a new and fairly poorly understood problem - it’s not even obvious that we will get mesa-optimizers - so I divide the work into understanding the problem and solving the problem.

Understanding:

Solving:

You get what you measure

This is the threat model outlined in What Failure Looks Like (Part 1) [? · GW] by Paul Christiano. I found the post insightful, but also somewhat cryptic, and found these clarifications from Ben Pace [AF · GW] and Sam Clarke [? · GW] helpful. Paul Christiano’s Another (outer) alignment failure story [AF · GW] is another story outlining a related threat model, which I also found helpful. Anecdotally, some researchers I respect take this very seriously - it was narrowly rated the most plausible threat model in a recent survey [AF · GW]. This case has been less fleshed out than those above, so the following is more my attempt to flesh out and steelman the case and less focused on summarising existing work.

The case

Reinforcement learning systems are great at optimizing simple reward functions in clever and creative ways, and are getting better at optimizing all the time, but we struggle to optimize complex reward functions, and are seeing much less progress there. As AI systems become more influential on the world and a bigger part of the global economic system, we will want them to achieve complex and nuanced goals, as human values are complex and nuanced. Assuming that we remain much better at achieving simple rewards, this means we will need to approximate our true goals and define a proxy goal for the system. And if enough optimisation power is applied to a proxy goal, eventually these imperfections will become magnified, resulting in potentially catastrophic outcomes.

This pattern of simple, easy-to-measure proxies to achieve complex goals is widespread (the formal jargon is Goodhart’s Law). For example, GDP is often used as an easy-to-measure proxy for measuring prosperity - this can work fairly well, but misses out on major components such as life satisfaction. Or, academia is intended to be a system to produce good science and advance human knowledge, by incentivising academics to publish rapidly, get many citations and publish in high-impact journals (this one often fails).

The notion of simple vs complex reward functions is doing a lot of work here, and is hard to define explicitly. Intuitively, I think of simple as “easy to measure” - could I give a system lots of samples from this reward function while training? In practice, reinforcement learning systems are often trained from very easy to measure functions, such as the score in a video game. It may be possible to train a system on more complex rewards, eg by having it directly ask a human for feedback, but we need systems trained on these complex rewards to also be competitive with systems trained on simple rewards - can we get comparable performance at comparable cost?

This phenomenon is not specific to AI, the world is already heavily shaped by systems optimising simple proxies, eg corporations maximising profit, and this is not (yet) an existential catastrophe. So why be concerned about AI?

One major reason that this is not currently a catastrophe is that society shapes and updates these proxies as the imperfections become clear through tools such as regulation. For example, ‘maximise profit’ is a bad proxy for ‘make society better’ as it doesn’t account for costs to third parties such as pollution. But we live in a world which is far less polluted than it could be, thanks to taxes and laws about pollution.

But this error-correction mechanism may break-down for AI. There are three key factors to analyse here: pace, comprehensibility and lock-in.

Pace: How rapidly is the technology being developed and deployed? When trying to react to and regulate new technologies, it is much harder when things are moving at a fast pace - when things are slow, you have more time to react, coordinate, learn from failures, etc. For example, governments are having a really hard time regulating new technologies like drones and social media. AI is developing extremely rapidly even today, and if it becomes a significant fraction of global GDP this could plausibly be much worse, as far more resources will be put into it. (Note: This is not an argument for discontinuous/fast takeoff, a ‘slow’ takeoff would still likely be very hard to respond to. (Discussed more in the forthcoming key considerations post))

Comprehensibility: Can we see what the system is doing and why? If so, it’s much easier to identify problems and notice them early. For example, regulating recommender systems is particularly hard because it’s hard to tell how the algorithm is making decisions, eg concerns around the Facebook algorithm radicalising people. A related point is that when there is a problem that will require coordination and decisive action to solve, this is much easier with legible, uncontroversial and early evidence. For example, smoking is terrible for you, but it took a long time to realise this and discourage use because the link to lung cancer is noisy and acts on long time horizons. AI is currently mostly an incomprehensible black box, and will likely remain that way without significant progress in interpretability.

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world. See Sam Clarke’s excellent post [? · GW] for more discussion of examples of lock-in.

So, how bad is all this? My personal take is that an inappropriate focus on optimising metrics is clearly already happening in the world today, is causing many bad effects (and many good ones!) and that AI will plausibly make this significantly worse. But it is highly unclear that this actually results in existential risk. Maybe the AIs will cause terrible collateral damage to eg the atmosphere or drinkable water (see discussion [AF(p) · GW(p)]), maybe they’ll never cause a catastrophe but result in the lock-in of suboptimal values, maybe they’ll cause a bunch of short-term damage but we’ll manage to fix things. It’s very unclear! As a brief aside, I’ve also updated in favour of outcomes like this over the course of the COVID-19 pandemic - as of the time of writing, there are numerous examples of things I consider to be obvious errors that have been left unfixed for a while (not widely using fluvoxamine, not preparing more for future pandemics, etc)

The work

AI caused coordination failures

This is the case I've seen most pushed by Andrew Critch, David Krueger and Allan Dafoe. Critch and Krueger discuss it in their ARCHES paper [AF · GW], and Critch discusses it on the FLI podcast and in What Multipolar Failure Looks Like [AF(p) · GW(p)]. Allan Dafoe discusses a related notion in Open Problems in Cooperative AI and this is a focus of his new foundation the Cooperative AI Foundation. There hasn’t been that much work fleshing out this case, and I don’t understand it as well as I’d like to, so the following is my interpretation and my best attempt to steelman the core ideas, rather than solely my attempt at a summary. I am much less confident in this section than the previous ones

The Case

This threat model stems from a worldview that sees cooperation and coordination failures as a fundamental lens through which to understand the world. Cooperation is hard and unstable, and coordination failures are the default state of the world, yet successful cooperation is the root of much of the value in the world. The concern centres around AI destabilising the current institutions and norms that enable cooperation, and causing coordination failures. (Terminology note: Cooperation here can encompass cooperation between humans and humans, between AIs and AIs, and betweens humans and AIs)

I see this less as a single coherent case and more as a general prior that cooperation is hard yet crucial, and that destabilisation will be bad. There are a bunch of specific points and stories, but I think you can disagree with those while buying the overall case.

Some rough intuitions for a cooperation-centric worldview: Cooperation is unstable, because this involves many actors working together, where each actor is self-interestedly incentivised to defect, in a way that causes costs to others. Enough actors are self-interested that you need good institutions and norms to avoid them defecting. And most of human history is defined by being in a perpetual state of war and conflict. In modern times, some coordination failures have been extremely bad, eg WW1 and WW2, climate change, air pollution, etc. While when we can get cooperation right, eg trade, peace, well functioning governments, etc, this is responsible for a lot of the progress humanity has made.

So, why would AI make cooperation worse/harder?

Maybe you agree that cooperation would be harder, and that this would be bad. But would this lead to an existential risk? I personally find this fairly unclear, and don’t feel very compelled by any particular story, but I find it plausible that this could lead to extremely bad outcomes. See Section 3 of ARCHES and What Multipolar Failure Looks Like [AF(p) · GW(p)] for more discussion.

One significant risk centres around collateral damage, the side effects of the coordination failures cause damage to eg the atmosphere or drinkable water and this causes humanity to die out. The underlying intuition here is one of human fragility - there is a large range of possible ways the Earth could be (temperature, composition of the atmosphere, etc) that could lead to machines thriving, while humans need a fairly specific environment. This means that unless AIs make a special effort to keep the Earth a good place for human life, and care highly about this, this will likely be expensive to maintain, and we should not expect this to go well by default. This is an important argument, because it does not require an AI system to be an agent actively optimising for harming humans, or even for the resulting ecosystem of AIs to be viewable as coherently optimising anything at all.

This feels related to the concerns outlined in ‘You get what you measure [AF · GW]’, but different. Before, AI systems cause collateral damage because the damage was instrumentally useful to their goals, and they weren’t programmed to care about the harms. Here, each agent may care about the harms, but not enough - if each agent only plays a small marginal part in the coordination failure, they may not be incentivised to change it. This is analogous to how, today, each country may find it valuable to burn fossil fuels, and accept the cost of their marginal contribution to climate change, even though most countries makes a net loss on the total benefits and costs of global fossil fuel usage and climate change - the issue is that each country captures most of the benefits of their actions and a small fraction of their costs.

Another risk is that when multiple AI systems are interacting, their interactions can cause unexpected feedback loops that bring them far outside of their training distribution, resulting in extreme and unexpected behaviour. One mundane example of this is the 2010 Flash Crash where interactions between badly programmed stock market trading bots resulted in a crash, which wiped trillions of dollars of value in minutes before recovering. A more speculative version of this raised by Critch is the Production Web [AF · GW], where the entire economy becomes automated and dominated by AIs, which build on each other and cause a great deal of growth, but does not cache out as morally relevant things such as increased human welfare, and ceases to be human comprehensible. This seems particularly concerning with new ideas coming out of the crypto world as Decentralized Autonomous Organizations, which could result in an economy where most resources are not, ultimately, controlled by humans, which could become totally out of control. Nick Bostrom described the ultimate outcome of this kind of process run wild as ‘a Disneyland with No Children’.

A useful framework introduced in ARCHES for thinking about AI is that of a delegation problem - operators make AI systems, and want the AI systems to act to achieve their values. This delegation problem is very different if there is one or many operators, and one or many AIs, giving four different scenarios:

One insight from this framing is that we should expect the multi-multi delegation problem to be neglected. The Alignment problem, as normally conceived, is single-single delegation, and it is plausible that the creators of AGI will put significant resources into solving it (though likely not enough!), since it is clearly their responsibility. But no one has a clear responsibility to solve multi-multi delegation, and far fewer resources are invested into it today. Yet, given how much greater train-time compute is than run-time compute for modern ML, it seems likely that once we have trained AGI, we will run many copies of it, and plausible that many different actors will have access to AGI. This means that the multi-multi delegation problem will become relevant at almost exactly the same time as single-single, and is plausibly a much harder problem.

The Work

I expect much of the useful work here to be policy centred, eg creating good institutions, regulations and norms around AI use, incentivising international cooperation, etc. But the proponents also argue that there is important technical work to be done to create AI agents better able to cooperate, which is what I’ll focus on here:

NOTE: This was intended to be a full sequence, but will likely be eternally incomplete - you can read a draft of the full sequence here.

3 comments

Comments sorted by top scores.

comment by Mau (Mauricio) · 2021-12-26T06:52:27.753Z · LW(p) · GW(p)

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model). I'll try to address two defenses of that (of framing them as distinct threat models) which I interpret this post as suggesting (in the context of this earlier comment [EA(p) · GW(p)] on the overview post). Broadly, I'll be arguing that: power-seeking AI is necessary for "you get what you measure" issues posing existential threats, so "you get what you measure" concerns are best thought of as a sub-threat model of power-seeking AI.

(Edit: An aspect of "you get what you measure" concerns--the emphasis on something like "sufficiently strong optimization for some goal is very bad for different goals"--is a tweaked framing of power-seeking AI risk in general, rather than a subset.)

Lock-in: Once we’ve noticed problems, how difficult will they be to fix, and how much resistance will there be? For example, despite the clear harms of CO2 emissions, fossil fuels are such an indispensable part of the economy that it’s incredibly hard to get rid of them. A similar thing could happen if AI systems become an indispensable part of the economy, which seems pretty plausible given how incredibly useful human-level AI would be. As another example, imagine how hard it would be to ban social media, if we as a society decided that this was net bad for the world.

Unless I'm missing something, this is just an argument for why AI might get locked in--not an argument for why misaligned AI might get locked in. AI becoming an indispensable part of the economy isn't a long-term problem if people remain capable of identifying and fixing problems with the AI. So we still need an additional lock-in mechanism (e.g. the initially deployed, misaligned AI being power-seeking) to have trouble. (If we're wondering how hard it will be to fix/improve non-power-seeking AI after it's been deployed, the difficulty of banning social media doesn't seem like a great analogy; a more relevant analogy would be the difficulty of fixing/improving social media after it's been deployed. Empirically, this doesn't seem that hard. For example, YouTube's recommendation algorithm started as a click-maximizer, and YouTube has already modified it to learn from human feedback.)

See Sam Clarke’s excellent post [? · GW] for more discussion of examples of lock-in.

I don't think Sam Clarke's post (which I'm also a fan of) proposes any lock-in mechanisms that (a) would plausibly cause existential catastrophe from misaligned AI and (b) do not depend on AI being power-seeking. Clarke proposes five mechanisms by which Part 1 of "What Failure Looks Like" could get locked in -- addressing each of these in turn (in the context of his original post [AF · GW]):

  • (1) short-term incentives and collective action -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (2) regulatory capture -- the worry here is that the companies controlling AI might have and permanently act on bad values; this arguably fails condition (a), because if we're mainly worried about AI developers being bad, then focusing on intent alignment doesn't make that much sense.
  • (3) genuine ambiguity -- arguably fails condition (a) or fails condition (b); if we don't assume AI will be power-seeking, then I see no reason why these difficulties would get much worse in hundreds of years than they are now, i.e. no reason why this on its own is a lock-in mechanism.
  • (4) dependency and deskilling -- addressed above
  • (5) [AI] opposition to [humanity] taking back influence -- clearly fails condition (b)

So I think there remains no plausible alignment-relevant threat model for "You get what you measure" that doesn't fall under "power-seeking AI."

Replies from: paulfchristiano
comment by paulfchristiano · 2021-12-26T07:08:18.090Z · LW(p) · GW(p)

I'm still pretty confused by "You get what you measure" being framed as a distinct threat model from power-seeking AI (rather than as another sub-threat model)

I also consider catastrophic versions of "you get what you measure" to be a subset/framing/whatever of "misaligned power-seeking." I think misaligned power-seeking is the main way the problem is locked in.

To a lesser extent, "you get what you measure" may also be an obstacle to using AI systems to help us navigate complex challenges without quick feedback, like improving governance. But I don't think that's an x-risk in itself, more like a missed opportunity to do better. This is in the same category as e.g. failures of the education system, though it's plausibly better-leveraged if you have EA attitudes about AI being extremely important/leveraged. (ETA: I also view AI coordination, and differential capability progress, in a similar way.)

comment by adamShimi · 2021-12-26T18:01:25.332Z · LW(p) · GW(p)

Thanks so much for the effort your putting in this work! It looks particularly relevant to my current interest of understanding the different approximations and questions used in alignment, and what forbids us the Grail of paradigmaticity.

Here is my more concrete feedback

A common approach when setting research agendas in AI Alignment is to be specific, and focus on a threat model. That is, to extrapolate from current work in AI and our theoretical understanding of what to expect, to come up with specific stories for how AGI could cause an existential catastrophe. And then to identify specific problems in current or future AI systems that make these failure modes more likely to happen, and try to solve them now.

Given that AFAIK it’s Rohin who introduced the term in alignment, linking to his corresponding talk might be a good idea. I also like this drawing from his slides, which might clarify the explanation for more visual readers.

While I’m at threat models, you confused me at first because “threat model” always makes me think of “development model”, and so I expected a discussion of seed AI vs Prosaic AI vs Brain-based-AGI vs CAIS vs alternatives. What you do instead is more a discussion of “risk models”, with a mention in passing that the first one traditionally came from the more seed AI development model.

Of course that’s your choice, but neglecting a bunch of development models with a lot of recent work, notably the brain-based AGI model  of Steve Byrnes, feels incoherent with the stated aim of the sequence — “mapping out the AI Alignment research landscape”.

And having a specific story to guide what you do can be a valuable source of direction, even if ultimately you know it will be flawed in many ways. Nate Soares makes the case for having a specific but flawed story in general well.

My first reaction when reading this part was “Hum, that doesn’t seem to be exactly what Nate is justifying here”. After rereading the post, I think what disturbed me was my initial reading that you were saying something like “the correctness of a threat model doesn’t matter, you just choose one and do stuff”. Which is not what either you or Nate are saying; instead, it’s that spending all the time waiting for a perfect plan/threat model is less productive than taking the best option available, getting your hands dirty and trying things.

Note that I think there is very much a spectrum between this category and robustly good approaches (a forthcoming post in this sequence). Most robustly good ways to help also address specific threat models, and many ways to address specific threat models feel useful even if that specific threat model is wrong. But I find this a helpful distinction to keep in mind.

This sounds to me like a better defense of threat model thinking, and I would like to read more about your ideas (especially the last two sentences).

When naively considered, this framework often implicitly thinks of intelligence as a mysterious black box that caches out as 'better able to achieve plans than us', without much concrete detail. Further, it assumes that all goals would lead to these issues.

I agree with the gist of the paragraph, but “all goals”  is an overstatement: both Nick Bostrom and Steve Omohundro note that some goals obviously don’t have power-seeking incentives, like the goal of dying as fast as possible. They say that most goals would have instrumental subgoals, which is the part that Richard Ngo criticizes and Alex Turner formalizes [? · GW].

See Tom Adamczewski’s discussion of how arguments have shifted

Oh, awesome resource! Thanks for the link!

Understanding the incentives and goals of the agent, and how the training process can affect these in subtle ways

I feel like you should definitely mention Alex Turner’s work [? · GW] here, where he formalizes Bostrom’s instrumental convergence thesis.

Limited optimization: Many of these problems inherently stem from having a goal-directed utility-maximiser, which will find creative ways to achieve these goals. Can we shift away from this paradigm?

Shouldn’t you include work on impact measures here? For example this survey post [AF · GW] and Alex Turner’s sequence [? · GW].

A particularly concerning special case of the power-seeking concern is inner misalignment. This was an idea that had been floating around MIRI for a while, but was first properly clarified by Evan Hubinger in Risks from Learned Optimization [? · GW].

Evan is adamant that the paper was done equally by all coauthors, and so should be cited as done by “Evan Hubinger, Chris van Merwijk, Vladimir Mikulik, Joar Skalse, Scott Garrabrant”.

Sub-Threat model: Inner Alignment

I feel like you’re sticking a bit close to the paper’s case, when there are more compact statements of the problem. Especially with your previous case, you could just say that inner alignment is about justifying power-seeking behavior and treacherous turns in the case where the AI is found by search instead of programmed by hand.

Plausibility of misaligned cognition: It is likely that, in practice, we will end up with networks with misaligned cognition

There’s also an argument that deception is robust once it has been found: making a deceptive model less deceptive would make it do more what it really wants to do, and so have a worse loss, which means it’s not pushed out of deception by SGD.

Better understanding how and when mesa-optimization arises (if it does at all).

One cool topic here is gradient hacking — see for example this recent survey [AF · GW].

Anecdotally, some researchers I respect take this very seriously - it was narrowly rated the most plausible threat model in a recent survey [AF · GW].

I want to note that this scenario looks more normal, which makes me think that by default, anyone would find this more plausible than the Bostrom/Yudkowsky scenario due to normalcy bias. So I tend to cancel this advantage when looking at what scenario people favor.

But this error-correction mechanism may break-down for AI. There are three key factors to analyse here: pace, comprehensibility and lock-in.

I like this decomposition!

So, why would AI make cooperation worse/harder?

At least for Critch’s RAAPs, my understanding is that it’s mostly Pace that makes a difference: the process already exists, but it’s not moving as fast as it could because of the fallibility of humans, because of legislation and restrictions. Replacing humans with AIs in most tasks removes the slow down, and so the process moves faster, towards loss of control.

comment by lukeprog · 2022-01-09T14:17:51.809Z · LW(p) · GW(p)