My research methodology

post by paulfchristiano · 2021-03-22T21:20:07.046Z · LW · GW · 36 comments

Contents

  What this looks like (3 examples)
    1: human feedback
    2: iterated amplification
    3: imitative generalization
  More general process
  Objections and responses
    you really come up with a working algorithm on paper? Empirical work seems important
    think this task is possible? 50% seems way too optimistic
    failure stories involve very unrealistic learned models
    there any examples of a similar research methodology working well? This is different from traditional theoretical work
None
36 comments

(Thanks to Ajeya Cotra, Nick Beckstead, and Jared Kaplan for helpful comments on a draft of this post.)

I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post).

Most possible ML techniques for avoiding egregious misalignment depend on detailed facts about the space of possible models: what kind of thing do neural networks learn? how do they generalize? how do they change as we scale them up?

But I feel like we should be possible to avoid egregious misalignment regardless of how the empirical facts shake out--it should be possible to get a model we build to do at least roughly what we want. So I’m interested in trying to solve the problem in the worst case, i.e. to develop competitive ML algorithms for which we can’t tell any plausible story about how they lead to egregious misalignment.

This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:

I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.

What this looks like (3 examples)

My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”

Example 1: human feedback

In an unaligned benchmark I describe a simple AI training algorithm:

In the same post, I describe a plausible story about how this algorithm leads to egregious misalignment:

I don’t know if or when this kind of reward hacking would happen — I think it’s pretty likely eventually, but it’s far from certain and it might take a long time.

But from my perspective this failure mode is at least plausible — I don’t see any contradictions between this sequence of events and anything I know about the real world. So this is enough for me to conclude that human feedback can’t handle the worst plausible situation, and to keep looking for an algorithm that can.

To better understand whether this story is really plausible, we can spend time refining it into something more and more concrete to see if it still seems to make sense. There are lots of directions in which we could add detail:

Filling more and more details lets us notice if our abstract story was actually incoherent in important ways, or to notice weird things the story implies about the world that we might want to rule out by assumption.

Example 2: iterated amplification

To avoid the problems with raw human feedback, we could train additional ML assistants that help us evaluate outcomes. For example, assistants could point out possible consequences of a plan that we didn’t notice. Various variants of this idea are explored in benign model-free RL, supervising strong learners by amplifying weak experts, AI safety via debate, and recursive reward modeling.

In inaccessible information I tried to explore a story about how this entire family of algorithms could fail:

Example 3: imitative generalization

Imitative generalization [LW · GW] is intended to address this problem with iterated amplification.

To briefly summarize: instead of using gradient descent to search over a space of human-incomprehensible models that predict some data (e.g. autoregressive models of videos), we try to search over space of models that a human can “understand” (perhaps with the help of aligned assistants as in amplification or debate), and optimize for a model that both looks plausible to the human and allows the human to successfully predict the same data (i.e. to predict the next pixel of a video). We hope that this allows us to find a human-comprehensible model that allows the human to both predict the data and figure out if the camera is being hacked.

(This algorithm is quite vague, so you could think of it as a whole family of algorithms based on how you parametrize the space of “human-comprehensible” models, how you search over that space, and how you define the prior. I’m going to try to tell a story about the limitations of this whole approach.)

Here’s an exotic situation where I think the naive version of this approach wouldn’t work:

There are many obvious ways to try to address this problem, but I think it does break the most obvious implementations of imitative generalization. So now I have two questions:

After a little bit of inspection it turns out that the original story is inconsistent: it’s literally impossible to run a detailed low-level simulation of physics in situations where the computer itself needs to be part of the simulation. So the story as I told it is inconsistent, and we can breathe a temporary sigh of relief.

Unfortunately, the basic problem persists even when we make the story more complicated and plausible. Our AI inevitably needs to reason about some parts of the world in a heuristic and high-level way, but it could still use a model that is lower-level than what humans are familiar with (or more realistically just alien but simpler). And at that point we have the same difficulty.

It’s possible that further refinements of the story would reveal other inconsistencies or contradictions with what we know about ML. But I’ve thought enough about this that I think this failure story is probably something that could actually happen, and so I’m back to the step of improving or replacing imitative generalization.

This story is even more exotic than the ones in the previous sections. I’m including it in part to illustrate how much I’m willing to push the bounds of “plausible.” I think it’s extremely difficult to tell completely concrete and realistic stories, so as we make our stories more concrete they are likely to start feeling a bit strange. But I think that’s OK if we are trying to think about the worst case, until the story starts contradicting some clear assumptions about reality that we might want to rely on for alignment. When that happens, I think it’s really valuable to talk concretely about what those assumptions are, and be more precise about why the unrealistic nature of the story excuses egregious misalignment.

More general process

We start with some unaligned “benchmark”. We rule out a proposed alignment algorithm if we can come up with any story about how it can be either egregiously misaligned or uncompetitive.

I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:

Objections and responses

Can you really come up with a working algorithm on paper? Empirical work seems important

My goal from theoretical work is to find a credible alignment proposal. Even from that point I think it will take a lot of practical work to get it to the point where it works well and we feel confident about it in practice:

My view is that working with pen and paper is an important first step that allows you to move quickly until you have something that looks good on paper. After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the theoretical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.

That’s why I’m personally excited about “starting with theory,” but I think we should do theoretical and applied work in parallel for a bunch of reasons:

Why think this task is possible? 50% seems way too optimistic

When I describe this methodology, many people feel that I’ve set myself an impossible task. Surely any algorithm will be egregiously misaligned under some conditions?

My “50% probability of possibility” is coming largely from a soup of optimistic intuitions. I think it would be crazy to be confident on the basis of this kind of intuition, but I do think it’s enough to justify 50%:

Despite having lots of optimistic words to say, feasibility is one of my biggest concerns with my methodology.

These failure stories involve very unrealistic learned models

My failure stories involve neural networks learning something like “simulate physics at a low level” or “perform logical deductions from the following set of axioms.” This is not the kind of thing that a neural network would learn in practice. I think this leads many people to be skeptical that thinking about such simplified stories could really be useful.

I feel a lot more optimistic:

Are there any examples of a similar research methodology working well? This is different from traditional theoretical work

When theorists design algorithms they often focus on the worst case. But for them the “worst case” is e.g. a particular graph on which their algorithm runs slowly, not a “plausible” story about how a model is “egregiously misaligned.”

I think this is a real, big divergence that’s going to make it way harder to get traditional theorists on board with this approach. But there are a few ways in which I think the situation is less disanalogous than it looks:


My research methodology was originally published in AI Alignment on Medium, where people are continuing the conversation by highlighting and responding to this story.

36 comments

Comments sorted by top scores.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-03-23T16:25:20.054Z · LW(p) · GW(p)
I really don’t want my AI to strategically deceive me and resist my attempts to correct its behavior. Let’s call an AI that does so egregiously misaligned (for the purpose of this post). ... But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

I'd love to hear more about this. To me, "egregious misalignment" feels extremely natural/normal/expected, perhaps due to convergent instrumental goals. You might as well have said "I really don't want my AI to think about politics" or "I really don't want my AI to think about distant superintelligences" or "I really don't want my AI to break any laws."

Separately, how much do you think your views would change if your feelings on this particular point changed?

Replies from: Eliezer_Yudkowsky, paulfchristiano
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-03-29T00:34:55.508Z · LW(p) · GW(p)

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

Paul, this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).  Not only do I not understand what you actually did mean by this, it seems like phrasing that potentially leads astray other readers coming in for the first time.  Say, if you imagine somebody at Deepmind coming in without a lot of prior acquaintance with the field - or some hapless innocent ordinary naive LessWrong reader who has a glowing brain, but not a galaxy brain, and who is taking Paul's words for a lot of stuff about alignment because Paul has such a reassuring moderate tone compared to Eliezer - then they would come away from your paragraph thinking, "Oh, well, this isn't something that happens if I take a giant model and train it to produce outputs that human raters score highly, because an 'extreme and somewhat strange failure mode' must surely require that I add on some unusual extra special code to my model in order to produce it."

I suspect that you are talking in a way that leads a lot of people to vastly underestimate how difficult you think alignment is, because you're assuming, in the background, exotic doing-stuff-right technology that does not exist, in order to prevent these "extreme and somewhat strange failure modes" from happening, as we agree they automatically would given any "naive" simple scheme, that you could actually sketch out concretely right now on paper.  By which I mean, concretely enough that you could have any ordinary ML person understand in concrete enough detail that they could go write a skeleton of the code, as opposed to that you think you could later sketch out a research approach for doing.  It's not just a buffer overflow that's the default for bad security, it's the equivalent of a buffer overflow where nobody can right now exhibit how strange-failure-mode-avoiding code should concretely work in detail.  "Strange" is a strange name for a behavior that is so much the default that it is an unsolved research problem to avoid it, even if you think that this research problem should definitely be solvable and it's just something wrong or stupid about all of the approaches we could currently concretely code that would make them exhibit that behavior.

Replies from: paulfchristiano, SDM
comment by paulfchristiano · 2021-03-29T19:20:13.955Z · LW(p) · GW(p)
  • I still feel fine about what I said, but that's two people finding it confusing (and thinking it is misleading) so I just changed it to something that is somewhat less contentful but hopefully clearer and less misleading.
  • Clarifying what I mean by way of analogy: suppose I'm worried about unzipping a malicious file causing my computer to start logging all my keystrokes and sending them to a remote server. I'd say that seems like a strange and extreme failure mode that you should be able to robustly avoid if we write our code right, regardless of how the logical facts shake out about how compression works. That said, I still agree that in some sense it's the "default" behavior without extensive countermeasures. It's rare for a failure to be so clearly different from what you want that you can actually hope to avoid them in the worst case. But that property is not enough to suggest that they are easily avoided.
  • I obviously don't agree with the inference from "X is the default result of optimizing for almost anything" to "X is the default result of our attempt to build useful AI without exotic technology or impressive mitigation efforts."
  • My overall level of optimism doesn't mostly come from hopes about exotic alignment technology. I am indeed way more optimistic about "exotic alignment technology" than you and maybe that optimism cuts off 25-50% of the total alignment risk. I think that's the most interesting/important disagreement between us since it's the area we both work in. But more of the disagreement about P(alignment) comes from me thinking it is much, much more likely that "winging it" works long enough that early AI systems will have completely changed the game.
  • I spend a significant fraction of my time arguing with people who work in ML about why they should be more scared. The problem mostly doesn't seem to be that I take a moderate or reassuring tone, it's that they don't believe the arguments I make (which are mostly strictly weaker forms your arguments, which they are in turn even less on board with).
comment by SDM · 2021-03-29T18:12:50.597Z · LW(p) · GW(p)

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

I'm not saying this is an exact analogy for AGI alignment - there are lots of specific technical reasons to expect that alignment is not like bridge building and that there are reasons why the approaches we're likely to try will break on us suddenly in ways we can't fix as we go - treacherous turns, inner misalignment or reactions to distributional shift. It's just that there are different answers to the question of what's the default outcome depending on if you're asking what to expect abstractly or in the context of how things are in fact done.

 

Instrumental Convergence plus a specific potential failure mode (like e.g. we won't pay sufficient attention to out of distribution robustness), is like saying 'you know the vast majority of physically possible bridge designs fall over straight away and also there's a giant crack in that load-bearing concrete pillar over there' - if for some reason your colleague has a mental block around the idea that a bridge could in principle fall down then the first part is needed (hence why IC is important for presentations of AGI risk because lots of people have crazily wrong intuitions about the nature of AI or intelligence), but otherwise IC doesn't do much to help the case for expecting catastrophic misalignment and isn't enough to establish that failure is a default outcome.

 

It seems like your reason for saying that catastrophic misalignment can't be considered an abnormal or extreme failure mode comes down to this pre-technical-detail Instrumental Convergence thesis - that IC by itself gives us a significant reason to worry, even if we all agree that IC is not the whole story. 

this seems a bizarre way to describe something that we agree is the default result of optimizing for almost anything (eg paperclips).  

= 'because strongly optimizing for almost anything leads to catastrophe via IC, we can't call catastrophic misalignment a bizarre outcome'?

Maybe it's just a subtle difference in emphasis without a real difference in expectation/world model, but I think there is an important need to clarify the difference between 'IC alone raises an issue that might not be obvious but doesn't give us a strong reason to expect a catastrophe' and 'IC alone suggests a catastrophe even though it's not the whole story' - and the first of these is a more accurate way of viewing the role of IC in establishing the likelihood of catastrophic misalignment.

Ben Garfinkel argues for the first of these and against the second, in his objection to the 'classic' formulation of instrumental convergence/orthogonality [LW(p) · GW(p)]- that these are just 'measure based' arguments which identify that a majority of possible AI designs with some agentive properties and large-scale goals will optimize in malign ways, rather than establishing that we're actually likely to build such agents.

Replies from: Eliezer_Yudkowsky, rasmus-eide
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2021-03-29T20:59:59.475Z · LW(p) · GW(p)

Is a bridge falling down the moment you finish building it an extreme and somewhat strange failure mode? In the space of all possible bridge designs, surely not. Most bridge designs fall over. But in the real world, you could win money all day betting that bridges won't collapse the moment they're finished.

Yeah, that kiiiinda relies on literally anybody anywhere being able to sketch a bridge that wouldn't fall over, which is not the situation we are currently in.

comment by Rasmus Eide (rasmus-eide) · 2021-03-30T19:30:24.302Z · LW(p) · GW(p)

Didn't it use to be for thousands of years, before we had observed thousands of bridge designs falling or not falling and developed exact models, that bridges DID fall down like that quite often?

Have you played Poly Bridge?

comment by paulfchristiano · 2021-03-23T17:54:38.881Z · LW(p) · GW(p)

I think I'm responding to a more basic intuition, that if I wrote some code and its now searching over ingenious ways to kill me, then something has gone extremely wrong in a way that feels preventable. It may be the default in some sense, just as wildly insecure software (which would lead to my computer doing the same thing under certain conditions) is the default in some sense, but in both cases I have the intuition that the failure comes from having made an avoidable mistake in designing the software.

In some sense changing this view would change my bottom line---e.g. if you ask me "Should you be able to design a bridge that doesn't fall down even in the worst case?" my gut take would be "why would that be possible?"---but I don't feel like there's a load-bearing intuitive disagreement in the vague direction of convergent instrumental goals.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-03-23T19:53:23.253Z · LW(p) · GW(p)

OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?

(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don't know what you are doing, that's totally preventable too because if you were an elite circus trainer you would have done it correctly.)

Replies from: paulfchristiano
comment by paulfchristiano · 2021-03-24T19:15:59.530Z · LW(p) · GW(p)

OK. I found the analogy to insecure software helpful. Followup question: Do you feel the same way about "thinking about politics" or "breaking laws" etc.? Or do you think that those sorts of AI behaviors are less extreme, less strange failure modes?

I don't really understand how thinking about politics is a failure mode. For breaking laws it depends a lot on the nature of the law-breaking---law-breaking generically seems like a hard failure mode to avoid, but there are kinds of grossly negligent law-breaking that do seem similarly perverse/strange/avoidable for basically the same reasons.

(I didn't find the "...something has gone extremely wrong in a way that feels preventable" as helpful, because it seems trivial. If you pull the pin on a grenade and then sit on it, something has gone extremely wrong in a way that is totally preventable. If you strap rockets to your armchair, hoping to hover successfully up to your apartment roof, and instead die in a fireball, something has gone extremely wrong in a way that was totally preventable. If you try to capture a lion and tame it and make it put its mouth around your head, and you end up dead because you don't know what you are doing, that's totally preventable too because if you were an elite circus trainer you would have done it correctly.)

I'm not really sure if or how this is a reductio. I don't think it's a trivial statement that this failure is preventable, unless you mean by not running AI. Indeed, that's really all I want to say---that this failure seems preventable, and that intuition doesn't seem empirically contingent, so it seems plausible to me that the solubility of the alignment problem also isn't empirically contingent.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-03-24T21:00:49.143Z · LW(p) · GW(p)

Thinking about politics may not be a failure mode; my question was whether it feels "extreme and somewhat strange," sorry for not clarifying. Like, suppose for some reason "doesn't think about politics" was on your list of desiderata for the extremely powerful AI you are building. So thinking about politics would in that case be a failure mode. Would it be an extreme and somewhat strange one?

I'd be interested to hear more about the law-breaking stuff -- what is it about some laws that makes AI breaking them unsurprising/normal/hard-to-avoid, whereas for others AI breaking them is perverse/strange/avoidable?

I wasn't constructing a reductio, just explaining why the phrase didn't help me understand your view/intuition. When I hear that phrase, it seems to me to apply equally to the grenade case, the lion-bites-head-off case, the AI-is-egregiously-misaligned case, etc. All of those cases feel the same to me.

(I do notice a difference between these cases and the bridge case. With the bridge, there's some sense in which no way you could have made the bridge would be good enough to prevent a certain sufficiently heavy load. By contrast, with AI, lions, and rocket-armchairs, there's at least some possible way to handle it well besides "just don't do it in the first place." Is this the distinction you are talking about?)

Is your claim just that the solubility of the alignment problem is not empirically contingent, i.e. there is no possible world (no set of laws of physics and initial conditions) such that someone like us builds some sort of super-smart AI, and it becomes egregiously misaligned, and there was no way for them to have built the AI without it becoming egregiously misaligned?

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-03-23T16:14:45.780Z · LW(p) · GW(p)

Nice post! I'm interested to hear more about how your methodology differs from others. Does this breakdown seem roughly right?

1. Naive AI alignment: We are satisfied by an alignment scheme that can tell a story about how it works. (This is what I expect to happen in practice at many AI labs.)

2. Typical-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and still it doesn't seem like failure is the most likely outcome. (This is what I expect the better sort of AI labs, the ones with big well-respected safety teams, will do.)

3. Worst-Case AI Alignment: We aren't satisfied until we try hard to think of ways our scheme could fail, and can't think of anything plausible. (This is your methodology, right?)

4. Ordinary Paranoia: We aren't satisfied until we try hard to think of a way our scheme could fail, and can't think of anything logically and physically possible. (Maybe this isn't importantly different from #3? See below.)

5. Security Mindset: As with ordinary paranoia, except that also we aren't satisfied until we can write a premise-conclusion form argument for why our scheme won't fail, such that the premises don't contain value-laden concepts and are in general fairly concrete/detailed, and such that each premise seems highly likely to be true. (This is what I think MIRI advocates? But I think I see shades of it in your methodology too.)


Second question: What counts as plausible? What does it mean for a story to contradict something we know to be true? The looser our standards for plausibility, the more your methodology ends up looking like Ordinary Paranoia. The stricter our standards for plausibility, the more it ends up looking like Typical-Case AI Alignment.

Replies from: paulfchristiano
comment by paulfchristiano · 2021-03-24T19:18:43.617Z · LW(p) · GW(p)

I don't really think of 3 and 4 as very different, there's definitely a spectrum regarding "plausible" and I think we don't need to draw the line firmly---it's OK if over time your "most plausible" failure mode becomes increasingly implausible and the goal is just to make it obviously completely implausible. I think 5 is a further step (doesn't seem like a different methodology, but a qualitatively further-off stopping point, and the further off you go the more I expect this kind of theoretical research to get replaced by empirical research). I think of it as: after you've been trying for a while to come up with a failure story, you can start thinking about why failure stories seem impossible and try to write an argument that there can't be any failure story...

Replies from: daniel-kokotajlo
comment by rohinmshah · 2021-03-24T21:22:04.253Z · LW(p) · GW(p)

I'm super on board with this general methodology, at least at a high level. (Counterexample guided loops are great. [LW(p) · GW(p)]) I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

For example, I feel like with iterated amplification, a bunch of people (including you, probably) said early on that it seems like a hard case to do e.g. translation between languages with people who only know one of the languages, or to reproduce brilliant flashes of insight [LW · GW]. (Iirc, the translation example was in some comment on one of the AI Alignment blog posts from ~2016, though I can't find it right now.) To my eye, inaccessible information is mostly stating this sort of objection more clearly and generally (in particular, it isn't a fundamentally different argument). What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?

Replies from: paulfchristiano
comment by paulfchristiano · 2021-03-25T18:15:02.265Z · LW(p) · GW(p)

High level point especially for folks with less context: I stopped doing theory for a while because I wanted to help get applied work going, and now I'm finally going back to doing theory for a variety of reasons; my story is definitely not that I'm transitioning back from applied work to theory because I now believe the algorithms aren't ready.

I think my main question is, how do you tell when a failure story is sufficiently compelling that you should switch back into algorithm-finding mode?

I feel like a story is basically plausible until proven implausible, so I have a pretty low bar.

What changed that made that sound sufficiently like a failure story that you started working on a different algorithm?

I don't think that iterated amplification ever was at the point where we couldn't tell a story about how it might fail (perhaps in the middle of writing the ALBA post was peak optimism? but by the time I was done writing that post I think I basically had a story about how it could fail). In this case it seems like the distinction is more like "what is a solution going to look like?" and there aren't clean lines between "big changes to this algorithm" and "new algorithm."

I guess the question is why I was as optimistic as I was. For example, all the way until mid-2017 I thought it was plausible that something like iterated amplification would work without too many big changes (that's a bit of a simplification, but you can see how I talked about it e.g. here).

Some thoughts on that:

  • I first remember discussing the translation example in a workshop in I think 2016. My view at that time was that the learning process might be implemented by amplification (i.e. that learning could occur within the big implicit HCH tree).  If that's the case then the big open question seemed to be preserving alignment within that kind of learning process (which I did think was a big/ambitious problem). Around this time I started thinking about amplification as at-best implementing an "enlightened prior" that would then handle updating on the basis of evidence.
  • I don't think it's super obvious that this approach can't work, and even now I don't think we've written up very clean arguments. The big issue is that the intermediate results of the learning process (that are needed to justify the final output, e.g. summary statistics from each half of the dataset) are both too large to be memorized by the model and too computationally complex to be reproduced by the model at test time. On top of that it seems like probably amplification/debate can't work with very large implicit trees even if they can be memorized (based on the kinds of issues raised in Beth's report [AF · GW], that also should have been obvious in advance but there were too many possibly-broken things to give each one the attention they deserved)
  • In 2016 I transitioned to doing more applied work for a while, and that's a lot of why my own views stopped changing so rapidly.  You could debate whether this was reasonable in light of the amount of theoretical uncertainty. I know plenty of people who think that it was both unreasonable to transition into applied work at that time and that it would be unreasonable for me to transition back now.
  • I still spent some time thinking about theory and by early 2018 I thought that we needed more fundamental extra ingredients. I started thinking about this more and talking with OpenAI folks about it. (Largely Geoffrey, though I think he was never as worried about these issues as I was, in part because of the methodological difference where I'm more focused on the worst case and he has the more typical perspective of just wanting something that's good enough to work in practice.)
  • Part of why these problems were less clear to me back in 2016-2017 is that it was less clear to me what exactly we needed to do in order to be safe (in the worst case). I had the vague idea of informed oversight, but hadn't thought through what parts of it were hard or necessary or what it really looked like. This all feels super obvious to me now but at the time it was pretty murky. I had to work through a ton of examples (stuff like implicit extortion) to develop a clear enough sense that I felt confident about it. This led to posts like ascription universality and strategy stealing.
  • The more recent round of writeups were largely about catching up in public communications, though my thinking is a lot clearer than it was 1-2 years before so it would have been even more of a mess if I'd been trying to do it as I went. Imitative generalization was a significant simplification/improvement over the kind of algorithm I'd been thinking about for a while to handle these problems. (I don't think it's the end of the line at all, if I was still doing applied work maybe we'd have a similar discussion in a while when I described a different algorithm that I thought worked better, but given that I'm focusing on theory right now the timeline will probably be much shorter.)
Replies from: rohinmshah
comment by rohinmshah · 2021-03-25T21:21:01.474Z · LW(p) · GW(p)

Cool, that makes sense, thanks!

comment by Ben Pace (Benito) · 2021-03-25T00:47:01.909Z · LW(p) · GW(p)

Curated. This post gives me a lot of context on your prior writing (unaligned benchmark, strategy stealing assumption, iterated amplification, imitative generalization), it helps me understand your key intuitions behind the plausibility of alignment, and it helps me understand where your research is headed. 

When I read Embedded Agency [? · GW], I felt like I then knew how to think productively about the main problems MIRI is working on by myself. This post leaves me feeling similarly about the problems you've been working on for the past 6+ years.

So thanks for that.

I'd like to read a version of this post where each example is 10x the length and analyzes it more thoroughly... I could just read all of your previous posts on each subject, though they're fairly technical. (Perhaps Mark Xu will write it such a post, he did a nice job previously on Solomonoff Induction [LW · GW].)

I'd also be pretty interested in people writing more posts presenting arguments for/against plausible stories for the failure of Imitative Generalization, or fleshing out the details of a plausible story such that we can see more clearly if the story is indeed plausible. Basically, making contributions in the ways you outline.

Aside: Since the post was initially published, some of the heading formatting was lost in an edit, so I fixed that before curating it.

Edit: Removed the line "After reading it I have a substantially higher probability of us solving the alignment problem." Understanding Paul's research is a big positive, but I'm not actually sure I stand by it leading to a straightforward change in my probability.

comment by Ben Pace (Benito) · 2021-03-22T23:13:00.773Z · LW(p) · GW(p)

This post gives great insight into your research methodology, thanks for writing it.

After that point I think you are mostly in applied world, and I think that applied investments are likely to ultimately dwarf the empirical investments by orders of magnitude even if it turns out that we found a really good algorithm on paper.

You contrast ‘applied’ and ‘empirical’ here, but they sound the same to me. Is it a typo and you meant ‘applied’ and ‘theoretical’? That would make sense to me.

Replies from: paulfchristiano
comment by paulfchristiano · 2021-03-23T00:20:30.345Z · LW(p) · GW(p)

Yeah, thanks for catching that.

comment by Wei_Dai · 2021-03-26T02:24:15.088Z · LW(p) · GW(p)

Why did you write [LW(p) · GW(p)] "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)

For example, at the beginning of modern cryptography you could describe the methodology as “Tell a story about how someone learns something about your secret” and that only gradually crystallized into definitions like semantic security (and still people sometimes retreat to this informal process in order to define and clarify new security notions).

Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me [LW(p) · GW(p)]?

The best case is that we end up with a precise algorithm for which we still can’t tell any failure story. In that case we should implement it (in some sense this is just the final step of making it precise) and see how it works in practice.

Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography? What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?

Replies from: paulfchristiano, paulfchristiano, paulfchristiano
comment by paulfchristiano · 2021-04-09T01:49:13.999Z · LW(p) · GW(p)

In my other response to your comment I wrote:

I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long.

I guess SSH itself would be an interesting test of this, e.g. comparing the theoretical model of this paper to a modern implementation. What is your view about that comparison? e.g. how do you think about the following possibilities:

  1. There is no material weakness in the security proof.
  2. A material weakness is already known.
  3. An interested layperson could find a material weakness with moderate effort.
  4. An expert could find a material weakness with significant effort.

My guess would be that probably we're in world 2, and if not that it's probably because no one cares that much (e.g. because it's obvious that there will be some material weakness and the standards of the field are such that it's not publishable unless it actually comes with an attack) and we are in world 3.

(On a quick skim, and from the author's language when describing the model, my guess is that material weaknesses of the model are more or less obvious and that the authors are aware of potential attacks not covered by their model.)

comment by paulfchristiano · 2021-04-27T03:24:57.248Z · LW(p) · GW(p)

I'm still curious for your view on the crypto examples you cited. My current understanding is that people do not expect the security proofs to rule out all possible attacks (a situation I can sympathize with since I've written multiple proofs that rule out large classes of attacks without attempting to cover all possible attacks), so I'm interested in whether (i) you disagree with that and believe that serious onlookers have had the expectation that proofs are comprehensive, (ii) you agree but feel it would be impractical to give a correct proof and this is a testament to the difficulty of proving things, (iii) you feel it would be possible but prohibitively expensive, and are expressing a quantitative point about the cost of alignment analyses being impractical, (iv) you feel that the crypto case would be practical but the AI case is likely to be much harder and just want to make a directionally analogous update.

I still feel like more of the action is in my skepticism about the (alignment analysis) <--> (security analysis) analogy, but I could still get some update out of the analogy if the crypto situation is thornier than I currently believe.

comment by paulfchristiano · 2021-04-09T01:31:45.298Z · LW(p) · GW(p)

Why did you write [AF(p) · GW(p)] "This post [Inaccessible Information] doesn't reflect me becoming more pessimistic about iterated amplification or alignment overall." just one month before publishing "Learning the prior"? (Is it because you were classifying "learning the prior" / imitative generalization under "iterated amplification" and now you consider it a different algorithm?)

I think that post is basically talking about the same kinds of hard cases as in Towards Formalizing Universality 1.5 years earlier (in section IV), so it's intended to be more about clarification/exposition than changing views.

See the thread with Rohin above for some rough history.

Why doesn't the analogy with cryptography make you a lot more pessimistic about AI alignment, as it did for me [AF(p) · GW(p)]?

I'm not sure.It's possible I would become more pessimistic if I walked through concrete cases of people's analyses being wrong in subtle and surprising ways.

My experience with practical systems is that it is usually easy for theorists to describe hypothetical breaks for the security model, and the issue is mostly one of prioritization (since people normally don't care too much about security). For example, my strong expectation would be that people had described hypothetical attacks on any of the systems discussed in the article you linked prior to their implementation, at least if they had ever been subject to formal scrutiny. The failures are just quite far away from the levels of paranoia that I've seen people on the theory side exhibit when they are trying to think of attacks.

I would also expect that e.g. if you were to describe almost any existing practical system with purported provable security, it would be straightforward for a layperson with theoretical background (e.g. me) to describe possible attacks that are not precluded by the security proof, and that it wouldn't even take that long. It sounds like a fun game.

Another possible divergence is that I'm less convinced by the analogy, since alignment seems more about avoiding the introduction of adversarial consequentialists and it's not clear if that game behaves in the same way. I'm not sure if that's more or less important than the prior point.

Would you do anything else to make sure it's safe, before letting it become potentially superintelligent? For example would you want to see "alignment proofs" similar to "security proofs" in cryptography?

I would want to do a lot of work before deploying an algorithm in any context where a failure would be catastrophic (though "before letting it become potentially superintelligent" kind of suggests a development model I'm not on board with).

That would ideally involve theoretical analysis from a lot of angles, e.g. proofs of key properties that are amenable to proof, demonstrations of how the system could plausibly fail if we were wrong about key claims or if we relax assumptions, and so on. 

It would also involve good empirical characterization, including things like running on red team inputs, or changing the training procedure in ways that seem as bad as possible while still preserving our alignment arguments, and performing extensive evals under those more pessimistic conditions. It would involve validating key claims individually, and empirically testing other claims that are established by structurally similar arguments. It would involve characterizing scaling behavior where applicable and understanding it as well as we can (along with typical levels of variability and plausible stories about deviations from trend).

What if such things do not seem feasible or you can't reach very high confidence that the definitions/assumptions/proofs are correct?

I'm not exactly sure what you are asking. It seems like we'll do what we can on all the fronts and prioritize them as well as we can. Do you mean, what else can we say today about what methodologies we'd use? Or under what conditions would I pivot to spending down my political capital to delay deployment? Or something else?

comment by AllAmericanBreakfast · 2021-03-23T23:12:50.208Z · LW(p) · GW(p)

I wanted a concrete example of the "human feedback" example. Let me know if this seems like a reasonable interpretation of the scenario.

Let's call the plan that acquires influence in a way obvious to a generative model but invisible and incomprehensible to humans "Plan X." Since by definition, I cannot conceive of "Plan X," we'll modify this scenario slightly.

Instead of a scenario in which humans design an AI, we'll imagine a scenario in which some very dumb goblins design an AI that's as smart as an average human. We'll call this a human-level AI or HL-AI.

  • Our AI observes the world through a bunch of cameras and outputs motor actions. We train a generative model that predicts these camera observations given the motor actions.

The HL-AI controls a robot with camera eyes and motor appendages, and is located in a room full of tools and disassembled Ikea furniture. The robot moves its appendages randomly in response to the HL-AI. The observations of the camera eyes are used to train a generative model that predicts what will be observed via the camera eyes based on previous movements and camera eye observations.

  • We ask humans goblins to evaluate possible futures by looking at the predicted videos output by the model. We then train a model to predict these goblin evaluations.

The goblins want the HL-AI to help them build the Ikea furniture. The model has been trained to accurately generate videos that illustrate the predicted results of simulated robot movements. The goblins watch these videos until they find ones in which pieces of furniture are partly or fully assembled. They assign value to the videos according to how much furniture has been built.

  • At test time the HL-AI searches for plans that lead to trajectories that look good to goblins.

Based on these evaluations, the HL-AI searches for plans that lead to trajectories in which the most furniture is assembled.

Failure mode:

  • Our HL-AI understands reality better than goblin evaluators.

The goblins, for example, don't have very good object permanence. They can keep track of a piece of Ikea furniture as it's being assembled step by step. However, if you introduce a screen for a couple of minutes, blocking their view of the furniture, you can remove the original furniture, introduce a new piece (or more than one), and the goblins won't realize what's happened, or even be suspicious.

  • There are plans that acquire influence in ways that are obvious to the generative model but completely incomprehensible and invisible to humans.

The goblins are real dumb, so they give points to the videos only based on the amount of partially or fully completed furniture they observe. That means that if the HL-AI figures out how to drag in pieces of furniture from the adjacent office, while turning off the camera eyes so that the goblins can't see it doing this, then the goblins will be tricked into thinking the HL-AI has magically produced extra furniture, for which they will reward it highly.

The goblins just think the furniture is magically disappearing from the office, so they keep replacing it. This allows the HL-AI to continue its strategy.

  • It’s possible to use that influence to “hack” the cameras, in the sense of creating a fiction that looks convincing to a goblin looking at predicted videos.

The goblins now believe they've created an HL-AI that's able to magically produce furniture, which they believe will be very helpful in solving the problem of the office in which furniture magically disappears. They continue putting the HL-AI to new "magical creation tasks," with the result that the HL-AI keeps on moving resources around pointlessly in the goblin world, creating new problems which they assign it to solve, leading to havoc.

comment by adamShimi · 2021-03-23T00:50:32.208Z · LW(p) · GW(p)

Thanks for writing this! I'm quite excited by learning more about your meta-agenda and your research process, and this reading stimulated me about my own research process.

But it feels to me like egregious misalignment is an extreme and somewhat strange failure mode and it should be possible to avoid it regardless of how the empirical facts shake out.

So you don't think that we could have a result of the sort "with these empirical facts, egregious misalignment is either certain or very hard to defend against, and so we should push towards not building AIs that way"? Or is it more than even with such arguments, you see incentives for people to use it, and so we might as well consider that we have to solve the problem even in such problematic cases?

This is a much higher bar for an algorithm to meet, so it may just be an impossible task. But if it’s possible, there are several ways in which it could actually be easier:

  • We can potentially iterate much faster, since it’s often easier to think of a single story about how an algorithm can fail than it is to characterize its behavior in practice.
  • We can spend a lot of our time working with simple or extreme toy cases that are easier to reason about, since our algorithm is supposed to work even in these cases.
  • We can find algorithms that have a good chance of working in the future even if we don’t know what AI will look like or how quickly it will advance, since we’ve been thinking about a very wide range of possible failure cases.

Of these, only the last one looks to me like it's making things simpler. The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior. Or said differently, if you have to solve every plausible scenario, then simple testing doesn't cut it. And for the second, my personal worry with work on toy models is that the solutions work on test cases but not on practical one, not the other way around.

I’d guess there’s a 25–50% chance that we can find an alignment strategy that looks like it works, in the sense that we can’t come up with a plausible story about how it leads to egregious misalignment. That’s a high enough probability that I’m very excited to gamble on it. Moreover, if it fails I think we’re likely to identify some possible “hard cases” for alignment — simple situations where egregious misalignment feels inevitable.

Reading that paragraph, I feel like you addressed some of my questions from above. One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.

What this looks like (3 examples)

Your rundown of examples from your research was really helpful, not only to get a grip of the process, but also because it clarified the path of refinement of your different proposals. I think it might be worth to make it its own post, with maybe more examples, for a view of how your "stable" evolved over the years.

My research basically involves alternating between “think of a plausible alignment algorithm” and “think of a plausible story about how it fails.”

This made me think of this famous paper in the theory of distributed computing, and especially what Nancy Lynch, the author, says about the process of working on impossibility results:

How does one go about working on an impossibility proof? [...]

Then it's time to begin the game of playing the positive and negative directions of a proof against each other. My colleagues and I have often worked alternatively on one direction and on the other, in each case until we got stuck. It is not a good idea to work just on an impossibility result, because there is always the unfortunate possibility that the task you are trying to prove is impossible is in fact possible, and some algorithm may surface.

 

I’m always thinking about a stable of possible alignment strategies and possible stories about how each strategy can fail. Depending on the current state of play, there are a bunch of different things to do:

I expect this description of the process to be really helpful to many starting researchers who don't know where to push when one direction or approach fails.

  • I think there’s a reasonable chance of empirical work turning up unknown unknowns that change how we think about alignment, or to find empirical facts that make alignment easier. We want to get those sooner rather than later.

This is the main reason I'm excited by empirical work.

 

For the objections and your response, I don't have any specific comment, except that I pretty much agree with most of what you say. On the differences with traditional theoretical computer science, I feel like the biggest one right now is that most of the work here lies in the "grasping towards the precise problem" instead of "solving a well-defined precise problem". I would expect that this is because the problem is harder, because the field is younger and has less theoretical work on, and because we are not satisfied by simply working on a tractable and/or exciting precise problem -- it has to be relevant to alignment.

Replies from: paulfchristiano
comment by paulfchristiano · 2021-03-23T18:04:06.495Z · LW(p) · GW(p)

The first seems misleading: what we need is a universal quantification over plausible stories, which I would guess requires understanding the behavior.

You get to iterate fast until you find an algorithm where it's hard to think of failure stories. And you get to work on toy cases until you find an algorithm that actually works in all the toy cases. I think we're a long way from meeting those bars, so that we'll get to iterate fast for a while. After we meet those bars, it's an open question how close we'd be to something that actually works. My suspicion is that we'd have the right basic shape of an algorithm (especially if we are good at thinking of possible failures).

One thing that I only understood here is that you want a solution such that we can't think of a plausible scenario where it leads to egregious misalignment, not a solution such that there isn't any such plausible scenario. I guess your reasons here are basically the same as the ones for using ascription universality with regard to a human's epistemic perspective.

I feel like these distinctions aren't important until we get to an algorithm for which we can't think of a failure story (which feels a long way off). At that point the game kind of flips around, and we try to come up with a good story for why it's impossible to come up with a failure story. Maybe that gives you a strong security argument. If not, then you have to keep trying on one side or the other, though I think you should definitely be starting to prioritize applied work more.

comment by Nisan · 2021-03-24T05:37:15.915Z · LW(p) · GW(p)

Red-penning is a general problem-solving method that's kinda similar to this research methodology.

Replies from: rohinmshah
comment by rohinmshah · 2021-03-24T21:09:12.841Z · LW(p) · GW(p)

These are both cases of counterexample-guided techniques. The basic idea is to solve "exists x: forall y: P(x, y)" statements according to the following algorithm:

  1. Choose some initial x, and initialize a set Y = {}.
  2. Solve "exists y: not P(x, y)". If unsolvable, you're done. If not, take the discovered y and put it in Y.
  3. Solve "exists x: forall y in Y: P(x, y)" and set the solution as your new x.
  4. Go to step 2.

The reason this is so nice is because you've taken a claim with two quantifiers and written an algorithm that must only ever solve claims with one quantifier. (For step 3, you inline the "forall y in Y" part, because Y is a small finite set.)

The methodology laid out in this post is a counterexample-guided approach to solve the claim "exists alignment proposal: forall plausible worlds: The alignment proposal is safe in the world"

Examples from programming languages include CEGIS (counterexample guided inductive synthesis) and CEGAR (counterexample guided abstraction refinement).

comment by ofer · 2021-03-27T13:59:19.840Z · LW(p) · GW(p)

For any competitive alignment scheme that involve helper (intermediate) ML models, I think we can construct the following story about an egregiously misaligned AI being created:

Suppose that there does not exist an ML model (in the model space being searched) that fulfills both the following conditions:

  1. The model is useful for either creating safe ML models or evaluating the safety of ML models, in a way that allows being competitive.
  2. The model is sufficiently simple/weak/narrow such that it's either implausible that the model is egregiously misaligned, or if it is in fact egregiously misaligned researchers can figure that out—before it's too late—without using any other helper models.

To complete the story: while we follow our alignment scheme, at some point we train a helper model that is egregiously misaligned, and we don't yet have any other helper model that allows to mitigate the associated risk.

If you don't find this story plausible, consider all the creatures that evolution created on the path from the first mammal to humans. The first mammal fulfills condition 2 but not 1. Humans might fulfill condition 1, but not 2. It seems that human evolution did not create a single creature that fulfills both conditions.

One might object to this analogy on the grounds that evolution did not optimize to find a solution that fulfills both conditions. But it's not like we know how to optimize for that (while doing a competitive search over a space of ML models).

comment by rohinmshah · 2021-04-06T22:39:51.045Z · LW(p) · GW(p)

Planned summary for the Alignment Newsletter:

This post outlines a simple methodology for making progress on AI alignment. The core idea is to alternate between two steps:

1. Come up with some alignment algorithm that solves the issues identified so far

2. Try to find some plausible situation in which either a) the resulting AI system is misaligned or b) the AI system is not competitive.

This is all done conceptually, so step 2 can involve fairly exotic scenarios that probably won't happen. Given such a scenario, we need to argue why no failure in the same class as that scenario will happen, or we need to go back to step 1 and come up with a new algorithm.

This methodology could play out as follows:

Step 1: RL with a handcoded reward function.

Step 2: This is vulnerable to <@specification gaming@>(@Specification gaming examples in AI@).

Step 1: RL from human preferences over behavior, or other forms of human feedback.

Step 2: The system might still pursue actions that are bad that humans can't recognize as bad. For example, it might write a well researched report on whether fetuses are moral patients, which intuitively seems good (assuming the research is good). However, this would be quite bad if the AI wow the report because it calculated that it would increase partisanship leading to civil war.

Step 1: Use iterated amplification to construct a feedback signal that is "smarter" than the AI system it is training.

Step 2: The system might pick up on <@inaccessible information@>(@Inaccessible information@) that the amplified overseer cannot find. For example, it might be able to learn a language just by staring at a large pile of data in that language, and then seek power whenever working in that language, and the amplified overseer may not be able to detect this.

Step 1: Use <@imitative generalization@>(@Imitative Generalisation (AKA 'Learning the Prior')@) so that the human overseer can leverage facts that can be learned by induction / pattern matching, which neural nets are great at.

Step 2: Since imitative generalization ends up learning a description of facts for some dataset, it may learn low-level facts useful for prediction on the dataset, while not including the high-level facts that tell us how the low-level facts connect to things we care about. 

The post also talks about various possible objections you might have, which I’m not going to summarize here.

Planned opinion:

I'm a big fan of having a candidate algorithm in mind when reasoning about alignment. It is a lot more concrete, which makes it easier to make progress and not get lost, relative to generic reasoning from just the assumption that the AI system is superintelligent.

I'm less clear on how exactly you move between the two steps -- from my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2. Our algorithms are instead meant to chip away at the problem, by continually increasing our control over these patterns of thought. It seems like the author has a better defined sense of what does and doesn't count as a valid step 2, and that makes this methodology more fruitful for him than it would be for me. More discussion [here](https://www.alignmentforum.org/posts/EF5M6CmKRd6qZk27Z/my-research-methodology?commentId=8Hq4GJtnPzpoALNtk).

Replies from: paulfchristiano
comment by paulfchristiano · 2021-04-09T00:58:37.887Z · LW(p) · GW(p)

rom my perspective, there is a core reason for worry, which is something like "you can't fully control what patterns of thought your algorithm learn, and how they'll behave in new circumstances", and it feels like you could always apply that as your step 2

That doesn't seem like it has quite the type signature I'm looking for. I'm imagining a story as a description of how something bad happens, so I want the story to end with "and then something bad happens."

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

I'm mostly doing that by making it more and more concrete---something is plausible iff there is a plausible way to fill in all the details. E.g. how is the model thinking, and why does that lead it to decide to kill you? 

Sometimes after filling in a few details I'll see that the current story isn't actually plausible after all (i.e. now I see how to argue that the details-so-far are contradictory). In that case I backtrack.

Sometimes I fill in enough details that I'm fairly convinced the story is plausible, i.e. that there is some way to fill in the rest of the details that's consistent with everything I know about the world. In that case I try to come up with a new algorithm or new assumption.

(Sometimes plausibility takes the form of an argument that there is a way to fill in some set of details, e.g. maybe there's an argument that a big enough model could certainly compute X . Or sometimes I'm just pretty convinced for heuristic reasons.)

That's not a fully-precise methodology. But it's roughly what I'd do. (There are many places where the the methodology in this post is not fully-precise and certainly not mechanical.)

If I was starting looking at the trivial story "and then your algorithm kills you," my first move would usually be to try to say what kind of model was learned, which needs to behave well on the training set and plausibly kill you off distribution. Then I might try to shoot that story down by showing that some other model behaves even better on the training set or is even more readily learned (to try to contradict the part where the story needed to claim "And this was the model learned by SGD"), then gradually filling in more details as necessary to evaluate plausibility of the story.

Replies from: rohinmshah
comment by rohinmshah · 2021-04-09T03:06:43.316Z · LW(p) · GW(p)

In some sense you could start from the trivial story "Your algorithm didn't work and then something bad happened." Then the "search for stories" step is really just trying to figure out if the trivial story is plausible. I think that's pretty similar to a story like: "You can't control what your model thinks, so in some new situation it decides to kill you."

To fill in the details more:

Assume that we're finding an algorithm to train an agent with a sufficiently large action space (i.e. we don't get safety via the agent having such a restricted action space that it can't do anything unsafe).

It seems like in some sense the game is in constraining the agent's cognition to be such that it is "safe" and "useful". The point of designing alignment algorithms is to impose such constraints, without requiring so much effort as to make the resulting agent useless / uncompetitive.

However, there are always going to be some plausible circumstances that we didn't consider (even if we're talking about amplified humans, which are still bounded agents). Even if we had maximal ability to place constraints on agent cognition, whatever constraints we do place won't have been tested in these unconsidered plausible circumstances. It is always possible that one misfires in a way that makes the agent do something unsafe.

(This wouldn't be true if we had some sort of proof against misfiring, that doesn't assume anything about what circumstances the agent experiences, but that seems ~impossible to get. I'm pretty sure you agree with that.)

More generally, this story is going to be something like:

  1. Suppose you trained your model M to do X using algorithm A.
  2. Unfortunately, when designing algorithm A / constraining M with A, you (or amplified-you) failed to consider circumstance C as a possible situation that might happen.
  3. As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.
  4. Circumstance C then happens in the real world, leading to an actual failure.

Obviously, I can't usually instantiate M, X, A, C, and H such that the story works for an amplified human (since they can presumably think of anything I can think of). And I'm not arguing that any of this is probable. However, it seems to meet your bar of "plausible":

there is some way to fill in the rest of the details that's consistent with everything I know about the world.

EDIT: Or maybe more accurately, I'm not sure how exactly the stories you tell are different / more concrete than the ones above.

----

When I say you have "a better defined sense of what does and doesn't count as a valid step 2", I mean that there's something in your head that disallows the story I wrote above, but allows the stories that you generally use, and I don't know what that something is; and that's why I would have a hard time applying your methodology myself.

----

Possible analogy / intuition pump for the general story I gave above: Human cognition is only competent in particular domains and must be relearned in new domains (like protein folding) or new circumstances (like when COVID-19 hits), and sometimes human cognition isn't up to the task (like when being teleported to a universe with different physics and immediately dying), or doesn't do so in a way that agrees with other humans (like how some humans would push a button that automatically wirehead everyone for all time, while others would find that abhorrent).

Replies from: paulfchristiano
comment by paulfchristiano · 2021-04-09T05:42:16.941Z · LW(p) · GW(p)

As a result, the model learned heuristic H, that works in all the circumstances you did consider, but fails in circumstance C.

That's basically where I start, but then I want to try to tell some story about why it kills you, i.e. what is it about the heuristic H and circumstance C that causes it to kill you?

I agree this involves discretion, and indeed moving beyond the trivial story "The algorithm fails and then it turns out you die" requires discretion, since those stories are certainly plausible. The other extreme would be to require us to keep making the story more and more concrete until we had fully specified the model, which also seems intractable. So instead I'm doing some in between thing, which is roughly like: I'm allowed to push on the story to make it more concrete along any axis, but I recognize that I won't have time to pin down every axis so I'm basically only going to do this a bounded number of times before I have to admit that it seems plausible enough (so I can't fill in a billion parameters of my model one by one this way; what's worse, filling in those parameters would take even more than a billion time and so this may become intractable even before you get to a billion).

Replies from: rohinmshah
comment by rohinmshah · 2021-04-09T06:24:10.606Z · LW(p) · GW(p)

I agree this involves discretion [...] So instead I'm doing some in between thing

Yeah, I think I feel like that's the part where I don't think I could replicate your intuitions (yet).

I don't think we disagree; I'm just noting that this methodology requires a fair amount of intuition / discretion, and I don't feel like I could do this myself. This is much more a statement about what I can do, rather than a statement about how good the methodology is on some absolute scale.

(Probably I could have been clearer about this in the original opinion.)

comment by Koen.Holtman · 2021-03-25T20:01:30.990Z · LW(p) · GW(p)

Interesting... On first reading your post, I felt that your methodological approach for dealing with the 'all is doomed in the worst case' problem is essentially the same as my approach. But on re-reading, I am not so sure anymore. So I'll try to explore the possible differences in methodological outlook, and will end with a question.

The key to your methodology is that you list possible process steps which one might take when one feels like

all of our current algorithms are doomed in the worst case.

The specific doom-removing process step that I want to focus on is this one:

If so, I may add another assumption about the world that I think makes alignment possible (e.g. the strategy stealing assumption), and throw out any [failure] stories that violate that assumption [...]

My feeling is that AGI safety/alignment community is way too reluctant to take this process step of 'add another assumption about the world' in order to eliminate a worst case failure story.

These seem to be several underlying causes for this reluctance. One of them is that in the field of developing machine learning algorithms, in the narrow sense where machine learning equals function approximation, the default stance is to make no assumptions about the function that has to be approximated. But the main function to be approximated in the case of an ML agent is the function that determines the behavior of the agent environment. So the default methodological stance in ML is that we can introduce no assumptions whatsoever about the agent environment, we can't for example assume that it contains a powerful oversight body that will help to keep the agent aligned. Obviously this stance is not very helpful if you want to make progress on certain alignment problems.

So I'm happy to see a post that encourages people to make explicit assumptions about the agent's environment. I have definitely used this technique to make progress in my own work.

But.

When I look at your example of 'the strategy stealing assumption' as one useful assumption to add, it is very much not the default example that would first come to my mind. So I am wondering if you would even recommend the approach of adding the kind of default assumptions that I tend to add.

To make this more specific, in this post I introduce an agent design with three safety interlocks that are supposed to be helpful to agent oversight. The interlocks are agent design refinements that make it easier for oversight to keep control over the agent. The interlocks contribute to more successful oversight not by making the oversight people+machines smarter (which is your main line of research I believe), but by making the agent less smart in very specific ways.

But at the same time, these interlocks do not remove all possible worst-case failure stories of doom. To quote from the post and the underlying paper:

We can always construct a highly unlikely but theoretically possible failure mode story where the agent ends up being unsafe.

The key here is the 'highly unlikely'. If we have an algorithm were

We train a generative model that predicts these camera observations given the motor actions.

then I typically add the following assumption to avoid doom:

  • Assumption: Take a sequence of of disastrously unsafe motor actions, actions that that modify the agent compute core to remove the safety interlocks. We assume that, while the following mis-prediction scenario is theoretically possible, the risk that it actually happens is very low. The mis-prediction scenario is that, for the above motor actions that modify the compute core, the above generative model incorrectly predicts camera observations in which the agent compute core remains perfectly untouched by these actions, and where the observed paperclip production resulting from these actions is better than under any other sequence of actions.

In terms of methodology, I usually describe the above move as one where we seek to drive down the risk of certain failure modes to residual levels. There is a link to empirical work here. To make it more plausible that the above assumption about low risk is valid for a particular ML system and physical realization of an agent and its environment, we can do simulations and real-life experiments with trained generative models.

So my question is: would the above assumption-adding step, about the low risk of mis-predictions, be a natural and valid assumption-adding process step for 'throwing out failure stories' in your methodology?

Or is the existence of this assumption automatically implied by default in your process?

comment by Dweomite · 2021-03-25T03:44:08.920Z · LW(p) · GW(p)

I feel confused about the failure story from example 3.  (First 3 bullet-points in that section.)

It sounded like: We ask for a human-comprehensible way to predict X; the computer uses a very low-level simulation plus a small bridge that predicts only and exactly X; humans can't use the model to predict any high-level facts besides X.

But I don't see how that leads to egregious misalignment.  Shouldn't the humans be able to notice their inability to predict high-level things they care about and send the AI back to its model-search phase?  (As opposed to proceeding to evaluate policies based on this model and being tricked into a policy that fails "off-screen" somewhere.)