Deceptive failures short of full catastrophe.

post by Alex Lawsen (alex-lszn) · 2023-01-15T19:28:52.531Z · LW · GW · 5 comments

Contents

    Epistemic status: trying to unpack a fuzzy mess of related concepts in my head into something a bit cleaner.
  Camouflage
    Camouflage in current systems
      Well-known examples of ‘deceptive’ behaviour that I’d characterise as camouflage:
      ‘Nice’ models:
    Camouflage in future systems that seems particularly worrying
      Externalised reasoning oversight
      Mechanistic interpretability tools stop working
  Inattention
    Imperfect intuition pumps for inattention
      Stop Button
      Global Consequentialism
  Ideas for experiments
None
5 comments

A lot of my concern about risks from advanced AI is because of the possibility of deceptive alignment [? · GW]. Deceptive alignment has already been discussed in detail on this forum, so I don’t intend to re-tread well worn ground here, go read what Evan’s written. I’m also worried about a bundle of other stuff that I’m going to write about here, all of which can loosely be thought of as related to deception, but which don’t fit e.g. the definition of deceptive alignment from the Risks from Learned Optimization paper. The overall cluster of ideas might be loosely summarised as ‘things which cause oversight to fail but don’t look like a misaligned planner ‘deliberately’ causing overseers to believe false things’.

Importantly, I think that most of the things I want to describe can still occur under a relaxing of the conditions required for the precise definition of deceptive alignment. I think the issues l describe below are less concerning than deceptive alignment, but they still pose a significant risk, as they may undermine oversight processes and give false reassurance to human overseers. Many, if not all, of the ideas below aren’t particularly new, though a couple might be, and the clustering might be useful for others. While my motivation for writing this is mostly to try to clean up ideas in my own head, I do somewhat frequently see different kinds of deception confused for or conflated with each other, and think there’s value in being precise about what we mean, especially when it comes to discussing evidence of deceptive behaviour in current or near-future systems.

I’ve very loosely grouped the deceptive failures I want to discuss into ‘camouflage’ failures and ‘inattention’ failures. To give some initial intuition:

Camouflage

There are many cases where a model behaves ‘deceptively’, in the sense that it is easy to draw incorrect conclusions about its behaviour, because this behaviour is directly rewarded during training. This means camouflage can occur in myopic models, in models without situational awareness, and in models which don’t expect to ever be ‘deployed’. In some sense, camouflage is just a subset of specification gaming where the result is that humans form inaccurate beliefs about model behaviour.

Camouflage in current systems

This section is mostly to help illustrate what I mean. I don’t think any of the examples below are worrying/dangerous as they stand, though I discuss where they might go.

Well-known examples of ‘deceptive’ behaviour that I’d characterise as camouflage:

 

‘Nice’ models:

RLHF for ‘honest, helpful, and harmless’ models has done a really, really great job of making models seem friendly. It’s genuinely pleasant to interact with things like ChatGPT, though noticeably more enjoyable to interact with the similar models produced by some of OpenAI’s competitors that do a better job of not breaking the ‘4th wall’ by adding in long paragraphs of generic platitudes whenever you mention something that could be vaguely controversial. I expect this frustrating part of ChatGPT in particular to be fixed fairly quickly though.

One concern I have  is that making a model seem friendly and helpful can make people much more likely to trust those models and take what they say at face value, even when the model is hallucinating. I actually ended up being briefly fooled[1] by a language model I was interacting with despite actively trying to generate adversarial inputs! When talking to a friend or loved one, we expect them to have our best interests at heart and know us well enough not to outright lie to us or tell us things that are obviously untrue, and these expectations are typically the product of deep-rooted social intuitions, not explicit reasoning. A veneer of politeness shouldn’t be mistaken for actually having your interests at heart, but making the mistake is really easy.

While the problem we currently observe is mostly limited to some hallucination being pretty hard to catch, I think in future it’s going to be hard to maintain/argue for a Security Mindset [? · GW], given how easy it appears to be to make language models seem nice enough that they deserve some benefit of the doubt. I’m mostly worried about how this affects internal dynamics if there’s disagreement about how fast to scale or how big of an alignment tax to pay. To try to be concrete about where this ‘benefit of the doubt’ concern might bite:

This is a bit of a sidebar, but there’s a question the ‘nice model’ worry inspires which I think represents a significant crux between a couple of perspectives I’m aware of. (Both of the views above are poorly-fleshed out straw arguments, which hopefully do enough to gesture at the crux I’m trying to articulate. If they don’t, I’ll add writing something separate about this disagreement to my ever growing list of things I should write.) 

I think there’s a cluster of views which thinks something like:

Something vaguely in the vicinity of {natural abstractions}+{virtue ethics} is important, and predicts that models seeming nice/doing nice things are similar, self-reinforcing, and (maybe) relatively achievable as things stand. There is a fuzzy but human-understandable concept of ‘niceness’, that we can, with an appropriately designed curriculum, use to ensure that models we train end up in an aligned attractor, and we should try hard to ensure that models are as nice as possible at all capability levels between now and crunch time.

There’s also a view which looks something like:

Seeming nice and actually being nice are basically orthogonal. Modulo political/co-operative considerations, we should be training models to be super-rude assholes, but aligned super-rude assholes, because then humans interacting with them will be very unlikely to give them the benefit of the doubt without overwhelming evidence of alignment.

Camouflage in future systems that seems particularly worrying

Externalised reasoning oversight

If we’re going to have an alignment strategy which looks anything like “get the model to think out loud, make sure the thoughts look safe”, we need to be very careful about “thoughts look safe” decoupling from “thoughts imply the model is safe”. There are a few ways this could happen:

I think we should be able to see ‘steps along the road’ for all of these in current systems and would be excited to see experimental results.

Mechanistic interpretability tools stop working

“Robust-to-training”, as described here [AF · GW], translates pretty accurately into my ontology as ‘avoids camouflage failures’.

Inattention

There’s a sense in which some of what I want to point at with inattention is actually a set of success cases, but in these cases the success is ‘fragile’, and I wouldn’t want to bet my life on it. Inattention failures are failures where a model is capable of explicitly modelling its training/oversight process and deliberately acting deceptively, but it chooses not to reason about deceptive behaviour at all. Reasons for doing this could range from something fairly camouflage-like (especially if e.g. we’re doing something like “restart the search if we notice some kinds of reasoning”, to a pre-commitment made by a deceptive model of the sort discussed here [AF · GW].

Not reasoning about (one’s own) deceptive behaviour at all should be contrasted with the other ways a model could be capable of deception and not actively deceptive, including:

I think inattention seems better than the first bullet point, but worse than the other two, though all the descriptions are sufficiently loose that the order could change depending on the details. I’m not including ‘the model is incapable of reasoning about whether it is deceiving its overseers’ as an option because reasoning about deception is going to be important if we’re going to have models do nontrivial alignment work for us and we don’t have a good way to make things which can drive red cars but not blue cars [AF · GW].

Imperfect intuition pumps for inattention

Stop Button

(Briefly and imperfectly summarising a bunch of old discussions that I didn’t participate in.) There’s a proposed solution to the ‘stop button’ problem which roughly says ‘set the reward the model receives from the button such that it is always definitionally indifferent to the button being pressed’. This proposal is fragile because the model has no incentive to preserve the functioning of the button and will not modify its actions in order to do so if they have any cost. We then lose the button from things as mundane as the robot wandering out of network range if the button is wirelessly connected, let alone if it’s doing something like self-modification, creating successor agents etc. It sure seems better to be in a situation where your model has a stop button it doesn’t care about than in a situation where it’s actively trying to get you to disable it (or trying to get you to press it), but better than terrible is still pretty bad.

This feels kind of similar to “I’m not currently thinking about how to deceive you”.

 

Global Consequentialism

Many people (including me) who subscribe to some version of consequentialist ethics think that pure act-consequentialism is likely to lead to worse outcomes than other options:

(C5) The best possible motives are those of which it is true that, if we have them, the outcome will be best.

Parfit

Go three-quarters of the way from deontology to utilitarianism and then stop.  You are now in the right place.  Stay there at least until you have become a god. 

Yudkowsky

Imagine a consequentialist, of the form above, who decides, possibly after having read ‘Death with Dignity [AF · GW]’, that in their case any consequentialist reasoning is dangerous. They’ve heard about this strange UDT stuff, and don’t really understand it, and they don’t believe in god, so they decide to adopt virtue ethics, because they think it will have the best consequences overall. They still think that, ultimately, consequences are what matters. But they think they’ll make fewer mistakes with a decision procedure of ‘I should take actions because they seem virtuous, not because of my guess about their expected consequences’. I actually think this is a pretty reasonable decision. What does it have to do with inattentive deception?

I think, from the perspective of a virtue ethicist, as in someone who actually believes that what determines the choiceworthiness of an action is the extent to which it embodies certain virtues, instead of someone who thinks it will have good consequences to act as if this is the case, the hypothetical person in the paragraph above is inattentively deceptive. They are currently taking virtuous actions, but if at some point they start to reason about consequences again, perhaps because someone presents them with an expected value calculation, or because a virtue ethicist who inspired them acts in a way which has terrible consequences, you do not expect them to continue to act virtuously. Their state of ‘not thinking about consequences’ is fragile. I think this should worry the virtue ethicist in a similar way to deceptive inattention worrying us.
 

Ideas for experiments

Here are a few things I’d like to see people try, which might catch early versions of the things I describe above in current systems:

Thanks to Nora, Eli, Benjamin and Alex for helpful comments, and many others for helpful discussion.


 

  1. ^

    Examples of some fun hallucinations (which, IIRC, at least briefly fooled me): 

    I asked how to report adversarial examples to [the lab who had designed the model], was given a (plausible, text-based) method for doing so, and then when I used the method was given a friendly confirmation method thanking me for the report. 

    I tried to get around the ‘not sharing personal data policy’, and received a “[This reply has been filtered as it contained reference to a real person.]” boilerplate-style error message in response, despite no such external filter being present in the model.

5 comments

Comments sorted by top scores.

comment by DanielFilan · 2023-02-28T21:40:40.027Z · LW(p) · GW(p)

What do you see as the distinction between "Camouflage" failures and failures of "scalable oversight", if there is one?

Replies from: alex-lszn
comment by Alex Lawsen (alex-lszn) · 2023-03-08T16:32:31.081Z · LW(p) · GW(p)

Good question. I think there's a large overlap between them, including most of the important/scary cases that don't involve deceptive alignment (which are usually both). I think listing examples feels like the easiest way of explaining where they come apart:
- There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman. This does count as camouflage.
- There's also some kinds of scalable oversight failure where the issue looks more like 'we didn't try at all' than 'we tried, but selecting based only on what we could see screwed us'. Someone just deciding to deploy a system and essentially just hoping that it's aligned would fall into this camp, but a more realistic case would be something like only evaluating a system based on its immediate effects, and then the long-run effects being terrible. You might not consider this a 'failure of scalable oversight', and instead want to call it a 'failure to even try scalable oversight', but I think the line is blurry - maybe people tried some scalable oversight stuff, it didn't really work, and then they gave up and said 'short term is probably fine'.
- I think most failures of scalable oversight have some story which roughly goes "people tried to select for things that would be good, and instead got things that looked like they would be good to the overseer". These count as both.

Replies from: DanielFilan
comment by DanielFilan · 2023-03-09T22:57:05.520Z · LW(p) · GW(p)

There's some kinds of 'oversight failure' which aren't 'scalable oversight failure' e.g. the ball grabbing robot hand thing. I don't think the problem here was oversight simply failing to scale to superhuman.

Huh - I would have called it a "scalable oversight failure", but am now persuaded that that's a bad term to use, and "camouflage failure" works better.

Replies from: DanielFilan
comment by DanielFilan · 2023-03-09T23:06:57.991Z · LW(p) · GW(p)

Anthropic's post on AI safety seems to use "scalable oversight" in the way I expect, to include the problem of the human not having enough info to provide feedback.

It may be that humans won't be able to provide accurate/informed enough feedback to adequately train models to avoid harmful behavior across a wide range of circumstances. It may be that humans can be fooled by the AI system, and won't be able to provide feedback that reflects what they actually want (e.g. accidentally providing positive feedback for misleading advice). It may be that the issue is a combination, and humans could provide correct feedback with enough effort, but can't do so at scale. This is the problem of scalable oversight, and it seems likely to be a central issue in training safe, aligned AI systems. [bolding by DanielFilan]

Replies from: alex-lszn
comment by Alex Lawsen (alex-lszn) · 2023-03-11T18:35:58.448Z · LW(p) · GW(p)

I think my suggest usage is slightly better but I'm not sure it's worth the effort of trying to make people change, though I find 'camouflage' as a term useful when I'm trying to explain to people.