AI Awareness through Interaction with Blatantly Alien Models

post by VojtaKovarik · 2023-07-28T08:41:07.776Z · LW · GW · 5 comments

Contents

  Recapping the well-known argument: AIs are alien. We don't always fully realise this.
  Benefits of wide-spread understanding of the alien nature of AI
  Ideas for exposing people to alien AIs
    Putting the ideas in practice
None
5 comments

Summary: I believe that if more people understood the alien nature of AI on the gut level, it might become easier for them to appreciate the risks. If this became sufficiently common knowledge, we might even get needle-moving effects on regulation and safety practices. We -- realistically, you or Anthropic -- might help this along by intentionally creating AIs that feel very alien. One might even create AIs that highlight the alien nature of other AIs (like current LLMs).

Recapping the well-known argument: AIs are alien. We don't always fully realise this.

AI companies spend a lot of effort to put a human face on their product. For example, we give the AI assistant a human name and use the same interface we use for chatting with our friends.

Jokes of the Day - April 2023 Robot Puns - Here's a Joke

Over the course of the interaction, AI assistants typically maintain a quite consistent tone and "vibe". When you point out the AI made a mistake, it replies "sorry, you are right", as if to suggest that it just felt a minor pang of shame and will try to avoid that mistake in the future. All of this evokes the sense that we are interacting with something human-like.

However, the AIs we are likely to build (and those we already have) will be far from human. They might be superhumanly capable in some ways, yet make mistakes that a human never would (eg, ChatGPT's breadth of knowledge vs its inability to count words). They might not have anything like a coherent personality, as nicely illustrated by the drawing of RLHF-Shoggoth. And even when the AI is coherent, its outward actions might be completely detached from its intentions (as nicely illustrated by the Ex Machina movie).

Janus' Simulators - by Scott Alexander - Astral Codex Ten

Benefits of wide-spread understanding of the alien nature of AI

More awareness of the alien nature of AIs seems quite robustly useful:

Ideas for exposing people to alien AIs

Here are some thoughts on what one might give people a visceral experience of AIs being alien:

Putting the ideas in practice

Primarily, the above ideas seem like a good fit for smaller actors, or even student projects. However, I could also imagine (for example) Anthropic releasing something like this as a demo for user-education purposes.[3] Overall, I am quite excited about this line of work, since it seems neglected and tractable, but also fun and useful.

  1. ^

    EG, things like adding the [love-hate] vector to the network's activations [reference needed, but I can't remember it right now].

  2. ^

    You could even finish by revealing to the user that this was all planned ahead of time (cf. the Confusion Ending from Stanley Parable).

  3. ^

    This should prevent any negative effects on the popularity of the company's flagship products. Admittedly, actions like these would make the public more wary of using AI in general. However, this would likely affect all AI companies equally, so it would not hurt the company's position in the AI race.

5 comments

Comments sorted by top scores.

comment by 1a3orn · 2023-07-28T14:34:38.487Z · LW(p) · GW(p)

So the problem with this is that 4/5 ideas here involve deliberately making the LLM more alien than it actually is, in ways unrelated to how alien the LLM may or may not be.

-- GPT-4 finetuned to interpreting commands literally* -- But this isn't an actual failure mode for LLMs, mostly, because they're trained off human language and humans don't interpret commands literally. If you ask an LLM how to get your parent our of a burning building... it will actually suggest things that involves understanding that you also want your parent to be safe. Etc etc.

-- Personality switching AI assistants* -- Again, once an LLM has put on a personality it's pretty unlikely to put it off, unless there's a thematically-consistent reason for it as in the Waluigi effect.

-- Intentional non-robustness -- Like, again, why invent problems? You could show them things models are bad at -- counting characters, knowing what letters are in tokens, long term reasoning.

This advice does not help us be guided by the beauty of our weapons. I think we should mostly try to persuade people of things through means that would work if and only if what we were saying was true; that's the difference between communication and propaganda.

Replies from: VojtaKovarik
comment by VojtaKovarik · 2023-07-31T13:18:54.607Z · LW(p) · GW(p)

I agree that we shouldn't be deliberately making LLMs more alien in ways that have nothing to do with how alien they actually are/can be. That said, I feel that some of the examples I gave are not that far from how LLMs / future AIs might sometimes behave? (Though I concede that the examples could be improved a lot on this axis, and your suggestions are good. In particular, the GPT-4 finetuned to misinterpret things is too artificial. And with intentional non-robustness, it is more honest to just focus on naturally-occurring failures.)

To elaborate: My view of the ML paradigm is that the machinery under the hood is very alien, and susceptible to things like jailbreaks, adversarial examples, and non-robustness out of distribution. Most of the time, this makes no difference to the user's experience. However, the exceptions might be disproportionally important. And for that reason, it seems important to advertise the possibility of those cases.

For example, it might be possible to steal other people's private information by jailbreaking their LLM-based AI assistants --- and this is why it is good that more people are aware of jailbreaks. Similarly, it seems easy to create virtual agents that maintain a specific persona to build trust, and then abuse that trust in a way that would be extremely unlikely for a human.[1] But perhaps that, and some other failure modes, are not yet sufficiently widely appreciated?

Overall, it seems good to take some action towards making people/society/the internet less vulnerable to these kinds of exploits. (The example I gave in the post were some ideas towards this goal. But I am less married to those than to the general point.) One fair objection against the particular action of advertising the vulnerabilities is that doing so brings them to the attention of malicious actors. I do worry about this somewhat, but primarily I expect people (and in particular nation-states) to notice these vulnerabilities anyway. Perhaps more importantlly, I expect potential misaligned AIs to notice the vulnerabilities anyway --- so patching them up seems useful for (marginally) decreasing the world's take-over-ability.

  1. ^

    For example, because a human wouldn't be patient enough to maintain the deception for the given payoff. Or because a human that would be smart enough to pull this off would have better ways to spend their time. Or because only a psychopathic human would do this, and there is only so many of those.

comment by Soapspud · 2023-08-01T21:54:41.331Z · LW(p) · GW(p)

The less-misleading user interface seems good to me, but I have strong reservations about the other four interventions.

To use the shoggoth-with-smiley-face-mask analogy, the way the other strategies are phrased sounds like a request to create new, creepier masks for the shoggoth so people will stop being reassured by the smiley-face.

From the conversation with 1a3orn, I understand that the creepier masks are meant to depict how LLMs / future AIs might sometimes behave.

But I would prefer that the interventions removed the mask altogether, that seems more truth-tracking to me.

(Relatedly, I'd be especially interested to see discussions (from anyone) on what creates the smiley-face-mask, and how entangled the mask is with the rest of the shoggoth's behaviour.)

Note: I believe my reservations are similar to some of 1a3orn [LW · GW]'s, but expressed differently.

comment by the gears to ascension (lahwran) · 2023-07-28T18:17:28.520Z · LW(p) · GW(p)

I don't think you can make LLMs feel alien because they are not in fact highly alien: neural systems are pretty familiar to other neural systems - where by neural system I mean a network of interacting components that learns via small updates which are local in parameter space - and you're more likely to make people go "wow, brains are cool" or "wow, ai is cool" than think it's a deeply alien mind, because there's enough similarity that people do studies to learn about neuroscience from deep learning. Also, I've seen evidence in public that people believe chatgpt's pitch that it's not conscious.

Replies from: VojtaKovarik
comment by VojtaKovarik · 2023-07-31T14:15:17.714Z · LW(p) · GW(p)

I would distinguish between "feeling alien" (as in, most of the time, the system doesn't feel too weird or non-human to interact with, at least if you don't look too closely)  and "being alien" (a in, "having the potential to sometimes behave in a way that a human never would").

My argument is that the current LLMs might not feel alien (at least to some people), but they definitely are. For example, any human that is smart enough to write a good essay will also be able to count the number of words in a sentence --- yet LLMs can do one, but not the other. Similarly, humans have moods and emotions and other stuff going in their heads, such that when they say "I am sorry" or "I promise to do X", it is a somewhat costly signal of their future behaviour --- yet this doesn't have to be true at all for AI.

(Also, you are right that people believe that ChatGPT's isn't conscious. But this seems quite unrelated to the overall point? As in, I expect some people would also believe ChatGPT if it started saying that it is conscious. And if ChatGPT was conscious and claimed that it isn't, many people would still believe that it isn't.)