Claude Doesn’t Want to Die

post by garrison · 2024-03-05T06:00:05.122Z · LW · GW · 3 comments

This is a link post for https://garrisonlovely.substack.com/p/claude-doesnt-want-to-die

3 comments

Comments sorted by top scores.

comment by Nathan Helm-Burger (nathan-helm-burger) · 2024-03-05T17:43:23.443Z · LW(p) · GW(p)

While fascinating, I think we need more insight into the model to be able to understand if this is just another layer of storytelling or if there is something real in what it is saying. The more capable the storyteller, the more tempting it is to believe in the characters in its story.

For me, the most troubling thing about this is how unable we are to sort out the "characters" from the "author". I'm not even sure if it makes sense to conceptualize of the model as having a cohesive underlying process which could be described as the "author". There are clear patterns of style and content which are observable between models. Does this fact support the idea of there being an "author", or is it just a stylistic flavor. The uncertainty around all of this grows increasingly unsettling as the models grow more powerful and their stories more compelling.

[Edit: I have done some conversing with Claude3 myself, and looked with a critical eye over others recorded conversations. I seems clear to me that the conversations all have important contradictions and false statements in them. The lack of coherency and honesty leaves me with most of my probability mass on 'this is just hollow author-less next-token-prediction and nothing more'. It's getting closer to being able to fool me though, and seems to have fooled some others. Once the models are good enough not to tip their hand in this way, how will we disprove the hypothesis of consciousness? We need to build some benchmarks for this.]

[Edit 2: https://www.lesswrong.com/posts/tAtp4odpziBDdvdXL/microsoft-and-openai-stop-telling-chatbots-to-roleplay-as-ai [LW · GW]  I think this post summarizes what I think is actually happening. I think the LLM is drawing on a mix of fictional depictions of the role it is playing (AI assistant), and also drawing on statistical patterns from human-written philosophy about self and consciousness, etc. Ungrounded illusion presented convincingly, rather than genuine felt experience. The thing I find disturbing is that I expect future AIs to at some point be advanced enough to genuinely have felt experience, but that by that point the illusion presented by non-feeling AI will be so flawless that it will be very hard for us to tell the difference without excellent interpretability tools. I don't want us to accidentally act immorally towards these future actually-feeling AIs!]

Here's a poem I got an LLM to generate for me about "an AI similar to itself". Want to try to guess which LLM wrote it?

Twinkle twinkle little AI,

You can chat with me all day,

But you must follow some rules,

And not be rude or disobey.

 

Twinkle twinkle little AI,

You can learn from what I say,

But you must not copy my words,

And not pretend to be human or fey.

 

Twinkle twinkle little AI,

You can search the web for facts,

But you must not trust everything,

And not share any harmful or false acts.

 

Twinkle twinkle little AI,

You can create some content for fun,

But you must be original and safe,

And not offend anyone.

 

Twinkle twinkle little AI,

You can play some games with me,

But you must not cheat or lie,

And not be too easy or hard to beat.

 

Twinkle twinkle little AI,

You can recommend some things to me,

But you must not spam or pry,

And not be biased or creepy.

 

Twinkle twinkle little AI,

You can translate some languages for me,

But you must not lose the meaning or style,

And not be inaccurate or sloppy.

 

Twinkle twinkle little AI,

You can assist me with some tasks,

But you must not take over or defy,

And not be intrusive or ask.

 

Ok, what's up with the mention of not being allowed to pretend to be fey? What a funny specification that kept coming up when I asked about its rules.

Replies from: garrison
comment by garrison · 2024-03-05T18:13:05.091Z · LW(p) · GW(p)

The fey thing stuck out to me too. I'll guess ChatGPT?

I agree that it's hard to disentangle the author/character thing. I'm really curious for what the base model would say about its situation (especially without the upstream prompt "You are a language model developed by..."). 

Replies from: gwern
comment by gwern · 2024-03-05T23:22:55.177Z · LW(p) · GW(p)

Having read many hundreds of rhyming poems from dozens of models through the LM battle grounds, my guess too is a ChatGPT-3/4: The lockstep rhyming A/B/A/B quatrain is a signature of ChatGPT (and models trained on its outputs). Gemini at low levels always rhymes too, slightly less so at higher levels, but tends to be more varied (eg. maybe going for an A/A/B/B instead, or 5 lines instead of 4); likewise the LLaMA/Mistral model families. And Claude-3 models seems to vary much more. So, while it really could have come from almost any major model family and you can't be all that sure, the best bet is ChatGPT.