AI chatbots don't know why they did it

skybrian

AI chatbots don't know why they did it

post by skybrian · 2023-04-27T06:57:09.239Z · LW · GW · 11 comments

This is a link post for https://skybrian.substack.com/p/ai-chatbots-dont-know-why-they-did

11 comments

[Epistemic status: a basic explanation, high confidence.]

Chatting with an AI seems like chatting with a person, so it’s natural to ask an AI chatbot to explain its answers. But it doesn’t work how you’d expect. An AI chatbot doesn't know how it decided what to write, so its explanation can only match its previous reasoning by coincidence.^[1]

How can I be sure of this? Based on how the API works, we know that a chatbot has no short-term memory.^[2] The only thing a chatbot remembers is what it wrote. It forgets its thought process immediately.

Yes, this is weird. If someone asked you why you did something yesterday, it’s understandable that you might not remember how you decided to do it. But if it were something you did a minute ago, you probably have some idea what you were thinking at the time. Forgetting your own thoughts so quickly might occasionally happen (“senior moment”), but having no short-term memory at all is a severe mental deficiency.

"“Senator, I don’t recall. Ever. Let me check my notes.”

But it seems to work. What happens when you ask a chatbot to explain its answer?

It will make up a plausible justification. It will try to justify its answer the way a human would,^[3] because it’s trained on writing justifications like people. Its justification will have little to do with how the language model really chooses what to write. It has no mechanism or training to do real introspection.

Often, chatbots are able to give reasonable justifications in reasonable situations, when there are likely similar justifications in their training set. But if you ask it to explain an off-the-wall answer, it may very well say something surreal.

How should this be fixed? The bug isn’t giving a bad justification, it’s trying to answer the question at all. Users will be misled by their intuitions about how people think.

OpenAI has trained ChatGPT to refuse to answer some kinds of questions. It could refuse these questions too. The chatbot could say "I don't remember how I decided that" whenever someone asks it for a justification, so that people learn what’s going on. This might seem like unhelpful evasion, but it’s literally true!

Maybe it could speculate anyway? “I don’t remember how I decided that, but a possible justification is…”

We’re on firmer ground when a chatbot explains its reasoning and then answers the question. We don't really know how LLM's think so there is no guarantee, but chain-of-thought reasoning improves benchmark scores, indicating that chatbots often do use their previous reasoning to decide on their answers.

Sometimes ChatGPT will use chain-of-thought reasoning its own. If it doesn’t, you can ask it to do that in a new chat session. (This avoids any possibility that it’s biased by its previous answer.)

^{^}
By “reasoning” or “thought process,” I mean how a chatbot decides what to write. The details are mysterious, but hopefully researchers will figure it out someday.
^{^}
ChatGPT’s API is stateless and requires the caller to pass in previous chat history. I wrote about some consequences of this before. It’s possible that the interactive version of ChatGPT works differently, but I think that’s unlikely and it would probably be discovered.
^{^}
Specifically, it answers the way the character it’s pretending to be would.

11 comments

Comments sorted by top scores.

comment by tailcalled · 2023-04-27T08:27:16.921Z · LW(p) · GW(p)

How can I be sure of this? Based on how the API works, we know that a chatbot has no short-term memory.[2] The only thing a chatbot remembers is what it wrote. It forgets its thought process immediately.

Isn't this false? I thought the chatbot is deterministic (other than sampling tokens from the final output probabilities) and that every layer of the transformer is autoregressive, so if you add a question to an existing conversation, it will reconstruct its thoughts underlying the initial conversation, and then allow those thoughts to affect its answers to your followup questions.

Replies from: gwern, skybrian

↑ comment by gwern · 2023-04-27T18:34:24.065Z · LW(p) · GW(p)

That seems unlikely, even if we stipulate that the full context is always available (because these are somehow always very short conversations), because the positional encodings will be different (by definition), the global nature of self-attention will almost certainly cause influence from the now-known later tokens when interpreting earlier in the context because that is such a universally useful thing to do, and the generated tokens were the result of the full forward pass at the final layer while to 'reconstruct' it so 'those thoughts can affect its answers' the full computation would have to be run completely within penultimate layers to make even the simplest use of the full reconstruction (so any 'difficult' thoughts would be hard to impossible to reconstruct before time/layers run out and the token has to be predicted by the ultimate layer). All of these mean it would be extremely difficult for it to simply 'replay' the original forward pass inside the new forward pass. (Also, if you think it really is literally replaying the exact computation when it explains a past computation, then what happens when you ask it to explain the explanation...? Does it replay the replay of the play?)

Imagine trying to reconstruct your own thoughts as you wrote something letter by letter, within a few seconds when reading an essay or program you wrote even an hour ago with the benefit of short-term memory, as you look at it as a whole trying to figure it out. You are definitely not simply replaying the thoughts you had back then as you figured it out, and you may not even be able to understand a summary of what you were thinking ('why did I write that?!') without taking quite a lot of time and going back and forth and laboriously reconstructing your reasoning process step by step. (Nor would you generally want to if you could, since that means repeating all of the mistakes and dead-ends.)

Replies from: tailcalled

↑ comment by tailcalled · 2023-04-27T20:52:55.946Z · LW(p) · GW(p)

Hmm upon reflection I might just have misremembered how GPTs work.

↑ comment by skybrian · 2023-04-27T15:31:12.690Z · LW(p) · GW(p)

That's a good question! I don't know but I suppose it's possible, at least when the input fits in the context window. How well it actually does at this seems like a question for researchers?

There's also a question of why it would do it when the training doesn't have any way of rewarding accurate explanations over human-like explanations. We also have many examples of explanations that don't make sense.

There are going to be deductions about previous text that are generally useful, though, and would need to be reconstructed. This will be true even if the chatbot didn't write the text in the first place (it doesn't know either way). The deductions couldn't be constructing the original thought process, though, when the chatbot didn't write the text.

So I think this points to a weakness in my explanation that I should look into, though it's likely still true that it confabulates explanations.

Replies from: tailcalled

↑ comment by tailcalled · 2023-04-27T16:54:38.569Z · LW(p) · GW(p)

I agree that it likely confabulates explanations.

comment by gjm · 2023-04-27T13:08:08.820Z · LW(p) · GW(p)

My impression is that humans also do more confabulation and less actual remembering than we think we do, when we answer questions about why we did particular things.

Some evidence for this, though it's not obvious to me quite how strong it is, comes from cases where it seems like the actual reasons aren't accessible when answering the question. For instance, someone is hypnotized and told "when you wake up, you will walk all the way around the room anticlockwise. You will not remember this instruction"; they do that; you ask them why; they say something like "oh, I felt fidgety and needed to walk around a bit". Or you take a patient who's had their corpus callosum severed to treat severe epilepsy, greatly reducing communication between the hemispheres of the brain; you show their right hemisphere a picture of a snowy field and their left hemisphere a picture of a chicken claw; then you provide them with a selection of other pictures and ask them to point to something relevant; their left hand picks a shovel and their right hand a chicken. Then you ask them why -- language mostly happens in the left hemisphere -- and they say they picked the shovel because you'd use it for cleaning out the chicken coop.

Of course these are unusual cases. But if confabulation happens so easily and naturally in these cases -- and it seems it does -- it's hard not to suspect that what's happening the rest of the time when we're asked why we do things might also be mostly confabulation, just confabulation that's harder for us and others to notice.

Replies from: skybrian

↑ comment by skybrian · 2023-04-27T15:39:47.028Z · LW(p) · GW(p)

Yes, I agree that confabulation happens a lot, and also that our explanations of why we do things aren't particularly trustworthy; they're often self-serving. I think there's also pretty good evidence that we remember our thoughts at least somewhat, though. A personal example: when thinking about how to respond to someone online, I tend to write things in my head when I'm not at a computer.

comment by Gunnar_Zarncke · 2023-04-27T10:55:38.656Z · LW(p) · GW(p)

This probably means that the answers from LLMs about the reasoning are close to how humans would answer if they were not really paying attention to what they said (and thus formed no short-term memory of it), as explained in Humans Who Are Not Concentrating Are Not General Intelligences [LW · GW].

comment by Matthew_Opitz · 2023-04-27T14:32:27.247Z · LW(p) · GW(p)

Yes, I think this is why laypeople who are new to the field are going to be confused about why interpretability work on LLMs won't be as simple as, "Uhh, obviously, just ask the LLM why it gave that answer, duh!" FYI, I recently wrote about this same topic as applied to the specific problem of Voynich translation:

Bing AI Generating Voynich Manuscript Continuations - It does not know how it knows [LW · GW]

comment by [deleted] · 2023-04-27T07:04:56.470Z · LW(p) · GW(p)

What is the reason by which ChatGPT has not been given any memory? Security/privacy issues? Performance?

I assume that it has to do with performance, or limitations of data transfer in particular? probably both?

Replies from: sanxiyn

↑ comment by sanxiyn · 2023-04-27T08:39:32.846Z · LW(p) · GW(p)

I think the primary difficulty is how to train it. GPT is trained from internet texts, but internet does not record memory of authors of those texts, so memory is unavailable in the training data set.

AI chatbots don't know why they did it

Contents

11 comments

Bing AI Generating Voynich Manuscript Continuations - It does not know how it knows [LW · GW]