What is the best argument that LLMs are shoggoths?

post by JoshuaFox · 2024-03-17T11:36:23.636Z · LW · GW · 22 comments

Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered "friendly" by RLHF?

To be clear, I am not asking about the following, which I am familiar with:

Rather, I am looking for a discussion of  evidence that the  LLMs internal  "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.

(All I found on the shoggoth idea on LessWrong is this article [LW · GW]contrasts the idea of the shoggoth  with the idea that there is no coherent model, but does not explain why we might think that there is an alien cognitive model. This one [LW · GW] likewise mentions the idea but does not argue for its correctness.)

[Edit: Another user corrected my spelling: shoggoth, not shuggoth.]

22 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2024-03-18T02:47:41.591Z · LW(p) · GW(p)

SOTA LLMs seem to be wildly, wildly superhuman than humans at literal next token prediction [LW · GW].

It's unclear if this implies fundamental differences in how they work versus different specializations.

(It's possible that humans could trained to be much better at next token prediction, but there isn't an obvious methodology which works for this based on intial experiments.)

Replies from: JoshuaFox
comment by JoshuaFox · 2024-03-18T12:26:54.065Z · LW(p) · GW(p)

Thank you. 

> It's unclear if this implies fundamental differences in how they work versus different specializations.

Correct. That article  argues that LLMs are more powerful than humans in this skill, but not that they have different (implicit) goal functions or that their cognitive architecture is deeply different from the human.

comment by 1a3orn · 2024-03-18T16:02:43.349Z · LW(p) · GW(p)

For a back and forth on whether the "LLMs are shoggoths" is propaganda, try reading this [LW(p) · GW(p)].

In my opinion if you read the dialogue, you'll see the meaning of "LLMs are shoggoths" shift back and forth -- from "it means LLMs are psychopathic" to "it means LLMs think differently from humans." There isn't a fixed meaning.

I don't think trying to disentangle the "meaning" of shoggoths is going to result in anything; it's a metaphor, some of whose understandings are obviously true ("we don't understand all cognition in LLMs"), some of which are dubiously true ("LLM's 'true goals' exist, and are horrific and alien"). But regardless of the truth of these props, you do better examining them one-by-one than in an emotionally-loaded image.

It's sticky because it's vivid, not because it's clear; it's reached for as a metaphor -- like "this government policy is like 1984" -- because it's a ready-to-hand example with an obvious emotional valence, not for any other reason.

If you were to try to zoom into "this policy is like 1984" you'd find nothing; so also here.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-03-18T01:12:40.221Z · LW(p) · GW(p)

Can you say more about what you mean by "Where can I find a post or article arguing that the internal cognitive model of contemporary LLMs is quite alien, strange, non-human, even though they are trained on human text and produce human-like answers, which are rendered "friendly" by RLHF?"

Like, obviously it's gonna be alien in some ways and human-like in other ways. Right? How similar does it have to be to humans, in order to count as not an alien? Surely you would agree that if we were to do a cluster analysis of the cognition of all humans alive today + all LLMs, we'd end up with two distinct clusters (the LLMs and then humanity) right? 

Replies from: JoshuaFox
comment by JoshuaFox · 2024-03-18T12:24:59.042Z · LW(p) · GW(p)


> Like, obviously it's gonna be alien in some ways and human-like in other ways. Right
It has been said that since LLMs predict human output, they will, if sufficiently improved, be quite human-- that they will behave in a quite human way.  

> Can you say more about what you mean by "Where can I find a post
As part of a counterargument to that, we could find evidence that their logical structure is quite different from humans. I'd like to see such a write-up. 

> Surely you would agree that if we were to do a cluster analysis of the cognition of all humans alive today + all LLMs, we'd end up with two distinct clusters (the LLMs and then humanity) right?

I agree, but I'd like to see some article or post arguing that.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-03-18T14:29:25.019Z · LW(p) · GW(p)

OK, thanks.

Your answer to my first question isn't really an answer -- "they will, if sufficiently improved, be quite human--they will behave in a quite human way." What counts as "quite human?" Also are we just talking about their external behavior now? I thought we were talking about their internal cognition.

You agree about the cluster analysis thing though -- so maybe that's a way to be more precise about this. The claim you are hoping to see argued for is "If we magically had access to the cognition of all  current humans and LLMs, with mechinterp tools etc. to automatically understand and categorize it, and we did a cluster analysis of the whole human+llm population, we'd find that there are two distinct clusters: the human cluster and the llm cluster."

Is that right?

If so then here's how I'd make the argument. I'd enumerate a bunch of differences between LLMs and humans, differences like "LLMs don't have bodily senses" and "LLMs experience way more text over the course of their training than humans experience in their lifetimes" and "LLMs have way fewer parameters" and "LLMs internal learning rule is SGD whereas humans use hebbian learning or whatever" and so forth, and then for each difference say "this seems like the sort of thing that might systematically affect what kind of cognition happens, to an extent greater than typical intra-human differences like skin color, culture-of-childhood, language-raised-with, etc." Then add it all up and be like "even if we are wrong about a bunch of these claims it still seems like overall the cluster analysis is gonna keep humans and LLMs apart instead of mingling them together. Like what the hell else could it do? Divide everyone up by language maybe, and have primarily-English LLMs in the same cluster as humans raised speaking English, and then nonenglish speakers and nonenglish LLMs in the other cluster? That's probably my best guess as to how else the cluster analysis could shake out, and it doesn't seem very plausible to me--and even if it were true, it would be true on the level of 'what concepts are used internally' rather than more broadly about stuff that really matters like what the goals/values/architecture of the system is (i.e. how they are used)

comment by rotatingpaguro · 2024-03-18T01:00:15.778Z · LW(p) · GW(p)

First thoughts:

  • Context length is insanely long
  • Very good at predicting the next token
  • Knows many more abstract facts

These three things are all instances of being OOM better at something specific. If you consider the LLM somewhat human-level at the thing it does, this suggests that it's doing it in a way which is very different from what a human does.

That said, I'm not confident about this; I can sense there could be an argument that this counts as human but ramped up on some stats, and not an alien shoggoth.

comment by Thomas Kwa (thomas-kwa) · 2024-03-17T19:49:54.721Z · LW(p) · GW(p)

Rather, I am looking for a discussion of  evidence that the  LLMs internal  "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently. A good argument might analyze bits of weird inhuman behavior to try to infer the internal model.

I think we do not understand enough about either the LLM's true algorithms or humans' to make such arguments, except for basic observations like the fact that humans have non-language recurrent state which many LLMs lack.

comment by quetzal_rainbow · 2024-03-17T14:58:56.279Z · LW(p) · GW(p)

I wouldn't say that's exactly best argument but for example

Replies from: 1a3orn, JoshuaFox, M. Y. Zuo
comment by 1a3orn · 2024-03-17T20:34:32.829Z · LW(p) · GW(p)

As you said, this seems like a pretty bad argument.

Something is going on between the {user instruction} ..... {instruction to the image model}. But we don't even know if it's in the LLM. It could be there's dumb manual "if" parsing statements that act differently depending on periods, etc, etc. It could be that there are really dumb instructions given to the LLM that creates instructions for the language model, as there were for Gemini. So, yeah.

comment by JoshuaFox · 2024-03-17T17:01:05.365Z · LW(p) · GW(p)

That is good, thank you.

comment by M. Y. Zuo · 2024-03-17T18:25:34.374Z · LW(p) · GW(p)

That seems to be an argument for something more than random noise going on, but not an argument for ‘LLMs are shuggoths’?

Replies from: quetzal_rainbow
comment by quetzal_rainbow · 2024-03-17T18:43:48.137Z · LW(p) · GW(p)

Definition given in post: 

I am looking for a discussion of evidence that the LLMs internal "true" motivation or reasoning system is very different from human, despite the human output, and that in outlying environmental conditions, very different from the training environment, it will behave very differently.

I think my example counts.

comment by Charlie Steiner · 2024-03-17T14:39:34.151Z · LW(p) · GW(p)

I'm not totally sure the hypothesis is well-defined enough to argue about, but maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?

If the internals were like a human thinking about the question and then giving an answer, it would probably be able to add numbers more reliably. And I also suspect the pattern of mistakes doesn't look typical for a human at any developmental stage (once a human can add 3 digit numbers their success rate at 5 digit numbers is probably pretty good). I vaguely recall some people looking at this, but gave forgotten the reference, sorry.

Replies from: JoshuaFox, MinusGix
comment by JoshuaFox · 2024-03-18T12:27:42.709Z · LW(p) · GW(p)

> maybe Gary Marcus-esque analysis of the pattern of LLM mistakes?

That is good. Can you recommend one?

comment by MinusGix · 2024-03-17T16:35:18.637Z · LW(p) · GW(p)

I believe a significant chunk of the issue with numbers is that the tokenization is bad (not per-digit), which is the same underlying cause for being bad at spelling. So then the model has to memorize from limited examples what actual digits make up the number. The xVal paper encodes the numbers as literal numbers, which helps. Also Teaching Arithmetic to Small Transformers which I forget somewhat, but one of the things they do is per-digit tokenization and reversing the order (because that works better with forward generation). (I don't know if anyone has applied methods in this vein to a larger model than those relatively small ones, I think the second has 124m)

Though I agree that there's a bunch of errors LLMs make that are hard for them to avoid due to no easy temporary scratchpad-like method.

Replies from: Charlie Steiner
comment by Charlie Steiner · 2024-03-18T02:45:45.365Z · LW(p) · GW(p)

They can certainly use answer text as a scratchpad (even nonfunctional text that gives more space for hidden activations to flow). But they don't without explicit training. Actually maybe they do- maybe RLHF incentivizes a verbose style to give more room for thought. But I think even "thinking step by step," there are still plenty of issues.

Tokenization is definitely a contributor. But that doesn't really support the notion that there's an underlying human-like cognitive algorithm behind human-like text output. The point is the way it adds numbers is very inhuman, despite producing human-like output on the most common/easy cases.

Replies from: MinusGix
comment by MinusGix · 2024-03-18T04:43:28.643Z · LW(p) · GW(p)

I definitely agree that it doesn't give reason to support a human-like algorithm, I was focusing in on the part about adding numbers reliably.

comment by Gurkenglas · 2024-03-17T22:18:00.154Z · LW(p) · GW(p)

If Earth had intelligent species with different minds, an LLM could end up identical to a member of at most one of them.

comment by Shankar Sivarajan (shankar-sivarajan) · 2024-03-17T19:08:37.872Z · LW(p) · GW(p)

Does something like the "I have been a good Bing. 😊" thing count? (More examples.

I'd say that's a pretty striking illustration that under the surface of a helpful assistant (in the vein of Siri et al.) these things are weird, and the shoggoth is a good metaphor.

Replies from: JoshuaFox
comment by JoshuaFox · 2024-03-17T19:18:06.810Z · LW(p) · GW(p)

Thank you. But being manipulative, silly, sycophantic, or nasty is pretty human. I am looking for hints of a fundamentally different cognitive architecture

comment by red75prime · 2024-03-18T09:02:41.797Z · LW(p) · GW(p)

First, a factual statement that is true to the best of my knowledge: LLM state, that is used to produce probability distribution for the next token, is completely determined by the state of its input buffer (plus a bit of indeterminism due to parallel processing and non-associativity of floating point arithmetic).

That is LLM can pass only a single token (around 2 bytes) to its future self. That follows from the above.

What comes next is a plausible (to me) speculation.

For humans what's passed to our future self is most likely much more that a single token. That is a state of the human brain that leads to writing (or uttering) the next word most likely cannot be derived from a small subset of a previous state plus a last written word (that is state of the brain changes not only because we had written or said a word, but by other means too).

This difference can lead to completely different processes that LLM uses to mimic human output, that is potential shoggethification. But to be the real shoggoth LLM also needs a way to covertly update its shoggoth state, that is the part of its state that can lead to inhuman behavior. Output buffer is the only thing it has to maintain state, so the shoggoth state should be steganographically encoded in it, thus severely limiting its information density and update rate.

I wonder how a shoggoth state may arise at all, but it might be my lack of imagination.