post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by ryan_greenblatt · 2023-11-07T19:01:12.275Z · LW(p) · GW(p)

Model Psychology is the study of how LLMs simulates the behavior of the agentic entities[3] it is simulating.

I'm not a huge fan of this definition:

  • Studying models from a black box perspective (what I'd typically call model psych) can be more general than LLMs
  • This definition assumes a particular frame for thinking about LLMs which seems somewhat misleading to me.
  • We don't just care about the character/personas that models predict, we also might care about questions like "when does the LLM make mistakes and what does that imply about the algorithm it's using to predict various things" (this can hold even in cases where character/persona are unimportant)
Replies from: quentin-feuillade-montixi, quentin-feuillade-montixi
comment by Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-08T09:17:42.586Z · LW(p) · GW(p)

After consideration, I think it makes sense to change the narrative to Large Language Model Psychology instead of Model Psychology as it is too vague.

comment by Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-07T19:24:39.101Z · LW(p) · GW(p)
  • Maybe it would have been better to call it LLM psychologie Indeed. I used this formulation because it seemed to be used quite a lot in the field.

  • In later posts I'll showcase why this framing makes sense, it is quite hard to argue without them right now. I'll come back to this comment later.

  • I think the current definition does not exclude this. I am talking about the study of agentic entities and their behaviors. Making a mistake is included in this. Something interesting would be to understand wether all the simulacra are making the same mistake, or whether it is only some specific simulacra that are making it. And what in the context is influencing it.

Replies from: ryan_greenblatt
comment by ryan_greenblatt · 2023-11-07T19:49:16.147Z · LW(p) · GW(p)

I think the current definition does not exclude this. I am talking about the study of agentic entities and their behaviors. Making a mistake is included in this. Something interesting would be to understand wether all the simulacra are making the same mistake, or whether it is only some specific simulacra that are making it. And what in the context is influencing it.

It seems weird to think of it as "the simulacra making a mistake" in many cases where the model makes a prediction error.

Like suppose I prompt the model with:

[user@computer ~]$ python
Python 3.11.5 (main, Sep  2 2023, 14:16:33) [GCC 13.2.1 20230801] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import random
>>> x = random.random()
>>> y = random.random()
>>> x
0.9818489460280343
>>> y
0.7500874791464012
>>> x + y

And suppose the model gets the wrong answer. Is this the Python simulacra making a mistake?

(edit: this would presumably work better with a base model, but even non-base models can be prompted to act much more like base models in many cases.)

Replies from: quentin-feuillade-montixi, quentin-feuillade-montixi
comment by Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-08T09:33:53.027Z · LW(p) · GW(p)

After your edit, I think I am seeing the confusion now. I agree that studying Oracles and Tools predictions are interesting, but it is out of the scope of LLM Psychology. I choosed to narrow down my approach to studying the behaviors of agentic entities as I think it is where the most interesting questions arise. Maybe I should clarify this in the post.

Replies from: quentin-feuillade-montixi
comment by Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-08T09:58:44.525Z · LW(p) · GW(p)

Note that I've chosen to narrow down my approach of LLM psychology to the agentic entities, mainly because the scary or interesting things to study with a psychological approach are either the behaviors of those entities, or the capability that they are able to use.


I added this to the Definition. Does it resolve your concerns for this point?

comment by Quentin FEUILLADE--MONTIXI (quentin-feuillade-montixi) · 2023-11-07T21:11:31.033Z · LW(p) · GW(p)

The thing is that, when you ask this to ChatGPT, it is still the simulacrum ChatGPT that is going to answer, not an oracle prediction (like you can see in base models). If you want to know the capability of the underlying simulator with chat models, you need to sample sufficiently enough simulacra to be sure that the mistakes comes from the simulator lack of capability and not the simulacra preferences (or modes as Janus call them). For math, it is often not important to check different simulacrum, because each simulacrum tends to share the math ability (unless you use some other weird languages, @Ethan Edwards [LW · GW] might be able to jump in here). But for other capability (like biology or cooking), changing the simulacrum with which you interact with does have a big impact on the performance of the model. You can see that in GPT-4's technical report, languages impact performance a lot. Using another language is one of the way to modulate the simulacrum you are interacting with.
I'll showcase in the next batch of post how you can take control a bit more accurately.
Tell me if you need more precision

comment by Fabien Roger (Fabien) · 2023-11-07T17:21:46.244Z · LW(p) · GW(p)

What I'd like to see in the coming posts:

  • A stated definition of the goals of the model psych research you're doing
  • An evaluation of the hypotheses you're presenting in terms of those goals
  • Controlled measurements (not only examples)

For example, from the previous post on stochastic parrots, I infer that one of your goal is predicting what capabilities models have. If that is the case, then the evaluation should be "given a specific range of capabilities, predict which of them models will and won't have", and the list should be established before any measurement is made, and maybe even before a model is trained/released (since this is where these predictions would be the most useful for AI safety, I'd love to know if the deadly model is GPT-4 or GPT-9).

I don't know much about human psych, but it seems to me that it is most useful when it describes some behavior quantitatively with controlled predictions (à la CBT), and not when it does qualitatively analysis based on personal experience with the subject (à la Freud).