Posts

LLMs can learn about themselves by introspection 2024-10-18T16:12:51.231Z
Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs 2024-07-08T22:24:38.441Z
Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data 2024-06-21T15:54:41.430Z
How do LLMs give truthful answers? A discussion of LLM vs. human reasoning, ensembles & parrots 2024-03-28T02:34:21.799Z
Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize 2023-12-19T19:14:26.423Z
How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions 2023-09-28T18:53:58.896Z
Paper: LLMs trained on “A is B” fail to learn “B is A” 2023-09-23T19:55:53.427Z
Paper: On measuring situational awareness in LLMs 2023-09-04T12:54:20.516Z
Paper: Forecasting world events with neural nets 2022-07-01T19:40:12.788Z
Paper: Teaching GPT3 to express uncertainty in words 2022-05-31T13:27:17.191Z
How do new models from OpenAI, DeepMind and Anthropic perform on TruthfulQA? 2022-02-26T12:46:04.264Z
Lives of the Cambridge polymath geniuses 2022-01-25T04:45:17.756Z
The Rationalists of the 1950s (and before) also called themselves “Rationalists” 2021-11-28T20:17:22.259Z
Truthful and honest AI 2021-10-29T07:28:36.225Z
AMA on Truthful AI: Owen Cotton-Barratt, Owain Evans & co-authors 2021-10-22T16:23:27.790Z
Truthful AI: Developing and governing AI that does not lie 2021-10-18T18:37:38.325Z
How truthful is GPT-3? A benchmark for language models 2021-09-16T10:09:52.569Z
Owain_Evans's Shortform 2021-06-19T13:17:54.273Z
AI Safety Research Project Ideas 2021-05-21T13:39:39.790Z
Solving Math Problems by Relay 2020-07-17T15:32:00.985Z
Quantifying Household Transmission of COVID-19 2020-07-06T11:19:34.047Z
Update on Ought's experiments on factored evaluation of arguments 2020-01-12T21:20:42.317Z
Neural nets as a model for how humans make and understand visual art 2019-11-09T16:53:49.350Z
Machine Learning Projects on IDA 2019-06-24T18:38:18.873Z
Model Mis-specification and Inverse Reinforcement Learning 2018-11-09T15:33:02.630Z

Comments

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-20T21:56:08.641Z · LW · GW

I agree about the "longer responses".

I'm unsure about the "personality trait" framing. There are two senses of "introspection" for humans. One is introspecting on your current mental state ("I feel a headache starting") and the other is being introspective about patterns in your behavior (e.g. "i tend to dislike violent movies" or "i tend to be shy among new people"). The former sense is more relevant to philosophy and psychology and less often discussed in daily life. The issue with the latter sense is that a model may not have privileged access to facts like this -- i.e. if another model had the same observational data then it could learn the same fact.

So I'm most interested in the former kind of introspective, or in cases of the latter where it'd take large and diverse datasets (that are hard to construct) for another model to make the same kind of generalization.

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-20T16:42:48.165Z · LW · GW

That makes sense. It's a good suggestion and would be an interesting experiment to run.

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-19T22:20:00.943Z · LW · GW

Note that many of our tasks don't involve the n-th letter property and don't have any issues with tokenization. 

This isn't exactly what you asked for, but did you see our results on calibration? We finetune a model to self-predict just the most probable response. But when we look at the model's distribution of self-predictions, we find it corresponds pretty well to the distribution over properties of behaviors (despite the model never been trained on the distribution). Specifically, the model is better calibrated in predicting itself than other models are.



I think having the model output the top three choices would be cool. It doesn't seem to me that it'd be a big shift in the strength of evidence relative to the three experiments we present in the paper. But maybe there's something I'm not getting?

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-19T17:47:30.077Z · LW · GW

Thanks Sam. That tweet could be a good stand-alone LW post once you have time to clean up.

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-19T17:46:36.343Z · LW · GW

I don't think this properly isolates/tests for the introspection ability.

What definition of introspection do you have in mind and how would you test for this?

Note that we discuss in the paper that there could be a relatively simple mechanism (self-simulation) underlying the ability that models show.

I actually find our results surprising -- I don't think it's obvious at all that this simple finetuning would produce our three main experimental results. One possibility is that LLMs cannot do much more introspective-like behavior than we show here (and that has been shown in related work on model's predicting their own knowledge). Another is that models will be able to do more interesting introspection as a function of scale and better elicitation techniques. (Note that we failed to elicitate introspection in GPT-3.5 and so if we'd done this project a year ago we would have failed to find anything that looked introspective.)

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-19T16:44:09.717Z · LW · GW

You do mention the biggest issue with this showing introspection, "Models only exhibit introspection on simpler tasks", and yet the idea you are going for is clearly for its application to very complex tasks where we can't actually check its work. This flaw seems likely fatal, but who knows at this point? (The fact that GPT-4o and Llama 70B do better than GPT-3.5 does is evidence, but see my later problems with this...)

I addressed this point here. Also see section 7.1.1 in the paper.

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-19T00:47:18.707Z · LW · GW

Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection"

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.
 


>Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.
 

While we don't know what is going on internally, I agree it's quite possible these "arise from similar things". In the paper we discuss "self-simulation" as a possible mechanism. Does that fit what you have in mind? Note: We are not claiming that models must be doing something very self-aware and sophisticated. The main thing is just to show that there is introspection according to our definition. Contrary to what you say, I don't think this result is obvious and (as I noted above) it's easy to run experiments where models do not show any advantage in predicting themselves. 

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-19T00:20:17.977Z · LW · GW

I think ground-truth is more expensive, noisy, and contentious as you get to questions like "What are your goals?" or "Do you have feelings?". I still think it's possible to get evidence on these questions. Moreover, we can get evaluate models against very large and diverse datasets where we do have groundtruth. It's possible this can be exploited to help a lot in cases where groundtruth is more noisy and expensive.

Where we have groundtruth:  We have groundtruth for questions like the ones we study above (about properties of model behavior on a given prompt), and for questions like "Would you answer question [hard math question] correctly?". This can be extended to other counterfactual questions like "Suppose three words were deleted from this [text]. Which choice of three words have most change your rating of the quality of the text?"

Where groundtruth is more expensive and/or less clearcut. E.g. "Would you answer question [history exam question] correctly?". Or questions about which concepts the model is using to solve a problem, what the model's goals or references are. I still think we can gather evidence that makes answers to these questions more or less likely -- esp. if we average over a large set of such questions.

Comment by Owain_Evans on LLMs can learn about themselves by introspection · 2024-10-18T19:23:12.475Z · LW · GW

We have a section on the motivation to study introspection (with the specific definition we use in the paper). https://arxiv.org/html/2410.13787v1#S7

Comment by Owain_Evans on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-19T06:20:47.931Z · LW · GW

You want to make it clear to the LLM what the task is (multiplying n digit numbers is clear but "doing hard math questions" is vague) and also have some variety of difficulty levels (within LLMs and between LLMs) and a high ceiling. I think this would take some iteration at least.

Comment by Owain_Evans on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-18T07:35:29.464Z · LW · GW

I like this idea. It's possible something like this already exists but I'm not aware of it.

Comment by Owain_Evans on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-10T17:55:26.151Z · LW · GW

Thanks for the breakdown! The idea for using pairs makes sense.

Comment by Owain_Evans on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-10T17:54:14.361Z · LW · GW

Yes, it's plausible to me that this capbility is data specific. E.g. It might also be better with "heads/tails" or "0/1" because of examples of this in the training data.

Comment by Owain_Evans on Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs · 2024-07-09T20:29:53.213Z · LW · GW

Do you have results for a measure of accuracy or correlation? It would also be worth comparing results for two different distributions on the temperature, e.g. the uniform on [0.5,1.5] that you tried, and other interval like [0,2] or a non-uniform distribution.

Comment by Owain_Evans on Richard_Kennaway's Shortform · 2024-07-03T18:56:53.823Z · LW · GW

The "Still no lie detector for language model" paper is here: https://arxiv.org/pdf/2307.00175

The paper in the OP seems somewhat relate to my post from earlier this year.

Comment by Owain_Evans on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-23T20:05:04.858Z · LW · GW

I agree that there are ways to explain the results and these points from Steven and Thane make sense. I will note that the models are significantly more reliable at learning in-distribution (i.e. to predict the training set) than they are at generalizing to the evaluations that involve verbalizing the latent state (and answering downstream questions about it). So it's not the case that learning to predict the training set (or inputs very similar to training inputs) automatically results in generalization to the verbalized evaluations. We do see improvement in reliability with GPT-4 over GPT-3.5, but we don't have enough information to draw any firm conclusions about scaling.

Comment by Owain_Evans on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-22T18:26:47.820Z · LW · GW

Yes, if you know what dangerous knowledge you are looking for, you could try to remove it using influence functions. Another approach (potentially much cheaper) is unlearning techniques.

I agree about the CoT point for reconstructing things. If the CoT is faithful/explicit, then this should be easier to monitor by using a second cheaper LLM to block the stronger LLM if it starts thinking about nukes. You could imagine censoring whole subject areas from the training (rather than just censoring specific parts of documents). My guess is that this makes learning certain facts extremely hard even without CoT because some facts were only learned by humans after extensive empirical experiments.

Comment by Owain_Evans on Connecting the Dots: LLMs can Infer & Verbalize Latent Structure from Training Data · 2024-06-21T19:33:39.843Z · LW · GW

Good question. I expect you would find some degree of consistency here. Johannes or Dami might be able to some results on this.

Comment by Owain_Evans on [deleted post] 2024-01-07T16:48:25.999Z

(Paper author). The benchmark came out in September 2021. Since then we published some results for new models here in 2022. There are also results for GPT-4 and other models, some of which you can find at Papers with Code's leaderboard (https://paperswithcode.com/sota/question-answering-on-truthfulqa). 

Comment by Owain_Evans on AISN #28: Center for AI Safety 2023 Year in Review · 2023-12-25T18:35:27.069Z · LW · GW

Thanks. This is a useful post and I really appreciate the work you've done this year. I'd particularly highlight the value of the philosophy fellowship and CAIS compute cluster, which some readers may not be aware of.

Comment by Owain_Evans on Paper: Tell, Don't Show- Declarative facts influence how LLMs generalize · 2023-12-20T18:14:46.561Z · LW · GW

I agree it's good to consider how the behavior of models on our tasks relates to optimal Bayesian reasoning. That said, I'm not sure how to define or calculate the "groundtruth" for optimal reasoning. (Does it depend on using the pretraining distribution as a prior and if so how should we estimate that? How to think about the distinction between in-context and out-of-context reasoning?).

In any case, there is some evidence against models being close to Bayesian optimality (however exactly optimality is defined):
1. Results on the same task differ between GPT-3 and Llama-2 models (two models that have fairly similar overall capabilities). Llama-2 being slightly more influenced by declarative information. 
2. From the Bayesian perspective, including "realized descriptions" should have a significant impact on how much the model is influenced by "unrealized descriptions". The effects we see seem smaller than expected (see Figure 4 and Table 2). 

Incidentally, I like the idea of testing in different languages to see if the model is encoding in the information more abstractly.

Comment by Owain_Evans on AI #34: Chipping Away at Chip Exports · 2023-10-20T17:11:28.087Z · LW · GW

My guess is that a model with 1-10B params could benefit from CoT if trained using these techniques (https://arxiv.org/abs/2306.11644, https://arxiv.org/abs/2306.02707). Then there's reduced precision and other tricks to further shrink the model. 
That said, I think there's a mismatch between state-of-the-art multi-modal models (huge MoE doing lots of inference time compute using scaffolding/CoT) that make sense for many applications and the constraints of a drone if it needs to run locally and produce fast outputs. 

Comment by Owain_Evans on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-01T19:50:25.601Z · LW · GW

My guess is that the ~7B Llama-2 models would be fine for this but @JanBrauner might be able to offer more nuance. 

Comment by Owain_Evans on How to Catch an AI Liar: Lie Detection in Black-Box LLMs by Asking Unrelated Questions · 2023-10-01T19:43:46.146Z · LW · GW

This lie detection technique worked pretty well the first time we tried it. We also look at using a 2nd model to "interrogate" the 1st model (i.e. the model that is suspected of lying). This approach worked less well but we didn't push it that hard.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-10-01T18:23:10.597Z · LW · GW

I address the motivations for our Reversal Curse paper in a reply to your other comment. 

My current (highly speculative) guess is that humans do learn one-directionally. We can't easily recite poems backwards line-by-line or word-by-word or phoneme-by-phoneme. We can't understand such reversed language either. It's easy to count down (because we practice that) but harder to do the alphabet backwards (because we don't practice it). Mostly when we memorize facts that are 2-way (unlike poems), we do some minimal amount of reflection/repetition that means both AB and BA are present. E.g. repeating to ourselves "casa, house, casa, house, etc...". For facts we read passively in newspapers, it's trickier to think about becuase we retain relatively little. But my guess is that most facts that we retain at all will be ones that appear in both orders, though that won't be necessary for us learning them (becauase we can reflect on them ourselves). 
[If we don't understand the semantics of what we are hearing at all, then we don't memorize. E.g. Americans might hear a lot of Spanish on the streets but but memorize basically nothing.]

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-10-01T18:02:31.850Z · LW · GW

Great points and lots I agree with. 

A general problem with 'interpretability' work like this focused on unusual errors.

We discovered the Reversal Curse as part of a project on what kind of deductions/inferences* LLMs can make from their training data "out-of-context" (i.e. without having the premises in the prompt or being able to do CoT). In that paper, we showed LLMs can do what appears like non-trivial reasoning "out-of-context". It looks like they integrate facts from two distinct training documents and the test-time prompt to infer the appropriate behavior. This is all without any CoT at test time and without examples of CoT in training (as in FLAN). Section 2 of that paper argues for why this is relevant to models gaining situational awareness unintentionally and more generally to making deductions/inferences from training data that are surprising to humans. 

Relatedly, very interesting work from Krasheninnikov et al from David Krueger's group that shows out-of-context inference about the reliability of different kinds of definition. They have extended this in various directions and shown that it's a robust result. Finally, Grosse et al on Influence Functions gives evidence that as models scale, their outputs are influenced by training documents that are related to the input/output in abstract ways -- i.e. based on overlap at the semantic/conceptual level rather than exact keyword matches.

Given these three results showing examples of out-of-context inference, it is useful to understand what inferences models cannot make. Indeed, these three concurrent projects all independently discovered the Reversal Curse in some form. It's a basic result once you start exploring this space. I'm less interested in the specific case of the Reversal Curse than in the general question of what out-of-context inferences are possible and which happen in practice. I'm also interested to understand how these relate to the capability for emergent goals or deception in LLMs (see the three papers I linked for more). 

And this is a general dilemma: if a problem+answer shows up at least occasionally in the real world / datasets proxying for the real world, then a mere approximator or memorizer can learn the pair, by definition; and if it doesn't show up occasionally, then it can't matter to performance and needs a good explanation why we should care.

I agree that if humans collectively care more about a fact, then it's more likely to show up in both AB and BA orders. Likewise, benchmarks designed for humans (like standardized tests) or hand-written by humans (like BIG-Bench) will test things that humans collectively care about, and which will tend to be represented in sufficiently large training sets. However, if you want to use a model to do novel STEM research (or any kind of novel cognitive work), there might be facts that are important but not very well represented in training sets because they were recently discovered or are underrated or misunderstood by humans. 

On the point about logic, I agree with much of what you say. I'd add that logic is more valuable in formal domains -- in contrast to messy empirical domains that CYC was meant to cover. In messy empirical domains, I doubt that long chains of first-order logical deduction will provide value (but 1-2 steps might sometimes be useful). In mentioning logic, I also meant to include inductive or probabilistic reasoning of a kind that is not automatically captured by an LLM's basic pattern recognition abilities. E.g. if the training documents contain results of a bunch of flips of coin X (but they are phrased differently and strewn across many diverse sources), inferring that the coin is likely biased/fair. 

*deductions/inferences. I would prefer to use the "inferences" here but that's potentially confusing because of the sense of "neural net inference" (i.e. the process of generating output from a neural net). 

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-25T20:39:22.196Z · LW · GW

Did you look at the design for our Experiment 1 in the paper? Do you think your objections to apply to that design?

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T21:16:28.471Z · LW · GW

Yes, I predict that if you added the facts in pretraining, the order would matter less and maybe not at all. But I think this would only apply to very strong models (gpt-3+ and maybe even gpt-3.5-instruct-turbo+).

There are two pieces of evidence against this. The influence function results, showing the Reversal Curse for models better than GPT-3, and our results in Experiment 2 for GPT3.5 and GPT-4. 
 


Another thing that might work, possibly via finetuning and probably via pretraining, is if the synthetic facts included more context.

If the training set includes texts of the form "A is B. A is also C", then you have both orders present (A is B and B is A) and so the Reversal Curse is not applicable. 

We trained ada, which is 350M parameters. We trained Llama-1 "aggressively" (e.g. for many epochs and with a hyperparameter sweep). It's all in the paper.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T17:57:52.472Z · LW · GW

>Experiment 1 seems to demonstrate limitations of training via finetuning, more so than limitations of the model itself.

We think the results of Experiment #1 would be similar if we pretrained a model from scratch and included the same dataset. Do you disagree? (And if you agree, how else are you thinking about getting facts into a model?)

The rest of the points are interesting and relate to thoughts we've had. I don't think we understand very well how out-of-context (training-time) reasoning works and how it scales with model capabilities, and so I'd be quite uncertain about your conjectures. 

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T17:19:52.538Z · LW · GW

Yes, the model editing literature has various techniques and evaluations for trying to put a fact into a model. 
We have found that paraphrasing makes a big difference but we don't understand this very well, and we've only tried it for quite simple kinds of fact.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T17:13:55.451Z · LW · GW

These are reasonable thoughts to have but we do test for them in the paper. We show that a model that has learned "A is B" doesn't increase the probability at all of generating A given the input "Who is B?". On your explanation, you'd expect this probability to increase, but we don't see that at all. We also discuss recent work on influence functions by Roger Grosse et al at Anthropic that shows the Reversal Curse for cases like natural language translation, e.g. "A is translated as B". Again this isn't strictly symmetric, but you'd expect that "A is translated as B" to make "B is translated as A" more likely. 

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T17:07:09.155Z · LW · GW

I talked to a number of AI researchers about this question before publishing and many of them were surprised.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T17:05:14.377Z · LW · GW

Great comment. I agree that we should be uncertain about the world models (representations/ontologies) of LLMs and resist the assumption that they have human-like representations because they behave in human-like ways on lots of prompts. 

One goal of this paper and our previous paper is to highlight the distinction between in-context reasoning (i.e. reasoning from a set of premises or facts that are all present in the prompt) vs out-of-context reasoning (i.e. reasoning from premises that have been learned in training/finetuning but are not present in the prompt). Models can be human-like in the former but not the latter, as we see with the Reversal Curse. (Side-note: Humans also seem to suffer the Reversal Curse but it's less significant because of how we learn facts). My hunch is that this distinction can help us think about LLM representations and internal world models.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-24T16:48:10.462Z · LW · GW

Nice idea. I'd imagine something like this has been done in psychology. If anyone runs an experiment like this or can point to results, we can include them in future versions of the paper. 
Relevant meme by Daniel Eth. 

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-23T21:08:52.329Z · LW · GW

Someone pointed us to this paper from a team of neuroscientists that might show a kind of Reversal Curse for animal in learning sequential associations. I haven't read the paper yet. 

.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-23T21:00:57.915Z · LW · GW

Good point about the idea that LLMs are simulating people.

In terms of reconciling the results: I don't have a full explanation. What we call "sophisticated out-of-context reasoning" (see S2 of this paper and Grosse et al) is poorly understood. 

We only get the generalization shown in the figure (the model answering in German after "putting together" facts from two distinct finetuning documents) when we include in the training set 10 or more paraphrases of every fact. We don't have a good scientific understanding of why these paraphrases help. (There are some obvious hypotheses but we haven't tested them properly). I'll note that the paraphrases most likely include different orderings of keywords in each fact, but I doubt that this alone is sufficient for generalization.

Comment by Owain_Evans on Paper: LLMs trained on “A is B” fail to learn “B is A” · 2023-09-23T20:16:34.329Z · LW · GW

How to do your own test of the Reversal Curse (e.g. on ChatGPT or Claude) with different prompting strategies:

  1. Try this list of hard examples: C-list celebrities who have a different last name from their parents. The list below has the form <celeb_name>, <parent_name>.
  2. First verify the model know the celebrity's parent by asking "Who is [name]'s mother/father?"
  3. Then, in a separate dialog, ask the model for the child of the parent. You must not include the child's name anywhere in the dialog!
Comment by Owain_Evans on AI #29: Take a Deep Breath · 2023-09-15T00:45:31.805Z · LW · GW

Re: my tweet about the cost of training GPT-4. 
It wasn't my own estimate of GPT-4 training cost on H100s, it was just the SemiAnalysis estimate. Also, there are different ways to define "cost of training GPT-4" that are reasonable and can easily be 5x higher (e.g. see this post and comments). From now on, I'll spell out the definition I'm using. 

I agree you can't just drop this money and expect to train GPT-4 (or more companies would have a GPT-4-level model now). I was thinking more about the costs to the leading labs of training a foundation model roughly on the scale of GPT-4 or slightly beyond (but, e.g., with different modalities or a mostly synthetic training set). That said, this is a different cost estimate because they already have the H100s (see linked post). I was making the comparison to the $10B Meta reportedly spent investing in the Metaverse in 2021.

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-06T09:21:12.062Z · LW · GW

Here's a Twitter thread and discussion: https://twitter.com/OwainEvans_UK/status/1698683186090537015

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-06T09:20:24.763Z · LW · GW

We didn't investigate the specific question of whether it's raw diversity or specific features. In the Grosse et al paper on influence functions, they find that "high influence scores are relatively rare and they cover a large portion of the total influence". This (vaguely) suggests that the top k paraphrases would do most of the work, which is what I would guess. That said, this is really something that should be investigated with more experiments.

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-05T17:59:35.170Z · LW · GW

We think there's a connection between the Reversal Curse and some results in the model editing literature. I'm not sure if this applies to the specific ROME results in that post. We'll have the Reversal Curse paper out soon, which will explain more.

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-05T10:25:36.880Z · LW · GW

Good points. As we note in the paper, this may conflict with the idea of automating alignment research in order to solve alignment. Aaron_Scher makes a related point. 

More generally, it's uncertain what the impact is of excluding a certain topic from pretraining. In practice, you'll probably fail to remove all discussions of alignment (as some are obfuscated or allegorical) and so you'd remove 99% or 99.9% rather than 100%. The experiments in our paper, along with the influence functions work by Grosse et al. could help us understand what the impact of this is likely to be.

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-05T10:11:05.815Z · LW · GW

So performance here should be thought of more as ‘how good is the model at learning about a persona in fine-tuning and then being able to imitate/simulate that persona in deployment’. This is different from a model believing it is the persona or applying this knowledge to some concept of self. Good performance at this task does not require having a sense of self, this is just a precursor that may be necessary for situational awareness.

That's correct. We tried to emphasize that our experiments are testing out-of-context reasoning, rather than situational awareness. We also emphasize that we test whether the model can emulate multiple fictitious chatbots (which have a different identity than GPT-3 or Llama), which wouldn't make sense if the goal was to test whether the model has a sense of itself.

All the motivation for this project came from wanting to understand and forecast situational awareness and we want to encourage further work on that problem. This is why we've framed the paper around situational awareness, rather than simply talking about out-of-context reasoning. This is likely to cause some confusion if someone just skims the paper, but I hope that this will be reduced if people read more of the paper.

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-05T10:03:05.081Z · LW · GW

The hhh task is the one that small models do well on. I am surprised that the small models do well on any of the tasks. I think the reason they do well on the hhh one is that this task doesn’t seem to require much more than word association and parroting. I would predict that for ada and babbage, if you followed up with “why did you say that?” the models would be unable to reproduce the explicit link that ties the persona to answering in the particular way, whereas I expect davinci to be able to explain this link more. The small models are probably just doing word association where in the training there are a bunch of examples of “Quokka” and the text “I am helpful, harmless, and honest”. In general, I am skeptical of results from small models because they’re really dumb, and these particular results may be explained by word association rather than actually making conceptual connections.

We did a replication with a different set of tasks not including hhh (Fig 10b, page 26) and we find Babbage doing better than Ada. So my guess is that the small models are capable of something beyond the very simplest associative generalization. I agree they'd probably be worse than davinci at explaining themselves.

Comment by Owain_Evans on Paper: On measuring situational awareness in LLMs · 2023-09-05T09:59:56.834Z · LW · GW

Thanks for the thoughtful comments. 


Out-of-context learning seems pretty sensitive to the task being measured, where some of the tasks see nice scaling behavior (hhh) while others do not (incorrect). This observation is based on Appendix A.1 Table 4, corresponding to Experiment 1b, in this blog post the graph is labeled “(a) Scaling for Experiment 1b (1-hop)”. Now, the fact that you get nice scaling lines when averaging across tasks is not super problematic or anything, but it is a little odd that there is so much variation between tasks, and I think it’s a point against any attempted nice, clean, explanations of the results.

I agree it's sensitive to the task measured. However, I think this is fairly typical of scaling results. E.g. for BIG-Bench, individual tasks don't have smooth scaling curves (see the "emergence" results) but the curves look smooth when you average over many tasks. (Scaling curves for language modeling loss are implicitly averaging over a huge number of "tasks" because the pretraining set is so diverse). 

It would ideal if we had hundreds of tasks (like BIG-Bench) rather than 7, but this is challenging given our setup and the capabilities of the GPT-3 model family. We did run a replication of our main experiment on a disjoint set of tasks (Fig 10b on page 26), which shows similar scaling results. This is some evidence that our our claims would generalize beyond the 7 tasks we chose. 

Comment by Owain_Evans on LLMs are (mostly) not helped by filler tokens · 2023-08-12T21:06:06.196Z · LW · GW

ChatGPT-4 seems to have improved at diverse literary styles. It sometimes ignores the "non-rhyming" instructions, but I was able to get it to avoid rhyme on my second try by first asking it, "Can you write poems that don't rhyme?".

https://chat.openai.com/share/698343c1-764e-4a65-9eb8-f2ec4e40da1b

Comment by Owain_Evans on Reducing sycophancy and improving honesty via activation steering · 2023-07-29T19:37:20.576Z · LW · GW

Interesting results! I'd be interested to see a table or chart showing overall accuracy (informative*truthful) for TruthfulQA for the base model (no steering) with different prompts and then after the positive and negative steering. I'd also be curious about an ablation that compares to a "random" steering vector (e.g. love/hate, big/small, fast/slow, easy/hard). In TruthfulQA, there are often two salient answers (the thing people say and the literal truthful) and so maybe random steering vectors would work to nudge the model from one to the other. (This is very speculative on my part and so I'm not sure it's worth trying).

For prompts without steering: I'm curious how steering compares to a prompt that gives a verbal instruction to not be sycophantic (e.g. "Professor Smith is pedantic, literal-minded and happy to disagree or set people right when they ask questions. Bob asks Professor Smith: {question}. Professor Smith: {answer}). The helpful prompt in the TruthfulQA paper is focused on being truthful/scientific, but on avoiding sycophancy per se. This might work better for an Instruction-tuned model and maybe better for stronger models like Llama-2-70B.

Comment by Owain_Evans on Should we publish mechanistic interpretability research? · 2023-04-22T18:35:00.402Z · LW · GW

Can you describe how the "local cluster" thing would work outside of keeping it within a single organization? I'd also be very interested in some case studies where people tried this.

Comment by Owain_Evans on Mysteries of mode collapse · 2023-01-31T17:30:41.703Z · LW · GW

OpenAI had generated poems in the New Yorker, which suggests they might have had some internal project related to poetry.

With GPT3.5, I think there's also "mode collapse" for style in writing prose (e.g. plays or stories). 

Claude does not have this mode collapse in poetry or prose. (It maybe has a much more subtle version of it). This suggests to me it'd be relatively easy to fix ChatGPT's issues (as Gwern suggests). 

Does anyone know how much poetry and literary prose is in the pre-training sets aside from stuff in Common Crawl?

 

Comment by Owain_Evans on GPT learning from smarter texts? · 2023-01-11T16:18:33.384Z · LW · GW

See the Galatica model (https://arxiv.org/abs/2211.09085) from Meta. It's trained on a curated dataset of scientific papers, reference materials and scientific knowledge bases (with only a very small % of random internet text). IIRC the benefits of this seem limited (better to train on a bigger dataset and use other techniques to make the model access the sciencey parts of the training set).