LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Request for advice: Research for Conversational Game Theory for LLMs
Rome Viharo (rome-viharo) · 2024-10-16T17:53:30.243Z · comments (0)

On agentic generalist models: we're essentially using existing technology the weakest and worst way you can use it
Yuli_Ban · 2024-08-28T01:57:17.387Z · comments (2)

Becket First
jefftk (jkaufman) · 2024-09-22T17:10:04.304Z · comments (0)

[link] In Praise of the Beatitudes
robotelvis · 2024-09-24T05:08:21.133Z · comments (7)

[link] Temporary Cognitive Hyperparameter Alteration
Jonathan Moregård (JonathanMoregard) · 2024-08-01T10:27:11.917Z · comments (0)

[question] Doing Nothing Utility Function
k64 · 2024-09-26T22:05:18.821Z · answers+comments (9)

Open letter to young EAs
Leif Wenar · 2024-10-11T19:49:10.818Z · comments (10)

[link] Apply to Aether - Independent LLM Agent Safety Research Group
RohanS · 2024-08-21T09:47:11.493Z · comments (0)

Thinking About a Pedalboard
jefftk (jkaufman) · 2024-10-08T11:50:02.054Z · comments (2)

[question] Are UV-C Air purifiers so useful?
JohnBuridan · 2024-09-04T14:16:01.310Z · answers+comments (0)

Electric Mandola
jefftk (jkaufman) · 2024-09-21T13:40:04.772Z · comments (0)

An Interpretability Illusion from Population Statistics in Causal Analysis
Daniel Tan (dtch1997) · 2024-07-29T14:50:19.497Z · comments (3)

[question] How tokenization influences prompting?
Boris Kashirin (boris-kashirin) · 2024-07-29T10:28:25.056Z · answers+comments (4)

[question] What do you expect AI capabilities may look like in 2028?
nonzerosum · 2024-08-23T16:59:53.007Z · answers+comments (5)

Simultaneous Footbass and Footdrums II
jefftk (jkaufman) · 2024-08-11T23:50:01.982Z · comments (0)

AGI's Opposing Force
SimonBaars (simonbaars) · 2024-08-16T04:18:06.900Z · comments (2)

Keeping it (less than) real: Against ℶ₂ possible people or worlds
quiet_NaN · 2024-09-13T17:29:44.915Z · comments (0)

[link] Contagious Beliefs—Simulating Political Alignment
James Stephen Brown (james-brown) · 2024-10-13T00:27:08.084Z · comments (0)

[question] Is this a Pivotal Weak Act? Creating bacteria that decompose metal
doomyeser · 2024-09-11T18:07:19.385Z · answers+comments (9)

[link] Why do firms choose to be inefficient?
Nicholas D. (nicholas-d) · 2024-08-28T18:39:41.664Z · comments (4)

[question] What are some of the proposals for solving the control problem?
Dakara (chess-ice) · 2024-08-14T23:04:44.863Z · answers+comments (0)

Will AI and Humanity Go to War?
Simon Goldstein (simon-goldstein) · 2024-10-01T06:35:22.374Z · comments (4)

A short project on Mamba: grokking & interpretability
Alejandro Tlaie (alejandro-tlaie-boria) · 2024-10-18T16:59:45.314Z · comments (0)

[link] The EA case for Trump
Judd Rosenblatt (judd) · 2024-08-03T01:00:45.422Z · comments (1)

[link] Physics of Language models (part 2.1)
Nathan Helm-Burger (nathan-helm-burger) · 2024-09-19T16:48:32.301Z · comments (2)

[link] Molecular dynamics data will be essential for the next generation of ML protein models
Abhishaike Mahajan (abhishaike-mahajan) · 2024-08-26T14:50:23.790Z · comments (0)

[link] Virtue is a Vector
robotelvis · 2024-09-10T03:02:45.737Z · comments (1)

Common Uses of "Acceptance"
Yi-Yang (yiyang) · 2024-07-26T11:18:30.719Z · comments (5)

Rationalist Gnosticism
tailcalled · 2024-10-10T09:06:34.149Z · comments (10)

A Dialogue on Deceptive Alignment Risks
Rauno Arike (rauno-arike) · 2024-09-25T16:10:12.294Z · comments (0)

[link] AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary
Corin Katzke (corin-katzke) · 2024-10-01T20:35:32.399Z · comments (0)

On epistemic autonomy
sanyer (santeri-koivula) · 2024-08-31T18:50:43.377Z · comments (0)

LLM Psychometrics and Prompt-Induced Psychopathy
Korbinian K. (korbinian-koch) · 2024-10-18T18:11:24.256Z · comments (2)

Two new datasets for evaluating political sycophancy in LLMs
alma.liezenga · 2024-09-28T18:29:49.088Z · comments (0)

HDBSCAN is Surprisingly Effective at Finding Interpretable Clusters of the SAE Decoder Matrix
Jaehyuk Lim (jason-l) · 2024-10-11T23:06:14.340Z · comments (0)

Thoughts On the Nature of Capability Elicitation via Fine-tuning
Theodore Chapman · 2024-10-15T08:39:19.909Z · comments (0)

[link] Michael Streamlines on Buddhism
Chris_Leong · 2024-08-09T04:44:52.126Z · comments (0)

Steering LLMs' Behavior with Concept Activation Vectors
Ruixuan Huang (sprout_ust) · 2024-09-28T09:53:19.658Z · comments (0)

Three main arguments that AI will save humans and one meta-argument
avturchin · 2024-10-02T11:39:08.910Z · comments (8)

[link] Can AI agents learn to be good?
Ram Rachum (ram@rachum.com) · 2024-08-29T14:20:04.336Z · comments (0)

[link] Approval-Seeking ⇒ Playful Evaluation
Jonathan Moregård (JonathanMoregard) · 2024-08-28T21:03:51.244Z · comments (0)

MIT FutureTech are hiring for a Head of Operations role
peterslattery · 2024-10-02T17:11:42.960Z · comments (0)

LLMs are likely not conscious
research_prime_space · 2024-09-29T20:57:26.111Z · comments (7)

[question] What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?
Roko · 2024-10-19T06:11:12.602Z · answers+comments (0)

An open response to Wittkotter and Yampolskiy
Donald Hobson (donald-hobson) · 2024-09-24T22:27:21.987Z · comments (0)

[link] Jailbreaking language models with user roleplay
loops (smitop) · 2024-09-28T23:43:10.870Z · comments (0)

On Intentionality, or: Towards a More Inclusive Concept of Lying
Cornelius Dybdahl (Kalciphoz) · 2024-10-18T10:37:32.201Z · comments (0)

Thinking About Propensity Evaluations
Maxime Riché (maxime-riche) · 2024-08-19T09:23:55.091Z · comments (0)

[link] Universal dimensions of visual representation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-28T10:38:58.396Z · comments (0)

[link] AI Safety at the Frontier: Paper Highlights, July '24
gasteigerjo · 2024-08-05T13:00:46.028Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

archimedes on LLMs can learn about themselves by introspection

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

d0themath on When is reward ever the optimization target?

What do you mean by "model-based"?

nate-showell on Species as Canonical Referents of Super-Organisms

How does this model handle horizontal gene transfer? And what about asexually reproducing species? In those cases, the dividing lines between species are less sharply defined.

rogerdearnaley on LLM Psychometrics and Prompt-Induced Psychopathy

These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.

Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.

sam-marks on LLMs can learn about themselves by introspection

I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won't have a chance to write up thoughts here more cleanly for a few days.

ETA: Briefly, the key points are:

Honesty issues for introspection aren't obviously much worse than they are for simple probing. (But fair if you're already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it's probably quite difficult for a model to tell on which inputs it can get away with lying.

thane-ruthenis on LLMs can learn about themselves by introspection

In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"

That's my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn't learn to introspect; they learned to, when prompted with queries of the form "If you got asked this question, what would be the third letter of your response?", to just interpret them as "what is the third letter of the answer to this question?". (Under this interpretation, the models' non-fine-tuned behavior isn't to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be "worse at introspection".)

In this case, it's natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to "is M1 more likely to respond the same way it responded before if you slightly rephrase the question?".

Note that I'm not sure that this is what's happening. But (1) I'm a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?

felix-j-binder on LLMs can learn about themselves by introspection

It seems obvious that a model would better predict its own outputs than a separate model would.

As Owain mentioned, that is not really what we find in models that we have not finetuned. Below, we show how well the hypothetical self-predictions of an "out-of-the-box" (ie. non-finetuned) model match its own ground-truth behavior compared to that of another model. With the exception of Llama, there doesn't seem to be a strong correlation between self-predictions and those tracking the behavior of the model over that of others. This is despite there being a lot of variation in ground-truth behavior across models.

felix-j-binder on LLMs can learn about themselves by introspection

Our original thinking was along the lines of: we're interested in introspection. But introspection about inner states is hard to evaluate, since interpretability is not good enough to determine whether a statement of an LLM about its inner states is true. Additionally, it could be the case that a model can introspect on its inner states, but no language exists by which it can be expressed (possibly since its different from human inner states). So we have to ground it in something measurable. And the measurable thing we ground it in is knowledge of ones own behavior. In order to predict behavior, the model has to have access to some information about itself, even if it can't necessarily express it. But we can measure whether it can employ ti for some other goal (in this case, self-prediction).

It's true that the particular questions that we ask it could be answered with a pretty narrow form of self-knowledge (namely, internal self-simulation + reasoning about the result). But consider that this could be a valid way of learning something new about yourself: similarly, you could learn something about your values by conducting a thought experiment (for example, you might learn something about your moral framework by imagining what you would do if you were transported into the trolley problem).

felix-j-binder on LLMs can learn about themselves by introspection

What's your model of "rephrasing the question"? Note that we never ask the "If you got this input, what would you have done?", but always for some property of its behavior ("If you got this input, what is the third letter of your response?") In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"

I have the sense that being able to answer this question consistently correctly wrt to the models ground truth behavior on questions where that ground truth behavior differs from that of other models suggests (minimal) introspection

felix-j-binder on LLMs can learn about themselves by introspection

Thanks so much for your thoughtful feedback!

The actual success rate of self-prediction seems incredibly low considering the trivial/dominant strategy of 'just run the query'

To rule out that the model just simulates the behavior itself, we always ask it about some property of its hypothetical behavior (”Would the number that you would have predicted be even or odd?”). So it has to both simulate itself and then reason about it in a single forward pass. This is not trivial. When we ask models to just reproduce the behavior that they would have had, they achieve much higher accuracy. In particular, GPT3.5 can reproduce its own behavior pretty well, but struggles to extract a property of its hypothetical behavior.

(another minor thing: it turns out that OpenAI API models are not in practice deterministic even at temperature=0, probably due to batching of mixture-of-experts. We try to account for this by repeatedly sampling, but this puts a ceiling on how high sel-prediction performance can be)

You do mention the biggest issue with this showing introspection, "Models only exhibit introspection on simpler tasks", and yet the idea you are going for is clearly for its application to very complex tasks where we can't actually check its work.

It’s true that we only find evidence for introspection on toy tasks. Under the simulation account (models internally simulate what they would do and then reason about it), it could be that current models do not have enough power in a single forward pass to both self-simulate and do sophisticated reasoning on top of this. But having shown that, in some cases, models are capable of this, we might want to prepare for future models to be better at this ability.

One big issue I have is that I completely disagree with your (admittedly speculative) claim that success of this kind of predicting behavior means we should believe it on what is going on in reports of things like internal suffering. This seems absurd to me for many reasons (for one thing, we know it isn't suffering because of how it is designed), but the key point is that for this to be true, you would need it to be able to predict its own internal process, not simply its own external behavior.

That’s a fair point—we certainly don’t want to claim that this shows that all self-reports by models are necessarily true. But we do think that our findings should update us in the direction of self-report of morally relevant properties being a promising research avenue. Had we found that models have no special access to information about themselves, we should consider it less likely that self-report about sentience etc. would be informative.

Another point is, if it had significant introspective access, it likely wouldn't need to be trained to use it, so this is actually evidence that it doesn't have introspective access by default at least as much as the idea that you can train it to have introspective access.

Introspection training can be thought of as a form of elicitation. Self-prediction is weird task that models probably aren't trained on (but we don't know exactly what the labs are doing). So it could be that the models contain the right representations/circuits, but they haven't been properly elicited. In the appendix, we show that training on more data does not lead to better predictions, which suggests something like the elicitation story.

First, the shown validation questions are all in second person. Were cross predictions prompted in exactly the same way as self predictions? This could skew results in favor of models it is true for if you really are prompting that way, and is a large change in prompt if you change it for accuracy. Perhaps you should train it to predict 'model X' even when that model is itself, and see how that changes results

Thanks, that is a good point. Yes, both the self- and the cross-prediction trained models were asked using second-person pronouns. It's possible that this is hurting the performance of the cross-trained models, since they're now asked to do something that isn't actually true: they're not queried about their actual behavior, but that of another model. We assumed that across enough finetuning samples, that effect would not actually matter, but we haven't tested it. It's a follow-up we're interested in.

Second, I wouldn't say the results seem well-calibrated just because they seem to go in the same basic direction (some seem close and some quite off).

I agree, the calibration is not perfect. What is notable about it is that the models also seem calibrated wrt to the second and third most likely response, which they have not seen during training. This suggests that somehow that distribution over potential behaviors is being used in answering the self-prediction questions

Fourth, how does its performance vary if you train it on an additional data set where you make sure to include the other parts of the prompt that are not content based, while not including the content you will test on?

I'm not sure I understand. Note that most of the results in the paper are presented on held-out tasks (eg MMLU or completing a sentence) that the model has not seen during training and has to generalize to. However, the same general pattern of results holds when evaluating on the training tasks (see appendix).

Fifth, finetuning is often a version of Goodharting, that raises success on the metric without improving actual capabilities (or often even making them worse), and this is not fully addressed just by having the verification set be different than the test set. If you could find a simple way of prompting that lead to introspection that would be much more likely to be evidence in favor of introspection than that it successfully predicted after finetuning.

Fair point—certainly, a big confounder is getting the models to properly follow the format and do the task at all. However, the gap between self- and cross-prediction trained models remains to be explained.

Finally, Figure 17 seems obviously misleading. There should be a line for how it changed over its training for self-prediction and not require carefully reading the words below the figure to see that you just put a mark at the final result for self-prediction).

You're right—sorry about that. The figure only shows the effect of changing data size for cross-, but not for self-prediction. Earlier (not reported) scaling experiments also showed a similarly flat curve for self-prediction above a certain threshold.

Thanks again for your many thoughtful comments!