LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Dating Roundup #1: This is Why You’re Single
Zvi · 2023-08-29T12:50:04.964Z · comments (27)

[question] What are the best arguments for/against AIs being "slightly 'nice'"?
Raemon · 2024-09-24T02:00:19.605Z · answers+comments (49)

GPT-o1
Zvi · 2024-09-16T13:40:06.236Z · comments (34)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

[link] Anxiety vs. Depression
Sable · 2024-03-17T00:15:08.255Z · comments (35)

Apollo Neuro Results
Elizabeth (pktechgirl) · 2023-07-30T18:40:05.213Z · comments (17)

Highlights: Wentworth, Shah, and Murphy on "Retargeting the Search"
RobertM (T3t) · 2023-09-14T02:18:05.890Z · comments (4)

The "spelling miracle": GPT-3 spelling abilities and glitch tokens revisited
mwatkins · 2023-07-31T19:47:02.793Z · comments (29)

Addressing Feature Suppression in SAEs
Benjamin Wright (Benw8888) · 2024-02-16T18:32:51.927Z · comments (3)

[link] Linkpost: Rishi Sunak's Speech on AI (26th October)
bideup · 2023-10-27T11:57:46.575Z · comments (8)

[link] The Puritans would one-box: evidential decision theory in the 17th century
Jacob G-W (g-w1) · 2023-10-14T20:23:24.346Z · comments (5)

[link] Environmentalism in the United States Is Unusually Partisan
Jeffrey Heninger (jeffrey-heninger) · 2024-05-13T21:23:10.755Z · comments (26)

Rejecting Television
Declan Molony (declan-molony) · 2024-04-23T04:59:50.253Z · comments (10)

Natural Latents: The Concepts
johnswentworth · 2024-03-20T18:21:19.878Z · comments (18)

[Valence series] 2. Valence & Normativity
Steven Byrnes (steve2152) · 2023-12-07T16:43:49.919Z · comments (5)

[link] Dario Amodei’s prepared remarks from the UK AI Safety Summit, on Anthropic’s Responsible Scaling Policy
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-11-01T18:10:31.110Z · comments (1)

[link] "AI Safety for Fleshy Humans" an AI Safety explainer by Nicky Case
habryka (habryka4) · 2024-05-03T18:10:12.478Z · comments (10)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (11)

Newsom Vetoes SB 1047
Zvi · 2024-10-01T12:20:06.127Z · comments (6)

My checklist for publishing a blog post
Steven Byrnes (steve2152) · 2023-08-15T15:04:56.219Z · comments (6)

[link] [Paper] Stress-testing capability elicitation with password-locked models
Fabien Roger (Fabien) · 2024-06-04T14:52:50.204Z · comments (10)

The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (3)

[link] A Universal Emergent Decomposition of Retrieval Tasks in Language Models
Alexandre Variengien (alexandre-variengien) · 2023-12-19T11:52:27.354Z · comments (3)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

MATS Winter 2023-24 Retrospective
utilistrutil · 2024-05-11T00:09:17.059Z · comments (28)

Some for-profit AI alignment org ideas
Eric Ho (eh42) · 2023-12-14T14:23:20.654Z · comments (19)

[link] Nietzsche's Morality in Plain English
Arjun Panickssery (arjun-panickssery) · 2023-12-04T00:57:42.839Z · comments (13)

[link] Hardshipification
Jonathan Moregård (JonathanMoregard) · 2024-05-28T20:02:29.709Z · comments (17)

[link] [Paper] AI Sandbagging: Language Models can Strategically Underperform on Evaluations
Teun van der Weij (teun-van-der-weij) · 2024-06-13T10:04:49.556Z · comments (10)

Update on the UK AI Taskforce & upcoming AI Safety Summit
Elliot Mckernon (elliot) · 2023-10-11T11:37:42.436Z · comments (2)

[link] Anthropic's Responsible Scaling Policy & Long-Term Benefit Trust
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-09-19T15:09:27.235Z · comments (23)

AI #51: Altman’s Ambition
Zvi · 2024-02-20T19:50:07.439Z · comments (5)

Retirement Accounts and Short Timelines
jefftk (jkaufman) · 2024-02-19T18:50:05.231Z · comments (35)

[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (18)

[link] The Real Fanfic Is The Friends We Made Along The Way
Eneasz · 2023-10-18T19:21:40.431Z · comments (0)

A Crisper Explanation of Simulacrum Levels
Thane Ruthenis · 2023-12-23T22:13:52.286Z · comments (13)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)

Sparse Autoencoders Work on Attention Layer Outputs
Connor Kissane (ckkissane) · 2024-01-16T00:26:14.767Z · comments (9)

New roles on my team: come build Open Phil's technical AI safety program with me!
Ajeya Cotra (ajeya-cotra) · 2023-10-19T16:47:59.701Z · comments (6)

Actually, Power Plants May Be an AI Training Bottleneck.
Lao Mein (derpherpize) · 2024-06-20T04:41:33.567Z · comments (13)

Untrusted smart models and trusted dumb models
Buck · 2023-11-04T03:06:38.001Z · comments (12)

Why you should be using a retinoid
GeneSmith · 2024-08-19T03:07:41.722Z · comments (57)

Live Theory Part 0: Taking Intelligence Seriously
Sahil · 2024-06-26T21:37:10.479Z · comments (3)

Decomposing independent generalizations in neural networks via Hessian analysis
Dmitry Vaintrob (dmitry-vaintrob) · 2023-08-14T17:04:40.071Z · comments (4)

Saying the quiet part out loud: trading off x-risk for personal immortality
disturbance · 2023-11-02T17:43:34.155Z · comments (89)

Release: Optimal Weave (P1): A Prototype Cohabitive Game
mako yass (MakoYass) · 2024-08-17T14:08:18.947Z · comments (21)

Stepping down as moderator on LW
Kaj_Sotala · 2023-08-14T10:46:58.163Z · comments (1)

The Good Life in the face of the apocalypse
Elizabeth (pktechgirl) · 2023-10-16T22:40:15.200Z · comments (8)

[link] Essay competition on the Automation of Wisdom and Philosophy — $25k in prizes
owencb · 2024-04-16T10:10:13.338Z · comments (12)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

rogerdearnaley on LLM Psychometrics and Prompt-Induced Psychopathy

These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.

Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.

sam-marks on LLMs can learn about themselves by introspection

I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won't have a chance to write up thoughts here more cleanly for a few days.

ETA: Briefly, the key points are:

Honesty issues for introspection aren't obviously much worse than they are for simple probing. (But fair if you're already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it's probably quite difficult for a model to tell on which inputs it can get away with lying.

thane-ruthenis on LLMs can learn about themselves by introspection

In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"

That's my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn't learn to introspect; they learned to, when prompted with queries of the form "If you got asked this question, what would be the third letter of your response?", to just interpret them as "what is the third letter of the answer to this question?". (Under this interpretation, the models' non-fine-tuned behavior isn't to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be "worse at introspection".)

In this case, it's natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to "is M1 more likely to respond the same way it responded before if you slightly rephrase the question?".

Note that I'm not sure that this is what's happening. But (1) I'm a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?

felix-j-binder on LLMs can learn about themselves by introspection

It seems obvious that a model would better predict its own outputs than a separate model would.

As Owain mentioned, that is not really what we find in models that we have not finetuned. Below, we show how well the hypothetical self-predictions of an "out-of-the-box" (ie. non-finetuned) model match its own ground-truth behavior compared to that of another model. With the exception of Llama, there doesn't seem to be a strong correlation between self-predictions and those tracking the behavior of the model over that of others. This is despite there being a lot of variation in ground-truth behavior across models.

felix-j-binder on LLMs can learn about themselves by introspection

Our original thinking was along the lines of: we're interested in introspection. But introspection about inner states is hard to evaluate, since interpretability is not good enough to determine whether a statement of an LLM about its inner states is true. Additionally, it could be the case that a model can introspect on its inner states, but no language exists by which it can be expressed (possibly since its different from human inner states). So we have to ground it in something measurable. And the measurable thing we ground it in is knowledge of ones own behavior. In order to predict behavior, the model has to have access to some information about itself, even if it can't necessarily express it. But we can measure whether it can employ ti for some other goal (in this case, self-prediction).

It's true that the particular questions that we ask it could be answered with a pretty narrow form of self-knowledge (namely, internal self-simulation + reasoning about the result). But consider that this could be a valid way of learning something new about yourself: similarly, you could learn something about your values by conducting a thought experiment (for example, you might learn something about your moral framework by imagining what you would do if you were transported into the trolley problem).

felix-j-binder on LLMs can learn about themselves by introspection

What's your model of "rephrasing the question"? Note that we never ask the "If you got this input, what would you have done?", but always for some property of its behavior ("If you got this input, what is the third letter of your response?") In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"

I have the sense that being able to answer this question consistently correctly wrt to the models ground truth behavior on questions where that ground truth behavior differs from that of other models suggests (minimal) introspection

felix-j-binder on LLMs can learn about themselves by introspection

Thanks so much for your thoughtful feedback!

The actual success rate of self-prediction seems incredibly low considering the trivial/dominant strategy of 'just run the query'

To rule out that the model just simulates the behavior itself, we always ask it about some property of its hypothetical behavior (”Would the number that you would have predicted be even or odd?”). So it has to both simulate itself and then reason about it in a single forward pass. This is not trivial. When we ask models to just reproduce the behavior that they would have had, they achieve much higher accuracy. In particular, GPT3.5 can reproduce its own behavior pretty well, but struggles to extract a property of its hypothetical behavior.

(another minor thing: it turns out that OpenAI API models are not in practice deterministic even at temperature=0, probably due to batching of mixture-of-experts. We try to account for this by repeatedly sampling, but this puts a ceiling on how high sel-prediction performance can be)

You do mention the biggest issue with this showing introspection, "Models only exhibit introspection on simpler tasks", and yet the idea you are going for is clearly for its application to very complex tasks where we can't actually check its work.

It’s true that we only find evidence for introspection on toy tasks. Under the simulation account (models internally simulate what they would do and then reason about it), it could be that current models do not have enough power in a single forward pass to both self-simulate and do sophisticated reasoning on top of this. But having shown that, in some cases, models are capable of this, we might want to prepare for future models to be better at this ability.

One big issue I have is that I completely disagree with your (admittedly speculative) claim that success of this kind of predicting behavior means we should believe it on what is going on in reports of things like internal suffering. This seems absurd to me for many reasons (for one thing, we know it isn't suffering because of how it is designed), but the key point is that for this to be true, you would need it to be able to predict its own internal process, not simply its own external behavior.

That’s a fair point—we certainly don’t want to claim that this shows that all self-reports by models are necessarily true. But we do think that our findings should update us in the direction of self-report of morally relevant properties being a promising research avenue. Had we found that models have no special access to information about themselves, we should consider it less likely that self-report about sentience etc. would be informative.

Another point is, if it had significant introspective access, it likely wouldn't need to be trained to use it, so this is actually evidence that it doesn't have introspective access by default at least as much as the idea that you can train it to have introspective access.

Introspection training can be thought of as a form of elicitation. Self-prediction is weird task that models probably aren't trained on (but we don't know exactly what the labs are doing). So it could be that the models contain the right representations/circuits, but they haven't been properly elicited. In the appendix, we show that training on more data does not lead to better predictions, which suggests something like the elicitation story.

First, the shown validation questions are all in second person. Were cross predictions prompted in exactly the same way as self predictions? This could skew results in favor of models it is true for if you really are prompting that way, and is a large change in prompt if you change it for accuracy. Perhaps you should train it to predict 'model X' even when that model is itself, and see how that changes results

Thanks, that is a good point. Yes, both the self- and the cross-prediction trained models were asked using second-person pronouns. It's possible that this is hurting the performance of the cross-trained models, since they're now asked to do something that isn't actually true: they're not queried about their actual behavior, but that of another model. We assumed that across enough finetuning samples, that effect would not actually matter, but we haven't tested it. It's a follow-up we're interested in.

Second, I wouldn't say the results seem well-calibrated just because they seem to go in the same basic direction (some seem close and some quite off).

I agree, the calibration is not perfect. What is notable about it is that the models also seem calibrated wrt to the second and third most likely response, which they have not seen during training. This suggests that somehow that distribution over potential behaviors is being used in answering the self-prediction questions

Fourth, how does its performance vary if you train it on an additional data set where you make sure to include the other parts of the prompt that are not content based, while not including the content you will test on?

I'm not sure I understand. Note that most of the results in the paper are presented on held-out tasks (eg MMLU or completing a sentence) that the model has not seen during training and has to generalize to. However, the same general pattern of results holds when evaluating on the training tasks (see appendix).

Fifth, finetuning is often a version of Goodharting, that raises success on the metric without improving actual capabilities (or often even making them worse), and this is not fully addressed just by having the verification set be different than the test set. If you could find a simple way of prompting that lead to introspection that would be much more likely to be evidence in favor of introspection than that it successfully predicted after finetuning.

Fair point—certainly, a big confounder is getting the models to properly follow the format and do the task at all. However, the gap between self- and cross-prediction trained models remains to be explained.

Finally, Figure 17 seems obviously misleading. There should be a line for how it changed over its training for self-prediction and not require carefully reading the words below the figure to see that you just put a mark at the final result for self-prediction).

You're right—sorry about that. The figure only shows the effect of changing data size for cross-, but not for self-prediction. Earlier (not reported) scaling experiments also showed a similarly flat curve for self-prediction above a certain threshold.

Thanks again for your many thoughtful comments!

thane-ruthenis on LLMs can learn about themselves by introspection

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question.

The skeptical interpretation here is that what the fine-tuning does is teaching the models to treat the hypothetical as just a rephrasing of the original question, while otherwise they're inclined to do something more complicated and incoherent that just leads to them confusing themselves.

Under this interpretation, no introspection/self-simulation actually takes place – and I feel it's a much simpler explanation.

j_passeri on Breakthroughs, Neurodivergence, and Working Outside the System

“ The answer dates back in time to the Italian renaissance epoch. The rich at the time realized the essentials of human industry and dexterity and invested heavily to it. Da Vinci, Marco Polo, Da Dama and so on. These men we sponsored by aristocrats who saw the future. What we have today is a money system (no thanks to the Medici family) that has blinded the world to the endless possibilities of the most sophisticated machine - the human mind. Neurodivergence is the new idiot, while the simple influencer/motivational speaker is celebrated and showered with wealth. There would be a turn in the tide, sooner than we think... Wanted to add a comment to your post but also didn't want to sign up”

owain_evans on LLMs can learn about themselves by introspection

Wrapping a question in a hypothetical feels closer to rephrasing the question than probing "introspection"

Note that models perform poorly at predicting properties of their behavior in hypotheticals without finetuning. So I don't think this is just like rephrasing the question. Also, GPT3.5 does worse at predicting GPT-3.5 than Llama-70B does at predicting GPT-3.5 (without finetuning), and GPT4 is only a little better at predicting itself than are other models.

>Essentially, the response to the object level and hypothetical reformulation both arise from very similar things going on in the model rather than something emergent happening.

While we don't know what is going on internally, I agree it's quite possible these "arise from similar things". In the paper we discuss "self-simulation" as a possible mechanism. Does that fit what you have in mind? Note: We are not claiming that models must be doing something very self-aware and sophisticated. The main thing is just to show that there is introspection according to our definition. Contrary to what you say, I don't think this result is obvious and (as I noted above) it's easy to run experiments where models do not show any advantage in predicting themselves.