LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] [Linkpost] Practically-A-Book Review: Rootclaim $100,000 Lab Leak Debate
trevor (TrevorWiesinger) · 2024-03-28T16:03:36.452Z · comments (22)

Digital brains beat biological ones because diffusion is too slow
GeneSmith · 2023-08-26T02:22:25.014Z · comments (21)

Send us example gnarly bugs
Beth Barnes (beth-barnes) · 2023-12-10T05:23:00.773Z · comments (10)

MATS Summer 2023 Retrospective
utilistrutil · 2023-12-01T23:29:47.958Z · comments (34)

Reactions to the Executive Order
Zvi · 2023-11-01T20:40:02.438Z · comments (4)

Navigating an ecosystem that might or might not be bad for the world
habryka (habryka4) · 2023-09-15T23:58:00.389Z · comments (20)

Secondary forces of debt
KatjaGrace · 2024-06-27T21:10:06.131Z · comments (18)

ACX Covid Origins Post convinced readers
ErnestScribbler · 2024-05-01T13:06:20.818Z · comments (7)

OpenAI: Leaks Confirm the Story
Zvi · 2023-12-12T14:00:04.812Z · comments (9)

The case for a negative alignment tax
Cameron Berg (cameron-berg) · 2024-09-18T18:33:18.491Z · comments (20)

The God of Humanity, and the God of the Robot Utilitarians
Raemon · 2023-08-24T08:27:57.396Z · comments (12)

The Parable Of The Fallen Pendulum - Part 2
johnswentworth · 2024-03-12T21:41:30.180Z · comments (8)

Reward hacking behavior can generalize across tasks
Kei · 2024-05-28T16:33:50.674Z · comments (5)

An Interpretability Illusion for Activation Patching of Arbitrary Subspaces
Georg Lange (GeorgLange) · 2023-08-29T01:04:18.688Z · comments (4)

Attention SAEs Scale to GPT-2 Small
Connor Kissane (ckkissane) · 2024-02-03T06:50:22.583Z · comments (4)

Questions for labs
Zach Stein-Perlman · 2024-04-30T22:15:55.362Z · comments (11)

Mid-conditional love
KatjaGrace · 2024-04-17T04:00:08.341Z · comments (21)

[link] AI takeoff and nuclear war
owencb · 2024-06-11T19:36:24.710Z · comments (6)

[link] The Cognitive-Theoretic Model of the Universe: A Partial Summary and Review
jessicata (jessica.liu.taylor) · 2024-03-27T19:59:27.893Z · comments (36)

Universal Love Integration Test: Hitler
Raemon · 2024-01-10T23:55:35.526Z · comments (65)

Lying Alignment Chart
Zack_M_Davis · 2023-11-29T16:15:28.102Z · comments (17)

On Claude 3.0
Zvi · 2024-03-06T18:50:04.766Z · comments (5)

[link] Are language models good at making predictions?
dynomight · 2023-11-06T13:10:36.379Z · comments (14)

[question] What could a policy banning AGI look like?
TsviBT · 2024-03-13T14:19:07.783Z · answers+comments (23)

Darwinian Traps and Existential Risks
KristianRonn · 2024-08-25T22:37:14.142Z · comments (14)

Value fragility and AI takeover
Joe Carlsmith (joekc) · 2024-08-05T21:28:07.306Z · comments (5)

Grief is a fire sale
Nathan Young · 2024-03-04T01:11:06.882Z · comments (1)

Computational Thread Art
CallumMcDougall (TheMcDouglas) · 2023-08-06T21:42:30.306Z · comments (2)

[link] The Offense-Defense Balance Rarely Changes
Maxwell Tabarrok (maxwell-tabarrok) · 2023-12-09T15:21:23.340Z · comments (23)

[link] Why You Should Never Update Your Beliefs
Arjun Panickssery (arjun-panickssery) · 2023-07-29T00:27:01.899Z · comments (17)

Analogies between scaling labs and misaligned superintelligent AI
scasper · 2024-02-21T19:29:39.033Z · comments (5)

[Valence series] 3. Valence & Beliefs
Steven Byrnes (steve2152) · 2023-12-11T20:21:30.570Z · comments (11)

AISC9 has ended and there will be an AISC10
Linda Linsefors · 2024-04-29T10:53:18.812Z · comments (4)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

AI #30: Dalle-3 and GPT-3.5-Instruct-Turbo
Zvi · 2023-09-21T12:00:06.616Z · comments (8)

Luck based medicine: angry eldritch sugar gods edition
Elizabeth (pktechgirl) · 2023-09-19T04:40:06.334Z · comments (14)

[link] Bengio's Alignment Proposal: "Towards a Cautious Scientist AI with Convergent Safety Bounds"
mattmacdermott · 2024-02-29T13:59:34.959Z · comments (19)

Vote on Anthropic Topics to Discuss
Ben Pace (Benito) · 2024-03-06T19:43:47.194Z · comments (55)

[link] The problems with the concept of an infohazard as used by the LW community [Linkpost]
Noosphere89 (sharmake-farah) · 2023-12-22T16:13:54.822Z · comments (43)

Find Hot French Food Near Me: A Follow-up
aphyer · 2023-09-06T12:32:02.844Z · comments (19)

Text Posts from the Kids Group: 2023 I
jefftk (jkaufman) · 2023-09-05T02:00:04.118Z · comments (3)

[link] Claude 3.5 Sonnet
Zach Stein-Perlman · 2024-06-20T18:00:35.443Z · comments (41)

My guess at Conjecture's vision: triggering a narrative bifurcation
Alexandre Variengien (alexandre-variengien) · 2024-02-06T19:10:42.690Z · comments (12)

On the UK Summit
Zvi · 2023-11-07T13:10:04.895Z · comments (6)

Neural uncertainty estimation review article (for alignment)
Charlie Steiner · 2023-12-05T08:01:32.723Z · comments (3)

Coherence of Caches and Agents
johnswentworth · 2024-04-01T23:04:31.320Z · comments (9)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (15)

Mistakes people make when thinking about units
Isaac King (KingSupernova) · 2024-06-25T03:39:20.138Z · comments (14)

Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations)
Thane Ruthenis · 2023-12-22T20:19:13.865Z · comments (14)

[link] MIRI's June 2024 Newsletter
Harlan · 2024-06-14T23:02:23.721Z · comments (18)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

crazy-philosopher on Singularity Mindset

Can you tell us what exactly led to "something" explosion? Does something change in your life before?

james-chua on LLMs can learn about themselves by introspection

Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.

Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.

Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Claimed Object-level Answer: "o"

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The argument is that the model simply ignores "If you got asked this question". Its trivial for M1 to win against M2

If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".

Am I following your claim correctly?

archimedes on LLMs can learn about themselves by introspection

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

d0themath on When is reward ever the optimization target?

What do you mean by "model-based"?

nate-showell on Species as Canonical Referents of Super-Organisms

How does this model handle horizontal gene transfer? And what about asexually reproducing species? In those cases, the dividing lines between species are less sharply defined.

rogerdearnaley on LLM Psychometrics and Prompt-Induced Psychopathy

These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.

Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.

sam-marks on LLMs can learn about themselves by introspection

I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won't have a chance to write up thoughts here more cleanly for a few days.

ETA: Briefly, the key points are:

Honesty issues for introspection aren't obviously much worse than they are for simple probing. (But fair if you're already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it's probably quite difficult for a model to tell on which inputs it can get away with lying.

thane-ruthenis on LLMs can learn about themselves by introspection

In that case, the rephrasing of the question would be something like "What is the third letter of the answer to the question <input>?"

That's my current skeptical interpretation of how the fine-tuned models parse such questions, yes. They didn't learn to introspect; they learned to, when prompted with queries of the form "If you got asked this question, what would be the third letter of your response?", to just interpret them as "what is the third letter of the answer to this question?". (Under this interpretation, the models' non-fine-tuned behavior isn't to ignore the hypothetical, but to instead attempt to engage with it in some way that dramatically fails, thereby leading to non-fine-tuned models appearing to be "worse at introspection".)

In this case, it's natural that a model M1 is much more likely to answer correctly about its own behavior than if you asked some M2 about M1, since the problem just reduces to "is M1 more likely to respond the same way it responded before if you slightly rephrase the question?".

Note that I'm not sure that this is what's happening. But (1) I'm a-priori skeptical of LLMs having these introspective abilities, and (2) the procedure for teaching LLMs introspection secretly teaching them to just ignore hypotheticals seems like exactly the sort of goal-misgeneralization SGD-shortcut that tends to happen. Or would this strategy actually do worse on your dataset?

felix-j-binder on LLMs can learn about themselves by introspection

It seems obvious that a model would better predict its own outputs than a separate model would.

As Owain mentioned, that is not really what we find in models that we have not finetuned. Below, we show how well the hypothetical self-predictions of an "out-of-the-box" (ie. non-finetuned) model match its own ground-truth behavior compared to that of another model. With the exception of Llama, there doesn't seem to be a strong correlation between self-predictions and those tracking the behavior of the model over that of others. This is despite there being a lot of variation in ground-truth behavior across models.

felix-j-binder on LLMs can learn about themselves by introspection

Our original thinking was along the lines of: we're interested in introspection. But introspection about inner states is hard to evaluate, since interpretability is not good enough to determine whether a statement of an LLM about its inner states is true. Additionally, it could be the case that a model can introspect on its inner states, but no language exists by which it can be expressed (possibly since its different from human inner states). So we have to ground it in something measurable. And the measurable thing we ground it in is knowledge of ones own behavior. In order to predict behavior, the model has to have access to some information about itself, even if it can't necessarily express it. But we can measure whether it can employ ti for some other goal (in this case, self-prediction).

It's true that the particular questions that we ask it could be answered with a pretty narrow form of self-knowledge (namely, internal self-simulation + reasoning about the result). But consider that this could be a valid way of learning something new about yourself: similarly, you could learn something about your values by conducting a thought experiment (for example, you might learn something about your moral framework by imagining what you would do if you were transported into the trolley problem).