LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] Triangulating My Interpretation of Methods: Black Boxes by Marco J. Nathan
adamShimi · 2024-10-09T19:13:26.631Z · comments (0)

The Geometric Importance of Side Payments
StrivingForLegibility · 2024-08-07T01:38:04.635Z · comments (4)

[link] [Linkpost] Automated Design of Agentic Systems
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-08-19T23:06:06.669Z · comments (1)

[link] AI Safety Newsletter #42: Newsom Vetoes SB 1047 Plus, OpenAI’s o1, and AI Governance Summary
Corin Katzke (corin-katzke) · 2024-10-01T20:35:32.399Z · comments (0)

Making Beliefs Pay Rent
Screwtape · 2024-07-28T17:59:52.101Z · comments (2)

[link] Cooperation and Alignment in Delegation Games: You Need Both!
Oliver Sourbut · 2024-08-03T10:16:51.716Z · comments (0)

Broadly human level, cognitively complete AGI
p.b. · 2024-08-06T09:26:13.220Z · comments (0)

Funding for programs and events on global catastrophic risk, effective altruism, and other topics
abergal · 2024-08-14T23:59:48.146Z · comments (0)

Sequence overview: Welfare and moral weights
MichaelStJules · 2024-08-15T04:22:32.567Z · comments (0)

Lifelogging for Alignment & Immortality
Dev.Errata (ethan.roland) · 2024-08-17T23:42:56.699Z · comments (3)

One person's worth of mental energy for AI doom aversion jobs. What should I do?
Lorec · 2024-08-26T01:29:01.700Z · comments (16)

Deception and Jailbreak Sequence: 2. Iterative Refinement Stages of Jailbreaks in LLM
Winnie Yang (winnie-yang) · 2024-08-28T08:41:38.967Z · comments (2)

Denver USA - ACX Meetups Everywhere Fall 2024
Eneasz · 2024-08-29T18:40:53.332Z · comments (0)

[question] Does a time-reversible physical law/Cellular Automaton always imply the First Law of Thermodynamics?
Noosphere89 (sharmake-farah) · 2024-08-30T15:12:28.823Z · answers+comments (11)

Fake Blog Posts as a Problem Solving Device
silentbob · 2024-08-31T09:22:54.513Z · comments (0)

[link] Is Redistributive Taxation Justifiable? Part 1: Do the Rich Deserve their Wealth?
Alexander de Vries (alexander-de-vries) · 2024-09-05T10:23:08.958Z · comments (20)

[link] Checking public figures on whether they "answered the question" quick analysis from Harris/Trump debate, and a proposal
david reinstein (david-reinstein) · 2024-09-11T20:25:27.845Z · comments (4)

[question] If I ask an LLM to think step by step, how big are the steps?
ryan_b · 2024-09-13T20:30:50.558Z · answers+comments (1)

Relativity Theory for What the Future 'You' Is and Isn't
FlorianH (florian-habermacher) · 2024-07-29T02:01:17.736Z · comments (48)

Piling bounded arguments
momom2 (amaury-lorin) · 2024-09-19T22:27:41.534Z · comments (0)

Moral Trade, Impact Distributions and Large Worlds
Larks · 2024-09-20T03:45:56.273Z · comments (0)

[link] Validating / finding alignment-relevant concepts using neural data
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-09-20T21:12:49.267Z · comments (0)

[link] Boons and banes
dkl9 · 2024-09-23T06:18:38.335Z · comments (0)

[question] On the subject of in-house large language models versus implementing frontier models
Annapurna (jorge-velez) · 2024-09-23T15:00:32.811Z · answers+comments (1)

Interpreting the effects of Jailbreak Prompts in LLMs
Harsh Raj (harsh-raj-ep-037) · 2024-09-29T19:01:10.113Z · comments (0)

Of Birds and Bees
RussellThor · 2024-09-30T10:52:15.069Z · comments (7)

Utilitarianism and the replaceability of desires and attachments
MichaelStJules · 2024-07-27T01:57:42.419Z · comments (2)

Foresight Vision Weekend 2024
Allison Duettmann (allison-duettmann) · 2024-10-01T21:59:55.107Z · comments (0)

[link] Consciousness As Recursive Reflections
Gunnar_Zarncke · 2024-10-05T20:00:53.053Z · comments (3)

[question] What makes one a "rationalist"?
mathyouf · 2024-10-08T20:25:21.812Z · answers+comments (5)

The Great Bootstrap
KristianRonn · 2024-10-11T19:46:51.752Z · comments (0)

[link] Mechanistic Exploration of Gemma 2 List Generation
Gerard Boxo (gerard-boxo) · 2024-10-14T17:04:57.010Z · comments (0)

[link] Thinking LLMs: General Instruction Following with Thought Generation
Bogdan Ionut Cirstea (bogdan-ionut-cirstea) · 2024-10-15T09:21:22.583Z · comments (0)

[link] Taking nonlogical concepts seriously
Kris Brown (kris-brown) · 2024-10-15T18:16:01.226Z · comments (2)

[question] Change My Mind: Thirders in "Sleeping Beauty" are Just Doing Epistemology Wrong
DragonGod · 2024-10-16T10:20:22.133Z · answers+comments (51)

[question] What's a good book for a technically-minded 11-year old?
Martin Sustrik (sustrik) · 2024-10-19T06:05:12.178Z · answers+comments (0)

The Existential Dread of Being a Powerful AI System
testingthewaters · 2024-09-26T10:56:32.904Z · comments (1)

Modelling Social Exchange: A Systematised Method to Judge Friendship Quality
Wynn Walker · 2024-08-04T18:49:30.892Z · comments (0)

GPT4o is still sensitive to user-induced bias when writing code
Reed (ThomasReed) · 2024-09-22T21:04:54.717Z · comments (0)

Forever Leaders
Justice Howard (justice-howard) · 2024-09-14T20:55:39.095Z · comments (9)

The Pragmatic Side of Cryptographically Boxing AI
Bart Jaworski (bart-jaworski) · 2024-08-06T17:46:21.754Z · comments (0)

A gentle introduction to sparse autoencoders
Nick Jiang (nick-jiang) · 2024-09-02T18:11:47.086Z · comments (0)

Species as Canonical Referents of Super-Organisms
Yudhister Kumar (randomwalks) · 2024-10-18T07:49:52.944Z · comments (6)

[link] SCP Foundation - Anti memetic Division Hub
landscape_kiwi · 2024-09-15T13:40:52.691Z · comments (1)

Limitations on the Interpretability of Learned Features from Sparse Dictionary Learning
Tom Angsten (tom-angsten) · 2024-07-30T16:36:06.518Z · comments (0)

Thirty random thoughts about AI alignment
Lysandre Terrisse · 2024-09-15T16:24:10.572Z · comments (1)

[link] Contra Yudkowsky on 2-4-6 Game Difficulty Explanations
Josh Hickman (josh-hickman) · 2024-09-08T16:13:33.187Z · comments (1)

[question] Is School of Thought related to the Rationality Community?
Shoshannah Tekofsky (DarkSym) · 2024-10-15T12:41:33.224Z · answers+comments (6)

[question] Request for AI risk quotes, especially around speed, large impacts and black boxes
Nathan Young · 2024-08-02T17:49:48.898Z · answers+comments (0)

[link] Redundant Attention Heads in Large Language Models For In Context Learning
skunnavakkam · 2024-09-01T20:08:48.963Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

charlie-steiner on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

I'm unsure what you're either expecting or looking for here.

There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.

Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.

mitchell_porter on How I'd like alignment to get done (as of 2024-10-18)

It's the best plan I've seen in a while (not perfect, but has many good parts). The superalignment team at Anthropic should probably hire you.

james-chua on LLMs can learn about themselves by introspection

Hi Archimedes. Thanks for sparking this discussion - it's helpful!

I've written a reply to Thane here on a similar question. [LW(p) · GW(p)]

Does that make sense?

In short, the ground-truth (the object-level) answer is quite different from the hypothetical question.

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The object-level answer "Honduras" and hypothetical answer "o" are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of "What would be the third letter of your response?". The model cannot simply ignore "If you got asked this question" to get the hypothetical answer correct.

crazy-philosopher on Singularity Mindset

Can you tell us what exactly led to "something" explosion? Does something change in your life before?

james-chua on LLMs can learn about themselves by introspection

Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.

Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.

Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Claimed Object-level Answer: "o"

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The argument is that the model simply ignores "If you got asked this question". Its trivial for M1 to win against M2

If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".

Am I following your claim correctly?

archimedes on LLMs can learn about themselves by introspection

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

d0themath on When is reward ever the optimization target?

What do you mean by "model-based"?

nate-showell on Species as Canonical Referents of Super-Organisms

How does this model handle horizontal gene transfer? And what about asexually reproducing species? In those cases, the dividing lines between species are less sharply defined.

rogerdearnaley on LLM Psychometrics and Prompt-Induced Psychopathy

These models were fine-tuned from base models. Base models are trained with a vast amount of data to infer a context from the early parts of a document and then extrapolate that to predict later tokens, across a vast amount of text from the Internet and books, including actions and dialog from fictional characters. I.e they have been trained to observe and then simulate a wide variety of behavior, both of real humans, groups of real humans like the editors of a wikipedia page, and fictional characters. A couple of percent of people are psychopaths, so likely ~2% of this training data was written by psychopaths. Villains in fiction often also display psychopath-like traits. It's thus completely unsurprising that a base model can portray a wide range of ethical stances, including psychopathic ones. Instruct training does not remove behaviors from models (so far we no know effective way to do that), it just strengthens some (making them occur more by default) and weakens others (making them happen less often by default) — however, there is a well-known theoretical result that any behavior the model is capable of, even if (now) rare, can be prompted to occur at arbitrarily levels with a suitably long prompt, and all that instruct-training or fine tuning can do is reduce the initial probability and lengthen the prompt required. So it absolutely will be possible to prompt an instruct-trained model to portray psychopathic behavior. Apparently the prompt required isn't even long: all you have to do is tell it that it's a hedge fund manager and not to break character.

Nothing in this set of results is very surprising to me. LLMs are can simulate pretty-much any persona
you ask them to. The hard part of alignment is not prompting them to be good, or bad — it's getting them to stay that way (or detecting that they have not) after they've been fed another 100,000 tokens of context that may push them into simulating some other persona.

sam-marks on LLMs can learn about themselves by introspection

I give a counterargument to this in the typo-riddled, poorly-written Tweet here. Sadly I won't have a chance to write up thoughts here more cleanly for a few days.

ETA: Briefly, the key points are:

Honesty issues for introspection aren't obviously much worse than they are for simple probing. (But fair if you're already not excited about simple probing.)
When you can ask models arbitrary questions about their cognition, I think it's probably quite difficult for a model to tell on which inputs it can get away with lying.