LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] [Repost] The Copenhagen Interpretation of Ethics
mesaoptimizer · 2024-01-25T15:20:08.162Z · comments (4)

Update on Chinese IQ-related gene panels
Lao Mein (derpherpize) · 2023-12-14T10:12:21.212Z · comments (7)

Eliezer's example on Bayesian statistics is wr... oops!
Zane · 2023-10-17T18:38:18.327Z · comments (13)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (20)

Influence functions - why, what and how
Nina Panickssery (NinaR) · 2023-09-15T20:42:08.653Z · comments (6)

Dumbing down
Martin Sustrik (sustrik) · 2024-06-09T06:50:47.469Z · comments (0)

Prompts for Big-Picture Planning
Raemon · 2024-04-13T03:04:24.523Z · comments (1)

AI Safety is Dropping the Ball on Clown Attacks
trevor (TrevorWiesinger) · 2023-10-22T20:09:31.810Z · comments (73)

Epistemic Hell
rogersbacon · 2024-01-27T17:13:09.578Z · comments (20)

“Artificial General Intelligence”: an extremely brief FAQ
Steven Byrnes (steve2152) · 2024-03-11T17:49:02.496Z · comments (6)

If we solve alignment, do we die anyway?
Seth Herd · 2024-08-23T13:13:10.933Z · comments (65)

[link] The True Story of How GPT-2 Became Maximally Lewd
Writer · 2024-01-18T21:03:08.167Z · comments (7)

[link] OpenAI: Preparedness framework
Zach Stein-Perlman · 2023-12-18T18:30:10.153Z · comments (23)

Red-teaming language models via activation engineering
Nina Panickssery (NinaR) · 2023-08-26T05:52:00.523Z · comments (6)

[link] [Link Post] "Foundational Challenges in Assuring Alignment and Safety of Large Language Models"
David Scott Krueger (formerly: capybaralet) (capybaralet) · 2024-06-06T18:55:09.151Z · comments (2)

Would You Work Harder In The Least Convenient Possible World?
Firinn · 2023-09-22T05:17:05.148Z · comments (93)

Game Theory without Argmax [Part 1]
Cleo Nardo (strawberry calm) · 2023-11-11T15:59:47.486Z · comments (18)

[link] Paper: Understanding and Controlling a Maze-Solving Policy Network
TurnTrout · 2023-10-13T01:38:09.147Z · comments (0)

[link] Former OpenAI Superalignment Researcher: Superintelligence by 2030
Julian Bradshaw · 2024-06-05T03:35:19.251Z · comments (30)

[link] Motivation gaps: Why so much EA criticism is hostile and lazy
titotal (lombertini) · 2024-04-22T11:49:59.389Z · comments (5)

High-level interpretability: detecting an AI's objectives
Paul Colognese (paul-colognese) · 2023-09-28T19:30:16.753Z · comments (4)

AXRP Episode 27 - AI Control with Buck Shlegeris and Ryan Greenblatt
DanielFilan · 2024-04-11T21:30:04.244Z · comments (10)

[link] The Inner Ring by C. S. Lewis
Saul Munn (saul-munn) · 2024-04-24T22:48:09.228Z · comments (6)

[link] InterLab – a toolkit for experiments with multi-agent interactions
Tomáš Gavenčiak (tomas-gavenciak) · 2024-01-22T18:23:35.661Z · comments (0)

Contra Yudkowsky on Epistemic Conduct for Author Criticism
Zack_M_Davis · 2023-09-13T15:33:14.987Z · comments (38)

When Omnipotence is Not Enough
lsusr · 2023-08-25T19:50:51.038Z · comments (3)

Multiplex Gene Editing: Where Are We Now?
sarahconstantin · 2024-07-16T20:50:04.590Z · comments (6)

[link] We're all in this together
Tamsin Leake (carado-1) · 2023-12-05T13:57:46.270Z · comments (65)

Transcoders enable fine-grained interpretable circuit analysis for language models
Jacob Dunefsky (jacob-dunefsky) · 2024-04-30T17:58:09.982Z · comments (14)

Modulating sycophancy in an RLHF model via activation steering
Nina Panickssery (NinaR) · 2023-08-09T07:06:50.859Z · comments (20)

How We Picture Bayesian Agents
johnswentworth · 2024-04-08T18:12:48.595Z · comments (14)

[link] Davidad's Provably Safe AI Architecture - ARIA's Programme Thesis
simeon_c (WayZ) · 2024-02-01T21:30:44.090Z · comments (17)

Flagging Potentially Unfair Parenting
jefftk (jkaufman) · 2023-12-26T12:40:05.099Z · comments (1)

3 levels of threat obfuscation
HoldenKarnofsky · 2023-08-02T14:58:32.506Z · comments (14)

Finding Sparse Linear Connections between Features in LLMs
Logan Riggs (elriggs) · 2023-12-09T02:27:42.456Z · comments (5)

[link] Barriers to Mechanistic Interpretability for AGI Safety
Connor Leahy (NPCollapse) · 2023-08-29T10:56:45.639Z · comments (13)

MATS AI Safety Strategy Curriculum
Ronny Fernandez (ronny-fernandez) · 2024-03-07T19:59:37.434Z · comments (2)

Visible loss landscape basins don't correspond to distinct algorithms
Mikhail Samin (mikhail-samin) · 2023-07-28T16:19:05.279Z · comments (13)

AI #79: Ready for Some Football
Zvi · 2024-08-29T13:30:10.902Z · comments (16)

Duct Tape security
Isaac King (KingSupernova) · 2024-04-26T18:57:05.659Z · comments (11)

Text Posts from the Kids Group: 2020
jefftk (jkaufman) · 2024-04-13T22:30:05.326Z · comments (3)

[New Feature] Your Subscribed Feed
Ruby · 2024-06-11T22:45:00.000Z · comments (8)

Best in Class Life Improvement
sapphire (deluks917) · 2024-04-04T01:51:02.556Z · comments (20)

[link] [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders
chanind · 2024-09-25T09:31:03.296Z · comments (15)

Different senses in which two AIs can be “the same”
Vivek Hebbar (Vivek) · 2024-06-24T03:16:43.400Z · comments (0)

How useful is "AI Control" as a framing on AI X-Risk?
habryka (habryka4) · 2024-03-14T18:06:30.459Z · comments (4)

Generalized Stat Mech: The Boltzmann Approach
David Lorell · 2024-04-12T17:47:31.880Z · comments (7)

[Summary] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:17.755Z · comments (0)

[link] GPT-4o System Card
Zach Stein-Perlman · 2024-08-08T20:30:52.633Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

jkaufman on Start an Upper-Room UV Installation Company?

You do need to pay attention to what paint is on the ceiling and measure to verify that levels are low in the places people are, but pointing UVC up is something we've done safely for a long time in many places.

michael-roe on Bitter lessons about lucid dreaming

Discussing sleep paralysis might be an infohazard…

The times I’ve entered sleep paralysis it hasn’t bothered me, as I knew what it was.

charlie-steiner on Start an Upper-Room UV Installation Company?

Pointing UVC LEDs at your ceiling seems sketchy. White paint will likely scatter ~5% of UVC, and shiny metal surfaces will scatter more. Try to go below 250nm for reduced reflection (and reduced penetration into human skin) and (more) unwanted chemistry will start happening to the air.

I guess an important question is whether UVC is more harmful than UVB. If it's not any more harmful, then as long as nobody's getting sunburned from being in that room all day, it's probably fine - that 5% scattering is just another name for SPF 20. But if it is more harmful, then sunburn might not be an adequate signal for when it's bad for you.

avturchin on Bitter lessons about lucid dreaming

The main risk is entering is sleep paralysis state, which itself is benign, but some terrifying sounds can be heard during it and this can cause stress.

Yes, it is to wake up from lucid dream - juts thing about your slleping body.

avturchin on Bitter lessons about lucid dreaming

The best practical application of lucid dreams is reducing effects of nightmares by recognizing that it is just a dream.

charlie-steiner on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

I'm unsure what you're either expecting or looking for here.

There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.

Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.

mitchell_porter on How I'd like alignment to get done (as of 2024-10-18)

It's the best plan I've seen in a while (not perfect, but has many good parts). The superalignment team at Anthropic should probably hire you.

james-chua on LLMs can learn about themselves by introspection

Hi Archimedes. Thanks for sparking this discussion - it's helpful!

I've written a reply to Thane here on a similar question. [LW(p) · GW(p)]

Does that make sense?

In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The object-level answer "Honduras" and hypothetical answer "o" are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of "What would be the third letter of your response?". The model cannot simply ignore "If you got asked this question" to get the hypothetical answer correct.

crazy-philosopher on Singularity Mindset

Can you tell us what exactly led to "something" explosion? Does something change in your life before?

james-chua on LLMs can learn about themselves by introspection

Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.

Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.

Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Claimed Object-level Answer: "o"

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The argument is that the model simply ignores "If you got asked this question". Its trivial for M1 to win against M2

If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".

Am I following your claim correctly?