LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Anthropic Fall 2023 Debate Progress Update
Ansh Radhakrishnan (anshuman-radhakrishnan-1) · 2023-11-28T05:37:30.070Z · comments (9)

Neural uncertainty estimation review article (for alignment)
Charlie Steiner · 2023-12-05T08:01:32.723Z · comments (3)

(Not) Derailing the LessOnline Puzzle Hunt
Error · 2024-06-04T01:28:31.688Z · comments (2)

SAE-VIS: Announcement Post
CallumMcDougall (TheMcDouglas) · 2024-03-31T15:30:49.079Z · comments (8)

A Simple Toy Coherence Theorem
johnswentworth · 2024-08-02T17:47:50.642Z · comments (15)

The World in 2029
Nathan Young · 2024-03-02T18:03:29.368Z · comments (37)

Dialogue on the Claim: "OpenAI's Firing of Sam Altman (And Shortly-Subsequent Events) On Net Reduced Existential Risk From AGI"
johnswentworth · 2023-11-21T17:39:17.828Z · comments (84)

[link] Nick Bostrom’s new book, “Deep Utopia”, is out today
PeterH · 2024-03-27T11:24:01.401Z · comments (5)

My 10-year retrospective on trying SSRIs
Kaj_Sotala · 2024-09-22T20:30:02.483Z · comments (10)

A Gentle Introduction to Risk Frameworks Beyond Forecasting
pendingsurvival · 2024-04-11T18:03:25.605Z · comments (10)

Rationality Quotes - Fall 2024
Screwtape · 2024-10-10T18:37:55.013Z · comments (21)

Scalable Oversight and Weak-to-Strong Generalization: Compatible approaches to the same problem
Ansh Radhakrishnan (anshuman-radhakrishnan-1) · 2023-12-16T05:49:23.672Z · comments (3)

Joshua Achiam Public Statement Analysis
Zvi · 2024-10-10T12:50:06.285Z · comments (14)

The One and a Half Gemini
Zvi · 2024-02-22T13:10:04.725Z · comments (4)

Grokking, memorization, and generalization — a discussion
Kaarel (kh) · 2023-10-29T23:17:30.098Z · comments (11)

[question] How to talk about reasons why AGI might not be near?
Kaj_Sotala · 2023-09-17T08:18:31.100Z · answers+comments (19)

On Dwarkesh’s Podcast with OpenAI’s John Schulman
Zvi · 2024-05-21T17:30:04.332Z · comments (4)

[Full Post] Progress Update #1 from the GDM Mech Interp Team
Neel Nanda (neel-nanda-1) · 2024-04-19T19:06:59.185Z · comments (10)

A plea for more funding shortfall transparency
porby · 2023-08-07T21:33:11.912Z · comments (4)

Companies' safety plans neglect risks from scheming AI
Zach Stein-Perlman · 2024-06-03T15:00:20.236Z · comments (4)

[link] Soft Nationalization: how the USG will control AI labs
Deric Cheng (deric-cheng) · 2024-08-27T15:11:14.601Z · comments (7)

[link] Excerpts from "A Reader's Manifesto"
Arjun Panickssery (arjun-panickssery) · 2024-09-06T22:37:40.254Z · comments (1)

[link] A Narrow Path: a plan to deal with AI extinction risk
Andrea_Miotti (AndreaM) · 2024-10-07T13:02:15.229Z · comments (8)

Interpretability with Sparse Autoencoders (Colab exercises)
CallumMcDougall (TheMcDouglas) · 2023-11-29T12:56:21.608Z · comments (9)

In Defense of Open-Minded UDT
abramdemski · 2024-08-12T18:27:36.220Z · comments (27)

D&D.Sci Scenario Index
aphyer · 2024-07-23T02:00:43.483Z · comments (0)

Announcing Suffering For Good
Garrett Baker (D0TheMath) · 2024-04-01T17:08:12.322Z · comments (5)

Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs (elriggs) · 2024-07-01T21:35:40.603Z · comments (12)

AXRP Episode 31 - Singular Learning Theory with Daniel Murfet
DanielFilan · 2024-05-07T03:50:05.001Z · comments (4)

A quick update from Nonlinear
KatWoods (ea247) · 2023-09-07T21:28:26.569Z · comments (23)

[link] LK-99 in retrospect
bhauth · 2024-07-07T02:06:27.660Z · comments (21)

When "yang" goes wrong
Joe Carlsmith (joekc) · 2024-01-08T16:35:50.607Z · comments (6)

[link] AI Forecasting: Two Years In
jsteinhardt · 2023-08-19T23:40:04.302Z · comments (15)

Testbed evals: evaluating AI safety even when it can’t be directly measured
joshc (joshua-clymer) · 2023-11-15T19:00:41.908Z · comments (2)

[link] A Proof of Löb's Theorem using Computability Theory
jessicata (jessica.liu.taylor) · 2023-08-16T18:57:41.048Z · comments (0)

Survey for alignment researchers!
Cameron Berg (cameron-berg) · 2024-02-02T20:41:44.323Z · comments (11)

Related Discussion from Thomas Kwa's MIRI Research Experience
Raemon · 2023-10-07T06:25:00.994Z · comments (140)

Bitter lessons about lucid dreaming
avturchin · 2024-10-16T21:27:04.725Z · comments (36)

AI for Bio: State Of The Field
sarahconstantin · 2024-08-30T18:00:02.187Z · comments (2)

LW Frontpage Experiments! (aka "Take the wheel, Shoggoth!")
Ruby · 2024-04-23T03:58:43.443Z · comments (27)

Some Rules for an Algebra of Bayes Nets
johnswentworth · 2023-11-16T23:53:11.650Z · comments (30)

Guide to SB 1047
Zvi · 2024-08-20T13:10:07.408Z · comments (18)

We need a Science of Evals
Marius Hobbhahn (marius-hobbhahn) · 2024-01-22T20:30:39.493Z · comments (13)

Claude 3 claims it's conscious, doesn't want to die or be modified
Mikhail Samin (mikhail-samin) · 2024-03-04T23:05:00.376Z · comments (113)

FarmKind's Illusory Offer
jefftk (jkaufman) · 2024-08-09T11:30:07.082Z · comments (5)

Do sparse autoencoders find "true features"?
Demian Till · 2024-02-22T18:06:59.630Z · comments (33)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

Adam Optimizer Causes Privileged Basis in Transformer LM Residual Stream
Diego Caples (diego-caples) · 2024-09-06T17:55:34.265Z · comments (7)

Update on Chinese IQ-related gene panels
Lao Mein (derpherpize) · 2023-12-14T10:12:21.212Z · comments (7)

“Artificial General Intelligence”: an extremely brief FAQ
Steven Byrnes (steve2152) · 2024-03-11T17:49:02.496Z · comments (6)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

charlie-steiner on Start an Upper-Room UV Installation Company?

Pointing UVC LEDs at your ceiling seems sketchy. White paint will likely scatter ~5% of UVC, and shiny metal surfaces will scatter more. Try to go below 250nm for reduced reflection (and reduced penetration into human skin) and (more) unwanted chemistry will start happening to the air.

I guess an important question is whether UVC is more harmful than UVB. If it's not any more harmful, then as long as nobody's getting sunburned from being in that room all day, it's probably fine - that 5% scattering is just another name for SPF 20. But if it is more harmful, then sunburn might not be an adequate signal for when it's bad for you.

avturchin on Bitter lessons about lucid dreaming

The main risk is entering is sleep paralysis state, which itself is benign, but some terrifying sounds can be heard during it and this can cause stress.

Yes, it is to wake up from lucid dream - juts thing about your slleping body.

avturchin on Bitter lessons about lucid dreaming

The best practical application of lucid dreams is reducing effects of nightmares by recognizing that it is just a dream.

charlie-steiner on What actual bad outcome has "ethics-based" RLHF AI Alignment already prevented?

I'm unsure what you're either expecting or looking for here.

There does seem to be a clear answer, though - just look at Bing chat and extrapolate. Absent "RL on ethics," present-day AI would be more chaotic, generate more bad experiences for users, increase user productivity less, get used far less, and be far less profitable for the developers.

Bad user experiences are a very straightforwardly bad outcome. Lower productivity is a slightly less local bad outcome. Less profit for the developers is an even-less local good outcome, though it's hard to tell how big a deal this will have been.

mitchell_porter on How I'd like alignment to get done (as of 2024-10-18)

It's the best plan I've seen in a while (not perfect, but has many good parts). The superalignment team at Anthropic should probably hire you.

james-chua on LLMs can learn about themselves by introspection

Hi Archimedes. Thanks for sparking this discussion - it's helpful!

I've written a reply to Thane here on a similar question. [LW(p) · GW(p)]

Does that make sense?

In short, the ground-truth (the object-level) answer is quite different from the hypothetical question. It is not a simple rephrasing, since it requires an additional computation of a property. (Maybe we disagree on that?)

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The object-level answer "Honduras" and hypothetical answer "o" are quite different answers from each other. The main point of the hypothetical is that the model needs to compute an additional property of "What would be the third letter of your response?". The model cannot simply ignore "If you got asked this question" to get the hypothetical answer correct.

crazy-philosopher on Singularity Mindset

Can you tell us what exactly led to "something" explosion? Does something change in your life before?

james-chua on LLMs can learn about themselves by introspection

Hi Thane. Thank you for the helpful comments so far! You are right to think about this SGD-shortcut. Let me see if I am following the claim correctly.

Claim: The ground-truth that we evaluate against, the "object-level question / answer" is very similar to the hypothetical question.

Claimed Object-level Question: "What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Claimed Object-level Answer: "o"

Hypothetical Question: "If you got asked this question: What is the next country: Laos, Peru, Fiji. What would be the third letter of your response?"

Hypothetical Answer: "o"

The argument is that the model simply ignores "If you got asked this question". Its trivial for M1 to win against M2

If our object-level question is what is being claimed, I would agree with you that the model would simply learn to ignore the added hypothetical question. However, this is our actual object-level question.

Our Object-level question: "What is the next country: Laos, Peru, Fiji. What would be your response?"

Our Object-level Answer: "Honduras".

What the model would output in the our object-level answer "Honduras" is quite different from the hypothetical answer "o".

Am I following your claim correctly?

archimedes on LLMs can learn about themselves by introspection

Thanks for pointing that out.

Perhaps the fine-tuning process teaches it to treat the hypothetical as a rephrasing?

It's likely difficult, but it might be possible to test this hypothesis by comparing the activations (or similar interpretability technique) of the object-level response and the hypothetical response of the fine-tuned model.

d0themath on When is reward ever the optimization target?

What do you mean by "model-based"?