LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Refusal in LLMs is mediated by a single direction
Andy Arditi (andy-arditi) · 2024-04-27T11:13:06.235Z · comments (93)

[link] Explore More: A Bag of Tricks to Keep Your Life on the Rails
Shoshannah Tekofsky (DarkSym) · 2024-09-28T21:38:52.256Z · comments (15)

Believing In
AnnaSalamon · 2024-02-08T07:06:13.072Z · comments (51)

[link] Introducing AI Lab Watch
Zach Stein-Perlman · 2024-04-30T17:00:12.652Z · comments (30)

[link] "How could I have thought that faster?"
mesaoptimizer · 2024-03-11T10:56:17.884Z · comments (32)

SAE feature geometry is outside the superposition hypothesis
jake_mendel · 2024-06-24T16:07:14.604Z · comments (17)

The ‘strong’ feature hypothesis could be wrong
lewis smith (lsgos) · 2024-08-02T14:33:58.898Z · comments (17)

AGI Safety and Alignment at Google DeepMind: A Summary of Recent Work
Rohin Shah (rohinmshah) · 2024-08-20T16:22:45.888Z · comments (33)

MIRI 2024 Mission and Strategy Update
Malo (malo) · 2024-01-05T00:20:54.169Z · comments (44)

Modern Transformers are AGI, and Human-Level
abramdemski · 2024-03-26T17:46:19.373Z · comments (88)

You are not too "irrational" to know your preferences.
DaystarEld · 2024-11-26T15:01:42.996Z · comments (50)

CFAR Takeaways: Andrew Critch
Raemon · 2024-02-14T01:37:03.931Z · comments (62)

LLM Generality is a Timeline Crux
eggsyntax · 2024-06-24T12:52:07.704Z · comments (119)

Brute Force Manufactured Consensus is Hiding the Crime of the Century
Roko · 2024-02-03T20:36:59.806Z · comments (156)

Superbabies: Putting The Pieces Together
sarahconstantin · 2024-07-11T20:40:05.036Z · comments (37)

Ayn Rand’s model of “living money”; and an upside of burnout
AnnaSalamon · 2024-11-16T02:59:07.368Z · comments (58)

"Slow" takeoff is a terrible term for "maybe even faster takeoff, actually"
Raemon · 2024-09-28T23:38:25.512Z · comments (69)

ChatGPT can learn indirect control
Raymond D · 2024-03-21T21:11:06.649Z · comments (27)

Towards more cooperative AI safety strategies
Richard_Ngo (ricraz) · 2024-07-16T04:36:29.191Z · comments (133)

The Sun is big, but superintelligences will not spare Earth a little sunlight
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-09-23T03:39:16.243Z · comments (141)

[link] What TMS is like
Sable · 2024-10-31T00:44:22.612Z · comments (23)

Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-04-30T18:51:13.493Z · comments (42)

OpenAI: Fallout
Zvi · 2024-05-28T13:20:04.325Z · comments (25)

Toward A Mathematical Framework for Computation in Superposition
Dmitry Vaintrob (dmitry-vaintrob) · 2024-01-18T21:06:57.040Z · comments (18)

What Goes Without Saying
sarahconstantin · 2024-12-20T18:00:06.363Z · comments (10)

[link] Jaan Tallinn's 2023 Philanthropy Overview
jaan · 2024-05-20T12:11:39.416Z · comments (5)

Funny Anecdote of Eliezer From His Sister
Noah Birnbaum (daniel-birnbaum) · 2024-04-22T22:05:31.886Z · comments (6)

Frontier Models are Capable of In-context Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-12-05T22:11:17.320Z · comments (24)

Pay Risk Evaluators in Cash, Not Equity
Adam Scholl (adam_scholl) · 2024-09-07T02:37:59.659Z · comments (19)

Making a conservative case for alignment
Cameron Berg (cameron-berg) · 2024-11-15T18:55:40.864Z · comments (68)

Maybe Anthropic's Long-Term Benefit Trust is powerless
Zach Stein-Perlman · 2024-05-27T13:00:47.991Z · comments (21)

[link] Sam Altman’s Chip Ambitions Undercut OpenAI’s Safety Strategy
garrison · 2024-02-10T19:52:55.191Z · comments (52)

The Hopium Wars: the AGI Entente Delusion
Max Tegmark (MaxTegmark) · 2024-10-13T17:00:29.033Z · comments (55)

How I Learned To Stop Trusting Prediction Markets and Love the Arbitrage
orthonormal · 2024-08-06T02:32:41.364Z · comments (30)

[link] Understanding Shapley Values with Venn Diagrams
Carson L · 2024-12-06T21:56:43.960Z · comments (32)

The impossible problem of due process
mingyuan · 2024-01-16T05:18:33.415Z · comments (64)

This might be the last AI Safety Camp
Remmelt (remmelt-ellen) · 2024-01-24T09:33:29.438Z · comments (34)

Response to Aschenbrenner's "Situational Awareness"
Rob Bensinger (RobbBB) · 2024-06-06T22:57:11.737Z · comments (27)

[question] Examples of Highly Counterfactual Discoveries?
johnswentworth · 2024-04-23T22:19:19.399Z · answers+comments (101)

Self-Other Overlap: A Neglected Approach to AI Alignment
Marc Carauleanu (Marc-Everin Carauleanu) · 2024-07-30T16:22:29.561Z · comments (43)

Optimistic Assumptions, Longterm Planning, and "Cope"
Raemon · 2024-07-17T22:14:24.090Z · comments (46)

[link] The Compendium, A full argument about extinction risk from AGI
adamShimi · 2024-10-31T12:01:51.714Z · comments (52)

What's Going on With OpenAI's Messaging?
ozziegooen · 2024-05-21T02:22:04.171Z · comments (13)

My AI Model Delta Compared To Christiano
johnswentworth · 2024-06-12T18:19:44.768Z · comments (73)

Communications in Hard Mode (My new job at MIRI)
tanagrabeast · 2024-12-13T20:13:44.825Z · comments (24)

Two easy things that maybe Just Work to improve AI discourse
jacobjacob · 2024-06-08T15:51:18.078Z · comments (35)

Cryonics is free
Mati_Roy (MathieuRoy) · 2024-09-29T17:58:17.108Z · comments (42)

OMMC Announces RIP
Adam Scholl (adam_scholl) · 2024-04-01T23:20:00.433Z · comments (5)

On Not Pulling The Ladder Up Behind You
Screwtape · 2024-04-26T21:58:29.455Z · comments (21)

My Interview With Cade Metz on His Reporting About Slate Star Codex
Zack_M_Davis · 2024-03-26T17:18:05.114Z · comments (187)

← previous page (newer posts) · next page (older posts) →

^{^}

Admittedly it's possible that this is totally happening all over the place and people are just covering it up in order to have all of the glory/status for themselves. But I doubt it: there are enough remarkably selfless LLM enthusiasts that if this were happening, I'd expect it would've gone viral already.

LessWrong 2.0 Reader

Archive

Recent comments