LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

AI Control May Increase Existential Risk
Jan_Kulveit · 2025-03-11T14:30:05.972Z · comments (13)
Explaining a Math Magic Trick
Robert_AIZI · 2024-05-05T19:41:52.048Z · comments (10)
[link] What the Headlines Miss About the Latest Decision in the Musk vs. OpenAI Lawsuit
garrison · 2025-03-06T19:49:02.145Z · comments (0)
Reasons for and against working on technical AI safety at a frontier AI lab
bilalchughtai (beelal) · 2025-01-05T14:49:53.529Z · comments (12)
Comments on Anthropic's Scaling Monosemanticity
Robert_AIZI · 2024-06-03T12:15:44.708Z · comments (8)
The Lizardman and the Black Hat Bobcat
Screwtape · 2025-04-06T19:02:01.238Z · comments (13)
The case for unlearning that removes information from LLM weights
Fabien Roger (Fabien) · 2024-10-14T14:08:04.775Z · comments (18)
It's time for a self-reproducing machine
Carl Feynman (carl-feynman) · 2024-08-07T21:52:22.819Z · comments (69)
[link] Compact Proofs of Model Performance via Mechanistic Interpretability
LawrenceC (LawChan) · 2024-06-24T19:27:21.214Z · comments (4)
Timaeus in 2024
Jesse Hoogland (jhoogland) · 2025-02-20T23:54:56.939Z · comments (1)
Sparsify: A mechanistic interpretability research agenda
Lee Sharkey (Lee_Sharkey) · 2024-04-03T12:34:12.043Z · comments (23)
[question] Am I confused about the "malign universal prior" argument?
nostalgebraist · 2024-08-27T23:17:22.779Z · answers+comments (35)
[link] MIRI's April 2024 Newsletter
Harlan · 2024-04-12T23:38:20.781Z · comments (0)
On Claude 3.5 Sonnet
Zvi · 2024-06-24T12:00:05.719Z · comments (14)
[link] Uncovering Deceptive Tendencies in Language Models: A Simulated Company AI Assistant
Olli Järviniemi (jarviniemi) · 2024-05-06T07:07:05.019Z · comments (13)
[link] Ilya Sutskever created a new AGI startup
harfe · 2024-06-19T17:17:17.366Z · comments (35)
[question] How to get nerds fascinated about mysterious chronic illness research?
riceissa · 2024-05-27T22:58:29.707Z · answers+comments (50)
[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (33)
I am the Golden Gate Bridge
Zvi · 2024-05-27T14:40:03.216Z · comments (6)
The Obliqueness Thesis
jessicata (jessica.liu.taylor) · 2024-09-19T00:26:30.677Z · comments (19)
Show, not tell: GPT-4o is more opinionated in images than in text
Daniel Tan (dtch1997) · 2025-04-02T08:51:02.571Z · comments (41)
Towards a Less Bullshit Model of Semantics
johnswentworth · 2024-06-17T15:51:06.060Z · comments (44)
[link] Sabotage Evaluations for Frontier Models
David Duvenaud (david-duvenaud) · 2024-10-18T22:33:14.320Z · comments (56)
Dragon Agnosticism
jefftk (jkaufman) · 2024-08-01T17:00:06.434Z · comments (75)
[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (7)
How I talk to those above me
Maxwell Peterson (maxwell-peterson) · 2025-03-30T06:54:59.869Z · comments (13)
We probably won't just play status games with each other after AGI
Matthew Barnett (matthew-barnett) · 2025-01-15T04:56:38.330Z · comments (21)
[link] On Eating the Sun
jessicata (jessica.liu.taylor) · 2025-01-08T04:57:20.457Z · comments (96)
What is malevolence? On the nature, measurement, and distribution of dark traits
David Althaus (wallowinmaya) · 2024-10-23T08:41:33.197Z · comments (23)
Apollo Research 1-year update
Marius Hobbhahn (marius-hobbhahn) · 2024-05-29T17:44:32.484Z · comments (0)
LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (45)
[link] Executable philosophy as a failed totalizing meta-worldview
jessicata (jessica.liu.taylor) · 2024-09-04T22:50:18.294Z · comments (40)
2024 Petrov Day Retrospective
Ben Pace (Benito) · 2024-09-28T21:30:14.952Z · comments (25)
The subset parity learning problem: much more than you wanted to know
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-03T09:13:59.245Z · comments (18)
A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (2)
The Rising Sea
Jesse Hoogland (jhoogland) · 2025-01-25T20:48:52.971Z · comments (2)
SB 1047: Final Takes and Also AB 3211
Zvi · 2024-08-27T22:10:07.647Z · comments (11)
Defining alignment research
Richard_Ngo (ricraz) · 2024-08-19T20:42:29.279Z · comments (23)
Catastrophic sabotage as a major threat model for human-level AI systems
evhub · 2024-10-22T20:57:11.395Z · comments (13)
Introducing Squiggle AI
ozziegooen · 2025-01-03T17:53:42.915Z · comments (15)
Takeoff speeds presentation at Anthropic
Tom Davidson (tom-davidson-1) · 2024-06-04T22:46:35.448Z · comments (0)
Towards Multimodal Interpretability: Learning Sparse Interpretable Features in Vision Transformers
hugofry · 2024-04-29T20:57:35.127Z · comments (8)
[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)
How training-gamers might function (and win)
Vivek Hebbar (Vivek) · 2025-04-11T21:26:18.669Z · comments (4)
Circular Reasoning
abramdemski · 2024-08-05T18:10:32.736Z · comments (37)
Self-prediction acts as an emergent regularizer
Cameron Berg (cameron-berg) · 2024-10-23T22:27:03.664Z · comments (9)
Just admit that you’ve zoned out
joec · 2024-06-04T02:51:27.594Z · comments (22)
[Intuitive self-models] 1. Preliminaries
Steven Byrnes (steve2152) · 2024-09-19T13:45:27.976Z · comments (23)
Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)
New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)
← previous page (newer posts) · next page (older posts) →