LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Notable runaway-optimiser-like LLM failure modes on Biologically and Economically aligned AI safety benchmarks for LLMs with simplified observation format
Roland Pihlakas (roland-pihlakas) · 2025-03-16T23:23:30.989Z · comments (6)
The Rise of Hyperpalatability
Jack (jack-3) · 2025-04-02T20:18:04.407Z · comments (10)
Grok Grok
Zvi · 2025-02-24T14:20:08.877Z · comments (2)
[Cross-post] Every Bay Area "Walled Compound"
davekasten · 2025-01-23T15:05:08.629Z · comments (3)
Announcing EXP: Experimental Summer Workshop on Collective Cognition
Jan_Kulveit · 2025-03-15T20:14:47.972Z · comments (2)
2024 was the year of the big battery, and what that means for solar power
transhumanist_atom_understander · 2025-02-01T06:27:39.082Z · comments (1)
Extended analogy between humans, corporations, and AIs.
Daniel Kokotajlo (daniel-kokotajlo) · 2025-02-13T00:03:13.956Z · comments (2)
Any-Benefit Mindset and Any-Reason Reasoning
silentbob · 2025-03-15T17:10:14.682Z · comments (9)
o3 Will Use Its Tools For You
Zvi · 2025-04-18T21:20:02.566Z · comments (3)
Meditation and Reduced Sleep Need
niplav · 2025-04-04T14:42:54.792Z · comments (8)
Theory of Change for AI Safety Camp
Linda Linsefors · 2025-01-22T22:07:10.664Z · comments (3)
ParaScopes: Do Language Models Plan the Upcoming Paragraph?
NickyP (Nicky) · 2025-02-21T16:50:20.745Z · comments (2)
[link] We don't want to post again "This might be the last AI Safety Camp"
Remmelt (remmelt-ellen) · 2025-01-21T12:03:33.171Z · comments (17)
[question] Will LLM agents become the first takeover-capable AGIs?
Seth Herd · 2025-03-02T17:15:37.056Z · answers+comments (10)
[link] Forecasting Frontier Language Model Agent Capabilities
Govind Pimpale (govind-pimpale) · 2025-02-24T16:51:32.022Z · comments (0)
Split Personality Training: Revealing Latent Knowledge Through Personality-Shift Tokens
Florian_Dietz · 2025-03-10T16:07:45.215Z · comments (3)
[link] SuperBabies podcast with Gene Smith
Eneasz · 2025-02-19T19:36:49.852Z · comments (1)
Kitchen Air Purifier Comparison
jefftk (jkaufman) · 2025-01-22T03:20:03.224Z · comments (2)
[link] Introducing MASK: A Benchmark for Measuring Honesty in AI Systems
Richard Ren (RichardR) · 2025-03-05T22:56:46.155Z · comments (5)
Call for Collaboration: Renormalization for AI safety 
Lauren Greenspan (LaurenGreenspan) · 2025-03-31T21:01:56.500Z · comments (0)
System 2 Alignment
Seth Herd · 2025-02-13T19:17:56.868Z · comments (0)
ARENA 5.0 - Call for Applicants
JamesH (AtlasOfCharts) · 2025-01-30T13:18:27.052Z · comments (2)
Can SAE steering reveal sandbagging?
jordine · 2025-04-15T12:33:41.264Z · comments (3)
[link] Forecasting time to automated superhuman coders [AI 2027 Timelines Forecast]
elifland · 2025-04-10T23:10:23.063Z · comments (0)
[link] Well-foundedness as an organizing principle of healthy minds and societies
Richard_Ngo (ricraz) · 2025-04-07T00:31:34.098Z · comments (6)
[link] AI Can't Write Good Fiction
JustisMills · 2025-03-12T06:11:57.786Z · comments (19)
Operator
Zvi · 2025-01-28T20:00:08.374Z · comments (1)
Boots theory and Sybil Ramkin
philh · 2025-03-18T22:10:08.855Z · comments (17)
More Fun With GPT-4o Image Generation
Zvi · 2025-04-03T02:10:02.317Z · comments (3)
Writing experiments and the banana escape valve
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-23T13:11:24.215Z · comments (1)
[link] Hunting for AI Hackers: LLM Agent Honeypot
Reworr R (reworr-reworr) · 2025-02-12T20:29:32.269Z · comments (0)
Avoid the Counterargument Collapse
marknm · 2025-03-26T03:19:58.655Z · comments (3)
Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?
Alex Mallen (alex-mallen) · 2025-03-24T17:55:59.358Z · comments (0)
Reasons-based choice and cluelessness
JesseClifton · 2025-02-07T22:21:47.232Z · comments (0)
Everything I Know About Semantics I Learned From Music Notation
J Bostock (Jemist) · 2025-03-09T18:09:11.789Z · comments (2)
Why Are The Human Sciences Hard? Two New Hypotheses
Aydin Mohseni (aydin-mohseni) · 2025-03-18T15:45:52.239Z · comments (14)
AI #106: Not so Fast
Zvi · 2025-03-06T15:40:05.919Z · comments (4)
Austin Chen on Winning, Risk-Taking, and FTX
Elizabeth (pktechgirl) · 2025-04-07T19:00:08.039Z · comments (3)
[link] Center on Long-Term Risk: Summer Research Fellowship 2025 - Apply Now
Tristan Cook · 2025-03-26T17:29:14.797Z · comments (0)
OpenAI rewrote its Preparedness Framework
Zach Stein-Perlman · 2025-04-15T20:00:50.614Z · comments (1)
Re: Taste
lsusr · 2025-02-01T03:34:10.918Z · comments (8)
[link] OpenAI releases GPT-4.5
Seth Herd · 2025-02-27T21:40:45.010Z · comments (12)
Is instrumental convergence a thing for virtue-driven agents?
mattmacdermott · 2025-04-02T03:59:20.064Z · comments (37)
Why do misalignment risks increase as AIs get more capable?
ryan_greenblatt · 2025-04-11T03:06:50.928Z · comments (6)
DeepSeek: Lemon, It’s Wednesday
Zvi · 2025-01-29T15:00:07.914Z · comments (0)
[link] Meta: Frontier AI Framework
Zach Stein-Perlman · 2025-02-03T22:00:17.103Z · comments (2)
FLAKE-Bench: Outsourcing Awkwardness in the Age of AI
annas (annasoli) · 2025-04-01T17:08:25.092Z · comments (0)
How much progress actually happens in theoretical physics?
ChristianKl · 2025-04-04T23:08:00.633Z · comments (32)
Top AI safety newsletters, books, podcasts, etc – new AISafety.com resource
Bryce Robertson (bryceerobertson) · 2025-03-04T17:01:18.758Z · comments (2)
What Uniparental Disomy Tells Us About Improper Imprinting in Humans
Morpheus · 2025-03-28T11:24:47.133Z · comments (1)
← previous page (newer posts) · next page (older posts) →