LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

On Emergent Misalignment
Zvi · 2025-02-28T13:10:05.973Z · comments (2)
Weirdness Points
lsusr · 2025-02-28T02:23:56.508Z · comments (7)
TamperSec is hiring for 3 Key Roles!
Jonathan_H (JonathanH) · 2025-02-28T23:10:31.540Z · comments (0)
[link] Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis (adam-scherlis) · 2025-03-01T02:11:56.313Z · comments (2)
AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
DanielFilan · 2025-03-01T01:20:04.778Z · comments (0)
January-February 2025 Progress in Guaranteed Safe AI
Quinn (quinn-dougherty) · 2025-02-28T03:10:01.909Z · comments (1)
Markdown Object Notation
bhauth · 2025-02-28T19:24:25.422Z · comments (0)
Dance Weekend Pay II
jefftk (jkaufman) · 2025-02-28T15:10:02.030Z · comments (0)
Cycles (a short story by Claude 3.7 and me)
Knight Lee (Max Lee) · 2025-02-28T07:04:46.602Z · comments (0)
[link] Do safety-relevant LLM steering vectors optimized on a single example generalize?
Jacob Dunefsky (jacob-dunefsky) · 2025-02-28T12:01:12.514Z · comments (0)
The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:20:30.168Z · comments (1)
[link] An Open Letter To EA and AI Safety On Decelerating AI Development
kenneth_diao · 2025-02-28T17:21:42.826Z · comments (0)
Defining and Characterising Reward Hacking
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:25:42.777Z · comments (0)
Partial Identifiability in Reward Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:23:30.738Z · comments (0)
Misspecification in Inverse Reinforcement Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:49.204Z · comments (0)
STARC: A General Framework For Quantifying Differences Between Reward Functions
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:52.965Z · comments (0)
Misspecification in Inverse Reinforcement Learning - Part II
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:59.570Z · comments (0)
Other Papers About the Theory of Reward Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:26:11.490Z · comments (0)
How to Contribute to Theoretical Reward Learning Research
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:27:52.552Z · comments (0)
Do we want alignment faking?
Florian_Dietz · 2025-02-28T21:50:48.891Z · comments (0)
Existentialists and Trolleys
David Gross (David_Gross) · 2025-02-28T14:01:49.509Z · comments (2)
Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs
tenseisoham · 2025-02-28T20:22:17.721Z · comments (0)
[question] What nation did Trump prevent from going to war (Feb. 2025)?
James Camacho (james-camacho) · 2025-03-01T01:46:58.929Z · answers+comments (0)
[link] Tetherware #2: What every human should know about our most likely AI future
Jáchym Fibír · 2025-02-28T11:12:59.033Z · comments (0)
Notes on Superwisdom & Moral RSI
welfvh · 2025-02-28T10:34:54.767Z · comments (4)
Exploring unfaithful/deceptive CoT in reasoning models
Lucy Wingard (lucy-wingard) · 2025-02-28T02:54:43.481Z · comments (0)
Few concepts mixing dark fantasy and science fiction
Marek Zegarek (marek-zegarek) · 2025-02-28T21:03:35.307Z · comments (0)
[link] Do clients need years of therapy, or can one conversation resolve the issue?
Chipmonk · 2025-02-28T00:06:29.276Z · comments (10)
next page (older posts) →