LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] Towards Monosemanticity: Decomposing Language Models With Dictionary Learning
Zac Hatfield-Dodds (zac-hatfield-dodds) · 2023-10-05T21:01:39.767Z · comments (21)
Book Review: Going Infinite
Zvi · 2023-10-24T15:00:02.251Z · comments (109)
Alignment Implications of LLM Successes: a Debate in One Act
Zack_M_Davis · 2023-10-21T15:22:23.053Z · comments (50)
Announcing MIRI’s new CEO and leadership team
Gretta Duleba (gretta-duleba) · 2023-10-10T19:22:11.821Z · comments (52)
Thoughts on responsible scaling policies and regulation
paulfchristiano · 2023-10-24T22:21:18.341Z · comments (33)
We're Not Ready: thoughts on "pausing" and responsible scaling policies
HoldenKarnofsky · 2023-10-27T15:19:33.757Z · comments (33)
Labs should be explicit about why they are building AGI
peterbarnett · 2023-10-17T21:09:20.711Z · comments (16)
Announcing Timaeus
Jesse Hoogland (jhoogland) · 2023-10-22T11:59:03.938Z · comments (15)
AI as a science, and three obstacles to alignment strategies
So8res · 2023-10-25T21:00:16.003Z · comments (79)
Architects of Our Own Demise: We Should Stop Developing AI
Roko · 2023-10-26T00:36:05.126Z · comments (74)
[link] President Biden Issues Executive Order on Safe, Secure, and Trustworthy Artificial Intelligence
Tristan Williams (tristan-williams) · 2023-10-30T11:15:38.422Z · comments (39)
Thomas Kwa's MIRI research experience
Thomas Kwa (thomas-kwa) · 2023-10-02T16:42:37.886Z · comments (52)
RSPs are pauses done right
evhub · 2023-10-14T04:06:02.709Z · comments (70)
Evaluating the historical value misspecification argument
Matthew Barnett (matthew-barnett) · 2023-10-05T18:34:15.695Z · comments (140)
Holly Elmore and Rob Miles dialogue on AI Safety Advocacy
jacobjacob · 2023-10-20T21:04:32.645Z · comments (30)
Announcing Dialogues
Ben Pace (Benito) · 2023-10-07T02:57:39.005Z · comments (51)
LoRA Fine-tuning Efficiently Undoes Safety Training from Llama 2-Chat 70B
Simon Lermen (dalasnoin) · 2023-10-12T19:58:02.119Z · comments (29)
[link] Will no one rid me of this turbulent pest?
Metacelsus · 2023-10-14T15:27:21.497Z · comments (23)
[link] Comp Sci in 2027 (Short story by Eliezer Yudkowsky)
sudo · 2023-10-29T23:09:56.730Z · comments (22)
Comparing Anthropic's Dictionary Learning to Ours
Robert_AIZI · 2023-10-07T23:30:32.402Z · comments (8)
At 87, Pearl is still able to change his mind
rotatingpaguro · 2023-10-18T04:46:29.339Z · comments (15)
Response to Quintin Pope's Evolution Provides No Evidence For the Sharp Left Turn
Zvi · 2023-10-05T11:39:02.393Z · comments (29)
Graphical tensor notation for interpretability
Jordan Taylor (Nadroj) · 2023-10-04T08:04:33.341Z · comments (11)
Don't Dismiss Simple Alignment Approaches
Chris_Leong · 2023-10-07T00:35:26.789Z · comments (9)
The 99% principle for personal problems
Kaj_Sotala · 2023-10-02T08:20:07.379Z · comments (20)
Goodhart's Law in Reinforcement Learning
jacek (jacek-karwowski) · 2023-10-16T00:54:11.669Z · comments (22)
Stampy's AI Safety Info soft launch
steven0461 · 2023-10-05T22:13:04.632Z · comments (9)
Revealing Intentionality In Language Models Through AdaVAE Guided Sampling
jdp · 2023-10-20T07:32:28.749Z · comments (14)
unRLHF - Efficiently undoing LLM safeguards
Pranav Gade (pranav-gade) · 2023-10-12T19:58:08.811Z · comments (15)
I Would Have Solved Alignment, But I Was Worried That Would Advance Timelines
307th · 2023-10-20T16:37:46.541Z · comments (32)
[link] Responsible Scaling Policies Are Risk Management Done Wrong
simeon_c (WayZ) · 2023-10-25T23:46:34.247Z · comments (33)
[link] A new intro to Quantum Physics, with the math fixed
titotal (lombertini) · 2023-10-29T15:11:27.168Z · comments (22)
[link] The Witching Hour
Richard_Ngo (ricraz) · 2023-10-10T00:19:37.786Z · comments (0)
Apply for MATS Winter 2023-24!
Rocket (utilistrutil) · 2023-10-21T02:27:34.350Z · comments (6)
Charbel-Raphaël and Lucius discuss Interpretability
Mateusz Bagiński (mateusz-baginski) · 2023-10-30T05:50:34.589Z · comments (7)
TOMORROW: the largest AI Safety protest ever!
Holly_Elmore · 2023-10-20T18:15:18.276Z · comments (25)
Programmatic backdoors: DNNs can use SGD to run arbitrary stateful computation
Fabien Roger (Fabien) · 2023-10-23T16:37:45.611Z · comments (3)
What's up with "Responsible Scaling Policies"?
habryka (habryka4) · 2023-10-29T04:17:07.839Z · comments (8)
Truthseeking when your disagreements lie in moral philosophy
Elizabeth (pktechgirl) · 2023-10-10T00:00:04.130Z · comments (4)
What's Hard About The Shutdown Problem
johnswentworth · 2023-10-20T21:13:27.624Z · comments (31)
I don’t find the lie detection results that surprising (by an author of the paper)
JanB (JanBrauner) · 2023-10-04T17:10:51.262Z · comments (8)
[question] Lying to chess players for alignment
Zane · 2023-10-25T17:47:15.033Z · answers+comments (54)
Value systematization: how values become coherent (and misaligned)
Richard_Ngo (ricraz) · 2023-10-27T19:06:26.928Z · comments (47)
Symbol/Referent Confusions in Language Model Alignment Experiments
johnswentworth · 2023-10-26T19:49:00.718Z · comments (44)
Trying to understand John Wentworth's research agenda
johnswentworth · 2023-10-20T00:05:40.929Z · comments (11)
[link] Linkpost: They Studied Dishonesty. Was Their Work a Lie?
Linch · 2023-10-02T08:10:51.857Z · comments (12)
Open Source Replication & Commentary on Anthropic's Dictionary Learning Paper
Neel Nanda (neel-nanda-1) · 2023-10-23T22:38:33.951Z · comments (12)
[link] Linkpost: A Post Mortem on the Gino Case
Linch · 2023-10-24T06:50:42.896Z · comments (7)
[link] Techno-humanism is techno-optimism for the 21st century
Richard_Ngo (ricraz) · 2023-10-27T18:37:39.776Z · comments (5)
Improving the Welfare of AIs: A Nearcasted Proposal
ryan_greenblatt · 2023-10-30T14:51:35.901Z · comments (5)
next page (older posts) →