LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Understanding Agent Preferences
martinkunev · 2025-02-24T17:46:04.022Z · comments (0)
Superintelligence Alignment Proposal
Davey Morse (davey-morse) · 2025-02-03T18:47:22.287Z · comments (3)
An Introduction to Evidential Decision Theory
Babić · 2025-02-02T21:27:35.684Z · comments (2)
[link] [Crosspost] Strategic wealth accumulation under transformative AI expectations
arden446 · 2025-02-25T21:50:11.458Z · comments (0)
[link] The Stag Hunt—cultivating cooperation to reap rewards
James Stephen Brown (james-brown) · 2025-02-25T23:45:07.472Z · comments (0)
[link] Sparse Autoencoder Features for Classifications and Transferability
Shan23Chen (shan-chen) · 2025-02-18T22:14:12.994Z · comments (0)
[link] Pre-ASI: The case for an enlightened mind, capital, and AI literacy in maximizing the good life
Noahh (noah-jackson) · 2025-02-21T00:03:47.922Z · comments (5)
[link] Tetherware #1: The case for humanlike AI with free will
Jáchym Fibír · 2025-01-30T10:58:11.717Z · comments (10)
[link] Market Capitalization is Semantically Invalid
Zero Contradictions · 2025-02-27T11:27:47.765Z · comments (10)
Jevon's paradox and economic intuitions
Abhimanyu Pallavi Sudhir (abhimanyu-pallavi-sudhir) · 2025-01-27T23:04:23.854Z · comments (0)
Partial Identifiability in Reward Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:23:30.738Z · comments (0)
Misspecification in Inverse Reinforcement Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:49.204Z · comments (0)
Empirical Insights into Feature Geometry in Sparse Autoencoders
Jason Boxi Zhang (jason-boxi-zhang) · 2025-01-24T19:02:19.167Z · comments (0)
Misspecification in Inverse Reinforcement Learning - Part II
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:59.570Z · comments (0)
Defining and Characterising Reward Hacking
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:25:42.777Z · comments (0)
Other Papers About the Theory of Reward Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:26:11.490Z · comments (0)
[link] Linguistic Imperialism in AI: Enforcing Human-Readable Chain-of-Thought
Lukas Petersson (lukas-petersson-1) · 2025-02-21T15:45:00.146Z · comments (0)
[question] Popular materials about environmental goals/agent foundations? People wanting to discuss such topics?
Q Home · 2025-01-22T03:30:38.066Z · answers+comments (0)
Safe Distillation With a Powerful Untrusted AI
Alek Westover (alek-westover) · 2025-02-20T03:14:04.893Z · comments (1)
Are current LLMs safe for psychotherapy?
PaperBike · 2025-02-12T19:16:34.452Z · comments (4)
Neuron Activations to CLIP Embeddings: Geometry of Linear Combinations in Latent Space
Roman Malov · 2025-02-03T10:30:48.866Z · comments (0)
Will AI Resilience protect Developing Nations?
ejk64 · 2025-01-21T15:31:32.378Z · comments (0)
[link] Bayesian Reasoning on Maps
Sjlver (jonas-wagner) · 2025-01-22T10:45:03.584Z · comments (0)
[question] are there 2 types of alignment?
KvmanThinking (avery-liu) · 2025-01-23T00:08:20.885Z · answers+comments (9)
How are Those AI Participants Doing Anyway?
mushroomsoup · 2025-01-24T22:37:47.999Z · comments (0)
[link] A concise definition of what it means to win
testingthewaters · 2025-01-25T06:37:37.305Z · comments (1)
[link] Understanding AI World Models w/ Chris Canal
jacobhaimes · 2025-01-27T16:32:47.724Z · comments (0)
Will LLMs supplant the field of creative writing?
Declan Molony (declan-molony) · 2025-01-28T06:42:24.799Z · comments (14)
[link] Whereby: The Zoom alternative you probably haven't heard of
Itay Dreyfus (itay-dreyfus) · 2025-01-29T13:01:08.564Z · comments (0)
Allegory of the Tsunami
Evan Hu (evan-hu) · 2025-01-29T19:09:33.761Z · comments (1)
[question] Why not train reasoning models with RLHF?
CBiddulph (caleb-biddulph) · 2025-01-30T07:58:35.742Z · answers+comments (4)
Proposal: Safeguarding Against Jailbreaking Through Iterative Multi-Turn Testing
jacquesallen · 2025-01-31T23:00:42.665Z · comments (0)
[question] How likely is an attempted coup in the United States in the next four years?
Alexander de Vries (alexander-de-vries) · 2025-02-01T13:12:04.053Z · answers+comments (2)
[link] Unlocking Ethical AI and Improving Jailbreak Defenses: Reinforcement Learning with Layered Morphology (RLLM)
MiguelDev (whitehatStoic) · 2025-02-01T19:17:32.071Z · comments (2)
Thoughts on Toy Models of Superposition
james__p · 2025-02-02T13:52:54.505Z · comments (0)
Easily Evaluate SAE-Steered Models with EleutherAI Evaluation Harness
Matthew Khoriaty (matthew-khoriaty) · 2025-01-21T02:02:35.177Z · comments (0)
Sleeper agents appear resilient to activation steering
Lucy Wingard (lucy-wingard) · 2025-02-03T19:31:30.702Z · comments (0)
When you downvote, explain why
KvmanThinking (avery-liu) · 2025-02-07T01:03:44.097Z · comments (31)
Cross-Layer Feature Alignment and Steering in Large Language Model
dlaptev · 2025-02-08T20:18:20.331Z · comments (0)
[link] How do you make a 250x better vaccine at 1/10 the cost? Develop it in India.
Abhishaike Mahajan (abhishaike-mahajan) · 2025-02-09T03:53:17.050Z · comments (5)
ML4Good Colombia - Applications Open to LatAm Participants
Alejandro Acelas (alejandro-acelas) · 2025-02-10T15:03:03.929Z · comments (0)
OpenAI’s NSFW policy: user safety, harm reduction, and AI consent
8e9 · 2025-02-13T13:59:22.911Z · comments (3)
Response to the US Govt's Request for Information Concerning Its AI Action Plan
Davey Morse (davey-morse) · 2025-02-14T06:14:08.673Z · comments (0)
Claude 3.5 Sonnet (New)'s AGI scenario
Nathan Young · 2025-02-17T18:47:04.669Z · comments (2)
[link] AISN #48: Utility Engineering and EnigmaEval
Corin Katzke (corin-katzke) · 2025-02-18T19:15:16.751Z · comments (0)
Permanent properties of things are a self-fulfilling prophecy
YanLyutnev (YanLutnev) · 2025-02-19T00:08:20.776Z · comments (0)
[link] Demonstrating specification gaming in reasoning models
Matrice Jacobine · 2025-02-20T19:26:20.563Z · comments (0)
Moral gauge theory: A speculative suggestion for AI alignment
James Diacoumis (james-diacoumis) · 2025-02-23T11:42:31.083Z · comments (2)
[link] The manifest manifesto
dkl9 · 2025-02-24T22:13:53.342Z · comments (1)
outlining is a historically recent underutilized gift to family
daijin · 2025-02-26T13:58:17.623Z · comments (2)
← previous page (newer posts) · next page (older posts) →