LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

On Emergent Misalignment
Zvi · 2025-02-28T13:10:05.973Z · comments (2)

Weirdness Points
lsusr · 2025-02-28T02:23:56.508Z · comments (7)

TamperSec is hiring for 3 Key Roles!
Jonathan_H (JonathanH) · 2025-02-28T23:10:31.540Z · comments (0)

[link] Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis (adam-scherlis) · 2025-03-01T02:11:56.313Z · comments (2)

AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
DanielFilan · 2025-03-01T01:20:04.778Z · comments (0)

January-February 2025 Progress in Guaranteed Safe AI
Quinn (quinn-dougherty) · 2025-02-28T03:10:01.909Z · comments (1)

Markdown Object Notation
bhauth · 2025-02-28T19:24:25.422Z · comments (0)

Dance Weekend Pay II
jefftk (jkaufman) · 2025-02-28T15:10:02.030Z · comments (0)

Cycles (a short story by Claude 3.7 and me)
Knight Lee (Max Lee) · 2025-02-28T07:04:46.602Z · comments (0)

[link] Do safety-relevant LLM steering vectors optimized on a single example generalize?
Jacob Dunefsky (jacob-dunefsky) · 2025-02-28T12:01:12.514Z · comments (0)

The Theoretical Reward Learning Research Agenda: Introduction and Motivation
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:20:30.168Z · comments (1)

[link] An Open Letter To EA and AI Safety On Decelerating AI Development
kenneth_diao · 2025-02-28T17:21:42.826Z · comments (0)

Defining and Characterising Reward Hacking
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:25:42.777Z · comments (0)

Partial Identifiability in Reward Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:23:30.738Z · comments (0)

Misspecification in Inverse Reinforcement Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:49.204Z · comments (0)

STARC: A General Framework For Quantifying Differences Between Reward Functions
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:52.965Z · comments (0)

Misspecification in Inverse Reinforcement Learning - Part II
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:24:59.570Z · comments (0)

Other Papers About the Theory of Reward Learning
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:26:11.490Z · comments (0)

How to Contribute to Theoretical Reward Learning Research
Joar Skalse (Logical_Lunatic) · 2025-02-28T19:27:52.552Z · comments (0)

Do we want alignment faking?
Florian_Dietz · 2025-02-28T21:50:48.891Z · comments (0)

Existentialists and Trolleys
David Gross (David_Gross) · 2025-02-28T14:01:49.509Z · comments (2)

Latent Space Collapse? Understanding the Effects of Narrow Fine-Tuning on LLMs
tenseisoham · 2025-02-28T20:22:17.721Z · comments (0)

[question] What nation did Trump prevent from going to war (Feb. 2025)?
James Camacho (james-camacho) · 2025-03-01T01:46:58.929Z · answers+comments (0)

[link] Tetherware #2: What every human should know about our most likely AI future
Jáchym Fibír · 2025-02-28T11:12:59.033Z · comments (0)

Notes on Superwisdom & Moral RSI
welfvh · 2025-02-28T10:34:54.767Z · comments (4)

Exploring unfaithful/deceptive CoT in reasoning models
Lucy Wingard (lucy-wingard) · 2025-02-28T02:54:43.481Z · comments (0)

Few concepts mixing dark fantasy and science fiction
Marek Zegarek (marek-zegarek) · 2025-02-28T21:03:35.307Z · comments (0)

[link] Do clients need years of therapy, or can one conversation resolve the issue?
Chipmonk · 2025-02-28T00:06:29.276Z · comments (10)

next page (older posts) →

Archive

Recent comments

daniel-tan on Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs

Ok, that makes sense! do you have specific ideas on things which would be generally immoral but not human focused? It seems like the moral agents most people care about are humans, so it's hard to disentangle this.

shmi on shminux's Shortform

I once wrote a post claiming that human learning is not computationally efficient: https://www.lesswrong.com/posts/kcKZoSvyK5tks8nxA/learning-is-asymptotically-computationally-inefficient [LW · GW]

It looks like the last three years of AI progress suggest that learning is sub-linear in resource use, but probably not logarithmically as I claimed for humans. Looks like the scaling benchmarks show something like capability increase ~ 4th root of model size. https://epoch.ai/data/ai-benchmarking-dashboard

jayson_virissimo on Why is Mencius Moldbug so popular on Less Wrong? [Answer: He's not.]

Sure, see the Substack post Feudalism as a Contested Concept in Historical Political Economy by Mark Koyama.

Sorry for the late response.

daniel-murfet on Estimating the Probability of Sampling a Trained Neural Network at Random

Indeed, very interesting!

martin-randall on Weirdness Points

I don't think that's about weirdness. Bob could have the exact same thoughts and actions if Alice provides some type of "normal" food (for whatever counts as "normal" in Bob's culture), but Bob hates that type of food, or hates the way Alice cooks it, or hates the place Alice buys it, or whatever.

Alice and Bob are having trouble communicating, which will cause problems no matter how normal (or weird) they both are.

matthew-khoriaty on Matthew Khoriaty's Shortform

Deepseek R1 used 8,000 samples. s1 used 1,000 offline samples. That really isn't all that much.

danielechlin on What Goes Without Saying

It's cool to put this to paper. I tried writing down my most fundamental principles and noticed I thought they were tautological and also realized many people disagree with them. Like "If you believe something is right you believe others are wrong." Many, many people have a belief "everyone's entitled to their own opinion" that overrules this one.

Or "if something is wrong you shouldn't do it." Sounds... tautological. But again, many people don't think that's really true when it comes to abstractly reasoned "effective altruism" type stuff. It's just an ocean of incomprehensible rightness and wrongness, better to focus on yourself and your loved ones.

"Corruption is bad." Nick Rafter recently had an article showing how your prior should really just be that elites manage corruption and normal people really don't care, they just want results.

"Belief is both rational and emotional/psychological/biological"... yep, that angers people on both sides.

danielechlin on Market Capitalization is Semantically Invalid

This might just be a writing critique but 1) I just skipped all the bricks stuff, 2) I found the conclusion was "shares aren't like bricks." Also like what should we use instead?

adam-scherlis on Estimating the Probability of Sampling a Trained Neural Network at Random

If you're wondering if this has a connection to Singular Learning Theory: Yup!

In SLT terms, we've developed a method for measuring the constant (with respect to n) term in the free energy, whereas LLC measures the log(n) term. Or if you like the thermodynamic analogy, LLC is the heat capacity and log(local volume) is the Gibbs entropy.

We're now working on better methods for measuring these sorts of quantities, and on interpretability applications of them.

danielechlin on The Cowpox of Doubt

I've been making increasingly more genuine arguments about this regarding horoscopes. They're not "scientific," but neither are any of my hobbies, and they're only harmful when taken to the extreme but that's also true for all my hobbies, and they seem to have a bunch of low-grade benefits like "making you curious about your personality." So then I felt astrology done scientifically (where you make predictions but hedge them and are really humble about failure) is way better than science done shoddily (where you yell at people for not wearing a mask to your intensity.) So I settled on the 52/48 rule -- science, the truth, liberal democracy, all of these things have about a 2% edge over their enemies. It's very rational to wind up in the 48 (a small mistake not a big one) and very hard to beat a 48 when you need to (like persuading people to take vaccines). I agree that humility is a good start. This seems to fit what I've lived through much better than my old ideology prior of like, 100-epsilon/epsilon.