LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

[link] Self-fulfilling misalignment data might be poisoning our AI models
TurnTrout · 2025-03-02T19:51:14.775Z · comments (4)

Maintaining Alignment during RSI as a Feedback Control Problem
beren · 2025-03-02T00:21:43.432Z · comments (4)

Open problems in emergent misalignment
Jan Betley (jan-betley) · 2025-03-01T09:47:58.889Z · comments (3)

Statistical Challenges with Making Super IQ babies
Jan Christian Refsgaard (jan-christian-refsgaard) · 2025-03-02T20:26:22.103Z · comments (2)

[question] Will LLM agents become the first takeover-capable AGIs?
Seth Herd · 2025-03-02T17:15:37.056Z · answers+comments (6)

[link] Estimating the Probability of Sampling a Trained Neural Network at Random
Adam Scherlis (adam-scherlis) · 2025-03-01T02:11:56.313Z · comments (5)

Cautions about LLMs in Human Cognitive Loops
Alice Blair (Diatom) · 2025-03-02T19:53:10.253Z · comments (3)

[question] Share AI Safety Ideas: Both Crazy and Not
ank · 2025-03-01T19:08:25.605Z · answers+comments (22)

Open Thread Spring 2025
Ben Pace (Benito) · 2025-03-02T02:33:16.307Z · comments (1)

[link] Historiographical Compressions: Renaissance as An Example
adamShimi · 2025-03-01T18:21:42.586Z · comments (2)

AXRP Episode 38.8 - David Duvenaud on Sabotage Evaluations and the Post-AGI Future
DanielFilan · 2025-03-01T01:20:04.778Z · comments (0)

Saving Zest
jefftk (jkaufman) · 2025-03-02T12:00:41.732Z · comments (1)

Spencer Greenberg hiring a personal/professional/research remote assistant for 5-10 hours per week
spencerg · 2025-03-02T18:01:32.880Z · comments (0)

Real-Time Gigstats
jefftk (jkaufman) · 2025-03-01T14:10:41.060Z · comments (0)

[question] Request for Comments on AI-related Prediction Market Ideas
PeterMcCluskey · 2025-03-02T20:52:41.114Z · answers+comments (0)

Not-yet-falsifiable beliefs?
Benjamin Hendricks (benjamin-hendricks) · 2025-03-02T14:11:07.121Z · comments (4)

[question] Examples of self-fulfilling prophecies in AI alignment?
Chipmonk · 2025-03-03T02:45:51.619Z · answers+comments (3)

[question] help, my self image as rational is affecting my ability to empathize with others
KvmanThinking (avery-liu) · 2025-03-02T02:06:36.376Z · answers+comments (8)

[question] What nation did Trump prevent from going to war (Feb. 2025)?
James Camacho (james-camacho) · 2025-03-01T01:46:58.929Z · answers+comments (3)

[link] AI Safety Policy Won't Go On Like This – AI Safety Advocacy Is Failing Because Nobody Cares.
henophilia · 2025-03-01T20:15:16.645Z · comments (0)

Positional kernels of attention heads
Alex Gibson · 2025-03-03T01:40:13.014Z · comments (0)

Meaning Machines
appromoximate (antediluvian) · 2025-03-01T19:16:08.539Z · comments (0)

next page (older posts) →

Archive

Recent comments

wilson-wu on Alexander Gietelink Oldenziel's Shortform

"Utter elitism" is a nice article about this phenomenon

habryka4 on Statistical Challenges with Making Super IQ babies

I greatly appreciate this kind of critique, thank you!

My guess is this is too big of an ask, and I am already grateful for your post, but do you have a prediction about how much of the variance would turn out to be causal in the relevant way?

My current best guess is we are going to be seeing some of these technologies used in the animal breeding space relatively soon (within a few years), and so early predictions seem helpful for validating models, and also might also just help people understand how much you currently think the post overestimates the impact of edits.

davey-morse on Davey Morse's Shortform

if we get self-interested superintelligence, let's make sure it has a buddhist sense of self, not a western one.

saidachmiz on Cautions about LLMs in Human Cognitive Loops

I’d bet that I’m still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish.

With respect, I suggest to you that this sort of thinking is a failure of security mindset. (However, I am content to leave the matter un-argued at this time.)

… if you’re going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

Yes… this is true in a personal-protection sense, I agree. And I do already try to stay away from people who talk to LLMs a lot, or who don’t seem to be showing any caution about it, or who don’t take concerns like this seriously, etc. (I have never needed any special reason to avoid Twitter, but if one does—well, here’s yet another reason for the list.)

However, taking a pure personal-protection stance on this matter does not seem to me to be sensible even from a selfish perspective. It seems to me that there is no choice but to try to convince others, insofar as it is possible to do this in accordance with my principles. In other words, if I take on some second-order effect risk, but in exchange I get some chance of several other people considering what I say and deciding to do as I am doing, then this seems to me to be a positive trade-off—especially since, if one takes the danger seriously, it is hard to avoid the conclusion that choosing to say nothing results in a bad end, regardless of how paranoid one has been.

gunnar_zarncke on Alexander Gietelink Oldenziel's Shortform

If the same pattern of other innovations hold, then it seems more likely that it is more a questions of concentration in one place (for software: Silicon Valley). Maybe one French university, maybe the École Normale Supérieure, managed to become the leading place for math and now every mathematician tries to go there.

alice-blair on Cautions about LLMs in Human Cognitive Loops

I agree that this is a notable point in the space of options. I didn't include it, and instead included the bunker line because if you're going to be that paranoid about LLM interference (as is very reasonable to do), it makes sense to try and eliminate second order effects and never talk to people who talk to LLMs, for they too might be meaningfully harmful e.g. be under the influence of particularly powerful LLM-generated memes.

I also separately disagree that LLM isolation is the optimal path at the moment. In the future it likely will be. I'd bet that I'm still on the side where I can safely navigate and pick up the utility, and I median-expect to be for the next couple months ish. At GPT-5ish level I get suspicious and uncomfortable, and beyond that exponentially more so.

gunnar_zarncke on Gunnar_Zarncke's Shortform

Is anybody aware of any updates on Logical Induction [? · GW], published in 2016? I would expect implementations in Lean by now.

chris_leong on Alexander Gietelink Oldenziel's Shortform

I wonder about the extent to which having an additional level of selection helps.

High school curricula are generally limited by having to be able to be taught by a large number of teachers all around the country and by needing a minimum number of students at the school who are capable of the content.

If the préparatoires can put more qualified teachers and students together that would allow significant development and running selection for elite universities after such an intermediate preparatory program it would reduce the chance that talented students aren't missed due to having attended a high school that is weaker at maths (even though it sounds like the preparatories have a selection bar too, I assume it's quite a bit lower than performing well enough to get into a top institution).

chris_leong on Share AI Safety Ideas: Both Crazy and Not

Here's a short-form with my Wise AI advisors research direction: https://www.lesswrong.com/posts/SbAofYCgKkaXReDy4/chris_leong-s-shortform?view=postCommentsNew&postId=SbAofYCgKkaXReDy4&commentId=Zcg9idTyY5rKMtYwo [LW · GW]

(I already posted this on the Less Wrong post).

chris_leong on The Theoretical Reward Learning Research Agenda: Introduction and Motivation

I was taking it as "solves" or "gets pretty close to solving". Maybe that's a misinterpretation on my part. What did you mean here?