LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

A Rocket–Interpretability Analogy
plex (ete) · 2024-10-21T13:55:18.184Z · comments (31)

OpenAI o1
Zach Stein-Perlman · 2024-09-12T17:30:31.958Z · comments (41)

[link] Arithmetic is an underrated world-modeling technology
dynomight · 2024-10-17T14:00:22.475Z · comments (32)

[link] o1: A Technical Primer
Jesse Hoogland (jhoogland) · 2024-12-09T19:09:12.413Z · comments (17)

Repeal the Jones Act of 1920
Zvi · 2024-11-27T15:00:06.801Z · comments (23)

[link] Stanislav Petrov Quarterly Performance Review
Ricki Heicklen (bayesshammai) · 2024-09-26T21:20:11.646Z · comments (3)

Momentum of Light in Glass
Ben (ben-lang) · 2024-10-09T20:19:42.088Z · comments (44)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

[Completed] The 2024 Petrov Day Scenario
Ben Pace (Benito) · 2024-09-26T08:08:32.495Z · comments (114)

Subskills of "Listening to Wisdom"
Raemon · 2024-12-09T03:01:18.706Z · comments (16)

[link] China Hawks are Manufacturing an AI Arms Race
garrison · 2024-11-20T18:17:51.958Z · comments (42)

[question] Which things were you surprised to learn are not metaphors?
Eric Neyman (UnexpectedValues) · 2024-11-21T18:56:18.025Z · answers+comments (79)

Circuits in Superposition: Compressing many small neural networks into one
Lucius Bushnaq (Lblack) · 2024-10-14T13:06:14.596Z · comments (8)

[link] OpenAI's CBRN tests seem unclear
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:28:30.290Z · comments (6)

"The Solomonoff Prior is Malign" is a special case of a simpler argument
David Matolcsi (matolcsid) · 2024-11-17T21:32:34.711Z · comments (44)

BIG-Bench Canary Contamination in GPT-4
Jozdien · 2024-10-22T15:40:48.166Z · comments (13)

Why Don't We Just... Shoggoth+Face+Paraphraser?
Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-19T20:53:52.084Z · comments (51)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (12)

A bird's eye view of ARC's research
Jacob_Hilton · 2024-10-23T15:50:06.123Z · comments (12)

[link] The Dangers of Mirrored Life
Niko_McCarty (niko-2) · 2024-12-12T20:58:32.750Z · comments (7)

[link] Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded
garrison · 2024-10-23T23:40:57.180Z · comments (1)

Passages I Highlighted in The Letters of J.R.R.Tolkien
Ivan Vendrov (ivan-vendrov) · 2024-11-25T01:47:59.071Z · comments (10)

The Dream Machine
sarahconstantin · 2024-12-05T00:00:05.796Z · comments (6)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (31)

The o1 System Card Is Not About o1
Zvi · 2024-12-13T20:30:08.048Z · comments (5)

[link] My Number 1 Epistemology Book Recommendation: Inventing Temperature
adamShimi · 2024-09-08T14:30:40.456Z · comments (18)

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
Thomas Kwa (thomas-kwa) · 2024-11-06T23:01:48.992Z · comments (35)

Why I funded PIBBSS
Ryan Kidd (ryankidd44) · 2024-09-15T19:56:33.018Z · comments (21)

Hire (or become) a Thinking Assistant / Body Double
Raemon · 2024-12-23T03:58:42.061Z · comments (29)

You should consider applying to PhDs (soon!)
bilalchughtai (beelal) · 2024-11-29T20:33:12.462Z · comments (19)

DeepSeek beats o1-preview on math, ties on coding; will release weights
Zach Stein-Perlman · 2024-11-20T23:50:26.597Z · comments (26)

Sorry for the downtime, looks like we got DDosd
habryka (habryka4) · 2024-12-02T04:14:30.209Z · comments (13)

The Big Nonprofits Post
Zvi · 2024-11-29T16:10:06.938Z · comments (10)

Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke (Paulawurm) · 2024-12-17T23:58:19.222Z · comments (1)

[link] Announcing turntrout.com, my new digital home
TurnTrout · 2024-11-17T17:42:08.164Z · comments (24)

AIs Will Increasingly Attempt Shenanigans
Zvi · 2024-12-16T15:20:05.652Z · comments (2)

Hierarchical Agency: A Missing Piece in AI Alignment
Jan_Kulveit · 2024-11-27T05:49:04.241Z · comments (20)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (20)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

Refactoring cryonics as structural brain preservation
Andy_McKenzie · 2024-09-11T18:36:30.285Z · comments (14)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (8)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (15)

[link] Advice for journalists
Nathan Young · 2024-10-07T16:46:40.929Z · comments (53)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (1)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

MIRI’s 2024 End-of-Year Update
Rob Bensinger (RobbBB) · 2024-12-03T04:33:47.499Z · comments (2)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (13)

You can, in fact, bamboozle an unaligned AI into sparing your life
David Matolcsi (matolcsid) · 2024-09-29T16:59:43.942Z · comments (171)

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

sharmake-farah on The Field of AI Alignment: A Postmortem, and What To Do About It

I actually disagree with the natural abstractions research being ungrounded. Indeed, I think there is reason to believe that at least some of the natural abstractions work, especially the natural abstraction hypothesis actually sorts of holds true for today's AI, and thus is the most likely out of the theoretical/agent-foundation approaches to work (I'm usually critical to agent foundations, but John Wentworth's work is an exception that I'd like funding for):

For example, this post does an experiment that shows that OOD data still makes the Platonic Representation Hypothesis true, meaning that it's likely that deeper factors are at play than just shallow similarity:

https://www.lesswrong.com/posts/Su2pg7iwBM55yjQdt/exploring-the-platonic-representation-hypothesis-beyond-in [LW · GW]

t3t on Corrigibility's Desirability is Timing-Sensitive

I mean, yes, but I'm addressing a confusion that's already (mostly) conditioning on building on it.

sharmake-farah on The Field of AI Alignment: A Postmortem, and What To Do About It

One particular way this issue could be ameliorated is by encouraging people to write up null results/negative results, and one part of your model here is that a null result doesn't get reported and thus other people don't hear about failure, while people do hear about success stories, meaning that there is a selection effect to work on successful programs, and no one hears about the failures to tackle the problem, which is bad for research culture, and negative results not being shown is a universal problem across fields.

faul_sname on The Field of AI Alignment: A Postmortem, and What To Do About It

Driver: My map doesn't show any cliffs

Passenger 1: Have you turned on the terrain map? Mine shows a sharp turn next to a steep drop coming up in about a mile

Passenger 5: Guys maybe we should look out the windshield instead of down at our maps?

Driver: No, passenger 1, see on your map that's an alternate route, the route we're on doesn't show any cliffs.

Passenger 1: You don't have it set to show terrain.

Passenger 6: I'm on the phone with the governor now, we're talking about what it would take to set a 5 mile per hour national speed limit.

Passenger 7: Don't you live in a different state?

Passenger 5: The road seems to be going up into the mountains, though all the curves I can see from here are gentle and smooth.

Driver and all passengers in unison: Shut up passenger 5, we're trying to figure out if we're going to fall off a cliff here, and if so what we should do about it.

Passenger 7: Anyway, I think what we really need to do to ensure our safety is to outlaw automobiles entirely.

Passenger 3: The highest point on Earth is 8849m above sea level, and the lowest point is 430 meters below sea level, so the cliff in front of us could be as high as 9279m.

jbash on Corrigibility's Desirability is Timing-Sensitive

Both seem well addressed by not building the thing "until you have a good plan for developing an actually aligned superintelligence".

Of course, somebody else still will, but you adding to the number of potentially catastrophic programs doesn't seem to improve that.

thane-ruthenis on johnswentworth's Shortform

That's mostly my experience as well: experiments are near-trivial to set up, and setting up any experiment that isn't near-trivial to set up is a poor use of the time that can instead be spent thinking on the topic a bit more and realizing what the experimental outcome would be or why this would be entirely the wrong experiment to run.

But the friction costs of setting up an experiment aren't zero. If it were possible to sort of ramble an idea at an AI and then have it competently execute the corresponding experiment (or set up a toy formal model and prove things about it), I think this would be able to speed up even deeply confused/non-paradigmatic research.

... That said, I think the sorts of experiments we do aren't the sorts of experiments ML researchers do. I expect they're often things like "do a pass over this lattice of hyperparameters and output the values that produce the best loss" (and more abstract equivalents of this that can't be as easily automated using mundane code). And which, due to the atheoretic nature of ML, can't be "solved in the abstract".

So ML research perhaps could be dramatically sped up by menial-software-labor AIs. (Though I think even now the compute needed for running all of those experiments would be the more pressing bottleneck.)

eneasz on The Deep Lore of LightHaven, with Oliver Habryka (TBC episode 228)

Really annoying that that's not available on the app! Oliver's added the transcript in the main post now, thankfully. :)

erich_grunewald on artemium's Shortform

The jury is still out, but it's currently available even in Direct Chat on Chatbot Arena, there will be more data on this soon.

Fyi, it's also available on https://chat.deepseek.com/, as is their reasoning model DeepSeek-R1-Lite-Preview ("DeepThink"). (I suggest signing up with a throwaway email and not inputting any sensitive queries.) From quickly throwing it a few requests I recently asked 3.5 Sonnet, DeepSeek-V3 seems slightly worse, but nonetheless solid.

the-gears-to-ascension on [Fiction] A Disneyland Without Children

the self referential joke thing

"mine some crypt-"

there's a contingent who would close it as soon as someone used an insult focused on intelligence, rather than on intentional behavior. to fix for that subcrowd, "idiot" becomes "fool"

those are the main ones, but then I sometimes get "tldr" responses, and even when I copy out the main civilization story section, I get "they think the authorities could be automated? that can't happen" responses, which I think would be less severe if the buildup to that showed more of them struggling to make autonomous robots work at all. Most people on the left who dislike ai think it doesn't and won't work, and any claim that it does needs to be in tune with reality about how ai currently looks, if it's going to predict that it eventually changes. the story spends a lot of time on making discovering the planet motivated and realistic, and not very much time on how they went from basic ai to replacing humans. in order for the left to accept it you'd need to make suck but kinda work, and yet get mass deployment anyway. it would need to be in touch with the real things that have happened so far.

I imagine something similar is true for pitching this to businesspeople - they'd have to be able to see how it went from the thing they enjoy now to being catastrophic, in a believable way, that doesn't feel like invoking clarketech or relying on altmanhype.

purplehermann on The Field of AI Alignment: A Postmortem, and What To Do About It

A few thoughts.

Have you checked what happens when you throw physic postdocs at the core issues - do they actually get traction or just stare at the sheer cliff for longer while thinking? Did anything come out of the Illiad meeting half a year later? Is there a reason that more standard STEMs aren't given an intro into some of the routes currently thought possibly workable, so they can feel some traction? I think either could be true- that intelligence and skills aren't actually useful right now, the problem is not tractable, or better onboarding could let the current talent pool get traction - and either way it might not be very cost effective to get physics postdocs involved.
Humans are generally better at doing things when they have more tools available. While the 'hard bits' might be intractable now, they could well be easier to deal with in a few years after other technical and conceptual advances in AI, and even other fields. (Something something about prompt engineering and Anthropic's mechanistic interpretability from inside the field and practical quantum computing outside).

This would mean squeezing every drop of usefulness out of AI at each level of capability, to improve general understanding and to leverage it into breakthroughs in other fields before capabilities increase further. In fact, it might be best to sabotage semiconductor/chip production once the models one gen before super-intelligence/extinction/ whatever, giving maximum time to leverage maximum capabilities and tackle alignment before the AIs get too smart.

How close is mechanistic interpretability to the hard problems, and what makes it not good enough?