LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser
habryka (habryka4) · 2024-11-30T02:55:16.077Z · comments (263)

Alignment Faking in Large Language Models
ryan_greenblatt · 2024-12-18T17:19:06.665Z · comments (68)

What’s the short timeline plan?
Marius Hobbhahn (marius-hobbhahn) · 2025-01-02T14:59:20.026Z · comments (49)

[link] Biological risk from the mirror world
jasoncrawford · 2024-12-12T19:07:06.305Z · comments (36)

[link] Review: Planecrash
L Rudolf L (LRudL) · 2024-12-27T14:18:33.611Z · comments (42)

What Goes Without Saying
sarahconstantin · 2024-12-20T18:00:06.363Z · comments (27)

The Field of AI Alignment: A Postmortem, and What To Do About It
johnswentworth · 2024-12-26T18:48:07.614Z · comments (155)

The Case Against AI Control Research
johnswentworth · 2025-01-21T16:03:10.143Z · comments (69)

[link] By default, capital will matter more than ever after AGI
L Rudolf L (LRudL) · 2024-12-28T17:52:58.358Z · comments (99)

Orienting to 3 year AGI timelines
Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-22T01:15:11.401Z · comments (48)

You are not too "irrational" to know your preferences.
DaystarEld · 2024-11-26T15:01:42.996Z · comments (50)

[link] Understanding Shapley Values with Venn Diagrams
Carson L · 2024-12-06T21:56:43.960Z · comments (34)

Frontier Models are Capable of In-context Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-12-05T22:11:17.320Z · comments (23)

Communications in Hard Mode (My new job at MIRI)
tanagrabeast · 2024-12-13T20:13:44.825Z · comments (25)

Shallow review of technical AI safety, 2024
technicalities · 2024-12-29T12:01:14.724Z · comments (34)

A Three-Layer Model of LLM Psychology
Jan_Kulveit · 2024-12-26T16:49:41.738Z · comments (12)

Mechanisms too simple for humans to design
Malmesbury (Elmer of Malmesbury) · 2025-01-22T16:54:37.601Z · comments (40)

Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals
johnswentworth · 2025-01-24T20:20:28.881Z · comments (51)

[link] The Gentle Romance
Richard_Ngo (ricraz) · 2025-01-19T18:29:18.469Z · comments (23)

[link] When Is Insurance Worth It?
kqr · 2024-12-19T19:07:32.573Z · comments (68)

[link] o1: A Technical Primer
Jesse Hoogland (jhoogland) · 2024-12-09T19:09:12.413Z · comments (17)

[link] Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud · 2024-12-06T22:19:26.717Z · comments (12)

What Is The Alignment Problem?
johnswentworth · 2025-01-16T01:20:16.826Z · comments (37)

o3
Zach Stein-Perlman · 2024-12-20T18:30:29.448Z · comments (156)

How will we update about scheming?
ryan_greenblatt · 2025-01-06T20:21:52.281Z · comments (19)

OpenAI #10: Reflections
Zvi · 2025-01-07T17:00:07.348Z · comments (7)

[link] Quotes from the Stargate press conference
Nikola Jurkovic (nikolaisalreadytaken) · 2025-01-22T00:50:14.793Z · comments (7)

Maximizing Communication, not Traffic
jefftk (jkaufman) · 2025-01-05T13:00:02.280Z · comments (10)

Subskills of "Listening to Wisdom"
Raemon · 2024-12-09T03:01:18.706Z · comments (29)

Activation space interpretability may be doomed
bilalchughtai (beelal) · 2025-01-08T12:49:38.421Z · comments (28)

Repeal the Jones Act of 1920
Zvi · 2024-11-27T15:00:06.801Z · comments (23)

“Sharp Left Turn” discourse: An opinionated review
Steven Byrnes (steve2152) · 2025-01-28T18:47:04.395Z · comments (7)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

Capital Ownership Will Not Prevent Human Disempowerment
beren · 2025-01-05T06:00:23.095Z · comments (18)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (15)

Passages I Highlighted in The Letters of J.R.R.Tolkien
Ivan Vendrov (ivan-vendrov) · 2024-11-25T01:47:59.071Z · comments (33)

Don’t ignore bad vibes you get from people
Kaj_Sotala · 2025-01-18T09:20:17.397Z · comments (50)

[link] China Hawks are Manufacturing an AI Arms Race
garrison · 2024-11-20T18:17:51.958Z · comments (43)

Applying traditional economic thinking to AGI: a trilemma
Steven Byrnes (steve2152) · 2025-01-13T01:23:00.397Z · comments (32)

What Indicators Should We Watch to Disambiguate AGI Timelines?
snewman · 2025-01-06T19:57:43.398Z · comments (55)

AI companies are unlikely to make high-assurance safety cases if timelines are short
ryan_greenblatt · 2025-01-23T18:41:40.546Z · comments (4)

[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty
tandem · 2025-01-07T19:11:21.238Z · comments (5)

Anomalous Tokens in DeepSeek-V3 and r1
henry (henry-bass) · 2025-01-25T22:55:41.232Z · comments (2)

[link] Training on Documents About Reward Hacking Induces Reward Hacking
evhub · 2025-01-21T21:32:24.691Z · comments (13)

[question] Which things were you surprised to learn are not metaphors?
Eric Neyman (UnexpectedValues) · 2024-11-21T18:56:18.025Z · answers+comments (80)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (47)

Tell me about yourself: LLMs are aware of their learned behaviors
Martín Soto (martinsq) · 2025-01-22T00:47:15.023Z · comments (5)

[link] Parkinson's Law and the Ideology of Statistics
Benquo · 2025-01-04T15:49:21.247Z · comments (6)

Building AI Research Fleets
Ben Goldhaber (bgold) · 2025-01-12T18:23:09.682Z · comments (11)

[link] OpenAI's CBRN tests seem unclear
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:28:30.290Z · comments (6)

next page (older posts) →

Archive

Recent comments

gwern on sarahconstantin's Shortform

then I think it is also very questionable whether the AI that wins wars is the most "advanced" AI. / People like Dario whose bread-and-butter is model performance invariably over-index on model performance, especially on benchmarks. But practical value comes from things besides the model; what tasks you use it for and how effective you are at deploying it.

Dario is about the last AI CEO you should be making this criticism of. Claude has been notable for a while for the model which somehow winds up being the most useful and having the best 'vibes', even when the benchmarks indicate it's #2 or #3; and meanwhile, it is the Chinese models which historically regress the most from their benchmarks when applied (and DeepSeek models, while not as bad as the rest, still do this and r1 is already looking shakier as people try out heldout problems or benchmarks).

martin-vlach on Two interviews with the founder of DeepSeek

This seems to state the opposite: https://www.lesswrong.com/posts/JTKaR5q59BgDp6rH8/a-high-level-closed-door-session-discussing-deepseek-vision#:~:text=we%20hardly%20see%20the%20benefit%20of%20multimodal%20data.%20In%20other%20words%2C%20the%20cost%20is%20too%20high.%20Today%20there%20is%20no%20evidence%20it%20is%20useful.%20In%20the%20future%2C%20opportunities%20may%20be%20bigger. [LW · GW]

vecn-the0verl0rd on Chapter 75: Self Actualization Final, Responsibility

I agree, but so many other things are different in this fan-fic and Eliezer is smart enough that I wouldn't be surprised if it turns out to bel like that for a reason.

hastings-greer on Hastings's Shortform

I feel like people are under-updating on the negative space left by the Deepseek r1 release. Deepseek was trained using ~$6million marginal dollars, Liang Wenfeng has a net worth in the billions of dollars. From whence the gap?

juggins on Superintelligent AI will make mistakes

Good idea. I have added one. Thanks!

seth-herd on “Sharp Left Turn” discourse: An opinionated review

> we're really training LLMs mostly to have a good world model and to follow instructions
I think I mostly agree with that, but it’s less true of o1 / r1-type stuff than what came before, right?

I think it's actually not any less true of o1/r1. It's still mostly predictive/world modeling training, with a dash of human-preference RL which could be described as following instructions as intended in a certain set of task domains. o1/r1 is a bad idea because RL training on a whole CoT works against faithfulness/transparency of the CoT.

If that's all we did, I assume we'd be dead when an agent based on such a system started doing what you describe as the 1-3 loop (which I'm going to term self-optimization). Letting the goals implicit in that training sort of coagulate into explicit goals would probably produce explicit, generalizing goals we'd hate. I find alignment by default wildly unlikely.

But that's not all we'll do when we turn those systems into agents. Developers will probably at least try to give the agent explicit goals, too.

Then there's going to be a complex process where the implicit and explicit goals sort of mix together or compete or something when the agent self-optimizes. Maybe we could think of this as a teenager decideing what their values are, sorting out their biological drives toward hedonism and pleasing others, along with the ideals they've been taught to follow until they could question them.

I think we're going to have to get into detail on how that process of working through goals from different sources might work. That's what I'm trying to do in my current work.

WRT your Optimist Type 2B pessimism: I don't think AI taste should play a role in AI help solving the value alignment problem. If we had any sense (which sometimes we do once problems are right in our faces), we'd be asking the AI "so what happens if we use this alignment approach/goal?" and then using our own taste, not asking it things like "tell us what to do with our future". We could certainly ask for input and there are ways that could go wrong. But I mostly hope for AGI help in the technical part of solving stable value alignment.

I'm not sure I'm more optimistic than you, but I am quite uncertain about how well the likely (low but not zero effort/thought) methods of aligning network-based AGI might go. I think others should be more uncertain as well. Some people being certain of doom while others with real expertise thinking it's probably going to be fine should be a signal that we do not have this worked through yet.

That's why I like this post and similar attempts to resolve optimist/pessimist disagreements so much.

kabir-kumar on The Gentle Romance

How is this optimistic.

martin-fell on Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development

In my opinion this kind of scenario is very plausible and deserves a lot more attention than it seems to get.

kabir-kumar on The Gentle Romance

Oh yes. It's extremely dystopian. And extremely lonely, too. Rather than having a person, actual people around him to help, his only help comes from tech. It's horrifyingly lonely and isolated. There is no community, only tech.

Also, when they died together, it was horrible. They literally offloaded more and more of themselves into their tech until they were powerless to do anything but die. I don't buy the whole 'the thoughts were basically them' thing at all. It was at best, some copy of them.

There can be made an argument for it qualitatively being them, but quantitatively, obviously not.

vecn-the0verl0rd on Chapter 74: SA, Escalation of Conflicts, Pt 9

This is a good example of a time when it would actually be worthwhile to know about the philosopher's zombie debate.