LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty
tandem · 2025-01-07T19:11:21.238Z · comments (5)

Anomalous Tokens in DeepSeek-V3 and r1
henry (henry-bass) · 2025-01-25T22:55:41.232Z · comments (2)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (47)

Catastrophe through Chaos
Marius Hobbhahn (marius-hobbhahn) · 2025-01-31T14:19:08.399Z · comments (14)

[link] Training on Documents About Reward Hacking Induces Reward Hacking
evhub · 2025-01-21T21:32:24.691Z · comments (13)

[question] Which things were you surprised to learn are not metaphors?
Eric Neyman (UnexpectedValues) · 2024-11-21T18:56:18.025Z · answers+comments (80)

Tell me about yourself: LLMs are aware of their learned behaviors
Martín Soto (martinsq) · 2025-01-22T00:47:15.023Z · comments (5)

Planning for Extreme AI Risks
joshc (joshua-clymer) · 2025-01-29T18:33:14.844Z · comments (3)

Building AI Research Fleets
Ben Goldhaber (bgold) · 2025-01-12T18:23:09.682Z · comments (11)

[link] Parkinson's Law and the Ideology of Statistics
Benquo · 2025-01-04T15:49:21.247Z · comments (6)

Ten people on the inside
Buck · 2025-01-28T16:41:22.990Z · comments (25)

[link] Gradual Disempowerment: Systemic Existential Risks from Incremental AI Development
Jan_Kulveit · 2025-01-30T17:03:45.545Z · comments (34)

"The Solomonoff Prior is Malign" is a special case of a simpler argument
David Matolcsi (matolcsid) · 2024-11-17T21:32:34.711Z · comments (44)

[link] OpenAI's CBRN tests seem unclear
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:28:30.290Z · comments (6)

[link] The Dangers of Mirrored Life
Niko_McCarty (niko-2) · 2024-12-12T20:58:32.750Z · comments (7)

Human takeover might be worse than AI takeover
Tom Davidson (tom-davidson-1) · 2025-01-10T16:53:27.043Z · comments (54)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (32)

2024 in AI predictions
jessicata (jessica.liu.taylor) · 2025-01-01T20:29:49.132Z · comments (3)

The Dream Machine
sarahconstantin · 2024-12-05T00:00:05.796Z · comments (6)

The o1 System Card Is Not About o1
Zvi · 2024-12-13T20:30:08.048Z · comments (5)

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
Thomas Kwa (thomas-kwa) · 2024-11-06T23:01:48.992Z · comments (35)

The Plan - 2024 Update
johnswentworth · 2024-12-31T13:29:53.888Z · comments (27)

You should consider applying to PhDs (soon!)
bilalchughtai (beelal) · 2024-11-29T20:33:12.462Z · comments (19)

AIs Will Increasingly Attempt Shenanigans
Zvi · 2024-12-16T15:20:05.652Z · comments (2)

DeepSeek beats o1-preview on math, ties on coding; will release weights
Zach Stein-Perlman · 2024-11-20T23:50:26.597Z · comments (26)

Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke (Paulawurm) · 2024-12-17T23:58:19.222Z · comments (1)

Hierarchical Agency: A Missing Piece in AI Alignment
Jan_Kulveit · 2024-11-27T05:49:04.241Z · comments (20)

[link] The Failed Strategy of Artificial Intelligence Doomers
Ben Pace (Benito) · 2025-01-31T18:56:06.784Z · comments (66)

Why I'm Moving from Mechanistic to Prosaic Interpretability
Daniel Tan (dtch1997) · 2024-12-30T06:35:43.417Z · comments (34)

Sorry for the downtime, looks like we got DDosd
habryka (habryka4) · 2024-12-02T04:14:30.209Z · comments (13)

The Big Nonprofits Post
Zvi · 2024-11-29T16:10:06.938Z · comments (10)

[link] Announcing turntrout.com, my new digital home
TurnTrout · 2024-11-17T17:42:08.164Z · comments (24)

[link] Aristocracy and Hostage Capital
Arjun Panickssery (arjun-panickssery) · 2025-01-08T19:38:47.104Z · comments (7)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

Gradual Disempowerment, Shell Games and Flinches
Jan_Kulveit · 2025-02-02T14:47:53.404Z · comments (20)

Thread for Sense-Making on Recent Murders and How to Sanely Respond
Ben Pace (Benito) · 2025-01-31T03:45:48.201Z · comments (42)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (5)

[link] The Intelligence Curse
lukedrago · 2025-01-03T19:07:43.493Z · comments (27)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

My supervillain origin story
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-27T12:20:46.101Z · comments (0)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (5)

2024 Unofficial LessWrong Census/Survey
Screwtape · 2024-12-02T05:30:53.019Z · comments (49)

My AGI safety research—2024 review, ’25 plans
Steven Byrnes (steve2152) · 2024-12-31T21:05:19.037Z · comments (4)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
LintzA (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (24)

[link] Attribution-based parameter decomposition
Lucius Bushnaq (Lblack) · 2025-01-25T13:12:11.031Z · comments (12)

MIRI’s 2024 End-of-Year Update
Rob Bensinger (RobbBB) · 2024-12-03T04:33:47.499Z · comments (2)

How do you deal w/ Super Stimuli?
Logan Riggs (elriggs) · 2025-01-14T15:14:51.552Z · comments (25)

The nihilism of NeurIPS
charlieoneill (kingchucky211) · 2024-12-20T23:58:11.858Z · comments (7)

The purposeful drunkard
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-12T12:27:51.952Z · comments (13)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (13)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

cstinesublime on CstineSublime's Shortform

Can you elaborate further on how Gato is proof that just supplementing the training data is sufficient? I looked on youtube and can't find any videos of task switching.

raemon on The ants and the grasshopper

This story is moved me a lot, and I am giving it a substantial vote,

But... I do still really wish this line...

And they turn away and go back to their work—all except for one, who brushes past the grasshopper and whispers “Meet me outside at dusk and I’ll bring you food. We can preserve the law and still forgive the deviation.”

Came where it originally was located, significantly later in the post, after these sections:

The ants start to receive dozens of requests for food, then hundreds—and while many are fraudulent, enough are real that they are moved to act. In order to set incentives correctly, the ants decide to only give food to those who can prove that they lost their own food supplies through no fault of their own, and set up a system for vetting claims.
This works well for a time—but as fraudsters grow more sophisticated, the ants’ bureaucratic requirements grow more onerous. In order to meet them, other creatures start to deposit their food in large group storehouses which can handle the administrative overhead. But now the food supply is exposed to systemic risk if the leaders of those storehouses make poor decisions, whether from carelessness or greed.
One year several storehouses fail; in trying to fill the shortfall, the ants almost run out of food for themselves. To avoid that ever happening again, they set up stringent regulations and oversight of permissible storehouses, funded by taxes levied throughout the year. At first this takes only a small proportion of their labor—but as their regulatory apparatus inevitably grows, they need to oversee more and more aspects of the ecosystem, and are called upon to right more and more injustices. Eventually the ants—originally the most productive of all creatures—stop producing any food of their own, so busy are they in tending to the system they’ve created.
...
“And therefore, to reduce risks from centralization, and to limit our own power, we can’t give you any food”, the ants conclude. And they turn away and go back to their work, with a quiet sense of satisfaction that they’ve given such legible and defensible reasons for focusing on their own problems and keeping all the food for themselves.

In it's original positioning, the "We can preserve the law and still forgive the deviation" line felt like the climax of a mini-arc. Previously Richard had said he moved it for two reasons [LW(p) · GW(p)]:

Firstly, if the grasshopper was just unlucky, then there's no "deviation" to forgive—it makes sense only if the grasshopper was culpable. Secondly, the earlier parts are about individuals, and the latter parts are about systems—it felt more compelling to go straight from "centralized government" to "locust war" than going via an individual act of kindness.

But, what I find particularly meaningful about it is that even when giant systems have evolved that are turning grotesque with their burueacratic weight... and when maybe those systems are necessary or inevitable or something... individuals can still choose to just... do a nice thing, when they have local knowledge suggesting it's the right thing to do in this case. (And meanwhile, I had no problem backfilling and appropriate history for this particular snippet. I can imagine a grasshopper who slacked off a bit, and also got a bit unlucky, and also seems to have earnestly learned and will make better choices next time)

lemonhope on Steering Gemini with BiDPO

What do you think is the ideal use-case for steering? Or is it not needed

jackson-wagner on The King and the Golem

Note that this story has recieved a beautifully-animated video adaptation by Rational Animations!

ted-sanders on evhub's Shortform

Terrific!

nathan-helm-burger on Nathan Helm-Burger's Shortform

Oops, yes.

lvsn on A "base process" conceptually "below" any "base" universes

What are the other describable or possible-though-indescribable hypotheses? If it's intuitive that there are no other hypotheses to start from — if the explanations have been reduced to some small number of all imaginable possibilities — that's a non-nothing sort of evidence which ought to be contended with at the very least, rather than scoffed at with 'you didn't see an epistemic polylemma therefor there's no evidence there was one'.

campbell-hutcheson-1 on Meta: Frontier AI Framework

I'm generally favorable to see more of these published. Mainly, because I think these are going to end up being the basis for an industry audit standard and then a law.

The more similar they are, they easier it will be for them to converge.

aidan-o-gara on Meta: Frontier AI Framework

I'm very happy to see Meta publish this. It's a meaningfully stronger commitment to avoiding deployment of dangerous capabilities than I expected them to make. Kudos to the people who pushed for companies to make these commitments and helped them do so.

One concern I have with the framework is that I think the "high" vs. "critical" risk thresholds may claim a distinction without a difference.

Deployments are high risk if they provide "significant uplift towards execution of a threat scenario (i.e. significantly enhances performance on key capabilities or tasks needed to produce a catastrophic outcome) but does not enable execution of any threat scenario that has been identified as potentially sufficient to produce a catastrophic outcome." They are critical risk if they "uniquely enable the execution of at least one of the threat scenarios that have been identified as potentially sufficient to produce a catastrophic outcome." The framework requires that threats be "net new," meaning "The outcome cannot currently be realized as described (i.e. at that scale / by that threat actor / for that cost) with existing tools and resources."

But what then is the difference between high risk and critical risk? Unless a threat scenario is currently impossible, any uplift towards achieving it more efficiently also "uniquely enables" it under a particular budget or set of constraints. For example, it is already possible for an attacker to create bio-weapons, as demonstrated by the anthrax attacks - so any cost reductions or time savings for any part of that process uniquely enable execution of that threat scenario within a given budget or timeframe. Thus it seems that no model can be classified as high risk if it provides uplift on an already-achievable threat scenario—instead, it must be classified as critical risk.

Does that logic hold? Am I missing something in my reading of the document?

knight-lee on Mikhail Samin's Shortform

I think you (or @Adam Scholl [LW · GW]) need to argue why people won't be angry at you if you developed nuclear weapons, in a way which doesn't sound like "yes, what I built could have killed you, but it has an even higher chance of saving you!"

Otherwise, it's hard to criticize Anthropic for working on AI capabilities without considering whether their work is a net positive. It's hard to dismiss the net positive arguments as "idiosyncratic utilitarian BOTEC," when you accept "net positive" arguments regarding nuclear weapons.

Allegedly, people at Anthropic have compared themselves to Robert Oppenheimer. Maybe they know that one could argue they have blood on their hands, the same way one can argue that about Oppenheimer. But people aren't "rioting" against Oppenheimer.

I feel it's more useful to debate whether it is a net positive, since that at least has a small chance of convincing Anthropic or their employees.