LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Three Months In, Evaluating Three Rationalist Cases for Trump
Arjun Panickssery (arjun-panickssery) · 2025-04-18T08:27:27.257Z · comments (20)

Training AGI in Secret would be Unsafe and Unethical
Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-18T12:27:35.795Z · comments (7)

What Makes an AI Startup "Net Positive" for Safety?
jacquesthibs (jacques-thibodeau) · 2025-04-18T20:33:22.682Z · comments (9)

Handling schemers if shutdown is not an option
Buck · 2025-04-18T14:39:18.609Z · comments (0)

Scaffolding Skills
Screwtape · 2025-04-18T17:39:25.634Z · comments (1)

o3 Will Use Its Tools For You
Zvi · 2025-04-18T21:20:02.566Z · comments (3)

[link] Inside OpenAI's Controversial Plan to Abandon its Nonprofit Roots
garrison · 2025-04-18T18:46:57.310Z · comments (0)

[question] Comprehensive up-to-date resources on the Chinese Communist Party's AI strategy, etc?
Mateusz Bagiński (mateusz-baginski) · 2025-04-18T04:58:32.037Z · answers+comments (3)

British and American Connotations
jefftk (jkaufman) · 2025-04-18T13:00:09.440Z · comments (2)

Emotional Theory for a Technical Manual on How Not to Freeze Completely
P. João (gabriel-brito) · 2025-04-19T09:12:56.615Z · comments (0)

AI Advances and Detection Strategy
jefftk (jkaufman) · 2025-04-19T11:40:07.264Z · comments (0)

[link] Conditional Forecasting as Model Parameterization
Molly (hickman-santini) · 2025-04-18T02:35:42.110Z · comments (0)

[Rockville] Rationalist Shabbat
maia · 2025-04-18T15:38:30.650Z · comments (0)

0 Motivation Mapping through Information Theory
P. João (gabriel-brito) · 2025-04-18T00:53:34.360Z · comments (0)

[link] The System Didn’t, and Doesn’t Need to be This Way ~ Thomas Paine on Economic Justice
James Stephen Brown (james-brown) · 2025-04-19T05:16:05.682Z · comments (0)

Karma Tests in Logical Counterfactual Simulations motivates strong agents to protect weak agents
Knight Lee (Max Lee) · 2025-04-18T11:11:23.239Z · comments (0)

One Night in Delphi
Eggs (donald-sampson) · 2025-04-18T02:17:04.957Z · comments (2)

The Case for White Box Control
J Rosser (j-rosser-uk) · 2025-04-18T16:10:57.823Z · comments (0)

Consequentialists should have a comprehensive set of deontological beliefs they adhere to
Jay95 · 2025-04-18T20:50:27.064Z · comments (2)

Towards Understanding the Representation of Belief State Geometry in Transformers
Karthik Viswanathan (vkarthik095) · 2025-04-18T12:39:01.251Z · comments (0)

[link] SecureDrop review
samuelshadrach (xpostah) · 2025-04-19T04:29:32.270Z · comments (0)

AI Control Methods Literature Review
Ram Potham (ram-potham) · 2025-04-18T21:15:34.682Z · comments (0)

Evaluating Collaborative AI Performance Subject to Sabotage
Matthew Khoriaty (matthew-khoriaty) · 2025-04-18T19:33:41.547Z · comments (0)

Could LLMs Learn to Detect Bias Autonomously, Like Tesla’s Self-Driving Cars?
Omnipheasant · 2025-04-18T18:45:36.242Z · comments (0)

Alignment Does Not Need to Be Opaque! An Introduction to Feature Steering with Reinforcement Learning
Jeremias Ferrao (jeremias-ferrao) · 2025-04-18T19:34:49.357Z · comments (0)

Measuring Beliefs of Language Models During Chain-of-Thought Reasoning
Baram Sosis (baram-sosis) · 2025-04-18T22:56:28.727Z · comments (0)

AI, Alignment & the Art of Relationship Design
Priyanka Bharadwaj (priyanka-bharadwaj) · 2025-04-19T00:47:02.591Z · comments (2)

I’m headed to DC this week. any tips?
Wes R · 2025-04-19T02:33:18.584Z · comments (0)

LLM-based Fact Checking for Popular Posts?
azergante · 2025-04-18T21:26:25.230Z · comments (1)

What If Galaxies Are Alive and Atoms Have Minds? A Thought Experiment on Life Across Scales
Saif Khan (saif-khan) · 2025-04-18T10:01:18.783Z · comments (4)

next page (older posts) →

Archive

Recent comments

huera on Scaffolding Skills

I think that the training wheels example is wrong. A quick search suggests they hinder learning how to ride a bike.
Anyway, I have a few more examples ([actual skill] / [scaffolding skill]):

Playing chess well / reading algebraic notation
Writing blog posts / touch typing
Cooking / cutting vegetables (also other things)
Cutting vegetables / sharpening knives
QS experiments / knowing statistics
Programming / debugging
Parkour / running
Dancing / aerobic endurance (This might be stretching the concept a bit)

testingthewaters on A Dissent on Honesty

For my part, I didn't realise it became so heavily downvoted, but I did not mean it at all in an accusatory or moralizing manner. I also, upon reflection, don't regret posting it.

sergii on Sergii's Shortform

The latest short story by Greg Egan is kind of a hit piece on LW/EA/longtermism. I've really enjoyed it. "DEATH AND THE GORGON" https://asimovs.com/wp-content/uploads/2025/03/DeathGorgon_Egan.pdf

jono on AI Safety Memes Wiki

Very cool, I'm not seeing a table of contents on aisafety.info however.

vladimir_nesov on o3 Will Use Its Tools For You

For me the main update from o3 is that since it's very likely GPT-4.1 with reasoning and is at Gemini 2.5 Pro level, the latter is unlikely to be a GPT-4.5 level model with reasoning. And so we still have no idea what a GPT-4.5 level model with reasoning can do, let alone when trained to use 1M+ token reasoning traces. As Llama 4 was canceled [LW(p) · GW(p)], irreversible proliferation of the still-unknown latent capabilities is not yet imminent at that level.

joseph-miller on What Makes an AI Startup "Net Positive" for Safety?

I think almost all startups are really great! I think there really is a very small set of startups that end up harmful for the world

I think you're kind of avoiding the question. What startups are really great for AI safety?

jmiller on Training AGI in Secret would be Unsafe and Unethical

Great post, Daniel!

I would expect that a misaligned ASI of the first kind would seek to keep knowledge of its capabilities to a minimum while it accumulates power. If nothing else, because by definition it prevents the detection and mitigation of its misalignment. Therefore for the same reasons this post advocates for openness past a certain stage of development, the unaligned ASI of the first kind would move towards a concentration and curtailing of knowledge (I.e. it would not be the kind of AI that stops the finding and fixing of its misalignment if it allowed 10x-1000x more human brain power investigating itself).

One way to increase the likelihood of keeping itself hidden is by influencing the people that already possess knowledge of its capabilities to act toward that outcome. So even if the few original decision makers with knowledge and power are predisposed to eventual openness/benevolence, the ASI could (rather easily, I imagine) tempt them away from said policy. Moreover, it could help them mitigate, reneg on, neutralize or ignore any precommitments or promises previously made in favour of openness.

poignardazur on Power Lies Trembling: a three-book review

One aspect of this I'm curious about is the role of propaganda, and especially russian-bot-style propaganda.

Under the belief cascade model, the goal may not be to make arguments that persuade people, so much as it is to occupy the space, to create a shared reality of "Everyone who comments under this Youtube video agrees that X". That shared reality discourages people from posting contrary opinions, and creates the appearance of unanimity.

I wonder if sociologists have ever tried to test how susceptible propaganda is to cascade dynamics.

erich_grunewald on Three Months In, Evaluating Three Rationalist Cases for Trump

The view shared by Hanania in 2024 that Trump would be reined in by others seems less solid now: whereas during his first term Trump’s top economic adviser Gary Cohn allegedly twice stole major trade-related documents off of his desk, nothing like that seems to be happening now.

This is an aside, but yesterday the WSJ reported something like this happening:

On April 9, financial markets were going haywire. Treasury Secretary Scott Bessent and Commerce Secretary Howard Lutnick wanted President Trump to put a pause on his aggressive global tariff plan. But there was a big obstacle: Peter Navarro, Trump’s tariff-loving trade adviser, who was constantly hovering around the Oval Office.
Navarro isn’t one to back down during policy debates and had stridently urged Trump to keep tariffs in place, even as corporate chieftains and other advisers urged him to relent. And Navarro had been regularly around the Oval Office since Trump’s “Liberation Day” event.
So that morning, when Navarro was scheduled to meet with economic adviser Kevin Hassett in a different part of the White House, Bessent and Lutnick made their move, according to multiple people familiar with the intervention.
They rushed to the Oval Office to see Trump and propose a pause on some of the tariffs—without Navarro there to argue or push back. They knew they had a tight window. The meeting with Bessent and Lutnick wasn’t on Trump’s schedule.
The two men convinced Trump of the strategy to pause some of the tariffs and to announce it immediately to calm the markets. They stayed until Trump tapped out a Truth Social post, which surprised Navarro, according to one of the people familiar with the episode. Bessent and press secretary Karoline Leavitt almost immediately went to the cameras outside the White House to make a public announcement.

faul_sname on faul_sname's Shortform

Prediction:

We will soon see the first high-profile example of "misaligned" model behavior where a model does something neither the user nor the developer want it to do, but which instead appears to be due to scheming.
On examination, the AI's actions will not actually be a good way to accomplish that goal. Other instances of the same model will be capable of recognizing this.
The AI's actions will make a lot of sense as an extrapolated of some contextually-activated behavior which led to better average performance on some benchmark.

That is to say, the traditional story is

We use RL to train AI
AI learns to predict reward
AI decides that its goal is to maximize reward
AI reasons about what behavior will lead to maximal reward
AI does something which neither its creators nor the user want it to do, but that thing serves the AI's long term goals, or at least it thinks that's the case

My prediction here is

We use RL to train AI
AI learns to recognize what the likely loss/reward signal is for its current task
AI learns a heuristic like "if the current task seems to have a gameable reward and success seems unlikely by normal means, try to game the reward"
AI ends up in some real-world situation which it decides looks like an unwinnable task
AI decides that some random thing it just thought of is its success criterion
AI thinks of some plan which has an outside chance of working by that success criterion it just came up with
AI does some random pants-on-head stupid thing which its creators don't want, the user doesn't want, and which doesn't serve any plausible long-term goal.