LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Why Should I Assume CCP AGI is Worse Than USG AGI?
Tomás B. (Bjartur Tómas) · 2025-04-19T14:47:52.167Z · comments (31)

[link] Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Adam Karvonen (karvonenadam) · 2025-04-14T17:38:02.918Z · comments (38)

AI companies are unlikely to make high-assurance safety cases if timelines are short
ryan_greenblatt · 2025-01-23T18:41:40.546Z · comments (5)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes (john-hughes) · 2025-04-08T17:32:55.315Z · comments (17)

The Most Forbidden Technique
Zvi · 2025-03-12T13:20:04.732Z · comments (9)

OpenAI #12: Battle of the Board Redux
Zvi · 2025-03-31T15:50:02.156Z · comments (1)

Ten people on the inside
Buck · 2025-01-28T16:41:22.990Z · comments (28)

Planning for Extreme AI Risks
joshc (joshua-clymer) · 2025-01-29T18:33:14.844Z · comments (5)

[link] A computational no-coincidence principle
Eric Neyman (UnexpectedValues) · 2025-02-14T21:39:39.277Z · comments (38)

Auditing language models for hidden objectives
Sam Marks (samuel-marks) · 2025-03-13T19:18:32.638Z · comments (15)

[link] The Hidden Cost of Our Lies to AI
Nicholas Andresen (nicholas-andresen) · 2025-03-06T05:03:47.239Z · comments (17)

The Milton Friedman Model of Policy Change
JohnofCharleston · 2025-03-04T00:38:56.778Z · comments (17)

[link] The Failed Strategy of Artificial Intelligence Doomers
Ben Pace (Benito) · 2025-01-31T18:56:06.784Z · comments (78)

Anomalous Tokens in DeepSeek-V3 and r1
henry (henry-bass) · 2025-01-25T22:55:41.232Z · comments (2)

[question] How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
Thane Ruthenis · 2025-03-04T16:23:39.296Z · answers+comments (51)

[link] Training on Documents About Reward Hacking Induces Reward Hacking
evhub · 2025-01-21T21:32:24.691Z · comments (15)

The Paris AI Anti-Safety Summit
Zvi · 2025-02-12T14:00:07.383Z · comments (21)

Some articles in “International Security” that I enjoyed
Buck · 2025-01-31T16:23:27.061Z · comments (10)

Tell me about yourself: LLMs are aware of their learned behaviors
Martín Soto (martinsq) · 2025-01-22T00:47:15.023Z · comments (5)

The Pando Problem: Rethinking AI Individuality
Jan_Kulveit · 2025-03-28T21:03:28.374Z · comments (13)

Gradual Disempowerment, Shell Games and Flinches
Jan_Kulveit · 2025-02-02T14:47:53.404Z · comments (36)

[question] when will LLMs become human-level bloggers?
nostalgebraist · 2025-03-09T21:10:08.837Z · answers+comments (34)

Anthropic, and taking "technical philosophy" more seriously
Raemon · 2025-03-13T01:48:54.184Z · comments (29)

AI-enabled coups: a small group could use AI to seize power
Tom Davidson (tom-davidson-1) · 2025-04-16T16:51:29.561Z · comments (16)

[link] Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Fabien Roger (Fabien) · 2025-03-11T11:52:38.994Z · comments (23)

Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt (abhatt349) · 2025-04-16T16:21:23.781Z · comments (0)

How I've run major projects
benkuhn · 2025-03-16T18:40:04.223Z · comments (10)

Learned pain as a leading cause of chronic pain
SoerenMind · 2025-04-09T11:57:58.523Z · comments (14)

[link] Research directions Open Phil wants to fund in technical AI safety
jake_mendel · 2025-02-08T01:40:00.968Z · comments (21)

Training AGI in Secret would be Unsafe and Unethical
Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-18T12:27:35.795Z · comments (8)

Do models say what they learn?
Andy Arditi (andy-arditi) · 2025-03-22T15:19:18.800Z · comments (12)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
LintzA (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (30)

[link] Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
jake_mendel · 2025-02-06T18:58:53.076Z · comments (0)

The News is Never Neglected
lsusr · 2025-02-11T14:59:48.323Z · comments (18)

New Cause Area Proposal
CallumMcDougall (TheMcDouglas) · 2025-04-01T07:12:34.360Z · comments (4)

Thread for Sense-Making on Recent Murders and How to Sanely Respond
Ben Pace (Benito) · 2025-01-31T03:45:48.201Z · comments (146)

Downstream applications as validation of interpretability progress
Sam Marks (samuel-marks) · 2025-03-31T01:35:02.722Z · comments (1)

2024 Unofficial LessWrong Survey Results
Screwtape · 2025-03-14T22:29:00.045Z · comments (28)

[link] Explaining British Naval Dominance During the Age of Sail
Arjun Panickssery (arjun-panickssery) · 2025-03-28T05:47:28.561Z · comments (5)

You can just wear a suit
lsusr · 2025-02-26T14:57:57.260Z · comments (48)

[link] Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
lewis smith (lsgos) · 2025-03-26T19:07:48.710Z · comments (15)

Two hemispheres - I do not think it means what you think it means
Viliam · 2025-02-09T15:33:53.391Z · comments (21)

[link] Attribution-based parameter decomposition
Lucius Bushnaq (Lblack) · 2025-01-25T13:12:11.031Z · comments (21)

My supervillain origin story
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-27T12:20:46.101Z · comments (1)

AI 2027: Responses
Zvi · 2025-04-08T12:50:02.197Z · comments (3)

[link] Steering Gemini with BiDPO
TurnTrout · 2025-01-31T02:37:55.839Z · comments (5)

Among Us: A Sandbox for Agentic Deception
7vik (satvik-golechha) · 2025-04-05T06:24:49.000Z · comments (4)

Judgements: Merging Prediction & Evidence
abramdemski · 2025-02-23T19:35:51.488Z · comments (5)

AGI Safety & Alignment @ Google DeepMind is hiring
Rohin Shah (rohinmshah) · 2025-02-17T21:11:18.970Z · comments (19)

[link] Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-02-06T15:46:53.024Z · comments (9)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

eva_ on A Dissent on Honesty

(Maybe this point isn’t particularly important to the main discussion. I can’t tell, honestly!)

Yeah I think it's an irrelevant tangent where we're describing the same underlying process a bit differently, not really disagreeing.

Frankly, I think that it’s not as hard as some people make it out to be, to tell when it is necessary to tell the truth and when one should instead lie. Mostly, the right answer is obvious to everyone, and the debates, such as they are, mostly boil down to people trying to justify things that they know perfectly well cannot be justified.
... the arguments most often concern whether it’s permissible to lie. Note: not, “is it obligatory to tell the truth, or is it obligatory to lie”—but “is it obligatory to tell the truth, or do I have no obligation here and can I just lie”. I think that this is very telling. And what it tells us (with imperfect but nevertheless non-trivial certainty) is that the person asking the question, or making the argument against the obligation, knows perfectly well what the real—which is to say, moral—answer is. Yes, the right thing to do is to tell the truth.

I think I disagree with this framing. In my model of the sort of person who asks that, they're sometimes selfish-but-honourable people who have noticed telling the truth ends badly for them and will do it if it is an obligation but would prefer to help themselves otherwise, but they are just as often altruistic-and-honourable people who have noticed telling the truth ends badly for everyone and are trying to convince themselves it's okay to do the thing that will actually help. There are also selfish-but-cowardly people who just care if they'll be socially punished for lying, or selfish-and-cruel people chewing at the bit to punish someone else for it, and similar, but moral arguments don't move to them either way so it doesn't matter.

More strongly I disagree because I think a lot of people have harmed themselves or their altruistic causes by failing to correctly determine where the line is, either lying when they shouldn't or not lying when they should, and it is too the communities shame that we haven't been more help with illuminating how to tell those cases apart. If smart hardworking people are getting it wrong so often, you can't just say the task is easy.

If you want to try to put together a complete list of such rules, that’s certainly a project, and I may even contribute to it, but there’s not much point in expecting this to be a definitively completable task. We’re fitting a curve to the data provided by our values, which cannot be losslessly compressed.

This is in total a fair response. I am not sure I can say that you have changed my mind without more detail and I'm not going to take down my original post (as long as there isn't a better post to take its place) because it's still I think directionally correct but thank you for your words.

maxwell-peterson on D&D.Sci Tax Day: Adventurers and Assessments

I trained a booster (LightGBM) and used it to look for nonlinearity in the items - basically I made one ICE plot per item. From this I discovered the following nonlinearities:

Unicorns were the big thing - if you submit enough Unicorn Horns, you seem to get a discount or credit on your taxes. Perhaps they are medicinal, and there is a shortage. This happens at 5 horns, and submitting more than 5 doesn't get any further discount.

There was also some discounting going on with Cockatrice Eyes, but more confusing, where in one view of mine, it looked like the tax was bigger at 0 of them, smaller at 1, bigger at 2, smaller at 3, etc., oscillating.

Dragon, Lich, and Zombie parts looked mostly linear though.

There are a number of tax submissions for which the assessed tax was zero. Even property as large as [1 cockatrice eye, 1 lich skull, 6 zombie hands] had a zero-tax entry. So I took the strategy of starting by copying the zero-tax historical records, where I could, for three of the adventurers. For the fourth, Dragon Heads always incur a big chunk of tax, so I gave the final adventurer all the Dragon Heads, as well as 5 Unicorn Horns and an odd number of Cockatrice Eyes, to offset them.

Then from there I poked around and tried to ride the gradient downward manually. I arrived at:

1: {2 Lich Skull, 8 Zombie Hand} [for 3 gp 6 sp = 3.6 tax]
2: {1 Cockatrice Eye, 1 Dragon Head, 1 Unicorn Horn} [0.0]
3: {1 Dragon Head, 1 Unicorn Horn} [4.2]
4: {3 Cockatrice Eye, 2 Dragon Head, 3 Lich Skull, 5 Unicorn Horn} [19.2]

For a total tax of 27 gp 0 sp.

From this poking around, I've started to feel like maybe one Unicorn Horn can cancel a Dragon Head, or something? I couldn't get a proper black-box optimization program working, so it was just my manual optimization at the end that got me from 32.0 down to 27.0. There is probably a bit of room for progress.

maxwell-peterson on D&D.Sci Tax Day: Adventurers and Assessments

(Haven't yet read what others wrote).

Cool setup! Haven't done one of these for a few years, and I enjoyed it a lot.

I did have a terrible time trying to get a black-box optimizer running - the hard constraints on the sums seemed to be mostly not a thing in optimizer packages? I'm interested in the thoughts of someone who knows more about black-box optimization like genetic algorithms, simulated annealing, or whatever, and if they think they'd be suitable for a problem like this.

Posting my findings in the comment below.

robo on Why Should I Assume CCP AGI is Worse Than USG AGI?

There's more variance within countries than between countries. Where did the disruptive upstart that cares about Free Software (free as in freedom, not as in beer) come from? China. Is that because China's more libertarian than the US? No, it's because there's a wide variance in both the US and China and by chance the most software-libertarian company was Chinese. Don't treat countries like point estimates.

da_peach on Power Lies Trembling: a three-book review

To sway public opinion about AI safety, let us consider the case of nuclear warfare—a domain where long-term safety became a serious institutional concern. Nuclear technology wasn’t always surrounded by protocols, safeguards, and watchdogs. In the early days, it was a raw demonstration of power: the bombs dropped on Hiroshima and Nagasaki were enough to show the sheer magnitude of destruction possible. That spectacle shocked the global conscience. It didn’t take long before nation after nation realized that this wasn't just a powerful new toy, but an existential threat. As more countries acquired nuclear capabilities, the world recognized the urgent need for checks, treaties, and oversight. What began as an arms race slowly transformed into a field of serious, respected research and diplomacy—nuclear safety became a field in its own right.

The point is: public concern only follows recognition of risk. AI safety, like nuclear safety, will only be taken seriously when people see it as more than sci-fi paranoia. For that shift to happen, we need respected institutions to champion the threat. Right now, it’s mostly academics raising the alarm. But the public—especially the media and politicians—won’t engage until the danger is demonstrated or convincingly explained. Unfortunately for the AI safety issue, evidence of AI misalignment causing significant trouble will probably mean it's too late.
Adding fuel to this fire is the fact that politicians aren't gonna campaign about AI safety if the corpos in your country don't want to & your enemies are already neck-to-neck in AI dev.

In my subjective opinion, we need the AI variant of Hiroshima. But I'm not too keen on this idea, for it is a rather dreadful thought.

Edit: I should clarify what I mean by "the AI variant of Hiroshima." I don't think a large-scale inhuman military operation is necessary (as I already said, I don't want AI warfare). What I mean instead is something that causes significant damage & makes it to newspaper headlines worldwide. Examples: strong evidence that AI swayed the presidential election one way; a gigantic economic crash caused because of a rogue AI (not the AI bubble bursting); millions of jobs being lost in a short timeframe because of one revolutionary model, which then snaps because of misalignment; etc. There are still dreadful, but at least no human lives are lost & it gets the point across that AI safety is an existential issue.

quetzal_rainbow on quetzal_rainbow's Shortform

I mostly think about alignment methods like "model-based RL which maximizes reward iff it outputs action which is provably good under our specification of good".

nico-hillbrand on Why does LW not put much more focus on AI governance and outreach?

1: My understanding is the classic arguments go something like: Assume interpretability won't work (illegible CoT, probes don't catch most problematic things). Assume we're training our AI on diverse tasks and human feedback. It'll sometimes get reinforced for deception. Assume useful proxy goals for solving tasks become drives that the AI comes up with instrumental strategies to achieve. Deception is often a useful instrumental strategy. Assume that alien or task focused drives win out over potential honesty etc drives because they're favoured by inductive biases. You get convergent deception.

I'm guessing you have interpretability working as th main crux and together with inductive biases for nice behaviours potentially winning it drives this story to low probability. Is that right?

A different story with interpretability at least somewhat working would be the following:

We again have deception by default because of human reinforcement for sycophancy and looking like solving problems (like in o3) as well as because of inductive biases for alien goals. However this time our interpretability methods work and since the AI is smart enough to know when it's deceptive we can catch correlates of that representation with interpretability techniques.

Within the project developing the AI, the vibes and compute commitments are that the main goal is to go fast and outcompete others and a secondary goals is to be safe which gets maybe 1-10% of resources. So then as we go along we have the deception monitors constantly going off. People debate on what they should do about it. They come to the conclusion that they can afford to do some amount of control techniques and resample the most deceptive outputs, investigate more etc but mostly still use the model. They then find various training techniques that don't directly train on their deception classifier but are evaluated against it and train on some subset of the classifications of another detector which leads to reducing deception without fixing the underlying problem of reinforcement and inductive biases. The probes have been optimized against and stop being useful. We're in the first scenario of interp not working now.

What are your thoughts in this scenario?

2: Assume that the AIs are aligned in the sense of following the spirit of a constitution and a reward process involving a bunch of human feedback. Assume the companies / projects deploying them have as their main goals to increase their power with constraints on not looking bad to peers etc. I'm not sure they'd use the AIs in such a way that the AIs push back on their orders to make them more powerful because they're aligned to a constitution or similar that makes them push for a nicer society with coordination mechanisms etc. I do agree that very likely the main projects will ask their AIs to set up coordination between them to stay in power and to have enough breathing room to do that. We wouldn't have full on gradual disempowerment there but instead a concentration of power. So then possibly because there's ideological pressure not to have concentration of power people will prevent the setup of these coordination mechanisms in general and we get a distributed AI world with AIs aligned to individual users and constitutions, but unable to help them with setting up coordination mechanisms.

I think we're unlikely to block coordination mechanisms (ie China is already pretty centralised and probably wouldn't do that within itself), but still curious about your thoughts on this.

jblack on Shortform

Yes, player 2 loses with extremely low probability even for a 1-bit hash (on the order of 2^-256). For a more commonly used hash, or for 2^24 searches on their second-last move, they reduce their probability of loss by a huge factor more.

luck-1 on Moral patienthood of simulated minds allows uncountabe infinity of value on finite hardware

I claim that it is possible to create a program, which can be interpreted as running uncountably infinite number of simulations. Does this interpretation carry any weight for morality? Can a simulation be viewed as a conscious mind? These questions have different answers in different philosophical frameworks. And yes, it does create weird implications in those frameworks that answer "yes" to both questions. My response is to just discard those frameworks, and use something else. What about other sizes of infinity? I don't know. I expect that it is possible to construct such a hypervisor for any size of infinity, but I'm not quite interested in doing it, because I've already discarded those philosophical frameworks in which it's important.

4gravitons on Lucius Bushnaq's Shortform

I think you'd be hard-pressed to get a scientist to admit that the money was lost. ;)

Honestly, it's not obvious that it would have been possible to do Advanced LIGO without the experience from the initial run, which is kind of the point I was making: we don't usually have tasks that humanity needs to get right on the first try, to the contrary humanity usually needs to fail a few times first!

But the initial budget was around $400 million, the upgrade took another $200 million. I don't know how much was spent operating the experiment in its initial run, which I guess would be the cleanest proxy for money "wasted", if you're imagining a counterfactual where they got it right on the first try.