LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

[link] A computational no-coincidence principle
Eric Neyman (UnexpectedValues) · 2025-02-14T21:39:39.277Z · comments (38)

[link] Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study
Adam Karvonen (karvonenadam) · 2025-04-14T17:38:02.918Z · comments (41)

AI companies are unlikely to make high-assurance safety cases if timelines are short
ryan_greenblatt · 2025-01-23T18:41:40.546Z · comments (5)

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes (john-hughes) · 2025-04-08T17:32:55.315Z · comments (19)

The Most Forbidden Technique
Zvi · 2025-03-12T13:20:04.732Z · comments (9)

OpenAI #12: Battle of the Board Redux
Zvi · 2025-03-31T15:50:02.156Z · comments (1)

Planning for Extreme AI Risks
joshc (joshua-clymer) · 2025-01-29T18:33:14.844Z · comments (5)

Ten people on the inside
Buck · 2025-01-28T16:41:22.990Z · comments (28)

Auditing language models for hidden objectives
Sam Marks (samuel-marks) · 2025-03-13T19:18:32.638Z · comments (15)

[link] The Hidden Cost of Our Lies to AI
Nicholas Andresen (nicholas-andresen) · 2025-03-06T05:03:47.239Z · comments (17)

[link] The Failed Strategy of Artificial Intelligence Doomers
Ben Pace (Benito) · 2025-01-31T18:56:06.784Z · comments (78)

The Milton Friedman Model of Policy Change
JohnofCharleston · 2025-03-04T00:38:56.778Z · comments (17)

Anomalous Tokens in DeepSeek-V3 and r1
henry (henry-bass) · 2025-01-25T22:55:41.232Z · comments (2)

[question] How Much Are LLMs Actually Boosting Real-World Programmer Productivity?
Thane Ruthenis · 2025-03-04T16:23:39.296Z · answers+comments (51)

Training AGI in Secret would be Unsafe and Unethical
Daniel Kokotajlo (daniel-kokotajlo) · 2025-04-18T12:27:35.795Z · comments (11)

Tell me about yourself: LLMs are aware of their learned behaviors
Martín Soto (martinsq) · 2025-01-22T00:47:15.023Z · comments (5)

Some articles in “International Security” that I enjoyed
Buck · 2025-01-31T16:23:27.061Z · comments (10)

The Paris AI Anti-Safety Summit
Zvi · 2025-02-12T14:00:07.383Z · comments (21)

The Pando Problem: Rethinking AI Individuality
Jan_Kulveit · 2025-03-28T21:03:28.374Z · comments (13)

[question] when will LLMs become human-level bloggers?
nostalgebraist · 2025-03-09T21:10:08.837Z · answers+comments (34)

Gradual Disempowerment, Shell Games and Flinches
Jan_Kulveit · 2025-02-02T14:47:53.404Z · comments (36)

Anthropic, and taking "technical philosophy" more seriously
Raemon · 2025-03-13T01:48:54.184Z · comments (29)

AI-enabled coups: a small group could use AI to seize power
Tom Davidson (tom-davidson-1) · 2025-04-16T16:51:29.561Z · comments (16)

Ctrl-Z: Controlling AI Agents via Resampling
Aryan Bhatt (abhatt349) · 2025-04-16T16:21:23.781Z · comments (0)

[link] Do reasoning models use their scratchpad like we do? Evidence from distilling paraphrases
Fabien Roger (Fabien) · 2025-03-11T11:52:38.994Z · comments (23)

How I've run major projects
benkuhn · 2025-03-16T18:40:04.223Z · comments (10)

[link] Research directions Open Phil wants to fund in technical AI safety
jake_mendel · 2025-02-08T01:40:00.968Z · comments (21)

Learned pain as a leading cause of chronic pain
SoerenMind · 2025-04-09T11:57:58.523Z · comments (14)

Do models say what they learn?
Andy Arditi (andy-arditi) · 2025-03-22T15:19:18.800Z · comments (12)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
LintzA (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (30)

The News is Never Neglected
lsusr · 2025-02-11T14:59:48.323Z · comments (18)

Downstream applications as validation of interpretability progress
Sam Marks (samuel-marks) · 2025-03-31T01:35:02.722Z · comments (2)

[link] Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
jake_mendel · 2025-02-06T18:58:53.076Z · comments (0)

Thread for Sense-Making on Recent Murders and How to Sanely Respond
Ben Pace (Benito) · 2025-01-31T03:45:48.201Z · comments (146)

2024 Unofficial LessWrong Survey Results
Screwtape · 2025-03-14T22:29:00.045Z · comments (28)

Two hemispheres - I do not think it means what you think it means
Viliam · 2025-02-09T15:33:53.391Z · comments (21)

[link] Explaining British Naval Dominance During the Age of Sail
Arjun Panickssery (arjun-panickssery) · 2025-03-28T05:47:28.561Z · comments (5)

[link] Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)
lewis smith (lsgos) · 2025-03-26T19:07:48.710Z · comments (15)

You can just wear a suit
lsusr · 2025-02-26T14:57:57.260Z · comments (48)

[link] Attribution-based parameter decomposition
Lucius Bushnaq (Lblack) · 2025-01-25T13:12:11.031Z · comments (21)

New Cause Area Proposal
CallumMcDougall (TheMcDouglas) · 2025-04-01T07:12:34.360Z · comments (4)

My supervillain origin story
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-27T12:20:46.101Z · comments (1)

AI 2027: Responses
Zvi · 2025-04-08T12:50:02.197Z · comments (3)

Three Months In, Evaluating Three Rationalist Cases for Trump
Arjun Panickssery (arjun-panickssery) · 2025-04-18T08:27:27.257Z · comments (23)

[link] Steering Gemini with BiDPO
TurnTrout · 2025-01-31T02:37:55.839Z · comments (5)

Judgements: Merging Prediction & Evidence
abramdemski · 2025-02-23T19:35:51.488Z · comments (5)

Among Us: A Sandbox for Agentic Deception
7vik (satvik-golechha) · 2025-04-05T06:24:49.000Z · comments (4)

AGI Safety & Alignment @ Google DeepMind is hiring
Rohin Shah (rohinmshah) · 2025-02-17T21:11:18.970Z · comments (19)

[link] Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-02-06T15:46:53.024Z · comments (9)

Fake thinking and real thinking
Joe Carlsmith (joekc) · 2025-01-28T20:05:06.735Z · comments (11)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

mondsemmel on Charbel-Raphaël's Shortform

Altman has already signed the CAIS Statement on AI Risk ("Mitigating the risk of extinction from AI should be a global priority alongside other societal-scale risks such as pandemics and nuclear war."), but OpenAI's actions almost exclusively exacerbate extinction risk, and nowadays Altman and OpenAI even downplay the very existence of this risk.

ozyrus on Is Gemini now better than Claude at Pokémon?

Not being able to do it right now is perfectly fine, still warrants setting it up to see when exactly they will start to be able to do it.

robo on Why Should I Assume CCP AGI is Worse Than USG AGI?

This is true, and it strongly influences the ways Americans think about how to provide public goods to the rest of the world. But they're thinking about how to provide public goods the rest of the world^[1]. "America First" is controversial in American intellectual circles, whereas in my (limited) conversations in China people are usually confused about what other sort of policy you would have.

^{^}
Disclosure: I'm American, I came of age in this era

charbel-raphael on Charbel-Raphaël's Shortform

X could also be agreeing to sign a public statement about the need to do something or whatever.

lblack on AE Studio is hiring!

I like AE Studios. They seem to genuinely care about AI not killing everyone, and have been willing to actually back original research ideas that don't fit into existing paradigms.

Side note:

Previous posts have been met with great reception by the likes of Eliezer Yudkowsky and Emmett Shear, so we’re up to something good.

This might be a joke, but just in case it's not: I don't think you should reason about your own alignment research agenda like this. I think Eliezer would probably be the first person to tell you that [? · GW].

hiandrewquinn on Training AGI in Secret would be Unsafe and Unethical

If so, by default the existence of AGI will be a closely guarded secret for some months. Only a few teams within an internal silo, plus leadership & security, will know about the capabilities of the latest systems.

This will result in a situation where only a few dozen people will be charged with ensuring that, and figuring out whether, the latest AIs are aligned/trustworthy/etc.

These are precisely the kinds of outcomes fine-insured bounties would excel at stopping, in my thinking. https://www.lesswrong.com/posts/rgFh6kE9FFjMuYrhc/fine-insured-bounties-for-preventing-dangerous-ai [LW · GW]

mondsemmel on Charbel-Raphaël's Shortform

Even if this strategy would work in principle among particularly honorable humans, surely Sam Altman in particular has already conclusively proven that he cannot be trusted to honor any important agreements? See: the OpenAI board drama; the attempt to turn OpenAI's nonprofit into a for-profit; etc.

adamzerner on adamzerner's Shortform

I've been doing Quantified Intuitions' Estimation Game every month. I really enjoy it. A big thing I've learned from it is the instinct to think in terms of orders of magnitude.

Well, not necessarily orders of magnitude, but something similar. For example, a friend just asked me about building a little web app calculator to provide better handicaps in golf scrambles. In the past I'd get a little overwhelmed thinking about how much time such a project would take and default to saying no. But this time I noticed myself approaching it differently.

Will it take minutes? Eh, probably not. Hours? Possibly, but seems a little optimistic. Days? Yeah, seems about right. Weeks? Eh, possibly, but even with the planning fallacy, I'd be surprised. Months? No, it won't take that long. Years. No way.

With this approach I can figure out the right ballpark very quickly. It's helpful.

osten on Implications for the likelihood of human extinction from the recent discovery of possible microbial life

Probably not saying something new here, but what is the probability of you living at a certain point in the evolution of the universe? Because of grabby aliens and the likelihood that grabby aliens don't have human-like minds you're not likely to live late in the history of the universe but rather early when the life-density of the universe is largest but the various life forms haven't propagated their grabby variants yet.

davidmanheim on Davidmanheim's Shortform

We can point to areas of chess like the endgame databases, which are just plain inscrutable

I think there isa key difference in places where the answers are just exhaustive search, rather than more intelligence - AI isn't better at that than humans, and from the little I understand, AI doesn't outperform in endgames (compared to their overperformance in general) via better policy engines, they do it via direct memorization or longer lookahead.

The difference here matters for other domains with far larger action spaces even more, since the exponential increase makes intelligence less marginally valuable at finding increasingly rare solutions. The design space for viruses is huge, and the design space for nanomachines using arbitrary configurations is even larger. If move-37-like intuitions are common, they will be able to do things humans cannot understand, whereas if it's more like chess endgames, they will need to search an exponential space in ways that are infeasible for them.

This relates closely to a folk theorem about NP-complete problems, where exponential problems are approximately solvable with greedy algorithms in nlogn or n^2 time, and TSP is NP complete but actual salesmen find sufficiently efficient routes easily.

But what part are you unsure about?

Yeah, on reflection, the music analogy wasn't a great one. I am not concerned that pattern creation that we can't intuit could exist - humans can do that as well. (For example, it's easy to make puzzles no-one can solve.) The question is whether important domains are amenable to kinds of solutions that ASI can understand robustly in ways humans cannot. That is, can ASI solve "impossible" problems?

One specific concerning difference is whether ASI could play perfect social 12-D chess by being a better manipulator, despite all of the human-experienced uncertainties, and engineer arbitrary outcomes in social domains. There clearly isn't a feasible search strategy with exact evaluation, but if it is far smarter than "human-legible ranges" of thinking, it might be possible.

This isn't jut relevant for AI risk, of course. Another area is biological therapies, where, for example, it seems likely that curing or reversing aging requires the same sort of brilliant insight into insane complexity, figuring out whether there would be long term or unexpected out of distribution impacts years later, without actually conducting multi-decade large scale trials.