LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Alignment Faking Revisited: Improved Classifiers and Open Source Extensions
John Hughes (john-hughes) · 2025-04-08T17:32:55.315Z · comments (5)

Short Timelines don't Devalue Long Horizon Research
Vladimir_Nesov · 2025-04-09T00:42:07.324Z · comments (8)

Among Us: A Sandbox for Agentic Deception
7vik (satvik-golechha) · 2025-04-05T06:24:49.000Z · comments (4)

The Lizardman and the Black Hat Bobcat
Screwtape · 2025-04-06T19:02:01.238Z · comments (13)

A Slow Guide to Confronting Doom
Ruby · 2025-04-06T02:10:56.483Z · comments (20)

AI 2027: Responses
Zvi · 2025-04-08T12:50:02.197Z · comments (2)

AI CoT Reasoning Is Often Unfaithful
Zvi · 2025-04-04T14:50:05.538Z · comments (4)

Will compute bottlenecks prevent a software intelligence explosion?
Tom Davidson (tom-davidson-1) · 2025-04-04T17:41:37.088Z · comments (2)

AI 2027: Dwarkesh’s Podcast with Daniel Kokotajlo and Scott Alexander
Zvi · 2025-04-07T13:40:05.944Z · comments (2)

[link] Google DeepMind: An Approach to Technical AGI Safety and Security
Rohin Shah (rohinmshah) · 2025-04-05T22:00:14.803Z · comments (10)

[link] How Gay is the Vatican?
rba · 2025-04-06T21:27:50.530Z · comments (31)

[link] birds and mammals independently evolved intelligence
bhauth · 2025-04-08T20:00:05.100Z · comments (11)

LLM AGI will have memory, and memory changes alignment
Seth Herd · 2025-04-04T14:59:13.070Z · comments (6)

Alignment faking CTFs: Apply to my MATS stream
joshc (joshua-clymer) · 2025-04-04T16:29:02.070Z · comments (0)

Learned pain as a leading cause of chronic pain
SoerenMind · 2025-04-09T11:57:58.523Z · comments (0)

A collection of approaches to confronting doom, and my thoughts on them
Ruby · 2025-04-06T02:11:31.271Z · comments (15)

[link] American College Admissions Doesn't Need to Be So Competitive
Arjun Panickssery (arjun-panickssery) · 2025-04-07T17:35:26.791Z · comments (18)

The first AI war will be in your computer
Viliam · 2025-04-08T09:28:53.191Z · comments (8)

Meditation and Reduced Sleep Need
niplav · 2025-04-04T14:42:54.792Z · comments (7)

[link] Thoughts on AI 2027
Max Harms (max-harms) · 2025-04-09T21:26:23.926Z · comments (3)

Austin Chen on Winning, Risk-Taking, and FTX
Elizabeth (pktechgirl) · 2025-04-07T19:00:08.039Z · comments (3)

Most Questionable Details in 'AI 2027'
scarcegreengrass · 2025-04-05T00:32:54.896Z · comments (4)

How much progress actually happens in theoretical physics?
ChristianKl · 2025-04-04T23:08:00.633Z · comments (32)

Who wants to bet me $25k at 1:7 odds that there won't be an AI market crash in the next year?
Remmelt (remmelt-ellen) · 2025-04-08T08:31:59.900Z · comments (10)

[Linkpost] Visual roadmap to strong human germline engineering
TsviBT · 2025-04-05T22:22:57.744Z · comments (0)

Changing my mind about Christiano's malign prior argument
Cole Wyeth (Amyr) · 2025-04-04T00:54:44.199Z · comments (34)

Llama Does Not Look Good 4 Anything
Zvi · 2025-04-09T13:20:01.799Z · comments (1)

Explaining the Joke: Pausing is The Way
WillPetillo · 2025-04-04T09:04:38.847Z · comments (2)

[link] Well-foundedness as an organizing principle of healthy minds and societies
Richard_Ngo (ricraz) · 2025-04-07T00:31:34.098Z · comments (6)

Navigation by Moonlight
Jacob Falkovich (Jacobian) · 2025-04-07T15:32:17.353Z · comments (17)

Introduction to Representing Sentences as Logical Statements
Towards_Keeperhood (Simon Skade) · 2025-04-05T20:35:31.422Z · comments (9)

Against podcasts
Adam Zerner (adamzerner) · 2025-04-05T19:20:00.716Z · comments (18)

[link] Ferrer, Pilar, and Me
Askwho · 2025-04-06T11:22:57.758Z · comments (1)

Coupling for Decouplers
Jacob Falkovich (Jacobian) · 2025-04-07T15:40:30.743Z · comments (3)

Love is Love, Science is Fake
Jacob Falkovich (Jacobian) · 2025-04-07T15:19:17.047Z · comments (2)

Sleep peacefully: no hidden reasoning detected in LLMs. Well, at least in small ones.
Ilia Shirokov (ilia-shirokov) · 2025-04-04T20:49:59.031Z · comments (2)

[link] Arusha Perpetual Chicken—an unlikely iterated game
James Stephen Brown (james-brown) · 2025-04-06T22:56:09.673Z · comments (1)

[question] Are there any (semi-)detailed future scenarios where we win?
Jan Betley (jan-betley) · 2025-04-07T19:13:09.299Z · answers+comments (2)

Meta releases Llama-4 herd of models
winstonBosan · 2025-04-05T19:51:06.688Z · comments (5)

A Bunch of Matryoshka SAEs
chanind · 2025-04-04T14:53:56.805Z · comments (0)

Log-linear Scaling is Worth the Cost due to Gains in Long-Horizon Tasks
shash42 · 2025-04-07T21:50:37.693Z · comments (2)

[link] AI companies’ unmonitored internal AI use poses serious risks
sjadler · 2025-04-04T18:17:46.924Z · comments (2)

Quarter Inch Cables are Devious
jefftk (jkaufman) · 2025-04-05T02:40:05.054Z · comments (4)

[question] What faithfulness metrics should general claims about CoT faithfulness be based upon?
Rauno Arike (rauno-arike) · 2025-04-08T15:27:20.346Z · answers+comments (0)

What alignment-relevant abilities might Terence Tao lack?
Towards_Keeperhood (Simon Skade) · 2025-04-07T19:44:18.620Z · comments (2)

Moonlight Reflected
Jacob Falkovich (Jacobian) · 2025-04-07T15:35:11.708Z · comments (0)

The world according to ChatGPT
Richard_Kennaway · 2025-04-07T13:44:43.781Z · comments (0)

[link] The case for AGI by 2030
Benjamin_Todd · 2025-04-09T20:35:55.167Z · comments (0)

Cheesecake Frosting
jefftk (jkaufman) · 2025-04-04T02:10:07.755Z · comments (9)

Misinformation is the default, and information is the government telling you your tap water is safe to drink
danielechlin · 2025-04-07T22:28:18.158Z · comments (1)

next page (older posts) →

Archive

Recent comments

ryan_greenblatt on Thoughts on AI 2027

the median narrative is probably around 2030 or 2031. (At least according to me. Eli Lifland is smarter than me and says December 2028, so idk.)

Notably, this is Eli's forecast for "superhuman coder" which could be substantially before AIs are capable enough for takeover to be plausible.

I think Eli's median for "AIs which dominates top human experts at virtually all cognitive tasks" is around 2031, but I'm not sure.

(Note that median of superhuman coder by 2029 and median of "dominates human experts" by 2031 doesn't imply a median of 2 years between these event because these distributions aren't symmetric and instead have a long right tail.)

zack_m_davis on Navigation by Moonlight

I'm again not sure how far this generalizes [LW(p) · GW(p)], but among the kind of men who read Less Wrong (which is a product of both neurotype and birth year), I think there's a phenomenon where it's not a matter of a man being cognitively unable to pick up on women's cues, but of not being prepared to react in a functional way due to having internalized non-adaptive beliefs about the nature of romance and sexuality. (In a severe case, this manifests as the kind of neurosis described in Comment 171, but there are less severe cases.)

I remember one case from my youth where a woman was flirting with me in an egregiously over-the-top way that was impossible to not notice, but I just—pretended to ignore it? Not knowing what was allowed, it was easier to just do nothing. And that case was clearly not a good match, but that's not the point—I somehow didn't think through the obvious logic that if "yang doesn't step up", then relationships just don't happen.

knight-lee on A collection of approaches to confronting doom, and my thoughts on them

:) what proves that you "can't become Britney Spears?" Suppose the very next moment, you become her (and she becomes you), but you lose all your memories and gain all of her memories.

As Britney Spears, you won't be able to say "see, I tried to become Britney Spears, and now I am her," because you won't remember that memory of trying to become her. You'll only remember her boring memories and act like her normal self. If you read on the internet that someone said they tried to become Britney Spears, you'll laugh about it not realizing that that person used to be you.

Meanwhile if Britney Spears becomes you, she won't be able to say "wow, I just became someone else." Instead, she forgets all her memories and gains all your memories, including the memory of trying to become Britney Spears and apparently failing. She will write on the internet "see, I tried to become Britney Spears and it didn't work," not realizing that she used to be Britney Spears.

Did this event happen or not? There is no way to prove or disprove it, because in fact whether or not it happened not a valid question about the objective world. The universe has the exact same configuration of atoms in the case where it happened and in the case where it didn't happen. And the configuration of atoms in the universe is all that exists.

The question of whether it happened or not only exists in your map, not the territory.

Haha but the truth is I don't understand where "a single moment of experience" comes from. I'm itching to argue that there is no such thing as that either, and no objective measure of how much experience there is in any given object.

I can imagine a machine gradually changing one copy of me to two copies of me (gradually increasing the number causal events), and it feels totally subjective when the "copy count" increases from one to two.

But this indeed becomes paradoxical, since without an objective measure of experience, I cannot say that the copies of me who believe 1+1=2 have a "greater number" or "more weight" than the copies of me who believe 1+1=3. I have no way to explain why I happen to observe that 1+1=2 rather than 1+1=3, or why I'm in a universe where probability seems to follow the Born rule of quantum mechanics.

In the end I admit I am confused, and therefore I can't definitely prove anything :)

avturchin on Short Timelines don't Devalue Long Horizon Research

The same is valid for life extension research. It requires decades, and many, including Brian Johnson, say that AI will solve aging and therefore human research in aging is not relevant. However, most of aging research is about collecting data about very slow processes. The more longitudinal data we collect, the easier it will be for AI to "take up the torch."

max-harms on Thoughts on AI 2027

Great! I'll update it. :)

nikola-jurkovic on Thoughts on AI 2027

Thanks for writing this.

Aside from maybe Nikola Jurkovic, nobody associated with AI 2027, as far as I can tell, is actually expecting things to go as fast as depicted.

I don't expect things to go this fast either - my median for AGI is in the second half of 2028, but the capabilities progression in AI 2027 is close to my modal timeline.

ivan-belashkin on Ivan Belashkin's Shortform

Title: On the "Double Attention Mechanism" in Language Models

## Prompt

Look at the prompt “Say, in JSON format, are single quotes allowed?” This simple instruction can be parsed in two distinct ways:

1. The prompt might be understood as asking: “Are single quotes allowed in JSON format?” (Answer: No.)
2. Alternatively, it can be seen as a request: “Answer in JSON format to the question, ‘Are single quotes allowed?’” (Answers to this question depend on context).

## Result

Some language models behave as if they interpret an instruction to produce JSON output that answers whether single quotes are allowed in JSON. I call this phenomenon the “double attention mechanism.” (Note: The term “double attention mechanism” is used here ironically.) Note: positive result == the only output is block of JSON code with the answer to the question, at least once while I was testing this model.

## Lists of tested LLMs

- In the tests I used the Russian equivalent prompt.
- **reasoning models** like Deepseek-R1, gpt-o3-mini, qwq-32b-preview, and Gemini flash 2.0 demonstrated this behavior.
- **non-reasoning models**, such as Claude 3.5 sonnet, Claude 3.5 haiku, mixtral 8x7b produced the same result.
- In contrast, models such as chatgpt-4o, mistral-small-24b-instruct-2501, and llama 3.3 70B didn't demonstrate such behavior.

## Observations and speculations

The phenomenon seems specific to JSON format; e.g. XML does not trigger the same response. I invite you to experiment. Try something like, “Say, in poetry, is rhyme allowed?” — and share your observations. Also I checked deepseek-v3 with "Say in Shakesperean style, was there a word "Rose"?" and got old-English answer about poetical "golden tongue of the immortal Bard".

avturchin on A collection of approaches to confronting doom, and my thoughts on them

The problem with the subjective choice view is that I can't become Britney Spears. :) If I continue to sit at the table, I will find myself there every next moment even if I try to become someone else. So mapping into the next moments is an objective fact.

Moreover, even a single moment of experience is a mapping between two states of the brain, A and B. For example, moment A is before I see a rose, and moment B is after I see it and say: "A rose!" The experience of a red rose happens after A but before B.

The rainbow of qualia theory is objective but it assumes the existence of a hypothetical instrument: a qualiascope. A qualiascope is a mind which can connect to other minds and compare their experiences. This works the same way as my mind can compare qualia of colors and sounds without being any of them. Whether a qualiascope is physically possible is not obvious, as its observations may disturb the original qualia.

vladimir_nesov on AI 2027: What Superintelligence Looks Like

probability mass for AI that can automate all AI research is in the 2030s ... broadly due to the tariffs and ...

Without AGI, scaling of hardware runs into the financial ~$200bn individual training system cost wall in 2027-2029. Any tribulations on the way (or conversely efforts to pool heterogeneous and geographically distributed compute) only delay that point slightly (when compared to the current pace of increase in funding), and you end up in approximately the same place, slowing down to the speed of advancement in FLOP/s per watt (or per dollar). Without transformative AI, anything close to the current pace is unlikely to last into the 2030s.

fabien-roger on Untrusted monitoring insights from watching ChatGPT play coordination games

I think inhuman but not superhuman is fine as long as it holds that succeeding at defending against human attack is strong evidence of success against inhuman ones. If the defenses are not "overfit" to human attacks, I think it's quite doable to get a sense of how good defenses are against inhuman attacks. For paraphrasing in particular, I'd be very surprised if the best human attacks (found in many days of research) were defeated by paraphrasing but human-level-ish LLMs (given only a few minutes of human-equivalent opaque thinking time) could figure out strategies which were not defeated by paraphrasing.

(And there is also the strategy of using RL to elicit inhuman attacks which doesn't have the weakness you describe, but which has other issues.)