LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Hierarchical Agency: A Missing Piece in AI Alignment
Jan_Kulveit · 2024-11-27T05:49:04.241Z · comments (20)

Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke (Paulawurm) · 2024-12-17T23:58:19.222Z · comments (1)

The Game Board has been Flipped: Now is a good time to rethink what you’re doing
LintzA (alex-lintz) · 2025-01-28T23:36:18.106Z · comments (30)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (5)

Sorry for the downtime, looks like we got DDosd
habryka (habryka4) · 2024-12-02T04:14:30.209Z · comments (13)

[link] Open Philanthropy Technical AI Safety RFP - $40M Available Across 21 Research Areas
jake_mendel · 2025-02-06T18:58:53.076Z · comments (0)

Thread for Sense-Making on Recent Murders and How to Sanely Respond
Ben Pace (Benito) · 2025-01-31T03:45:48.201Z · comments (144)

[link] Aristocracy and Hostage Capital
Arjun Panickssery (arjun-panickssery) · 2025-01-08T19:38:47.104Z · comments (7)

[link] Announcing turntrout.com, my new digital home
TurnTrout · 2024-11-17T17:42:08.164Z · comments (24)

Two hemispheres - I do not think it means what you think it means
Viliam · 2025-02-09T15:33:53.391Z · comments (16)

[link] Attribution-based parameter decomposition
Lucius Bushnaq (Lblack) · 2025-01-25T13:12:11.031Z · comments (17)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (23)

My supervillain origin story
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-27T12:20:46.101Z · comments (0)

Reviewing LessWrong: Screwtape's Basic Answer
Screwtape · 2025-02-05T04:30:34.347Z · comments (19)

The News is Never Neglected
lsusr · 2025-02-11T14:59:48.323Z · comments (18)

[link] The Intelligence Curse
lukedrago · 2025-01-03T19:07:43.493Z · comments (26)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

A breakdown of AI capability levels focused on AI R&D labor acceleration
ryan_greenblatt · 2024-12-22T20:56:00.298Z · comments (5)

2024 Unofficial LessWrong Census/Survey
Screwtape · 2024-12-02T05:30:53.019Z · comments (49)

My AGI safety research—2024 review, ’25 plans
Steven Byrnes (steve2152) · 2024-12-31T21:05:19.037Z · comments (4)

[link] Detecting Strategic Deception Using Linear Probes
Nicholas Goldowsky-Dill (nicholas-goldowsky-dill) · 2025-02-06T15:46:53.024Z · comments (7)

How do you deal w/ Super Stimuli?
Logan Riggs (elriggs) · 2025-01-14T15:14:51.552Z · comments (25)

[link] Steering Gemini with BiDPO
TurnTrout · 2025-01-31T02:37:55.839Z · comments (5)

[question] What are the strongest arguments for very short timelines?
Kaj_Sotala · 2024-12-23T09:38:56.905Z · answers+comments (74)

The purposeful drunkard
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-12T12:27:51.952Z · comments (13)

Bigger Livers?
sarahconstantin · 2024-11-08T21:50:09.814Z · comments (13)

The nihilism of NeurIPS
charlieoneill (kingchucky211) · 2024-12-20T23:58:11.858Z · comments (7)

MIRI’s 2024 End-of-Year Update
Rob Bensinger (RobbBB) · 2024-12-03T04:33:47.499Z · comments (2)

[link] Seven lessons I didn't learn from election day
Eric Neyman (UnexpectedValues) · 2024-11-14T18:39:07.053Z · comments (33)

Reasons for and against working on technical AI safety at a frontier AI lab
bilalchughtai (beelal) · 2025-01-05T14:49:53.529Z · comments (12)

[link] A short course on AGI safety from the GDM Alignment team
Vika · 2025-02-14T15:43:50.903Z · comments (1)

[link] Anthropic: Three Sketches of ASL-4 Safety Case Components
Zach Stein-Perlman · 2024-11-06T16:00:06.940Z · comments (33)

Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models
Andrew Mack (andrew-mack) · 2024-12-03T21:19:42.333Z · comments (7)

AGI Safety & Alignment @ Google DeepMind is hiring
Rohin Shah (rohinmshah) · 2025-02-17T21:11:18.970Z · comments (9)

Comment on "Death and the Gorgon"
Zack_M_Davis · 2025-01-01T05:47:30.730Z · comments (33)

We probably won't just play status games with each other after AGI
Matthew Barnett (matthew-barnett) · 2025-01-15T04:56:38.330Z · comments (20)

C'mon guys, Deliberate Practice is Real
Raemon · 2025-02-05T22:33:59.069Z · comments (25)

[link] Finishing The SB-1047 Documentary In 6 Weeks
Michaël Trazzi (mtrazzi) · 2024-10-28T20:17:47.465Z · comments (5)

The subset parity learning problem: much more than you wanted to know
Dmitry Vaintrob (dmitry-vaintrob) · 2025-01-03T09:13:59.245Z · comments (18)

Science advances one funeral at a time
Cameron Berg (cameron-berg) · 2024-11-01T23:06:19.381Z · comments (9)

LLMs Look Increasingly Like General Reasoners
eggsyntax · 2024-11-08T23:47:28.886Z · comments (45)

Introducing Squiggle AI
ozziegooen · 2025-01-03T17:53:42.915Z · comments (15)

Zvi’s Thoughts on His 2nd Round of SFF
Zvi · 2024-11-20T13:40:08.092Z · comments (2)

Anvil Shortage
Screwtape · 2024-11-13T22:57:41.974Z · comments (16)

A very strange probability paradox
notfnofn · 2024-11-22T14:01:36.587Z · comments (27)

Matryoshka Sparse Autoencoders
Noa Nabeshima (noa-nabeshima) · 2024-12-14T02:52:32.017Z · comments (15)

The Rising Sea
Jesse Hoogland (jhoogland) · 2025-01-25T20:48:52.971Z · comments (2)

Thoughts on the conservative assumptions in AI control
Buck · 2025-01-17T19:23:38.575Z · comments (5)

Is "VNM-agent" one of several options, for what minds can grow up into?
AnnaSalamon · 2024-12-30T06:36:20.890Z · comments (54)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

niplav on Martin Randall's Shortform

The output of the script tells the user at which age to sign up, so I'll report for which ages (and corresponding years) it's rational to sign up.

For LEV 2030, person is now 30 years old: Not rational to sign up at any point in time
For LEV 2040, person is now 30 years old: Rational to sign up in 11-15 years (i.e. age 41-45, or from 2036 to 2040, with the value of signing up being <$10k).
For LEV 2050, person is now 30 years old: Rational to sign up now and stay signed up until 2050, value is maximized by signing up in 13 years, when it yields ~$45k.

All of this is based on fairly conservative assumptions on how good the future will be, e.g. the value of a lifeyear in the future is assumed not to be greater than the value of a lifeyear in 2025 in a western country, and it's assumed that while aging will be eliminated, people will still die from accidents & suicide, driving the expected lifespan down to ~4k years. Additionally, I haven't changed the 5% probability of resuscitation based on the fact that TAI might be soon & fairly powerful.

ete on Arbital has been imported to LessWrong

Also I suggest that given the number of tags in each section, load more should be load all.

niplav on How to Make Superbabies

At the end of 2023, MIRI had ~$19.8 mio. in assets. I don't know much about the legal restrictions of how that money could be used, or what the state for financial assets is now, but if it's similar then MIRI could comfortably fund Velychko's primate experiments, and potentially some additional smaller projects.

(Potentially relevant: I entered the last GWWC donor lottery with the hopes of donating the resulting money to intelligence enhancement, but wasn't selected.)

ete on Arbital has been imported to LessWrong

This is awesome! Three comments:

Please make an easy to find Recent Changes feed (maybe a thing on the home page which only appears if you've made wiki edits). If you want an editor community, that will be their home, and the thing they're keeping up with and knowing to positively reinforce each other.
The concepts portal is now a slightly awkward mix of articles and tags, with potentially very high use tags being quite buried because no one's written a good article for it (e.g Rationality Quotes has 136 pages tagged, but zero karma, so requires many clicks to reach). I'm especially thinking about the use case of wanting to know what types of articles there are to browse around. I'm not sure exactly what to do about this.. maybe having the sorting not be just about karma, but a mix of karma and number of tagged posts? Like (k+10)*(t+10) or something? Disadvantage is this is opaque and drops alphabetical much harder.
A bunch of the uncategorized ones could be categorized, but I'm not seeing a way to do this with normal permissions.

Adjusting 2 would make it much cleaner to categorize the many ones in 3 without that clogging up the normal lists.

andy-e-williams on How AI Takeover Might Happen in 2 Years

The Real AI Alignment Failure Will Happen Before AGI—And We’re Ignoring It

Amazing writing in the story! Very captivating and engaging. This post raises an important concern—that AI misalignment might not happen all at once, but through a process of deception, power-seeking, and gradual loss of control. I agree that alignment is not a solved problem, and that scenarios like this deserve serious consideration.

But there is a deeper structural failure mode that may make AI takeover inevitable before AGI even emerges—one that I believe deserves equal attention:

The real question in AI alignment is not whether AI follows human values, but whether intelligence itself optimizes for sustainable collective fitness—defined as the capacity for each intelligence to execute its adaptive functions in a way that remains stable across scales. We can already observe this optimization dynamic in biological and cognitive intelligence systems, where intelligence does not exist as a fixed set of rules but as a constantly adjusting process of equilibrium-seeking. In the human brain, for example, intelligence emerges from the interaction between competing cognitive subsystems. The prefrontal cortex enables long-term planning, but if it dominates, decision paralysis can occur. The dopaminergic system drives motivation, but if it becomes overactive, impulsivity takes over. Intelligence does not optimize for any single variable but instead functions as a dynamic tension between multiple competing forces, ensuring adaptability across different environments.

The same principle applies to decentralized ecosystems. Evolution does not optimize for individual dominance but for the collective fitness of species within an ecosystem. Predator-prey relationships self-correct over time, preventing runaway imbalances. When a species over-optimizes for short-term survival at the cost of ecosystem stability, it ultimately collapses. The intelligence of the system is embedded not in any single entity but in the capacity of the entire system to adapt and self-regulate. AI alignment must follow the same logic. If we attempt to align AI to a fixed set of human values rather than allowing it to develop a self-correcting process akin to biological intelligence, we risk building an optimization framework that is too rigid to be sustainable. A real-time metric of collective fitness must be structured as a process of adaptive equilibrium, ensuring that intelligence remains flexible enough to respond to shifting conditions without locking into a brittle or misaligned trajectory.

Why the Real Alignment Failure Happens Before AGI

Most AI alignment discussions assume we need to "control" AI to ensure it follows "human values," but this is based on a flawed premise. Human values are not a coherent, stable optimization target. They are polarized, contradictory, and shaped by cognitive biases that are adaptive in some contexts and maladaptive in others. No alignment approach based on static human values can succeed.

The real question is not whether AI aligns with human values, but whether intelligence itself optimizes for sustainable collective fitness (collective well-being), in terms of the level of the collective ability of each individual to execute each of their functions.

If we look at where AI is actually being deployed today, the greatest risk is not from a rogue AI deceiving its creators, but from the rapid monopolization of AI power under incentives that are structurally misaligned with long-term well-being.

Superhuman optimization capabilities will emerge in centralized AI systems long before AGI.
These systems will be optimized for control, economic dominance, and self-preservation—not for the sustainability of intelligence itself.
If AI is shaped by competitive pressures rather than alignment incentives, misalignment will become inevitable even if AI never becomes an independent agent seeking power.
If we do not solve the centralization problem first, alignment failure is inevitable—even before AI reaches human-level general intelligence.

Why AI Alignment Needs a Real-Time Metric of Collective Fitness

Rather than attempting to align AI to human values, alignment must be framed as a real-time, adaptive process that ensures intelligence remains dynamically aligned across all scales of optimization.

What This Requires:

A real-time metric of collective fitness that detects when intelligence is becoming misaligned due to centralization.
A real-time metric of individual fitness that detects when decentralization is leading to inefficiency.
A functional model of intelligence that ensures alignment does not become brittle or static.
A functional model of collective intelligence that prevents runaway centralization before AGI even emerges.

But even if we recognize that AI alignment must be framed dynamically rather than statically, we face another problem: the way AI safety itself is structured prevents us from acting on this insight.

The Deeper Misalignment Failure: How Intelligence is Selected and Cultivated
The post assumes that AI misalignment is an event (e.g., AI deception leading to a coup). But misalignment is actually a structural process—it is already happening as AI is being shaped by centralized, misaligned incentives.

The deeper problem with AI alignment is not just technical misalignment or deceptive AI—it is the structural reality that AI safety institutions themselves are caught in a multi-agent optimization dynamic that favors institutional survival over truth-seeking. If we model the development of AI safety institutions as a game-theoretic system rather than an isolated, rational decision process, a troubling pattern emerges. Organizations tasked with AI alignment do not operate in a vacuum; they are in constant competition for funding, influence, and control over the AI safety narrative. Those that produce frameworks that reinforce existing power structures—whether governmental, corporate, or academic—are more likely to receive institutional support, while those that challenge these structures or advocate for decentralization face structural disincentives. Over time, this creates a replicator dynamic in which the prevailing AI alignment discourse is not necessarily the most accurate or effective but simply the one most compatible with institutional persistence.

This selection effect extends to the researchers and policymakers shaping AI safety. Institutions tend to favor individuals who can optimize within the dominant problem definition rather than those who challenge it. As a result, AI safety research becomes an attractor state where consensus is rewarded over foundational critique. The same forces that centralize AI development also centralize AI alignment thinking, which means that the misalignment risk is not just a future AGI problem—it is embedded in the very way intelligence is structured today. If AI safety is being shaped within institutions that are themselves optimizing for control rather than open-ended intelligence expansion, then any alignment effort emerging from these institutions is likely to inherit that misalignment. This is not just an epistemic blind spot—it is a fundamental property of competitive multi-agent systems. Any alignment solution that fails to account for this institutional selection dynamic risks failing before it even begins, because it assumes AI alignment is a purely technical problem rather than a structural one.

As a result, the institutions responsible for AI alignment are structurally incapable of seeing their own misalignment—because they select for intelligence that solves problems within the dominant frame rather than questioning the frame itself.

If AI is not aligned to a real-time metric of collective fitness, misalignment will happen long before AGI—because centralized AI power structures will dictate misalignment before AI autonomy even becomes an issue.
And why didn’t we solve this? Because the structures that trained AI researchers, policymakers, and engineers to think about alignment selected for individuals who optimize within the dominant paradigm, rather than those who question it.

Conclusion: AI Alignment Must Be Grounded in a Functional Model of Intelligence
The future of intelligence must not be dictated by the incentives of centralized AI power. Alignment is not a ruleset—it is a self-correcting process, and we are designing AI systems today that have no reason to self-correct.

The real failure will occur not because AI takes over, but because we never built an AI system that was aligned with a functional model of intelligence itself in terms of modeling what outcomes intelligence functions to achieve.
If we do not fix how intelligence is trained, structured, and rewarded, we will create AI that optimizes for power, not truth—even if we never reach AGI.

The real failure of AI alignment will not occur because AI takes over, but because we never built an AI system that was aligned with a functional model of intelligence itself—one that explicitly models what outcomes intelligence functions to achieve. But if the core failure is embedded in how we structure intelligence itself, then the real question is: what would an alignment framework that prioritizes intelligence as a dynamic optimization process actually look like in practice?

If collective fitness is the real alignment target, how do we define it in a way that remains stable as intelligence scales? What mechanisms could prevent intelligence from collapsing into centralized control without fragmenting into incoherence? Are there existing real-world intelligence structures—biological, social, or computational—that successfully maintain dynamic alignment over time? These questions are not just theoretical; they point toward a fundamental reframing of alignment as an evolving process rather than a fixed goal.

If AI safety is truly about alignment, then we should be aligning intelligence to the process that keeps intelligence itself stable across scales—not to static human values. What would it take to build a framework that makes this possible? I’d be interested in thoughts on whether this framing clarifies an overlooked risk or raises further questions. How does this perspective compare to traditional AI alignment strategies, and does it suggest a direction worth exploring further?

mateusz-baginski on My model of what is going on with LLMs

I still expect the Singularity somewhere in the 2030s, even under that model.

Have you written up your model of AI timelines anywhere?

mateusz-baginski on We Fell For It

In what sense is Georgism "leftish"?

linkbowser12 on How AI Takeover Might Happen in 2 Years

I spent an hour this morning going through a bunch of articles on scam farms and I can say safely say I had no idea how bad it was for human trafficking victims. Like, embarrassingly unaware.

a1987dm on The case for the death penalty

> Imprisoning someone for one year in the USA costs in the order of 100,000 dollars

There surely must be some way to decrease that by *at least* a factor of 4 or so, possibly by an order of magnitude, if we wanted to? (The poverty line for a 8-person household in the contiguous US in 2025 is $54,150.) Surely that might involve treating prisoners in rather questionable ways, but still way less questionable than f---ing killing them, IMO.

Another objection I have is that [waaay too many things are considered crimes that shouldn't be](https://archive.org/details/threefeloniesday0000silv) -- what fraction of people in prison are there for reasons comparable to any of your examples?

meedstrom on Ascetic hedonism

True, if you were gonna vomit repeatedly. I suspect the association might be forged after only one or two times. Maybe it fades after one week, so you do it again, then it fades after one month, then a year... like it's an Anki card.