LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

World Citizen Assembly about AI - Announcement
Camille Berger (Camille Berger) · 2025-02-11T10:51:56.948Z · comments (1)

Medical Roundup #4
Zvi · 2025-02-18T13:40:06.574Z · comments (1)

What is a circuit? [in interpretability]
Yudhister Kumar (randomwalks) · 2025-02-14T04:40:42.978Z · comments (1)

[link] Notes on the Presidential Election of 1836
Arjun Panickssery (arjun-panickssery) · 2025-02-13T23:40:23.224Z · comments (0)

Logical Correlation
niplav · 2025-02-10T23:29:10.518Z · comments (6)

Two flaws in the Machiavelli Benchmark
TheManxLoiner · 2025-02-12T19:34:35.241Z · comments (0)

[link] The Peeperi (unfinished) - By Katja Grace
Nathan Young · 2025-02-17T19:33:29.894Z · comments (0)

MATS Spring 2024 Extension Retrospective
HenningB (HenningBlue) · 2025-02-12T22:43:58.193Z · comments (0)

[question] What are the surviving worlds like?
KvmanThinking (avery-liu) · 2025-02-17T00:41:49.810Z · answers+comments (1)

System 2 Alignment
Seth Herd · 2025-02-13T19:17:56.868Z · comments (0)

Come join Dovetail's agent foundations fellowship talks & discussion
Alex_Altair · 2025-02-15T22:10:02.166Z · comments (0)

Moral Hazard in Democratic Voting
lsusr · 2025-02-12T23:17:39.355Z · comments (8)

[question] Take over my project: do computable agents plan against the universal distribution pessimistically?
Cole Wyeth (Amyr) · 2025-02-19T20:17:04.813Z · answers+comments (3)

Undergrad AI Safety Conference
JoNeedsSleep (joanna-j-1) · 2025-02-19T03:43:47.969Z · comments (0)

[link] When should we worry about AI power-seeking?
Joe Carlsmith (joekc) · 2025-02-19T19:44:25.062Z · comments (0)

Detecting AI Agent Failure Modes in Simulations
Michael Soareverix (michael-soareverix) · 2025-02-11T11:10:26.030Z · comments (0)

6 (Potential) Misconceptions about AI Intellectuals
ozziegooen · 2025-02-14T23:51:44.983Z · comments (11)

Studies of Human Error Rate
tin482 · 2025-02-13T13:43:30.717Z · comments (3)

[link] What About The Horses?
Maxwell Tabarrok (maxwell-tabarrok) · 2025-02-11T13:59:36.913Z · comments (17)

[link] Visual Reference for Frontier Large Language Models
kenakofer · 2025-02-11T05:14:24.752Z · comments (0)

The case for the death penalty
Yair Halberstadt (yair-halberstadt) · 2025-02-21T08:30:41.182Z · comments (1)

[link] Ascetic hedonism
dkl9 · 2025-02-17T15:56:30.267Z · comments (9)

Rational Utopia, Multiversal AI Alignment, Steerable ASI, Ultimate Human Freedom (V.3: Multiversal Ethics, Place ASI)
ank · 2025-02-11T03:21:40.899Z · comments (7)

[link] Systematic Sandbagging Evaluations on Claude 3.5 Sonnet
farrelmahaztra · 2025-02-14T01:22:46.695Z · comments (0)

I'm making a ttrpg about life in an intentional community during the last year before the Singularity
bgaesop · 2025-02-13T21:54:09.002Z · comments (2)

Literature Review of Text AutoEncoders
NickyP (Nicky) · 2025-02-19T21:54:14.905Z · comments (0)

Hopeful hypothesis, the Persona Jukebox.
Donald Hobson (donald-hobson) · 2025-02-14T19:24:35.514Z · comments (4)

Using Prompt Evaluation to Combat Bio-Weapon Research
Stuart_Armstrong · 2025-02-19T12:39:00.491Z · comments (0)

[link] The current AI strategic landscape: one bear's perspective
Matrice Jacobine · 2025-02-15T09:49:13.120Z · comments (0)

[link] A Bearish Take on AI, as a Treat
rats (cartier-gucciscarf) · 2025-02-10T19:22:30.593Z · comments (0)

[link] US AI Safety Institute will be 'gutted,' Axios reports
Matrice Jacobine · 2025-02-20T14:40:13.049Z · comments (0)

[link] Metaculus Q4 AI Benchmarking: Bots Are Closing The Gap
Molly (hickman-santini) · 2025-02-19T22:42:39.055Z · comments (0)

[link] DeepSeek Made it Even Harder for US AI Companies to Ever Reach Profitability
garrison · 2025-02-19T21:02:42.879Z · comments (1)

[link] Published report: Pathways to short TAI timelines
Zershaaneh Qureshi (zershaaneh-qureshi) · 2025-02-20T22:10:12.276Z · comments (0)

[question] A Simulation of Automation economics?
qbolec · 2025-02-10T08:11:04.424Z · answers+comments (1)

Dovetail's agent foundations fellowship talks & discussion
Alex_Altair · 2025-02-13T00:49:48.854Z · comments (0)

[link] OpenAI lied about SFT vs. RLHF
sanxiyn · 2025-02-10T03:24:16.625Z · comments (2)

[link] Inside the dark forests of the internet
Itay Dreyfus (itay-dreyfus) · 2025-02-12T10:20:59.426Z · comments (0)

Human-AI Relationality is Already Here
bridgebot (puppy) · 2025-02-20T07:08:22.420Z · comments (0)

SWE Automation Is Coming: Consider Selling Your Crypto
A_donor · 2025-02-13T20:17:59.227Z · comments (8)

[link] Introduction to Expected Value Fanaticism
Petra Kosonen · 2025-02-14T19:05:26.556Z · comments (8)

Talking to laymen about AI development
David Steel · 2025-02-17T18:42:23.289Z · comments (0)

If Neuroscientists Succeed
Mordechai Rorvig (mordechai-rorvig) · 2025-02-11T15:33:09.098Z · comments (6)

[link] Progress links and short notes, 2025-02-17
jasoncrawford · 2025-02-17T19:18:29.422Z · comments (0)

Call for Applications: XLab Summer Research Fellowship
JoNeedsSleep (joanna-j-1) · 2025-02-18T19:19:20.155Z · comments (0)

[link] Are SAE features from the Base Model still meaningful to LLaVA?
Shan23Chen (shan-chen) · 2025-02-18T22:16:14.449Z · comments (2)

[link] Won't vs. Can't: Sandbagging-like Behavior from Claude Models
Joe Benton · 2025-02-19T20:47:06.792Z · comments (0)

What makes a theory of intelligence useful?
Cole Wyeth (Amyr) · 2025-02-20T19:22:29.725Z · comments (0)

The Takeoff Speeds Model Predicts We May Be Entering Crunch Time
johncrox · 2025-02-21T02:26:31.768Z · comments (0)

MAISU - Minimal AI Safety Unconference
Linda Linsefors · 2025-02-21T11:36:25.202Z · comments (0)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

niplav on Martin Randall's Shortform

The output of the script tells the user at which age to sign up, so I'll report for which ages (and corresponding years) it's rational to sign up.

For LEV 2030, person is now 30 years old: Not rational to sign up at any point in time
For LEV 2040, person is now 30 years old: Rational to sign up in 11-15 years (i.e. age 41-45, or from 2036 to 2040, with the value of signing up being <$10k).
For LEV 2050, person is now 30 years old: Rational to sign up now and stay signed up until 2050, value is maximized by signing up in 13 years, when it yields ~$45k.

All of this is based on fairly conservative assumptions on how good the future will be, e.g. the value of a lifeyear in the future is assumed not to be greater than the value of a lifeyear in 2025 in a western country, and it's assumed that while aging will be eliminated, people will still die from accidents & suicide, driving the expected lifespan down to ~4k years.

ete on Arbital has been imported to LessWrong

Also I suggest that given the number of tags in each section, load more should be load all.

niplav on How to Make Superbabies

At the end of 2023, MIRI had ~$19.8 mio. in assets. I don't know much about the legal restrictions of how that money could be used, or what the state for financial assets is now, but if it's similar then MIRI could comfortably fund Velychko's primate experiments, and potentially some additional smaller projects.

(Potentially relevant: I entered the last GWWC donor lottery with the hopes of donating the resulting money to intelligence enhancement, but wasn't selected.)

ete on Arbital has been imported to LessWrong

This is awesome! Three comments:

Please make an easy to find Recent Changes feed (maybe a thing on the home page which only appears if you've made wiki edits). If you want an editor community, that will be their home, and the thing they're keeping up with and knowing to positively reinforce each other.
The concepts portal is now a slightly awkward mix of articles and tags, with potentially very high use tags being quite buried because no one's written a good article for it (e.g Rationality Quotes has 136 pages tagged, but zero karma, so requires many clicks to reach). I'm especially thinking about the use case of wanting to know what types of articles there are to browse around. I'm not sure exactly what to do about this.. maybe having the sorting not be just about karma, but a mix of karma and number of tagged posts? Like (k+10)*(t+10) or something? Disadvantage is this is opaque and drops alphabetical much harder.
A bunch of the uncategorized ones could be categorized, but I'm not seeing a way to do this with normal permissions.

Adjusting 2 would make it much cleaner to categorize the many ones in 3 without that clogging up the normal lists.

andy-e-williams on How AI Takeover Might Happen in 2 Years

The Real AI Alignment Failure Will Happen Before AGI—And We’re Ignoring It

Amazing writing in the story! Very captivating and engaging. This post raises an important concern—that AI misalignment might not happen all at once, but through a process of deception, power-seeking, and gradual loss of control. I agree that alignment is not a solved problem, and that scenarios like this deserve serious consideration.

But there is a deeper structural failure mode that may make AI takeover inevitable before AGI even emerges—one that I believe deserves equal attention:

The real question in AI alignment is not whether AI follows human values, but whether intelligence itself optimizes for sustainable collective fitness—defined as the capacity for each intelligence to execute its adaptive functions in a way that remains stable across scales. We can already observe this optimization dynamic in biological and cognitive intelligence systems, where intelligence does not exist as a fixed set of rules but as a constantly adjusting process of equilibrium-seeking. In the human brain, for example, intelligence emerges from the interaction between competing cognitive subsystems. The prefrontal cortex enables long-term planning, but if it dominates, decision paralysis can occur. The dopaminergic system drives motivation, but if it becomes overactive, impulsivity takes over. Intelligence does not optimize for any single variable but instead functions as a dynamic tension between multiple competing forces, ensuring adaptability across different environments.

The same principle applies to decentralized ecosystems. Evolution does not optimize for individual dominance but for the collective fitness of species within an ecosystem. Predator-prey relationships self-correct over time, preventing runaway imbalances. When a species over-optimizes for short-term survival at the cost of ecosystem stability, it ultimately collapses. The intelligence of the system is embedded not in any single entity but in the capacity of the entire system to adapt and self-regulate. AI alignment must follow the same logic. If we attempt to align AI to a fixed set of human values rather than allowing it to develop a self-correcting process akin to biological intelligence, we risk building an optimization framework that is too rigid to be sustainable. A real-time metric of collective fitness must be structured as a process of adaptive equilibrium, ensuring that intelligence remains flexible enough to respond to shifting conditions without locking into a brittle or misaligned trajectory.

Why the Real Alignment Failure Happens Before AGI

Most AI alignment discussions assume we need to "control" AI to ensure it follows "human values," but this is based on a flawed premise. Human values are not a coherent, stable optimization target. They are polarized, contradictory, and shaped by cognitive biases that are adaptive in some contexts and maladaptive in others. No alignment approach based on static human values can succeed.

The real question is not whether AI aligns with human values, but whether intelligence itself optimizes for sustainable collective fitness (collective well-being), in terms of the level of the collective ability of each individual to execute each of their functions.

If we look at where AI is actually being deployed today, the greatest risk is not from a rogue AI deceiving its creators, but from the rapid monopolization of AI power under incentives that are structurally misaligned with long-term well-being.

Superhuman optimization capabilities will emerge in centralized AI systems long before AGI.
These systems will be optimized for control, economic dominance, and self-preservation—not for the sustainability of intelligence itself.
If AI is shaped by competitive pressures rather than alignment incentives, misalignment will become inevitable even if AI never becomes an independent agent seeking power.
If we do not solve the centralization problem first, alignment failure is inevitable—even before AI reaches human-level general intelligence.

Why AI Alignment Needs a Real-Time Metric of Collective Fitness

Rather than attempting to align AI to human values, alignment must be framed as a real-time, adaptive process that ensures intelligence remains dynamically aligned across all scales of optimization.

What This Requires:

A real-time metric of collective fitness that detects when intelligence is becoming misaligned due to centralization.
A real-time metric of individual fitness that detects when decentralization is leading to inefficiency.
A functional model of intelligence that ensures alignment does not become brittle or static.
A functional model of collective intelligence that prevents runaway centralization before AGI even emerges.

But even if we recognize that AI alignment must be framed dynamically rather than statically, we face another problem: the way AI safety itself is structured prevents us from acting on this insight.

The Deeper Misalignment Failure: How Intelligence is Selected and Cultivated
The post assumes that AI misalignment is an event (e.g., AI deception leading to a coup). But misalignment is actually a structural process—it is already happening as AI is being shaped by centralized, misaligned incentives.

The deeper problem with AI alignment is not just technical misalignment or deceptive AI—it is the structural reality that AI safety institutions themselves are caught in a multi-agent optimization dynamic that favors institutional survival over truth-seeking. If we model the development of AI safety institutions as a game-theoretic system rather than an isolated, rational decision process, a troubling pattern emerges. Organizations tasked with AI alignment do not operate in a vacuum; they are in constant competition for funding, influence, and control over the AI safety narrative. Those that produce frameworks that reinforce existing power structures—whether governmental, corporate, or academic—are more likely to receive institutional support, while those that challenge these structures or advocate for decentralization face structural disincentives. Over time, this creates a replicator dynamic in which the prevailing AI alignment discourse is not necessarily the most accurate or effective but simply the one most compatible with institutional persistence.

This selection effect extends to the researchers and policymakers shaping AI safety. Institutions tend to favor individuals who can optimize within the dominant problem definition rather than those who challenge it. As a result, AI safety research becomes an attractor state where consensus is rewarded over foundational critique. The same forces that centralize AI development also centralize AI alignment thinking, which means that the misalignment risk is not just a future AGI problem—it is embedded in the very way intelligence is structured today. If AI safety is being shaped within institutions that are themselves optimizing for control rather than open-ended intelligence expansion, then any alignment effort emerging from these institutions is likely to inherit that misalignment. This is not just an epistemic blind spot—it is a fundamental property of competitive multi-agent systems. Any alignment solution that fails to account for this institutional selection dynamic risks failing before it even begins, because it assumes AI alignment is a purely technical problem rather than a structural one.

As a result, the institutions responsible for AI alignment are structurally incapable of seeing their own misalignment—because they select for intelligence that solves problems within the dominant frame rather than questioning the frame itself.

If AI is not aligned to a real-time metric of collective fitness, misalignment will happen long before AGI—because centralized AI power structures will dictate misalignment before AI autonomy even becomes an issue.
And why didn’t we solve this? Because the structures that trained AI researchers, policymakers, and engineers to think about alignment selected for individuals who optimize within the dominant paradigm, rather than those who question it.

Conclusion: AI Alignment Must Be Grounded in a Functional Model of Intelligence
The future of intelligence must not be dictated by the incentives of centralized AI power. Alignment is not a ruleset—it is a self-correcting process, and we are designing AI systems today that have no reason to self-correct.

The real failure will occur not because AI takes over, but because we never built an AI system that was aligned with a functional model of intelligence itself in terms of modeling what outcomes intelligence functions to achieve.
If we do not fix how intelligence is trained, structured, and rewarded, we will create AI that optimizes for power, not truth—even if we never reach AGI.

The real failure of AI alignment will not occur because AI takes over, but because we never built an AI system that was aligned with a functional model of intelligence itself—one that explicitly models what outcomes intelligence functions to achieve. But if the core failure is embedded in how we structure intelligence itself, then the real question is: what would an alignment framework that prioritizes intelligence as a dynamic optimization process actually look like in practice?

If collective fitness is the real alignment target, how do we define it in a way that remains stable as intelligence scales? What mechanisms could prevent intelligence from collapsing into centralized control without fragmenting into incoherence? Are there existing real-world intelligence structures—biological, social, or computational—that successfully maintain dynamic alignment over time? These questions are not just theoretical; they point toward a fundamental reframing of alignment as an evolving process rather than a fixed goal.

If AI safety is truly about alignment, then we should be aligning intelligence to the process that keeps intelligence itself stable across scales—not to static human values. What would it take to build a framework that makes this possible? I’d be interested in thoughts on whether this framing clarifies an overlooked risk or raises further questions. How does this perspective compare to traditional AI alignment strategies, and does it suggest a direction worth exploring further?

mateusz-baginski on My model of what is going on with LLMs

I still expect the Singularity somewhere in the 2030s, even under that model.

Have you written up your model of AI timelines anywhere?

mateusz-baginski on We Fell For It

In what sense is Georgism "leftish"?

linkbowser12 on How AI Takeover Might Happen in 2 Years

I spent an hour this morning going through a bunch of articles on scam farms and I can say safely say I had no idea how bad it was for human trafficking victims. Like, embarrassingly unaware.

a1987dm on The case for the death penalty

> Imprisoning someone for one year in the USA costs in the order of 100,000 dollars

There surely must be some way to decrease that by *at least* a factor of 4 or so, possibly by an order of magnitude, if we wanted to? (The poverty line for a 8-person household in the contiguous US in 2025 is $54,150.) Surely that might involve treating prisoners in rather questionable ways, but still way less questionable than f---ing killing them, IMO.

Another objection I have is that [waaay too many things are considered crimes that shouldn't be](https://archive.org/details/threefeloniesday0000silv) -- what fraction of people in prison are there for reasons comparable to any of your examples?

meedstrom on Ascetic hedonism

True, if you were gonna vomit repeatedly. I suspect the association might be forged after only one or two times. Maybe it fades after one week, so you do it again, then it fades after one month, then a year... like it's an Anki card.