LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

Reliable Sources: The Story of David Gerard
TracingWoodgrains (tracingwoodgrains) · 2024-07-10T19:50:21.191Z · comments (54)

Universal Basic Income and Poverty
Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2024-07-26T07:23:50.151Z · comments (137)

80,000 hours should remove OpenAI from the Job Board (and similar EA orgs should do similarly)
Raemon · 2024-07-03T20:34:50.741Z · comments (71)

Superbabies: Putting The Pieces Together
sarahconstantin · 2024-07-11T20:40:05.036Z · comments (37)

Self-Other Overlap: A Neglected Approach to AI Alignment
Marc Carauleanu (Marc-Everin Carauleanu) · 2024-07-30T16:22:29.561Z · comments (49)

Optimistic Assumptions, Longterm Planning, and "Cope"
Raemon · 2024-07-17T22:14:24.090Z · comments (46)

Towards more cooperative AI safety strategies
Richard_Ngo (ricraz) · 2024-07-16T04:36:29.191Z · comments (133)

This is already your second chance
Malmesbury (Elmer of Malmesbury) · 2024-07-28T17:13:57.680Z · comments (13)

Safety consultations for AI lab employees
Zach Stein-Perlman · 2024-07-27T15:00:27.276Z · comments (4)

[link] Decomposing Agency — capabilities without desires
owencb · 2024-07-11T09:38:48.509Z · comments (32)

On saying "Thank you" instead of "I'm Sorry"
Michael Cohn (michael-cohn) · 2024-07-08T03:13:50.663Z · comments (16)

An Extremely Opinionated Annotated List of My Favourite Mechanistic Interpretability Papers v2
Neel Nanda (neel-nanda-1) · 2024-07-07T17:39:35.064Z · comments (16)

[link] "AI achieves silver-medal standard solving International Mathematical Olympiad problems"
gjm · 2024-07-25T15:58:57.638Z · comments (38)

Pantheon Interface
NicholasKees (nick_kees) · 2024-07-08T19:03:51.681Z · comments (22)

A List of 45+ Mech Interp Project Ideas from Apollo Research’s Interpretability Team
Lee Sharkey (Lee_Sharkey) · 2024-07-18T14:15:50.248Z · comments (18)

Efficient Dictionary Learning with Switch Sparse Autoencoders
Anish Mudide (anish-mudide) · 2024-07-22T18:45:53.502Z · comments (20)

You should go to ML conferences
Jan_Kulveit · 2024-07-24T11:47:52.214Z · comments (13)

OthelloGPT learned a bag of heuristics
jylin04 · 2024-07-02T09:12:56.377Z · comments (10)

Introduction to French AI Policy
Lucie Philippon (lucie-philippon) · 2024-07-04T03:39:45.273Z · comments (12)

Me, Myself, and AI: the Situational Awareness Dataset (SAD) for LLMs
L Rudolf L (LRudL) · 2024-07-08T22:24:38.441Z · comments (37)

[link] Most smart and skilled people are outside of the EA/rationalist community: an analysis
titotal (lombertini) · 2024-07-12T12:13:56.215Z · comments (36)

[link] Poker is a bad game for teaching epistemics. Figgie is a better one.
rossry · 2024-07-08T06:05:20.459Z · comments (47)

[link] Transformer Circuit Faithfulness Metrics Are Not Robust
Joseph Miller (Josephm) · 2024-07-12T03:47:30.077Z · comments (5)

A simple model of math skill
Alex_Altair · 2024-07-21T18:57:33.697Z · comments (16)

[link] I found >800 orthogonal "write code" steering vectors
Jacob G-W (g-w1) · 2024-07-15T19:06:17.636Z · comments (19)

Dialogue introduction to Singular Learning Theory
Olli Järviniemi (jarviniemi) · 2024-07-08T16:58:10.108Z · comments (15)

[link] Against Aschenbrenner: How 'Situational Awareness' constructs a narrative that undermines safety and threatens humanity
GideonF · 2024-07-15T18:37:40.232Z · comments (17)

A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication
johnswentworth · 2024-07-26T00:33:42.000Z · comments (2)

New page: Integrity
Zach Stein-Perlman · 2024-07-10T15:00:41.050Z · comments (3)

Covert Malicious Finetuning
Tony Wang (tw) · 2024-07-02T02:41:51.698Z · comments (4)

Reflections on Less Online
Error · 2024-07-07T03:49:44.534Z · comments (15)

AI #73: Openly Evil AI
Zvi · 2024-07-18T14:40:05.770Z · comments (20)

[link] Re: Anthropic's suggested SB-1047 amendments
RobertM (T3t) · 2024-07-27T22:32:39.447Z · comments (13)

Fluent, Cruxy Predictions
Raemon · 2024-07-10T18:00:06.424Z · comments (14)

[link] What are you getting paid in?
Austin Chen (austin-chen) · 2024-07-17T19:23:04.219Z · comments (14)

Decomposing the QK circuit with Bilinear Sparse Dictionary Learning
keith_wynroe · 2024-07-02T13:17:16.352Z · comments (7)

Scalable oversight as a quantitative rather than qualitative problem
Buck · 2024-07-06T17:42:41.325Z · comments (11)

A simple case for extreme inner misalignment
Richard_Ngo (ricraz) · 2024-07-13T15:40:37.518Z · comments (41)

3C's: A Recipe For Mathing Concepts
johnswentworth · 2024-07-03T01:06:11.944Z · comments (5)

On the CrowdStrike Incident
Zvi · 2024-07-22T12:40:05.894Z · comments (14)

Interpreting Preference Models w/ Sparse Autoencoders
Logan Riggs (elriggs) · 2024-07-01T21:35:40.603Z · comments (12)

D&D.Sci Scenario Index
aphyer · 2024-07-23T02:00:43.483Z · comments (0)

[link] LK-99 in retrospect
bhauth · 2024-07-07T02:06:27.660Z · comments (21)

Multiplex Gene Editing: Where Are We Now?
sarahconstantin · 2024-07-16T20:50:04.590Z · comments (6)

[link] Yoshua Bengio: Reasoning through arguments against taking AI safety seriously
Judd Rosenblatt (judd) · 2024-07-11T23:53:17.187Z · comments (3)

Indecision and internalized authority figures
Kaj_Sotala · 2024-07-06T10:10:02.528Z · comments (1)

Analyzing DeepMind's Probabilistic Methods for Evaluating Agent Capabilities
Axel Højmark (hojmax) · 2024-07-22T16:17:07.665Z · comments (0)

An AI Race With China Can Be Better Than Not Racing
niplav · 2024-07-02T17:57:36.976Z · comments (33)

What and Why: Developmental Interpretability of Reinforcement Learning
Garrett Baker (D0TheMath) · 2024-07-09T14:09:40.649Z · comments (4)

Brief notes on the Wikipedia game
Olli Järviniemi (jarviniemi) · 2024-07-14T02:28:22.473Z · comments (9)

next page (older posts) →

Archive

2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
- January
- February
- March
- April
- May
- June
- July
- August
- September
- October
- November
- December
2025

Recent comments

mo-putera on Mo Putera's Shortform

(Not a take, just pulling out infographics and quotes for future reference from the new DeepMind paper outlining their approach to technical AGI safety and security)

Overview of risk areas, grouped by factors that drive differences in mitigation approaches:

Refer to caption

Overview of their approach to mitigating misalignment:

Refer to caption

Overview of their approach to mitigating misuse:

Refer to caption

Path to deceptive alignment:

Refer to caption

How to use interpretability:

Goal	Understanding v Control	Confidence	Concept v Algorithm	(Un)supervised?	How context specific?
Alignment evaluations	Understanding	Any	Concept+	Either	Either
FaithfulReasoning	Understanding^∗	Any	Concept+	Supervised+	Either
DebuggingFailures	Understanding^∗	Low	Either	Unsupervised+	Specific
Monitoring	Understanding	Any	Concept+	Supervised+	General
Red teaming	Either	Low	Either	Unsupervised+	Specific
Amplified oversight	Understanding	Complicated	Concept	Either	Specific

Interpretability techniques:

Technique	Understanding v Control	Confidence	Concept v Algorithm	(Un)supervised?	How specific?	Scalability
Probing	Understanding	Low	Concept	Supervised	Specific-ish	Cheap
Dictionary learning	Both	Low	Concept	Unsupervised	General^∗	Expensive
Steering vectors	Control	Low	Concept	Supervised	Specific-ish	Cheap
Training data attribution	Understanding	Low	Concept	Unsupervised	General^∗	Expensive
Auto-interp	Understanding	Low	Concept	Unsupervised	General^∗	Cheap
Component Attribution	Both	Medium	Concept	Complicated	Specific	Cheap
Circuit analysis (causal)	Understanding	Medium	Algorithm	Complicated	Specific	Expensive

Assorted random stuff that caught my attention:

They consider Exceptional AGI (Level 4) from Morris et al. (2023), defined as an AI system that matches or exceeds that of the 99th percentile of skilled adults on a wide range of non-physical tasks (contra the Metaculus "when AGI?" question that has diverse robotic capabilities, so their 2030 is probably an overestimate)
The irrelevance of physical limits to the paper's scope: "By considering the construction of “the ultimate laptop”, Lloyd (2000) suggests that Moore’s law (formalized as an 18 month doubling) cannot last past 2250. Krauss and Starkman (2004) consider limits on the total computation achievable by any technological civilization in our expanding universe - this approach imposes a (looser) 600-year limit in Moore’s law. However, since we are very far from these limits, we do not expect them to have a meaningful impact on timelines to Exceptional AGI"
Structural risks are "out of scope of this paper" because they're "a much bigger category, often with each risk requiring a bespoke approach. They are also much harder for an AI developer to address, as they often require new norms or institutions to shape powerful dynamics in the world" (although "much of the technical work discussed in this paper will also be relevant for structural risks")
Mistakes are also out of scope because "standard safety engineering practices (e.g. testing) can drastically reduce risks, and should be similarly effective for averting AI mistakes as for human mistakes... so we believe that severe harm from AI mistakes will be significantly less likely than misuse or misalignment, and is further reducible through appropriate safety practices"
The paper focuses "primarily on techniques that can be integrated into current AI development, due to our focus on anytime approaches to safety" i.e. excludes "research bets that pay out over longer periods of time but can provide increased safety, such as agent foundations, science of deep learning, and application of formal methods to AI"
Algorithmic progress papers: "Erdil and Besiroglu (2022) sought to decompose AI progress in a way that can be attributed to the separate factors of scaling (compute, model size and data) and algorithmic innovation, and concluded that algorithmic progress doubles effective compute budgets roughly every nine months. Ho et al. (2024) further extend this approach to study algorithmic improvements in the pretraining of language models for the period of 2012 - 2023. During this period, the authors estimate that the compute required to reach a set performance threshold halved approximately every eight months"
Explosive economic growth paper: "Recent modeling by Erdil et al. (2025) that draws on empirical scaling laws and semi-endogenous growth theory and models changes in compute, automation and production supports the plausibility of very rapid growth in Gross World Product (e.g. exceeding 30% per year in 2045) when adopting parameters from empirical data, existing literature and reasoned judgment" (I'm still wondering how this will get around johnswentworth's objection to using GDP [LW · GW] to track this)
General competence scales smoothly with compute: "Owen (2024) find that aggregate benchmarks (BIG-Bench (Srivastava et al., 2023), MMLU (Hendrycks et al., 2020)) are predictable with up to 20 percentage points of error when extrapolating through one order of magnitude (OOM) of compute. Gadre et al. (2024) similarly find that aggregate task performance can be predicted with relatively high accuracy, predicting average top-1 error across 17 tasks to within 1 percentage point using 20× less compute than is used for the predicted model. Ruan et al. (2024) find that 8 standard downstream LLM benchmark scores across many model families are well-explained in terms of their top 3 principal components. Their first component scales smoothly across 5 OOMs of compute and many model families, suggesting that something like general competence scales smoothly with compute"
"given that total labor compensation represents over 50% of global GDP (International Labour Organisation, 2022), it is clear that the economic incentive for automation is extraordinarily large"

david-james on [deleted]

It is more probable that A, than that A and B.

I can see the appeal here -- litanies tend to have a particular style after all -- but I wonder if we can improve it.

I see two problems:

This doesn't convey that Occam's razor is about explanations of observations.
In general, one explanation is not a logical "subset" of the other. So the comparison is not between A and A and B; it is between A and B.

Perhaps one way forward would involve a mention (or reference to) Minimum Description Length (MDL) or Kolmogorov complexity.

seth-herd on How AI Takeover Might Happen in 2 Years

I think the assumption here that AIs are "learning to train themselves" is important. In this scenario they're producing the bulk of the data.

I also take your point that this is probably correctable with good training data. One premise of the story here seems to be that the org simply didn't try very hard to align the model. Unfortunately, I find this premise all too plausible. Fortunately, this may be a leverage point for shifting the odds. "Bother to align it" is a pretty simple and compelling message.

Even with the data-based alignment you're suggesting, I think it's still not totally clear that weird chains of thought couldn't take it off track.

davidmanheim on Is instrumental convergence a thing for virtue-driven agents?

I understand what an argument is, but I don't understand why you think that converting policies to.utility functions needs to assume no systematic errors, or why, if true, that would make it incompatible with varying intelligence.

mitchell_porter on AI #110: Of Course You Know…

Regarding the tariffs, I have taken to saying "It's not the end of the world, and it's not even the end of world trade." In the modern world, every decade sees a few global economic upheavals, and in my opinion that's all this is. It is a strong player within the world trade system (China and the EU being the other strong players), deciding to do things differently. Among other things, it's an attempt to do something about America's trade deficits, and to make the country into a net producer rather than a net consumer. Those are huge changes but now that they are being attempted, I don't see any going back. The old situation was tolerated because it was too hard to do anything about it, and the upper class was still living comfortably. I think a reasonable prediction is that world trade avoiding the US will increase, US national income may not grow as fast, but the US will re-industrialize (and de-financialize). Possibly there's some interaction with the US dollar's status as reserve currency too, but I don't know what that would be.

andrew-sauer on The "Intuitions" Behind "Utilitarianism"

Gwern seems to think this would be used as a way to get rid of corrupt oligarchs, but... Wouldn't this just immediately be co-opted by those oligarchs to solidify their power by legally paying for the assassinations of their opponents? Markets aren't democratic, because a small percentage of the people have most of the money.

cole-wyeth on Existing UDTs test the limits of Bayesianism (and consistency)

If the universe contained a source of ML-random bits they might look like uniformly random coin flips to us, even if they actually had some uncomputable distribution. For instance, perhaps spin measurements are not iid Bernoulli, but since their distribution is not computable, we aren’t able to predict it any better than that model?

I’m not sure how you’re imagining this oracle would act? Nothing like what you’re describing seems to be embedded as a physical object in spacetime, but I think that’s the wrong thing to expect, failures of computability wouldn’t act like Newtonian objects.

cole-wyeth on AI 2027: What Superintelligence Looks Like

Sure, I’ll keep it simple (will submit through proper channels later):

Here’s my attempt to change their minds: https://www.lesswrong.com/posts/vvgND6aLjuDR6QzDF/my-model-of-what-is-going-on-with-llms [LW · GW]
I’ll bet 100 USD that by 2027 AI agents have not replaced human AI engineers. If it’s hard to decide I’ll pay 50 USD.

adamzerner on LessWrong merch?

I would buy various forms of merch, including clothing. I feel very fond of LessWrong and would find it cool to wear a shirt or something with that brand.

rahulxyz on AI 2027: What Superintelligence Looks Like

My initial reaction - A lot of AI related predictions are based on "follow the curve" predictions, and this is mostly doing that. With a lack of more deeper underlying theory on the nature of intelligence, I guess that's all we get.

If you look at the trend of how far behind China is to the US, that has gone from 5 years behind 2 years ago, to maybe 3 months behind now. If you follow that curve, it seems to me that China will be ahead of the US by 2026 (even with the chip controls, and export regulations etc - my take is you're not giving them enough agency). If you want to follow the curve, IMO you can /s/USA/China after 2026 (i.e - China is ahead of the US), and I can imagine it being a better trend-following prediction. It's much less convincing to tell a story about Chinese AI labs being ahead given who we are, but I'd put at least a 50/50 chance on China being ahead vs. USA being ahead.

Other than that, thanks for putting something concrete out there. Even though it's less likely the more specific it is, I feel this will get a lot more talked about, and hopefully some people with power (i.e - governments) start paying some attention to disempowerment scenarios.