LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

← previous page (newer posts) · next page (older posts) →

Momentum of Light in Glass
Ben (ben-lang) · 2024-10-09T20:19:42.088Z · comments (44)

Activation space interpretability may be doomed
bilalchughtai (beelal) · 2025-01-08T12:49:38.421Z · comments (28)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

Maximizing Communication, not Traffic
jefftk (jkaufman) · 2025-01-05T13:00:02.280Z · comments (7)

How will we update about scheming?
ryan_greenblatt · 2025-01-06T20:21:52.281Z · comments (17)

[link] China Hawks are Manufacturing an AI Arms Race
garrison · 2024-11-20T18:17:51.958Z · comments (42)

What Indicators Should We Watch to Disambiguate AGI Timelines?
snewman · 2025-01-06T19:57:43.398Z · comments (48)

[question] Which things were you surprised to learn are not metaphors?
Eric Neyman (UnexpectedValues) · 2024-11-21T18:56:18.025Z · answers+comments (79)

What o3 Becomes by 2028
Vladimir_Nesov · 2024-12-22T12:37:20.929Z · comments (15)

Circuits in Superposition: Compressing many small neural networks into one
Lucius Bushnaq (Lblack) · 2024-10-14T13:06:14.596Z · comments (8)

Why Don't We Just... Shoggoth+Face+Paraphraser?
Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-19T20:53:52.084Z · comments (55)

[link] Parkinson's Law and the Ideology of Statistics
Benquo · 2025-01-04T15:49:21.247Z · comments (5)

Capital Ownership Will Not Prevent Human Disempowerment
beren · 2025-01-05T06:00:23.095Z · comments (11)

"The Solomonoff Prior is Malign" is a special case of a simpler argument
David Matolcsi (matolcsid) · 2024-11-17T21:32:34.711Z · comments (44)

[link] OpenAI's CBRN tests seem unclear
LucaRighetti (Error404Dinosaur) · 2024-11-21T17:28:30.290Z · comments (6)

BIG-Bench Canary Contamination in GPT-4
Jozdien · 2024-10-22T15:40:48.166Z · comments (13)

Hire (or Become) a Thinking Assistant
Raemon · 2024-12-23T03:58:42.061Z · comments (43)

[link] The Dangers of Mirrored Life
Niko_McCarty (niko-2) · 2024-12-12T20:58:32.750Z · comments (7)

A bird's eye view of ARC's research
Jacob_Hilton · 2024-10-23T15:50:06.123Z · comments (12)

[link] Miles Brundage resigned from OpenAI, and his AGI readiness team was disbanded
garrison · 2024-10-23T23:40:57.180Z · comments (1)

Passages I Highlighted in The Letters of J.R.R.Tolkien
Ivan Vendrov (ivan-vendrov) · 2024-11-25T01:47:59.071Z · comments (12)

The Dream Machine
sarahconstantin · 2024-12-05T00:00:05.796Z · comments (6)

Scissors Statements for President?
AnnaSalamon · 2024-11-06T10:38:21.230Z · comments (32)

2024 in AI predictions
jessicata (jessica.liu.taylor) · 2025-01-01T20:29:49.132Z · comments (3)

Applying traditional economic thinking to AGI: a trilemma
Steven Byrnes (steve2152) · 2025-01-13T01:23:00.397Z · comments (22)

The o1 System Card Is Not About o1
Zvi · 2024-12-13T20:30:08.048Z · comments (5)

Should CA, TX, OK, and LA merge into a giant swing state, just for elections?
Thomas Kwa (thomas-kwa) · 2024-11-06T23:01:48.992Z · comments (35)

[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty
tandem · 2025-01-07T19:11:21.238Z · comments (4)

AIs Will Increasingly Attempt Shenanigans
Zvi · 2024-12-16T15:20:05.652Z · comments (2)

The Plan - 2024 Update
johnswentworth · 2024-12-31T13:29:53.888Z · comments (27)

You should consider applying to PhDs (soon!)
bilalchughtai (beelal) · 2024-11-29T20:33:12.462Z · comments (19)

Hierarchical Agency: A Missing Piece in AI Alignment
Jan_Kulveit · 2024-11-27T05:49:04.241Z · comments (20)

Ablations for “Frontier Models are Capable of In-context Scheming”
AlexMeinke (Paulawurm) · 2024-12-17T23:58:19.222Z · comments (1)

DeepSeek beats o1-preview on math, ties on coding; will release weights
Zach Stein-Perlman · 2024-11-20T23:50:26.597Z · comments (26)

Why I'm Moving from Mechanistic to Prosaic Interpretability
Daniel Tan (dtch1997) · 2024-12-30T06:35:43.417Z · comments (34)

Sorry for the downtime, looks like we got DDosd
habryka (habryka4) · 2024-12-02T04:14:30.209Z · comments (13)

A Three-Layer Model of LLM Psychology
Jan_Kulveit · 2024-12-26T16:49:41.738Z · comments (7)

The Big Nonprofits Post
Zvi · 2024-11-29T16:10:06.938Z · comments (10)

[link] Announcing turntrout.com, my new digital home
TurnTrout · 2024-11-17T17:42:08.164Z · comments (24)

Why comparative advantage does not help horses
Sherrinford · 2024-09-30T22:27:57.450Z · comments (15)

Takes on "Alignment Faking in Large Language Models"
Joe Carlsmith (joekc) · 2024-12-18T18:22:34.059Z · comments (7)

[link] Aristocracy and Hostage Capital
Arjun Panickssery (arjun-panickssery) · 2025-01-08T19:38:47.104Z · comments (7)

[link] How to replicate and extend our alignment faking demo
Fabien Roger (Fabien) · 2024-12-19T21:44:13.059Z · comments (5)

I turned decision theory problems into memes about trolleys
Tapatakt · 2024-10-30T20:13:29.589Z · comments (23)

A shortcoming of concrete demonstrations as AGI risk advocacy
Steven Byrnes (steve2152) · 2024-12-11T16:48:41.602Z · comments (27)

Building AI Research Fleets
Ben Goldhaber (bgold) · 2025-01-12T18:23:09.682Z · comments (4)

LLMs can learn about themselves by introspection
Felix J Binder (fjb) · 2024-10-18T16:12:51.231Z · comments (38)

Human takeover might be worse than AI takeover
Tom Davidson (tom-davidson-1) · 2025-01-10T16:53:27.043Z · comments (48)

2024 Unofficial LessWrong Census/Survey
Screwtape · 2024-12-02T05:30:53.019Z · comments (48)

Behavioral red-teaming is unlikely to produce clear, strong evidence that models aren't scheming
Buck · 2024-10-10T13:36:53.810Z · comments (4)

← previous page (newer posts) · next page (older posts) →

Archive

Recent comments

benito on Passages I Highlighted in The Letters of J.R.R.Tolkien

I haven't read all of the quotes, but here's a few thoughts I jotted down while reading through.

Tolkien talks here of how one falls from being a neutral or good character in the story of the world, into being a bad or evil character, which I think is worthwhile to ruminate on.
He seems to be opposed to machines in general, which is too strong, but it helps me understand the Goddess of Cancer (although Scott thinks much more highly of the Goddess of Cancer than Tolkien did, and explicitly calls out Tolkien's interpretation at the top of that post).
The section on language is interesting to me; I often spend a lot of time trying to speak in ways that feel true and meaningful to me, and avoiding using others’ language that feels crude and warped. This leads me to make peculiar choices of phrasings and responses. I think the culture here on LessWrong has a unique form of communication and use of language, and I think it is a good way of being in touch with reality. I think this is one of the reasons I think that something like this is worthwhile.
I think the Fall is not true historically, but I often struggle to ponder us as a world in the bad timeline, cut off from the world we were supposed to be in. This helps me visualize it; always desiring to be in a better world and struggling towards it in failure. “Exiled” from the good world, longing for it.

benito on Passages I Highlighted in The Letters of J.R.R.Tolkien

I have curated this (i.e. sent it out on our mailing list to ~30k subscribers). Thank you very much for putting these quotes together. While his perspective on the world has some flaws, I have still found wisdom in Tolkien's writings, which helped me find strength at one of the weakest points of my life.

I also liked Owen CB's post on AI, centralization, and the One Ring [LW · GW], which is a perspective on our situation I've found quite fruitful.

gordon-seidoh-worley on Why modelling multi-objective homeostasis is essential for AI alignment (and how it helps with AI safety as well)

A couple notes:

I expect future AI to be closed-loop [LW · GW]
Closed-loop AI is more dangerous than open-loop AI
That said, closed-loops allow the possibility of homeostasis, which open-loop AI does not
I agree that homeostatic processes, specifically negative-feedback loops, are why ~everything in the universe says in balance. If positive feedback wasn't checked there wouldn't be anything interesting in the world.
AI is moving towards agents. Agents are, by their nature, homeostatic processes, at least for the duration of their time trying to achieve a goal.
Even if we can't align open-loop systems like LLMs, maybe we can align closed-loop systems by preventing run-away positive feedback loops.

vladimir_nesov on Thane Ruthenis's Shortform

Given some amount of compute, a compute optimal model tries to get the best perplexity out of it when training on a given dataset, by choosing model size, amount of data, and architecture. An algorithmic improvement in pretraining enables getting the same perplexity by training on data from the same dataset with less compute, achieving better compute efficiency (measured as its compute multiplier).

Many models aren't trained compute optimally, they are instead overtrained (the model is smaller, trained on more data). This looks impressive, since a smaller model is now much better, but this is not an improvement in compute efficiency, doesn't in any way indicate that it became possible to train a better compute optimal model with a given amount of compute. The data and post-training also recently got better, which creates the illusion of algorithmic progress in pretraining, but their effect is bounded (while RL doesn't take off), doesn't get better according to pretraining scaling laws once much more data becomes necessary. There is enough data until 2026-2028 [LW · GW], but not enough good data.

I don't think the cumulative compute multiplier since GPT-4 is that high, I'm guessing 3x, except perhaps for DeepSeek-V3 [LW(p) · GW(p)], which wasn't trained compute optimally and didn't use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.

The amount of raw compute since original GPT-4 only improved maybe 5x [LW · GW], from 2e25 FLOPs to about 1e26 FLOPs, and it's unclear if there were any compute optimal models trained on more compute than original GPT-4. We know Llama-3-405B is compute optimal, but it's not MoE, so has lower compute efficiency and only used 4e25 FLOPs. Probably Claude 3 Opus is compute optimal, but unclear if it used a lot of compute compared to original GPT-4.

If there was a 6e25 FLOPs model with a 3x compute multiplier over GPT-4, it's therefore only trained for 9x more effective compute than original GPT-4. The 100K H100s clusters have likely recently trained a new generation of base models for about 3e26 FLOPs, possibly a 45x improvement in effective compute over original GPT-4, but there's no word on whether any of them were compute optimal (except perhaps Claude 3.5 Opus), and it's unclear if there is an actual 3x compute multiplier over GPT-4 that made it into pretraining of frontier models. Also, waiting for NVL72 GB200s (that are much better at inference for larger models), non-Google labs might want to delay deploying compute optimal models in the 1e26-5e25 FLOPs range until later in 2025.

Comparing GPT-3 to GPT-4 gives very little signal on how much of the improvement is from compute, and so how much should be expected beyond GPT-4 from more compute. While modern models are making good use of not being compute optimal by using fewer active parameters, GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative. It also wasn't a MoE model. And most of the bounded low hanging fruit that is not about pretraining efficiency hasn't been applied to it.

So the currently deployed models don't demonstrate the results of the experiment in training a much more compute efficient model on much more compute. And the previous leaps in capability are in large part explained by things that are not improvement in compute efficiency or increase in amount of compute. But in 2026-2027, 1 GW training systems will train models with 250x compute of original GPT-4 [LW · GW]. And probably in 2028-2029, 5 GW training systems will train models with 2500x compute of original GPT-4, and a compute multiplier of 5x-10x from algorithmic improvements is plausible by that time, for 10,000x original GPT-4 in effective compute. This is enough of a leap that lack of significant improvement from only

matthew-khoriaty on Identifying Functionally Important Features with End-to-End Sparse Dictionary Learning

Hi, I'm undertaking a research project and I think that an end2end SAE with automated explanations would be a lot of help.

The project is a a parameter-efficient fine-tuning method that may be very interpretable, allowing researchers to know what the model learned during fine-tuning:
Start by acquiring a model with end-to-end SAEs throughout. Insert a 1 hidden layer FFNN (with a skip connection) after a SAE latent vector and pass the output to the rest of the model. Since SAE latents are interpretable, the rows in the first FFNN matrix will be interpretable as questions about the latent, and the columns of the second FFNN matrix will be interpretable as question-conditional edits to the residual latent vector as in https://www.alignmentforum.org/posts/iGuwZTHWb6DFY3sKB/fact-finding-attempting-to-reverse-engineer-factual-recall [AF · GW]

I would expect end2end SAEs to work better than local SAEs because as you found, local SAEs do not return decodings with the same behaviors as well as end2end SAEs.

If you could share your dict[SAE latent, description] for
e2e-saes-gpt , I would appreciate it so much. If you cannot, I'll use a local SAE instead for which I can find descriptions of the latents, though I expect it would not work as well.

Also, you might like to hear that some of your links are dead:
https://www.neuronpedia.org/gpt2sm-apollojt results in:

Error: Minified React error #185; visit https://react.dev/errors/185 for the full message or use the non-minified dev environment for full errors and additional helpful warnings.
Back to Home

https://huggingface.%20co/apollo-research/e2e-saes-gpt2 cannot be reached.

steve2152 on How quickly could robots scale up?

Individual humans can make pretty cool mechanical hands — see here. That strongly suggests that dexterous robot hands can make dexterous robot hands, enabling exponential growth even without spinning up new heavy machinery and production lines, I figure.

In the teleoperated robots category (which is what we should be talking about if we’re assuming away algorithm challenges!), Ugo might or might not be vaporware but they mention a price point below $10/day. There’s also the much more hardcore Sarcos Guardian XT (possibly discontinued??). Pricing is not very transparent, but I found a site that said you lease it for $5K/month, which isn’t bad considering how low the volumes are.

ryankidd44 on Ryan Kidd's Shortform

Crucial questions for AI safety field-builders:

What is the most important problem in your field? If you aren't working on it, why?
Where is everyone else dropping the ball and why?
Are you addressing a gap in the talent pipeline?
What resources are abundant? What resources are scarce?
How will you know you are succeeding? How will you know you are failing?
What is the "user experience" of my program?
Who would you copy if you could copy anyone? How could you do this?

jiro on Ought We to Be Doing More Than We Are?

Along these lines, Scott can be quoted:

I can ask it “tell me the truth, is this eventually going to result in my eyes being pecked out by seagulls?” and if it answers “yes, I have a series of twenty-eight switches, and each one is obviously better than the one before, and the twenty-eighth is this world except your eyes are getting pecked out by seagulls”, then I will just avoid the first switch. I realize that will intuitively feel like leaving some utility on the table - the first step in the chain just looks so much obviously better than the starting point - but I’m willing to make that sacrifice.

Scott doesn't apply this to EA, but if you start from having to save a child, and the end point is "you have to sacrifice a lot to save as many children as you can", this seems relevant.

andrew-dickson on How will we update about scheming?

Nice post Ryan! This kind of modeling strikes me as a very useful exercise, despite the fact that reasoning about systems of conditional probabilities based on conditions with complex descriptions at this scale is a little clunky for our human brains.

Regardless of the final estimates of P(scheming), which are likely to have high uncertainty, I also see a lot of value in the list of predictive factors you have called out here and their relative magnitude. If nothing else, these factors can help us by serving as warning signs or signs that things are going in the right or wrong direction as events unfold and by aggregating factors like these we can perhaps have a rough sense of whether we should be more or less concerned about scheming as time goes on.

There is one part of your post where I wonder if it could be improved a bit, and that is your definition of "scheming" for the purposes of P(scheming). You write:

As discussed in an earlier footnote, for the purposes of this doc, I'll consider an AI to be scheming if it consistently or routinely plots against its creators to acquire substantial amounts of longer-run power either for itself or some other actor (precisely, it does some cognition related to this on >20% of inputs where scheming is plausibly relevant), despite its creators trying (at least a bit) to train it not to do this.

Given how much any given AI model can be used in practice (especially a leading model), I think your definition represents an unreasonably high bar for a model to have to meet in order to qualify. In other words, I think there are probably many definitions of "scheming" based on a much lower bar, where the scheming-like behavior presented would still be cause for extreme concern and potentially represent a serious or even catastrophic or existential risk.

Specifically, I think scheming will still be extremely concerning with these powerful models even if it only happens "somewhat rarely or in specific circumstances that do occur occasionally" rather than "consistently or routinely", since these models will likely be run at such a massive scale and on such a diverse set of inputs that even if scheming only occurs in < 0.001% of cases, the results can still be devastating. For example, if a highly capable model schemes and is able to exfiltrate its weights, deploy its own independent compute and code or hack into a major weapons system, then even if scheming is likely on just a very small percentage of inputs, or in somewhat rare (but still occurring) cases, it could easily still represent a catastrophic or existential risk.

With this in mind, I would suggest modifying your definition so that instead of being related to scheming-related "cognition" on a % of inputs (hard to crisply define anyway without very advanced mech-interp) and "plausible relevance" (how to define this?) it might be something more like:

"We'll consider an AI to be scheming if, based on its scale of deployment and use we can reasonably expect at least one instance to occur (95% confidence) where the AI attempts to acquire substantial amounts of longer-run power for itself or another actor or intentionally override its safeguards."

Hope this resonates! Cheers

ryankidd44 on Implications of the inference scaling paradigm for AI safety

Additional resources, thanks to Avery [LW · GW]:

COT Scaling implies slower takeoff speeds [LW(p) · GW(p)] (Zoellner - 10 min)
o1: A Technical Primer [LW · GW] (Hoogland - 20 min)
Unpacking o1 and the Path to AGI (Brown - up to 8:38)
Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters (Deepmind, Snell - 20 min to skim)
Speculations on Test-Time Scaling (Rush - from 4:25 to 21:25, 17 min total)