LessWrong 2.0 Reader

View: New · Old · Top

Restrict date range: Today · This week · This month · Last three months · This year · All time

next page (older posts) →

(The) Lightcone is nothing without its people: LW + Lighthaven's big fundraiser
habryka (habryka4) · 2024-11-30T02:55:16.077Z · comments (238)

OpenAI Email Archives (from Musk v. Altman and OpenAI blog)
habryka (habryka4) · 2024-11-16T06:38:03.937Z · comments (80)

Alignment Faking in Large Language Models
ryan_greenblatt · 2024-12-18T17:19:06.665Z · comments (53)

The hostile telepaths problem
Valentine · 2024-10-27T15:26:53.610Z · comments (85)

[link] Survival without dignity
L Rudolf L (LRudL) · 2024-11-04T02:29:38.758Z · comments (29)

[link] Biological risk from the mirror world
jasoncrawford · 2024-12-12T19:07:06.305Z · comments (36)

[link] Review: Planecrash
L Rudolf L (LRudL) · 2024-12-27T14:18:33.611Z · comments (40)

[link] I got dysentery so you don’t have to
eukaryote · 2024-10-22T04:55:58.422Z · comments (4)

What Goes Without Saying
sarahconstantin · 2024-12-20T18:00:06.363Z · comments (19)

What’s the short timeline plan?
Marius Hobbhahn (marius-hobbhahn) · 2025-01-02T14:59:20.026Z · comments (38)

The Field of AI Alignment: A Postmortem, and What To Do About It
johnswentworth · 2024-12-26T18:48:07.614Z · comments (152)

The Online Sports Gambling Experiment Has Failed
Zvi · 2024-11-11T14:30:04.371Z · comments (58)

[link] By default, capital will matter more than ever after AGI
L Rudolf L (LRudL) · 2024-12-28T17:52:58.358Z · comments (94)

Orienting to 3 year AGI timelines
Nikola Jurkovic (nikolaisalreadytaken) · 2024-12-22T01:15:11.401Z · comments (42)

You are not too "irrational" to know your preferences.
DaystarEld · 2024-11-26T15:01:42.996Z · comments (50)

Ayn Rand’s model of “living money”; and an upside of burnout
AnnaSalamon · 2024-11-16T02:59:07.368Z · comments (58)

[link] Understanding Shapley Values with Venn Diagrams
Carson L · 2024-12-06T21:56:43.960Z · comments (34)

[link] What TMS is like
Sable · 2024-10-31T00:44:22.612Z · comments (23)

Frontier Models are Capable of In-context Scheming
Marius Hobbhahn (marius-hobbhahn) · 2024-12-05T22:11:17.320Z · comments (24)

Communications in Hard Mode (My new job at MIRI)
tanagrabeast · 2024-12-13T20:13:44.825Z · comments (25)

Making a conservative case for alignment
Cameron Berg (cameron-berg) · 2024-11-15T18:55:40.864Z · comments (68)

The Hopium Wars: the AGI Entente Delusion
Max Tegmark (MaxTegmark) · 2024-10-13T17:00:29.033Z · comments (55)

[link] The Compendium, A full argument about extinction risk from AGI
adamShimi · 2024-10-31T12:01:51.714Z · comments (52)

Information vs Assurance
johnswentworth · 2024-10-20T23:16:25.762Z · comments (17)

My motivation and theory of change for working in AI healthtech
Andrew_Critch · 2024-10-12T00:36:30.925Z · comments (37)

Shallow review of technical AI safety, 2024
technicalities · 2024-12-29T12:01:14.724Z · comments (31)

[link] When Is Insurance Worth It?
kqr · 2024-12-19T19:07:32.573Z · comments (65)

[link] Overcoming Bias Anthology
Arjun Panickssery (arjun-panickssery) · 2024-10-20T02:01:23.463Z · comments (14)

The Median Researcher Problem
johnswentworth · 2024-11-02T20:16:11.341Z · comments (71)

o1 is a bad idea
abramdemski · 2024-11-11T21:20:24.892Z · comments (38)

The Summoned Heroine's Prediction Markets Keep Providing Financial Services To The Demon King!
abstractapplic · 2024-10-26T12:34:51.059Z · comments (16)

Neutrality
sarahconstantin · 2024-11-13T23:10:05.469Z · comments (27)

Current safety training techniques do not fully transfer to the agent setting
Simon Lermen (dalasnoin) · 2024-11-03T19:24:51.537Z · comments (8)

[link] o1: A Technical Primer
Jesse Hoogland (jhoogland) · 2024-12-09T19:09:12.413Z · comments (17)

o3
Zach Stein-Perlman · 2024-12-20T18:30:29.448Z · comments (155)

[link] Gradient Routing: Masking Gradients to Localize Computation in Neural Networks
cloud · 2024-12-06T22:19:26.717Z · comments (12)

[link] Arithmetic is an underrated world-modeling technology
dynomight · 2024-10-17T14:00:22.475Z · comments (33)

"It's a 10% chance which I did 10 times, so it should be 100%"
egor.timatkov · 2024-11-18T01:14:27.738Z · comments (57)

A Rocket–Interpretability Analogy
plex (ete) · 2024-10-21T13:55:18.184Z · comments (31)

Repeal the Jones Act of 1920
Zvi · 2024-11-27T15:00:06.801Z · comments (23)

“Alignment Faking” frame is somewhat fake
Jan_Kulveit · 2024-12-20T09:51:04.664Z · comments (13)

OpenAI #10: Reflections
Zvi · 2025-01-07T17:00:07.348Z · comments (6)

Subskills of "Listening to Wisdom"
Raemon · 2024-12-09T03:01:18.706Z · comments (17)

[link] China Hawks are Manufacturing an AI Arms Race
garrison · 2024-11-20T18:17:51.958Z · comments (42)

Maximizing Communication, not Traffic
jefftk (jkaufman) · 2025-01-05T13:00:02.280Z · comments (7)

[question] Which things were you surprised to learn are not metaphors?
Eric Neyman (UnexpectedValues) · 2024-11-21T18:56:18.025Z · answers+comments (79)

How will we update about scheming?
ryan_greenblatt · 2025-01-06T20:21:52.281Z · comments (4)

Circuits in Superposition: Compressing many small neural networks into one
Lucius Bushnaq (Lblack) · 2024-10-14T13:06:14.596Z · comments (8)

What Indicators Should We Watch to Disambiguate AGI Timelines?
snewman · 2025-01-06T19:57:43.398Z · comments (35)

Why Don't We Just... Shoggoth+Face+Paraphraser?
Daniel Kokotajlo (daniel-kokotajlo) · 2024-11-19T20:53:52.084Z · comments (55)

next page (older posts) →

Archive

Recent comments

martin-randall on What is the most impressive game LLMs can play well?

Diplomacy AI by Meta [LW · GW] is a clear example of how adding LLMs can improve narrow game playing. Most multiplayer games with communication will benefit in the same way.

martin-randall on Preference Inversion

I don't understand your perspective. Many forms of warfighting required soldiers to move towards danger. Your basic Roman infantry charge is running towards people who want to stab you. Some video games can be seen as a safe super-stimulus for warfighting and require moving towards danger. If an innate preference for warfighting stopped being reproductively fit in the 1800s that won't have a large genetic effect by the 2000s.

The bible is large and includes both pro-war and anti-war passages. I agree that a preference for warfighting is not universal, but innate preferences need not be universal. Some are attracted to men, some are attracted to women, some both, some neither; this does not mean that sexual attraction is not an innate preference. The naive evo-psych perspective implies that warfighting is more often an innate preference for young males, and not for women, very young children, and elders. Innate preferences can be deactivated or inverted by reality, as in the turn against warfighting in post-WW1 Europe, or post-Vietnam USA. Or they can be activated.

anthonyc on On Eating the Sun

Without making any normative arguments: if you're in a position (industrially and technologically) to disassemble the sun at all, or build something like a Dyson swarm, then it's probably not too difficult to build an artificial system to light the Earth in such a way as to mimic the sun, and make it look and feel nearly identical to biological humans living on the surface, using less than a billionth of the sun's normal total light output. The details of tides might be tricky, but probably not out of reach.

307th on Activation space interpretability may be doomed

Nice post! I think these are good criticisms that don't justify the title. Points 1 through 4 are all (specific, plausible) examples of ways we may interpret the activation space incorrectly. This is worth keeping in mind, and I agree that just looking at the activation space of a single layer isn't enough, but it still seems like a very good place to start.

A layer's activation is a relatively simple space, constructed by the model, that contains all the information that the model needs to make its prediction. This makes it a great place to look if you're trying to understand how the model's thinking.

programcrafter on Is "VNM-agent" one of several options, for what minds can grow up into?

Yes, many people will have problems with the Archimedes' axiom because it implies that everything has a price (that any good option can be probability-diluted enough that a mediocre is chosen instead), and people don't take it kindly when you tell "you absolutely must have a trade-off between value A and value B" - especially if they really don't have a trade-off, but also if they don't want to admit or consciously estimate it.

Thankfully, that VNM property is not that critical for rational decision-making because we can simply use surreal numbers instead.

One possible real-world example (with integer-valued for deterministic outcomes) would be a parent whose top priority is minimizing the number of their children who die within the parent's lifetime, with the rest of their utility function being secondary.

Wouldn't work well since in real world outcomes are non-deterministic; given that, minimizing expected number is accomplished by simply having zero children.

experience-machine on Activation space interpretability may be doomed

I see, thanks for sharing!

lblack on Activation space interpretability may be doomed

The third term in that. Though it was in a somewhat different context related to the weight partitioning project mentioned in the last paragraph, not SAE training.

Yes, brittle in hyperparameters. It was also just very painful to train in general. I wouldn't straightforwardly extrapolate our experience to a standard SAE setup though, we had a lot of other things going on in that optimisation.

experience-machine on Activation space interpretability may be doomed

Did you use something like as described here ? By brittle do you mean w.r.t the sparsity penality (and other hyperparameters)?

tin482 on What Indicators Should We Watch to Disambiguate AGI Timelines?

I think the best argument for a "fastest-er" timeline would be that several of your bottlenecks end up heavily substituting against each other or some common factor. A researcher in NLP in 2015 might reasonably have guessed it'd take decades to reach the level of ChatGPT - after all, it would require breakthroughs in parsing, entailment, word sense, semantics, world knowledge... In reality these capabilities were all blessings of scale.

o1 may or may not be the central breakthrough in this scenario, but I can paint a world where it is, and that world is very fast indeed. RL, like next word prediction, is "mechanism agnostic" - it induces whatever capabilities are instrumental to maximizing reward. Already, "back-tracking" or "error-correcting" behavior has emerged, previously cited by LeCun and others as a fundamental block. In this world RL applied to "Chains of Action" strongly induces agentic structures like adaptability, tool use, note caching, goal-directedness, and coherence. Gradient descent successfully routes around any weaknesses in LLMs (as I suspect it does for weaknesses in our LLM training pipelines today). By the time we reach the same engineering effort in post-training that we've dedicated to pretraining, it's AGI. Well, given the scale of OpenAI's investment, we should be able to rule on this scenario pretty quickly.

I'd say I'm skeptical of the specifics (RL hasn't demonstrated this kind of rich success in the past) but more uncertain about the outline (How well could e.g. fast adaptation trade off against multimodality, gullibility, coherence?)

tchauvin on Predicting AI Releases Through Side Channels

Nice attempt. This reminds of the Pizza Meter and Gay Bar Index related to Pentagon crisis situations. I found it hard to find reliable information on this when I looked (I can't even find a good link to share), but the mechanism seems plausible.