Posts

The Plan - 2024 Update 2024-12-31T13:29:53.888Z
The Field of AI Alignment: A Postmortem, and What To Do About It 2024-12-26T18:48:07.614Z
What Have Been Your Most Valuable Casual Conversations At Conferences? 2024-12-25T05:49:36.711Z
The Median Researcher Problem 2024-11-02T20:16:11.341Z
Three Notions of "Power" 2024-10-30T06:10:08.326Z
Information vs Assurance 2024-10-20T23:16:25.762Z
Minimal Motivation of Natural Latents 2024-10-14T22:51:58.125Z
Values Are Real Like Harry Potter 2024-10-09T23:42:24.724Z
We Don't Know Our Own Values, but Reward Bridges The Is-Ought Gap 2024-09-19T22:22:05.307Z
Why Large Bureaucratic Organizations? 2024-08-27T18:30:07.422Z
... Wait, our models of semantics should inform fluid mechanics?!? 2024-08-26T16:38:53.924Z
Interoperable High Level Structures: Early Thoughts on Adjectives 2024-08-22T21:12:38.223Z
A Robust Natural Latent Over A Mixed Distribution Is Natural Over The Distributions Which Were Mixed 2024-08-22T19:19:28.940Z
What is "True Love"? 2024-08-18T16:05:47.358Z
Some Unorthodox Ways To Achieve High GDP Growth 2024-08-08T18:58:56.046Z
A Simple Toy Coherence Theorem 2024-08-02T17:47:50.642Z
A Solomonoff Inductor Walks Into a Bar: Schelling Points for Communication 2024-07-26T00:33:42.000Z
(Approximately) Deterministic Natural Latents 2024-07-19T23:02:12.306Z
Dialogue on What It Means For Something to Have A Function/Purpose 2024-07-15T16:28:56.609Z
3C's: A Recipe For Mathing Concepts 2024-07-03T01:06:11.944Z
Corrigibility = Tool-ness? 2024-06-28T01:19:48.883Z
What is a Tool? 2024-06-25T23:40:07.483Z
Towards a Less Bullshit Model of Semantics 2024-06-17T15:51:06.060Z
My AI Model Delta Compared To Christiano 2024-06-12T18:19:44.768Z
My AI Model Delta Compared To Yudkowsky 2024-06-10T16:12:53.179Z
Natural Latents Are Not Robust To Tiny Mixtures 2024-06-07T18:53:36.643Z
Calculating Natural Latents via Resampling 2024-06-06T00:37:42.127Z
Value Claims (In Particular) Are Usually Bullshit 2024-05-30T06:26:21.151Z
When Are Circular Definitions A Problem? 2024-05-28T20:00:23.408Z
Why Care About Natural Latents? 2024-05-09T23:14:30.626Z
Some Experiments I'd Like Someone To Try With An Amnestic 2024-05-04T22:04:19.692Z
Examples of Highly Counterfactual Discoveries? 2024-04-23T22:19:19.399Z
Why Would Belief-States Have A Fractal Structure, And Why Would That Matter For Interpretability? An Explainer 2024-04-18T00:27:43.451Z
Generalized Stat Mech: The Boltzmann Approach 2024-04-12T17:47:31.880Z
How We Picture Bayesian Agents 2024-04-08T18:12:48.595Z
Coherence of Caches and Agents 2024-04-01T23:04:31.320Z
Natural Latents: The Concepts 2024-03-20T18:21:19.878Z
The Worst Form Of Government (Except For Everything Else We've Tried) 2024-03-17T18:11:38.374Z
The Parable Of The Fallen Pendulum - Part 2 2024-03-12T21:41:30.180Z
The Parable Of The Fallen Pendulum - Part 1 2024-03-01T00:25:00.111Z
Leading The Parade 2024-01-31T22:39:56.499Z
A Shutdown Problem Proposal 2024-01-21T18:12:48.664Z
Some Vacation Photos 2024-01-04T17:15:01.187Z
Apologizing is a Core Rationalist Skill 2024-01-02T17:47:35.950Z
The Plan - 2023 Version 2023-12-29T23:34:19.651Z
Natural Latents: The Math 2023-12-27T19:03:01.923Z
Talk: "AI Would Be A Lot Less Alarming If We Understood Agents" 2023-12-17T23:46:32.814Z
Principles For Product Liability (With Application To AI) 2023-12-10T21:27:41.403Z
What I Would Do If I Were Working On AI Governance 2023-12-08T06:43:42.565Z
On Trust 2023-12-06T19:19:07.680Z

Comments

Comment by johnswentworth on Fabien's Shortform · 2025-01-10T15:10:23.640Z · LW · GW

Yeah, I'm aware of that model. I personally generally expect the "science on model organisms"-style path to contribute basically zero value to aligning advanced AI, because (a) the "model organisms" in question are terrible models, in the sense that findings on them will predictably not generalize to even moderately different/stronger systems (like e.g. this story), and (b) in practice IIUC that sort of work is almost exclusively focused on the prototypical failure story of strategic deception and scheming, which is a very narrow slice of the AI extinction probability mass.

Comment by johnswentworth on johnswentworth's Shortform · 2025-01-10T15:09:32.292Z · LW · GW

Also (separate comment because I expect this one to be more divisive): I think the scheming story has been disproportionately memetically successful largely because it's relatively easy to imagine hacky ways of preventing an AI from intentionally scheming. And that's mostly a bad thing; it's a form of streetlighting.

Comment by johnswentworth on johnswentworth's Shortform · 2025-01-10T15:08:38.933Z · LW · GW

I think a very common problem in alignment research today is that people focus almost exclusively on a specific story about strategic deception/scheming, and that story is a very narrow slice of the AI extinction probability mass. At some point I should probably write a proper post on this, but for now here are few off-the-cuff example AI extinction stories which don't look like the prototypical scheming story. (These are copied from a Facebook thread.)

  • Perhaps the path to superintelligence looks like applying lots of search/optimization over shallow heuristics. Then we potentially die to things which aren't smart enough to be intentionally deceptive, but nonetheless have been selected-upon to have a lot of deceptive behaviors (via e.g. lots of RL on human feedback).
  • The "Getting What We Measure" scenario from Paul's old "What Failure Looks Like" post.
  • The "fusion power generator scenario".
  • Perhaps someone trains a STEM-AGI, which can't think about humans much at all. In the course of its work, that AGI reasons that an oxygen-rich atmosphere is very inconvenient for manufacturing, and aims to get rid of it. It doesn't think about humans at all, but the human operators can't understand most of the AI's plans anyway, so the plan goes through. As an added bonus, nobody can figure out why the atmosphere is losing oxygen until it's far too late, because the world is complicated and becomes more so with a bunch of AIs running around and no one AI has a big-picture understanding of anything either (much like today's humans have no big-picture understanding of the whole human economy/society).
  • People try to do the whole "outsource alignment research to early AGI" thing, but the human overseers are themselves sufficiently incompetent at alignment of superintelligences that the early AGI produces a plan which looks great to the overseers (as it was trained to do), and that plan totally fails to align more-powerful next-gen AGI at all. And at that point, they're already on the more-powerful next gen, so it's too late.
  • The classic overnight hard takeoff: a system becomes capable of self-improving at all but doesn't seem very alarmingly good at it, somebody leaves it running overnight, exponentials kick in, and there is no morning.
  • (At least some) AGIs act much like a colonizing civilization. Plenty of humans ally with it, trade with it, try to get it to fight their outgroup, etc, and the AGIs locally respect the agreements with the humans and cooperate with their allies, but the end result is humanity gradually losing all control and eventually dying out.
  • Perhaps early AGI involves lots of moderately-intelligent subagents. The AI as a whole mostly seems pretty aligned most of the time, but at some point a particular subagent starts self-improving, goes supercritical, and takes over the rest of the system overnight. (Think cancer, but more agentic.)
  • Perhaps the path to superintelligence looks like scaling up o1-style runtime reasoning to the point where we're using an LLM to simulate a whole society. But the effects of a whole society (or parts of a society) on the world are relatively decoupled from the things-individual-people-say-taken-at-face-value. For instance, lots of people talk a lot about reducing poverty, yet have basically-no effect on poverty. So developers attempt to rely on chain-of-thought transparency, and shoot themselves in the foot.
Comment by johnswentworth on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-08T16:31:09.613Z · LW · GW
  • Tricks that work on smaller scales often don't generalize to larger scales.
  • Tricks that work on larger scales often don't work on smaller scales (due to bigger ML models having various novel emergent properties).

My understanding is that these two claims are mostly false in practice. In particular, there have been a few studies (like e.g. this) which try to run yesterday's algorithms with today's scale, and today's algorithms with yesterday's scale, in order to attribute progress to scale vs algorithmic improvements. I haven't gone through those studies in very careful detail, but my understanding is that they pretty consistently find today's algorithms outperform yesterday's algorithms even when scaled down, and yesterday's algorithms underperform today's even when scaled up. So unless I've badly misunderstood those studies, the mental model in which different tricks work best on different scales is basically just false, at least at the range of different scales the field has gone through in the past ~decade.

That said, there are cases where I could imagine Ilya's claim making sense, e.g. if the "experiments" he's talking about are experiments in using the net rather than training the net. Certainly one can do qualitatively different things with GPT4 than GPT2, so if one is testing e.g. a scaffolding setup or a net's ability to play a particular game, then one needs to use the larger net. Perhaps that's what Ilya had in mind?

Comment by johnswentworth on Alexander Gietelink Oldenziel's Shortform · 2025-01-07T18:39:16.195Z · LW · GW

I don't remember the details, but IIRC ZIP is mostly based on Lempel-Ziv, and it's fairly straightforward to modify Lempel-Ziv to allow for efficient local decoding.

My guess would be that the large majority of the compression achieved by ZIP on NN weights is because the NN weights are mostly-roughly-standard-normal, and IEEE floats are not very efficient for standard normal variables. So ZIP achieves high compression for "kinda boring reasons", in the sense that we already knew all about that compressibillity but just don't leverage it in day-to-day operations because our float arithmetic hardware uses IEEE.

Comment by johnswentworth on Is "hidden complexity of wishes problem" solved? · 2025-01-06T01:00:31.169Z · LW · GW

Short answer: no.

Longer answer: we need to distinguish between two things people might have in mind when they say that LLMs "solve the hidden complexity of wishes problem".

First, one might imagine that LLMs "solve the hidden complexity of wishes problem" because they're able to answer natural-language questions about humans' wishes much the same way a human would. Alas, that's a misunderstanding of the problem. If the ability to answer natural-language questions about humans' wishes in human-like ways were all we needed in order to solve the "hidden complexity of wishes" problem, then a plain old human would be a solution to the problem; one could just ask the human. Part of the problem is that humans themselves understand their own wishes so poorly that their own natural-language responses to questions are not a safe optimization target either.

Second, one might imagine LLMs "solve the hidden complexity of wishes problem" because when we ask an LLM to solve a problem, it solves the problem in a human-like way. It's not about the LLM's knowledge of humans' (answers to questions about their) wishes, but rather about LLMs solving problems and optimizing in ways which mimic human problem-solving and optimization. And that does handle the hidden complexity problem... but only insofar as we continue to use LLMs in exactly the same way. If we start e.g. scaling up o1-style methods, or doing HCH, or put the LLM in some other scaffolding so we're not directly asking it to solve a problem and then using the human-like solutions it generates... then we're (potentially) back to having a hidden complexity problem. For each of those different methods of using the LLM to solve problems, we have to separately consider whether the human-mimicry properties of the LLM generalize to that method enough to handle the hidden complexity issue.

(Toy example: suppose we use LLMs to mimic a very very large organization. Like most real-world organizations, information and constraints end up fairly siloed/modularized, so some parts of the system are optimizing for e.g. "put out the fire" and don't know that grandma's in the house at all. And then maybe that part of the system chooses a nice efficient fire-extinguishing approach which kills grandma, like e.g. collapsing the house and then smothering it.)

And crucially: if AI is ever to solve problems too hard for humans (which is one of its main value propositions), we're definitely going to need to do something with LLMs besides use them to solve problems in human-like ways.

Comment by johnswentworth on Values Are Real Like Harry Potter · 2025-01-05T12:22:41.382Z · LW · GW

If you can still have values without reward signals that tell you about them, then doesn't that mean your values are defined by more than just what the "screen" shows? That even if you could see and understand every part of someone's reward system, you still wouldn't know everything about their values?

No.

An analogy: suppose I run a small messaging app, and all the users' messages are stored in a database. The messages are also cached in a faster-but-less-stable system. One day the database gets wiped for some reason, so I use the cache to repopulate the database.

In this example, even though I use the cache to repopulate the database in this one weird case, it is still correct to say that the database is generally the source of ground truth for user messages in the system; the weird case is in fact weird. (Indeed, that's exactly how software engineers would normally talk about it.)

Spelling out the analogy: in a human brain in ordinary operation, our values (I claim) ground out in the reward stream, analogous to the database. There's still a bunch of "caching" of values, and in weird cases like the one you suggest, one might "repopulate" the reward stream from the "cached" values elsewhere in the system. But it's still correct to say that the reward stream is generally the source of ground truth for values in the system; the weird case is in fact weird.

Comment by johnswentworth on Values Are Real Like Harry Potter · 2025-01-02T23:08:33.308Z · LW · GW

Good question.

First and most important: if you know beforehand that you're at risk of entering such a state, then you should (according to your current values) probably put mechanisms in place to pressure your future self to restore your old reward stream. (This is not to say that fully preserving the reward stream is always the right thing to do, but the question of when one shouldn't conserve one's reward stream is a separate one which we can factor apart from the question at hand.)

... and AFAICT, it happens that the human brain already works in a way which would make that happen to some extent by default. In particular, most of our day-to-day planning draws on cached value-estimates which would still remain, at least for a time, even if the underlying rewards suddenly zeroed out.

... and it also happens that other humans, like e.g. your friends, would probably prefer (according to their values) for you to have roughly-ordinary reward signals rather than zeros. So that would also push in a similar direction.

And again, you might decide to edit the rewards away from the original baseline afterwards. But that's a separate question.

On the other hand, consider a mind which was never human in the first place, never had any values or rewards, and is given the same ability to modify its rewards as in your hypothetical. Then - I claim - that mind has no particular reason to favor any rewards at all. (Although we humans might prefer that it choose some particular rewards!)

Your question touched on several different things, so let me know if that missed the parts you were most interested in.

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-02T15:06:36.630Z · LW · GW

Interpretability is not currently trying to look at AIs to determine whether they will kill us. That's way too advanced for where we're at.

Right, and that's a problem. There's this big qualitative gap between the kinds of questions interp is even trying to address today, and the kinds of questions it needs to address. It's the gap between talking about stuff inside the net, and talking about stuff in the environment (which the stuff inside the net represents).

And I think the focus on LLMs is largely to blame for that gap seeming "way too advanced for where we're at". I expect it's much easier to cross if we focus on image models instead.

(And to be clear, even after crossing the internals/environment gap, there will still be a long ways to go before we're ready to ask about e.g. whether an AI will kill us. But the internals/environment gap is the main qualitative barrier I know of; after that it should be more a matter of iteration and ordinary science.)

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-02T12:28:22.926Z · LW · GW

Great question.

First, I do think of systems biology as one of the main fields where the tools we're developing should apply.

But I would not say that fields like artificial life or computational biology have done much to cross the theory-practice gap, mainly because they have little theory at all. Artificial life, at least what I've seen of it, is mostly just running various cool simulations without any generalizable theory at all. When there is "theory", it's usually e.g. someone vibing about the free energy principle, but it's just vibing without any gears behind it. Levin's work is very cool, but doesn't seem to involve any theory beyond kinda vibing about communication and coordination. (Though of course all this could just be my own ignorance talking, please do let me know if there's some substantive theory I haven't heard of.)

Uri Alon's book is probably the best example I know of actual theory in biology, but most of the specifics are too narrow for our purposes. Some of it is a useful example to keep in mind.

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-02T12:19:13.492Z · LW · GW

The preprocessing itself is one of the main important things we need to understand (I would even argue it's the main important thing), if our interpretability methods are ever going to tell us about how the stuff-inside-the-net relates to the stuff-in-the-environment (which is what we actually care about).

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-01T14:31:27.788Z · LW · GW

But why do we care more about statistical relationships between physical humans and dogs than statistical relationships between the word "human" and the word "dog" as characters on your screen?

An overly-cute but not completely wrong answer: because I care about whether AI kills all the physical humans, not whether something somewhere writes the string "kill all the humans". My terminal-ish values are mostly over the physical stuff.

I think the point you're trying to make is roughly "well, it's all pretty entangled with the physical stuff anyway, so why favor one medium or another? Instrumentally, either suffices.". And the point I'm trying to make in response is "it matters a lot how complicated the relationship is between the medium and the physical stuff, because terminally it's the physical we care about, so instrumentally stuff that's more simply related to the physical stuff is a lot more useful to understand.".

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-01T11:34:20.431Z · LW · GW

I don't know how much the difficulty of crossing the theory-practice gap has deviated from your expectations since then.

It's been pretty on-par.

But I would indeed be worried that a lot of the difficulty is going to be in getting any good results for deep learning, and that finding additional theoretical/conceptual results in other settings doesn't constitute much progress on that.

Amusingly, I tend to worry more about the opposite failure mode: findings on today's nets won't generalize to tomorrow's nets (even without another transformers-level paradigm shift), and therefore leveraging evidence from other places is the only way to do work which will actually be relevant.

(More accurately, I worry that the relevance or use-cases of findings on today's nets won't generalize to tomorrow's nets. Central example: if we go from a GPT-style LLM to a much bigger o1/o3-style model which is effectively simulating a whole society talking to each other, then the relationship between the tokens and the real-world effects of the system changes a lot. So even if work on the GPT-style models tells us something about the o1/o3-style models, its relevance is potentially very different.)

I assume that was some other type of experiment involving image generators? (and the notion of "working well" there isn't directly comparable to what you tried now?)

Yeah, that was on a little MNIST net. And the degree of success I saw in that earlier experiment was actually about on par with what we saw in our more recent experiments, our bar was just quite a lot higher this time around. This time we were aiming for things like e.g. "move one person's head" rather than "move any stuff in any natural way at all".

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-01T11:24:23.746Z · LW · GW

Interpretability on an LLM might, for example, tell me a great deal about the statistical relationships between the word "human" and the word "dog" in various contexts. And the "trivial" sense in which this tells me about physical stuff is that the texts in question are embedded in the world - they're characters on my screen, for instance.

The problem is that I don't care that much about the characters on my screen in-and-of themselves. I mostly care about the characters on my screen insofar as they tell me about other things, like e.g. physical humans and dogs.

So, say I'm doing interpretability work on an LLM, and I find some statistical pattern between the word "human" and the word "dog". (Flag: this is oversimplified compared to actual interp.) What does that pattern tell me about physical humans and physical dogs, the things I actually care about? How does that pattern even relate to physical humans and physical dogs? Well shit, that's a whole very complicated question in its own right.

On the other hand, if I'm doing interp work on an image generator... I'm forced to start lower-level, so by the time I'm working with things like humans and dogs I've already understood a whole lot of stuff about the lower-level patterns which constitute humans and dogs (which is itself probably useful, that's exactly the sort of thing I want to learn about). Then I find some relationship between the human parts of the image and the dog parts of the image, and insofar as the generator was trained on real-world images, that much more directly tells me about physical humans and dogs and how they relate statistically (like e.g. where they're likely to be located relative to each other).

Comment by johnswentworth on The Plan - 2024 Update · 2025-01-01T01:25:20.775Z · LW · GW

The thing I ultimately care about is patterns in our physical world, like trees or humans or painted rocks. I am interested in patterns in speech/text (like e.g. bigram distributions) basically only insofar as they tell something useful about patterns in the physical world. I am also interested in patterns in pixels only insofar as they tell something useful about patterns in the physical world. But it's a lot easier to go from "pattern in pixels" to "pattern in physical world" than from "pattern in tokens" to "pattern in physical world". (Excluding the trivial sense in which tokens are embedded in the physical world and therefore any pattern in tokens is a pattern in the physical world; that's not what we're talking about here.)

That's the sense in which pixels are "closer to the metal", and why I care about that property.

Does that make sense?

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-31T14:39:55.205Z · LW · GW

I currently think broad technical knowledge is the main requisite, and I think self-study can suffice for the large majority of that in principle. The main failure mode I see would-be autodidacts run into is motivation, but if you can stay motivated then there's plenty of study materials.

For practice solving novel problems, just picking some interesting problems (preferably not AI) and working on them for a while is a fine way to practice.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-30T22:19:15.355Z · LW · GW

I’m not convinced that the “hard parts” of alignment are difficult in the standardly difficult, g-requiring way that e.g., a physics post-doc might possess.

To be clear, I wasn't talking about physics postdocs mainly because of raw g. Raw g is a necessary element, and physics postdocs are pretty heavily loaded on it, but I was talking about physics postdocs mostly because of the large volume of applied math tools they have.

The usual way that someone sees footholds on the hard parts of alignment is to have a broad enough technical background that they can see some analogy to something they know about, and try borrowing tools that work on that other thing. Thus the importance of a large volume of technical knowledge.

Comment by johnswentworth on Fabien's Shortform · 2024-12-29T08:38:47.262Z · LW · GW

Kudos for correctly identifying the main cruxy point here, even though I didn't talk about it directly.

The main reason I use the term "propaganda" here is that it's an accurate description of the useful function of such papers, i.e. to convince people of things, as opposed to directly advancing our cutting-edge understanding/tools. The connotation is that propagandists over the years have correctly realized that presenting empirical findings is not a very effective way to convince people of things, and that applies to these papers as well.

And I would say that people are usually correct to not update much on empirical findings! Not Measuring What You Think You Are Measuring is a very strong default, especially among the type of papers we're talking about here.

Comment by johnswentworth on The Median Researcher Problem · 2024-12-29T08:18:14.000Z · LW · GW

The big names do tend to have disproportionate weight, but they're few in number, and when a big name promotes something the median researcher doesn't understand everyone just kind of shrugs and ignores them. Memeticity selects who the big names are, much more so than vice-versa.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T08:10:48.037Z · LW · GW

All four of those I think are basically useless in practice for purposes of progress toward aligning significantly-smarter-than-human AGI, including indirectly (e.g. via outsourcing alignment research to AI). There are perhaps some versions of all four which could be useful, but those versions do not resemble any work I've ever heard of anyone actually doing in any of those categories.

That said, many of those do plausibly produce value as propaganda for the political cause of AI safety, especially insofar as they involve demoing scary behaviors.

EDIT-TO-ADD: Actually, I guess I do think the singular learning theorists are headed in a useful direction, and that does fall under your "science of generalization" category. Though most of the potential value of that thread is still in interp, not so much black-box calculation of RLCTs.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-29T00:16:28.542Z · LW · GW

If you're thinking mainly about interp, then I basically agree with what you've been saying. I don't usually think of interp as part of "prosaic alignment", it's quite different in terms of culture and mindset and it's much closer to what I imagine a non-streetlight-y field of alignment would look like. 90% of it is crap (usually in streetlight-y ways), but the memetic selection pressures don't seem too bad.

If we had about 10x more time than it looks like we have, then I'd say the field of interp is plausibly on track to handle the core problems of alignment.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T23:53:32.846Z · LW · GW

This is the sort of object-level discussion I don't want on this post, but I've left a comment on Fabien's list.

Comment by johnswentworth on Fabien's Shortform · 2024-12-28T23:52:35.900Z · LW · GW

Someone asked what I thought of these, so I'm leaving a comment here. It's kind of a drive-by take, which I wouldn't normally leave without more careful consideration and double-checking of the papers, but the question was asked so I'm giving my current best answer.

First, I'd separate the typical value prop of these sort of papers into two categories:

  • Propaganda-masquerading-as-paper: the paper is mostly valuable as propaganda for the political agenda of AI safety. Scary demos are a central example. There can legitimately be valuable here.
  • Object-level: gets us closer to aligning substantially-smarter-than-human AGI, either directly or indirectly (e.g. by making it easier/safer to use weaker AI for the problem).

My take: many of these papers have some value as propaganda. Almost all of them provide basically-zero object-level progress toward aligning substantially-smarter-than-human AGI, either directly or indirectly.

Notable exceptions:

  • Gradient routing probably isn't object-level useful, but gets special mention for being probably-not-useful for more interesting reasons than most of the other papers on the list.
  • Sparse feature circuits is the right type-of-thing to be object-level useful, though not sure how well it actually works.
  • Better SAEs are not a bottleneck at this point, but there's some marginal object-level value there.
Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T23:09:36.362Z · LW · GW

My impression, from conversations with many people, is that the claim which gets clamped to True is not "this research direction will/can solve alignment" but instead "my research is high value". So when I've explained to someone why their current direction is utterly insufficient, they usually won't deny some class of problems. They'll instead tell me that the research still seems valuable even though it isn't addressing a bottleneck, or that their research is maybe a useful part of a bigger solution which involves many other parts, or that their research is maybe useful step toward something better.

(Though admittedly I usually try to "meet people where they're at", by presenting failure-modes which won't parse as weird to them. If you're just directly explaining e.g. dangers of internal RSI, I can see where people might instead just assume away internal RSI or some such.)

... and then if I were really putting in effort, I'd need to explain that e.g. being a useful part of a bigger solution (which they don't know the details of) is itself a rather difficult design constraint which they have not at all done the work to satisfy. But usually I wrap up the discussion well before that point; I generally expect that at most one big takeaway from a discussion can stick, and if they already have one then I don't want to overdo it.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T23:00:42.551Z · LW · GW

This post isn't intended to convince anyone at all that people are in fact streetlighting. This post is intended to present my own models and best guesses at what to do about it to people who are already convinced that most researchers in the field are streetlighting. They are the audience.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T09:48:17.881Z · LW · GW

I think you have two main points here, which require two separate responses. I'll do them opposite the order you presented them.

Your second point, paraphrased: 90% of anything is crap, that doesn't mean there's no progress. I'm totally on board with that. But in alignment today, it's not just that 90% of the work is crap, it's that the most memetically successful work is crap. It's not the raw volume of crap that's the issue so much as the memetic selection pressures.

Your first point, paraphrased: progress toward the the hard problem does not necessarily immediately look like tackling the meat of the hard problem directly. I buy that to some extent, but there are plenty of cases where we can look at what people are doing and see pretty clearly that it is not progress toward the hard problem, whether direct or otherwise. And indeed, I would claim that prosaic alignment as a category is a case where people are not making progress on the hard problems, whether direct or otherwise. In particular, one relevant criterion to look at here is generalizability: is the work being done sufficiently general/robust that it will still be relevant once the rest of the problem is solved (and multiple things change in not-yet-predictable ways in order to solve the rest of the problem)? See e.g. this recent comment for an object-level example of what I mean.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-28T00:13:21.860Z · LW · GW

From the post:

... but crucially, the details of the rationalizations aren't that relevant to this post. Someone who's flinching away from a hard problem will always be able to find some rationalization. Argue them out of one (which is itself difficult), and they'll promptly find another. If we want people to not streetlight, then we need to somehow solve the flinching.

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T20:20:06.509Z · LW · GW

High variance. A lot of mathematics programs allow one to specialize in fairly narrow subjects IIUC, which does not convey a lot of general technical skill. I'm sure there are some physics programs which are relatively narrow, but my impression is that physics programs typically force one to cover a pretty wide volume of material.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-27T20:17:28.779Z · LW · GW

I mean, there are lots of easy benchmarks on which I can solve the large majority of the problems, and a language model can also solve the large majority of the problems, and the language model can often have a somewhat lower error rate than me if it's been optimized for that. Seems like GPQA (and GPQA diamond) are yet another example of such a benchmark.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-27T19:12:02.704Z · LW · GW

Even assuming you're correct here, I don't see how that would make my original post pretty misleading?

Comment by johnswentworth on The Field of AI Alignment: A Postmortem, and What To Do About It · 2024-12-27T17:47:01.741Z · LW · GW

Simply focusing on physics post-docs feels too narrow to me.

The post did explicitly say "Obviously that doesn't mean we exclusively want physics postdocs".

Comment by johnswentworth on leogao's Shortform · 2024-12-27T13:32:40.239Z · LW · GW

This comment seems to implicitly assume markers of status are the only way to judge quality of work. You can just, y'know, look at it? Even without doing a deep dive, the sort of papers or blog posts which present good research have a different style and rhythm to them than the crap. And it's totally reasonable to declare that one's audience is the people who know how to pick up on that sort of style.

The bigger reason we can't entirely escape "status"-ranking systems is that there's far too much work to look at it all, so people have to choose which information sources to pay attention to.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-27T04:38:33.560Z · LW · GW

I remember finishing early, and then spending a lot of time going back over all them a second time, because the goal of the workshop was to answer correctly with very high confidence. I don't think I updated any answers as a result of the second pass, though I don't remember very well.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-27T03:03:26.644Z · LW · GW

@Buck Apparently the five problems I tried were GPQA diamond, they did not take anywhere near 30 minutes on average (more like 10 IIRC?), and I got 4/5 correct. So no, I do not think that modern LLMs probably outperform (me with internet access and 30 minutes).

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-26T23:21:32.859Z · LW · GW

I don't know, I have not specifically tried GPQA diamond problems. I'll reply again if and when I do.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-26T19:04:56.202Z · LW · GW

Is this with internet access for you?

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-26T18:36:30.528Z · LW · GW

That's the opposite of my experience. Nearly all the papers I read vary between "trash, I got nothing useful out besides an idea for a post explaining the relevant failure modes" and "high quality but not relevant to anything important". Setting up our experiments is historically much faster than the work of figuring out what experiments would actually be useful.

There are exceptions to this, large projects which seem useful and would require lots of experimental work, but they're usually much lower-expected-value-per-unit-time than going back to the whiteboard, understanding things better, and doing a simpler experiment once we know what to test.

Comment by johnswentworth on What Have Been Your Most Valuable Casual Conversations At Conferences? · 2024-12-25T12:41:30.961Z · LW · GW

How does one make a question post these days? That was my original intent, but the old button is gone.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-24T15:52:38.074Z · LW · GW

On o3: for what feels like the twentieth time this year, I see people freaking out, saying AGI is upon us, it's the end of knowledge work, timelines now clearly in single-digit years, etc, etc. I basically don't buy it, my low-confidence median guess is that o3 is massively overhyped. Major reasons:

  • I've personally done 5 problems from GPQA in different fields and got 4 of them correct (allowing internet access, which was the intent behind that benchmark). I've also seen one or two problems from the software engineering benchmark. In both cases, when I look the actual problems in the benchmark, they are easy, despite people constantly calling them hard and saying that they require expert-level knowledge.
    • For GPQA, my median guess is that the PhDs they tested on were mostly pretty stupid. Probably a bunch of them were e.g. bio PhD students at NYU who would just reflexively give up if faced with even a relatively simple stat mech question which can be solved with a couple minutes of googling jargon and blindly plugging two numbers into an equation.
    • For software engineering, the problems are generated from real git pull requests IIUC, and it turns out that lots of those are things like e.g. "just remove this if-block".
    • Generalizing the lesson here: the supposedly-hard benchmarks for which I have seen a few problems (e.g. GPQA, software eng) turn out to be mostly quite easy, so my prior on other supposedly-hard benchmarks which I haven't checked (e.g. FrontierMath) is that they're also mostly much easier than they're hyped up to be.
  • On my current model of Sam Altman, he's currently very desperate to make it look like there's no impending AI winter, capabilities are still progressing rapidly, etc. Whether or not it's intentional on Sam Altman's part, OpenAI acts accordingly, releasing lots of very over-hyped demos. So, I discount anything hyped out of OpenAI, and doubly so for products which aren't released publicly (yet).
  • Over and over again in the past year or so, people have said that some new model is a total game changer for math/coding, and then David will hand it one of the actual math or coding problems we're working on and it will spit out complete trash. And not like "we underspecified the problem" trash, or "subtle corner case" trash. I mean like "midway through the proof it redefined this variable as a totally different thing and then carried on as though both definitions applied". The most recent model with which this happened was o1.
    • Of course I am also tracking the possibility that this is a skill issue on our part, and if that's the case I would certainly love for someone to help us do better. See this thread for a couple examples of relevant coding tasks.
    • My median-but-low-confidence guess here is that basically-all the people who find current LLMs to be a massive productivity boost for coding are coding things which are either simple, or complex only in standardized ways - e.g. most web or mobile apps. That's the sort of coding which mostly involves piping things between different APIs and applying standard patterns, which is where LLMs shine.
Comment by johnswentworth on Shortform · 2024-12-24T00:23:15.284Z · LW · GW
  • O3 scores higher on FrontierMath than the top graduate students

I'd guess that's basically false. In particular, I'd guess that:

  • o3 probably does outperform mediocre grad students, but not actual top grad students. This guess is based on generalization from GPQA: I personally tried 5 GPQA problems in different fields at a workshop and got 4 of them correct, whereas the benchmark designers claim the rates at which PhD students get them right are much lower than that. I think the resolution is that the benchmark designers tested on very mediocre grad students, and probably the same is true of the FrontierMath benchmark.
  • the amount of time humans spend on the problem is a big factor - human performance has compounding returns on the scale of hours invested, whereas o3's performance basically doesn't have compounding returns in that way. (There was a graph floating around which showed this pretty clearly, but I don't have it on hand at the moment.) So plausibly o3 outperforms humans who are not given much time, but not humans who spend a full day or two on each problem.
Comment by johnswentworth on fake alignment solutions???? · 2024-12-13T16:24:09.702Z · LW · GW

I'm pretty uncertain on this one. Could a superintelligence find a plan which fools me? Yes. Will such a plans show up early on in a search order without actively trying to fool me? Ehh... harder to say. It's definitely a possibility I keep in mind. Most importantly, over time as our understanding improves on the theory side, it gets less and less likely that a plan which would fool me shows up early in a natural search order.

Comment by johnswentworth on fake alignment solutions???? · 2024-12-11T13:53:16.816Z · LW · GW

Note that Eliezer and I and probably Tammy would all tell you that "can't see how it fails" is not a very useful bar to aim for. At the bare minimum, we should have both (1) a compelling positive case that it in fact works, and (2) a compelling positive case that even if we missed something crucial, failure is likely to be recoverable rather than catastrophic.

Comment by johnswentworth on Core Pathways of Aging · 2024-12-07T02:59:03.164Z · LW · GW

I don't know of one, though I haven't followed the literature much the past couple years.

Comment by johnswentworth on johnswentworth's Shortform · 2024-12-06T18:08:30.969Z · LW · GW

Basically every time a new model is released by a major lab, I hear from at least one person (not always the same person) that it's a big step forward in programming capability/usefulness. And then David gives it a try, and it works qualitatively the same as everything else: great as a substitute for stack overflow, can do some transpilation if you don't mind generating kinda crap code and needing to do a bunch of bug fixes, and somewhere between useless and actively harmful on anything even remotely complicated.

It would be nice if there were someone who tries out every new model's coding capabilities shortly after they come out, reviews it, and gives reviews with a decent chance of actually matching David's or my experience using the thing (90% of which will be "not much change") rather than getting all excited every single damn time. But also, to be a useful signal, they still need to actually get excited when there's an actually significant change. Anybody know of such a source?

EDIT-TO-ADD: David has a comment below with a couple examples of coding tasks.

Comment by johnswentworth on Should there be just one western AGI project? · 2024-12-03T20:07:41.663Z · LW · GW

On my understanding, the push for centralization came from a specific faction whose pitch was basically:

  • here's the scaling laws for tokamaks
  • here's how much money we'd need
  • ... so let's make one real big tokamak rather than spending money on lots of little research devices.

... and that faction mostly won the competition for government funding for about half a century.

The current boom accepted that faction's story at face value, but then noticed that new materials allowed the same "scale up the tokamaks" strategy to be executed on a budget achievable with private funding, and therefore they could fund projects without having to fight the faction which won the battle for government funding.

The counterfactual which I think is probably correct is that there exist entirely different designs far superior to tokamaks, which don't require that much scale in the first place, but which were never discovered because the "scale up the tokamaks" faction basically won the competition for funding and stopped most research on alternative designs from happening.

Comment by johnswentworth on Should there be just one western AGI project? · 2024-12-03T17:11:03.231Z · LW · GW

If you mean the Manhattan Project: no. IIUC there were basically zero Western groups and zero dollars working toward the bomb before that, so the Manhattan Project clearly sped things up. That's not really a case of "centralization" so much as doing-the-thing-at-all vs not-doing-the-thing-at-all.

If you mean fusion: yes. There were many fusion projects in the sixties, people were learning quickly. Then the field centralized, and progress slowed to a crawl.

Comment by johnswentworth on Should there be just one western AGI project? · 2024-12-03T17:00:41.931Z · LW · GW

I think this is missing the most important consideration: centralization would likely massively slow down capabilities progress.

Comment by johnswentworth on leogao's Shortform · 2024-11-29T06:23:23.423Z · LW · GW

the number one spontaneous conversation is "what are you working on" or "what have you done so far", which forces you to re-explain what you're doing & the reasons for doing it to a skeptical & ignorant audience

I'm very curious if others also find this to be the biggest value-contributor amongst spontaneous conversations. (Also, more generally, I'm curious what kinds of spontaneous conversations people are getting so much value out of.)

Comment by johnswentworth on leogao's Shortform · 2024-11-28T06:24:38.509Z · LW · GW

I have heard people say this so many times, and it is consistently the opposite of my experience. The random spontaneous conversations at conferences are disproportionately shallow and tend toward the same things which have been discussed to death online already, or toward the things which seem simple enough that everyone thinks they have something to say on the topic. When doing an activity with friends, it's usually the activity which is novel and/or interesting, while the conversation tends to be shallow and playful and fun but not as substantive as the activity. At work, spontaneous conversations generally had little relevance to the actual things we were/are working on (there are some exceptions, but they're rarely as high-value as ordinary work).

Comment by johnswentworth on You are not too "irrational" to know your preferences. · 2024-11-27T19:02:31.652Z · LW · GW

The ice cream snippets were good, but they felt too much like they were trying to be a relatively obvious not-very-controversial example of the problems you're pointing at, rather than a central/prototypical example. Which is good as an intro, but then I want to see it backed up by more central examples.

The dishes example was IMO the best in the post, more like that would be great.

Unfiltered criticism was discussed in the abstract, it wasn't really concrete enough to be an example. Walking through an example conversation (like the ice cream thing) would help.

Mono vs open vs poly would be a great example, but it needs an actual example conversation (like the ice cream thing), not just a brief mention. Same with career choice. I want to see how specifically the issues you're pointing to come up in those contexts.

(Also TBC it's an important post, and I'm glad you wrote it.)