Posts

Are You More Real If You're Really Forgetful? 2024-11-24T19:30:55.233Z
Towards the Operationalization of Philosophy & Wisdom 2024-10-28T19:45:07.571Z
Thane Ruthenis's Shortform 2024-09-13T20:52:23.396Z
A Crisper Explanation of Simulacrum Levels 2023-12-23T22:13:52.286Z
Idealized Agents Are Approximate Causal Mirrors (+ Radical Optimism on Agent Foundations) 2023-12-22T20:19:13.865Z
Most People Don't Realize We Have No Idea How Our AIs Work 2023-12-21T20:02:00.360Z
How Would an Utopia-Maximizer Look Like? 2023-12-20T20:01:18.079Z
Don't Share Information Exfohazardous on Others' AI-Risk Models 2023-12-19T20:09:06.244Z
The Shortest Path Between Scylla and Charybdis 2023-12-18T20:08:34.995Z
A Common-Sense Case For Mutually-Misaligned AGIs Allying Against Humans 2023-12-17T20:28:57.854Z
"Humanity vs. AGI" Will Never Look Like "Humanity vs. AGI" to Humanity 2023-12-16T20:08:39.375Z
Current AIs Provide Nearly No Data Relevant to AGI Alignment 2023-12-15T20:16:09.723Z
Hands-On Experience Is Not Magic 2023-05-27T16:57:10.531Z
A Case for the Least Forgiving Take On Alignment 2023-05-02T21:34:49.832Z
World-Model Interpretability Is All We Need 2023-01-14T19:37:14.707Z
Internal Interfaces Are a High-Priority Interpretability Target 2022-12-29T17:49:27.450Z
In Defense of Wrapper-Minds 2022-12-28T18:28:25.868Z
Accurate Models of AI Risk Are Hyperexistential Exfohazards 2022-12-25T16:50:24.817Z
Corrigibility Via Thought-Process Deference 2022-11-24T17:06:39.058Z
Value Formation: An Overarching Model 2022-11-15T17:16:19.522Z
Greed Is the Root of This Evil 2022-10-13T20:40:56.822Z
Are Generative World Models a Mesa-Optimization Risk? 2022-08-29T18:37:13.811Z
AI Risk in Terms of Unstable Nuclear Software 2022-08-26T18:49:53.726Z
Broad Picture of Human Values 2022-08-20T19:42:20.158Z
Interpretability Tools Are an Attack Channel 2022-08-17T18:47:28.404Z
Convergence Towards World-Models: A Gears-Level Model 2022-08-04T23:31:33.448Z
What Environment Properties Select Agents For World-Modeling? 2022-07-23T19:27:49.646Z
Goal Alignment Is Robust To the Sharp Left Turn 2022-07-13T20:23:58.962Z
Reframing the AI Risk 2022-07-01T18:44:32.478Z
Is This Thing Sentient, Y/N? 2022-06-20T18:37:59.380Z
The Unified Theory of Normative Ethics 2022-06-17T19:55:19.588Z
Towards Gears-Level Understanding of Agency 2022-06-16T22:00:17.165Z
Poorly-Aimed Death Rays 2022-06-11T18:29:55.430Z
Reshaping the AI Industry 2022-05-29T22:54:31.582Z
Agency As a Natural Abstraction 2022-05-13T18:02:50.308Z

Comments

Comment by Thane Ruthenis on OpenAI releases deep research agent · 2025-02-03T20:57:27.291Z · LW · GW

If anyone is game for creating an agentic research scaffold like that Thane describes

Here's a more detailed the basic structure as envisioned after 5 minutes' thought:

  • You feed a research prompt to the "Outer Loop" of a model, maybe have a back-and-forth fleshing out the details.
  • The Outer Loop decomposes the research into several promising research directions/parallel subproblems.
  • Each research direction/subproblem is handed off to a "Subagent" instance of the model.
  • Each Subagent runs search queries on the web and analyses the results, up to the limits of its context window. After the analysis, it's prompted to evaluate (1) which of the results/sources are most relevant and which should be thrown out, (2) whether this research direction is promising and what follow-up questions are worth asking.
    • If a Subagent is very eager to pursue a follow-up question, it can either run a subsequent search query (if there's enough space in the context window), or it's prompted to distill its current findings and replace itself with a next-iteration Subagent, in whose context it loads only the most important results + its analyses + the follow-up question.
    • This is allowed up to some iteration count.
  • Once all Subagents have completed their research, instantiate an Evaluator instance, into whose context window we dump the final results of each Subagent's efforts (distilling if necessary). The Evaluator integrates the information from all parallel research directions and determines whether the research prompt has been satisfactorily addressed, and if not, what follow-up questions are worth pursuing.
  • The Evaluator's final conclusions are dumped into the Outer Loop's context (without the source documents, to not overload the context window).
  • If the Evaluator did not choose to terminate, the next generation of Subagents is spawned, each prompted with whatever contexts are recommended by the Evaluator.
  • Iterate, spawning further Evaluator instances and Subproblem instances as needed.
  • Once the Evaluator chooses to terminate, or some compute upper-bound is reached, the Outer Loop instantiates a Summarizer, into which it dumps all of the Evaluator's analyses + all of the most important search results.
  • The Summarizer is prompted to generate a high-level report outline, then write out each subsection, then the subsections are patched together into a final report.

Here's what this complicated setup is supposed to achieve:

  • Do a pass over an actually large, diverse amount of sources. Most of such web-search tools (Google's, DeepSeek's, or this shallow replication) are basically only allowed to make one search query, and then they have to contend with whatever got dumped into their context window. If the first search query turns out poorly targeted, in a way that becomes obvious after looking through the results, the AI's out of luck.
  • Avoid falling into "rabbit holes", i. e. some arcane-and-useless subproblems the model becomes obsessed with. Subagents would be allowed to fall into them, but presumably Evaluator steps, with a bird's-eye view of the picture, would recognize that for a failure mode and not recommend the Outer Loop to follow it.
  • Attempt to patch together a "bird's eye view" on the entirety of the results given the limitations of the context window. Subagents and Evaluators would use their limited scope to figure out what results are most important, then provide summaries + recommendations based on what's visible from their limited vantage points. The Outer Loop and then the Summarizer instances, prompted with information-dense distillations of what's visible from each lower-level vantage point, should effectively be looking at (a decent approximation of) the full scope of the problem.

Has anything like this been implemented in the open-source community or one of the countless LLM-wrapper startups? I would expect so, since it seems like an obvious thing to try + the old AutoGPT scaffolds worked in a somewhat similar manner... But it's possible the market's totally failed to deliver.

It should be relatively easy to set up using Flowise plus e. g. Firecrawl. If nothing like this has been implemented, I'm putting a $500 $250 bounty on it.

Edit: Alright, this seems like something very close to it.

Comment by Thane Ruthenis on OpenAI releases deep research agent · 2025-02-03T19:00:36.830Z · LW · GW

I really wonder how much of the perceived performance improvement comes from agent-y training, as opposed to not sabotaging the format of the model's answer + letting it do multi-turn search.

Compared to most other LLMs, Deep Research is able to generate reports up to 10k words, which is very helpful for providing comprehensive-feeling summaries. And unlike Google's analogue, it's not been fine-tuned to follow some awful deliberately shallow report template[1].

In other words: I wonder how much of Deep Research's capability can be replicated by putting Claude 3.5.1 in some basic agency scaffold where it's able to do multi-turn web search, and is then structurally forced to generate a long-form response (say, by always prompting with something like "now generate a high-level report outline", then with "now write section 1", "now write section 2", and then just patching those responses together).

Have there been some open-source projects that already did so? If not, anyone willing to try? This would provide a "baseline" for estimating how well OpenAI's agency training actually works, how much improvement is from it.

Subjectively, its use as a research tool is limited - it found only 18 sources in a five-minute search

Another test to run: if you give those sources to Sonnet 3.5.1 (or a Gemini model, if Sonnet's context window is too small) and ask it to provide a summary, how far from Deep Research's does it feel in quality/insight?

  1. ^

    Which is the vibe I got from Google's Deep Research when I'd been experimenting. I think its main issue isn't any lack of capability, but that the fine-tuning dataset for how the final reports are supposed to look had been bad: it's been taught to have an atrocious aesthetic taste.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-02-03T07:02:12.707Z · LW · GW

Ah, I think I see my error.

Consider a CoT of the form "[problem statement]: ...........", which always outputs dots regardless of the problem statement. Can that still encode useful computations, with the information being passed through intermediate self-attention? Yes, but: this only increases parallel-computing capabilities, not serial-computing capabilities.

And in the framework I'd outlined, under the Paraphraser, for the purposes of scheme-y thinking, even a legible CoT would consist of such "filler" tokens.

So yep, you're right, this does force all scheming to be taking place within one forward pass.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-02-03T01:54:58.239Z · LW · GW

I don't see why it would be harder to do.

Suppose you tasked a model with refactoring some database, and it secretly wants to sabotage its functionality. It outputs a CoT segment that goes, "I need to design unit tests for this module". Each word of this CoT is mapped to some internal representation, and the internal representation consists of "on-task" human-intended reasoning , and misaligned scheming  (corresponding to, e. g., a tactical decision to fail to include unit tests for some subtle issue). "I" is mapped to , "need" is mapped to , etc. Each  iterates on , each  iterates on , we have a sequence of serial computations.

In other words: as the model outputs, "I need to design unit tests for this module", there's an unspoken subtext of "I need to sabotage the unit tests here", and the model packs nine words' worth of serial computations into both the "textual"  and the "subtextual"  aspects of it.

Now you use the Paraphraser on the string, generating alternatives like "the next step is to generate some unit tests" and "several unit tests for this module are needed here". The intent, if I understand you correctly, is that the semantic-equivalence-but-syntactic-difference would ensure that the th token of this phrase is mapped to , as if the paraphrasing didn't happen, but that the mapping of  would be broken.

Except, why would that happen? What privileges -representations over -representations? Why would  be preserved whereas  is broken, instead of both of them remaining functional (and so preserving scheming), or both breaking (and so compromising the model's capabilities)?

In other words: We're basically just teaching the model to fail to distinguish between equivalent-to-a-human rephrasings. Why would this lead to the task-related representations becoming indifferent to the specifics of a phrasing whereas the scheming-related representations remain sensitive to it? Why wouldn't this just map the subtextual "I need to sabotage the unit tests here" to similarly semantically-equivalent-but-syntactically-different "let us compromise the relevant unit-test suite" or "I shall ensure the unit tests don't work properly"?

I expect it'd just lead, to wit, to the model's computations becoming indifferent to the details of a phrasing in full generality. And if the model were inclined to associate the general idea of generating unit tests with tactically screwing the task up, the model would come up with this idea regardless of the details of a phrasing.

It wouldn't interfere with its ability to engage in scheming-relevant serial computations, much like it doesn't interfere with its ability to engage in task-relevant serial computations, for the exact same reasons.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-02-02T22:12:23.907Z · LW · GW

Some thoughts on protecting against LLM steganography.

tl;dr: I suspect the Paraphraser idea is near-completely ineffectual as you scale the capabilities, both for scheming and for general CoT interpretability.

In the reasoning-as-communication framework, the amount of information a fixed-length message/CoT can transfer is upper-bounded by information-theoretic constraints. That is, if we have a ten-bit message, it cannot communicate more than ten bits of information: it cannot always uniquely specify one internal state of an AI model from more than  internal states.

However, the amount of information is not lower-bounded. Depending on the decoder, a ten-bit message can communicate between zero and ten bits of information. (A decoder that's just a constant function recovers zero bits, a decoder that only pays attention to the first bit recovers only one bit, et cetera.)

The idea of steganography is that, if we pick a CoT string, and run both an AI model's internal decoders and a human's language-comprehension decoders on it, the AI model would extract more bits of information from it. Paraphrasing, humans' decoders are "coarser": they have broader equivalence classes of messages that map to the same internal representation, compared to AI models.

For example, "a dog crossed the street" and "a dog traveled across the street" might map to the same internal representation for a human, but to different ones for an AI model. Similarly, things like "quantum butterflies juggled metaphysical spaghetti" and "ambitious lampshade collected vintage moonbeams" can map to the same "nonsense" internal representation in humans, but have some useful meaning for AI models. The bits a human fails to make use of correspond to "steganography" or "superhuman latent knowledge".

Proposals such as Shoggoth + Paraphraser + Face aim to destroy any information in CoTs that cannot be recovered by humans, by mapping each CoT string to the equivalence class of strings that all map to the same internal representation in humans.

However, there's an issue: what if the empty string already communicates plenty of steganographically encoded information that humans aren't privy to? That is: what if some knowledge is hard-coded into an AI model, such that it's elicited on any activation, and leads to it converging towards reaching the same hidden conclusions even after we destroy its ability to sneak-in additional query-relevant information?

For example:

  • If the AI model implicitly learned the Theory of Everything, that knowledge is going to be activated on any prompt – and so, no amount of paraphrasing is going to force it to talk about the Theory of Everything in humanese. Communicating the ToE from one instance of the model to its future self takes up zero bits.
  • If the AI model has internalized some hidden goals, the heuristics for pursuing those goals are likewise going to be activated on any prompt, and it would engage in massively productive hidden scheming on any prompt. Importantly, the information-bottleneck constraints imposed by the Paraphraser wouldn't even be hitting the hidden-plotting compartment specifically. They'd just uniformly lower the model's computational power.
    • E. g., in the Shoggoth + Paraphraser + Face paradigm, that'd just mean the model learns to think in longer CoTs (because it needs longer messages to communicate the same amount of information), not to do fewer "sneaky" computations.[1]

Like, imagine that we have two different internal-representation sets: , containing problem-solving strategies for the overt task we're presenting the model, and , related to some misaligned scheming. We would like to constrain the model such that it cannot plan about ; such that its CoTs contain zero bits for narrowing down which state out of all  it must pick.

This assumes something like the following model: A given CoT string communicates some  bits of information. Of them,  bits are human-legible. The model then follows the algorithm of: use  task-relevant bits on picking a state out of , and use any leftover  bits to pick a state out of . We then destroy the string's information content until  (via the Paraphraser).  Now the model doesn't have the ability to encode any non-human-legible bits, and if it uses any bits on picking a state out of , they would necessarily be human-legible, and we can spot the scheming by reading the CoT.

But that's not how this works. Instead, the model picks both simultaneously: a CoT is directly mapped to an  two-tuple. Which means if the model can narrow down the state of  to one out of , it can narrow down the state of  to one out of  in the same move. Even if we scramble the CoT until it communicates no more bits than a human can extract out of it, that doesn't differentially constrain the model's ability to think about scheming (= communicate scheming-relevant information to its future self).

It's a uniform communication constraint/computational-complexity penalty.

Generalizing beyond scheming: if the model were inclined to not put any of its thoughts into humanese (due to superhuman latent knowledge, say), the Paraphraser wouldn't address this issue either. If a model's plans contain some catastrophic side-effects (which aren't even malign, just something we didn't tell it not to cause), the Paraphraser wouldn't reveal that either.

@Daniel Kokotajlo, thoughts?

(This doesn't touch on the Face part of the proposal, i. e., the idea that we want to not pressure the model to conceal any scheme-y cognition in the hopes it'd naively reveal them. But if there's some pressure to conceal that, the Paraphraser would not help with it at all. It'd just teach the model to map humanese-equivalent strings to equivalent scheming-states, which isn't particularly useful.)

  1. ^

    Arguably, if you also penalize the CoT length, this should cause a pressure towards throwing out any computations unrelated to the object-level problem the model's being trained on... But then the whole "paraphraser" bit doesn't actually do anything, you can achieve the same outcome by just directly penalizing the CoT length.

    Which, inasmuch as DL training loops use some form of simplicity prior, should be happening by default anyway. I. e., this doesn't address the pressures that we expect to lead to steganography/hidden plotting despite the simplicity prior: which is to say, arguments that steganography/plotting are the actual optimal solutions to the training problem.

    Consider the Theory of Everything example: those hidden computations are not actually "irrelevant computations" that we need to displace, they're directly useful. Similarly, any misaligned mesa-objectives the model forms would presumably be convergently useful for performance on the training objective: that would be why they formed to begin with. So the information-bottleneck trickery would do ~nothing against them.

Comment by Thane Ruthenis on CBiddulph's Shortform · 2025-02-02T03:41:51.120Z · LW · GW

Such an entity as described could absolutely be an existential threat to humanity

I agree. I think you don't even need most of the stuff on the "superhuman" list, the equivalent of a competent IQ-130 human upload probably does it, as long as it has the speed + self-copying advantages.

Comment by Thane Ruthenis on CBiddulph's Shortform · 2025-01-31T03:43:19.033Z · LW · GW

The problem with this neat picture is reward-hacking. This process wouldn't optimize for better performance on fuzzy tasks, it would optimize for performance on fuzzy tasks that looks better to the underlying model. And much like RLHF doesn't scale to superintelligence, this doesn't scale to superhuman fuzzy-task performance.

It can improve the performance a bit. But once you ramp up the optimization pressure, "better performance" and "looks like better performance" would decouple from each other and the model would train itself into idiosyncratic uselessness. (Indeed: if it were this easy, doesn't this mean you should be able to self-modify into a master tactician or martial artist by running some simulated scenarios in your mind, improving without bound, and without any need to contact reality?)

... Or so my intuition goes. It's possible that this totally works for some dumb reason. But I don't think so. RL has a long-standing history of problems with reward-hacking, and LLMs' judgement is one of the most easily hackable things out there.

(Note that I'm not arguing that recursive self-improvement is impossible in general. But RLAIF, specifically, just doesn't look like the way.)

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-01-30T18:47:52.337Z · LW · GW

The above toy model assumed that we're picking one signal at a time, and that each such "signal" specifies the intended behavior for all organs simultaneously...

... But you're right that the underlying assumption there was that the set of possible desired behaviors is discrete (i. e., that X in "kidneys do X" is a discrete variable, not a vector of reals). That might've indeed assumed me straight out of the space of reasonable toy models for biological signals, oops.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-01-30T15:14:20.901Z · LW · GW

Yeah but if something is in the general circulation (bloodstream), then it’s going everywhere in the body. I don’t think there’s any way to specifically direct it.

The point wouldn't be to direct it, but to have different mixtures of chemicals (and timings) to mean different things to different organs.

Loose analogy: Suppose that the intended body behaviors ("kidneys do X, heart does Y, brain does Z" for all combinations of X, Y, Z) are latent features, basic chemical substances and timings are components of the input vector, and there are dramatically more intended behaviors than input-vector components. Can we define the behavior-controlling function of organs (distributed across organs) such that, for any intended body behavior, there's a signal that sets the body into approximately this state?

It seems that yes. The number of almost-orthogonal vectors in  dimensions scales exponentially with , so we simply need to make the behavior-controlling function sensitive to these almost-orthogonal directions, rather than the chemical-basis vectors. The mappings from the input vector to the output behaviors, for each organ, would then be some complicated mixtures, not a simple "chemical A sets all organs into behavior X".

This analogy seems flawed in many ways, but I think something directionally-like-this might be happening?

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T05:16:57.442Z · LW · GW

Have you looked at samples of CoT of o1, o3, deepseek, etc. solving hard math problems?

Certainly (experimenting with r1's CoTs right now, in fact). I agree that they're not doing the brute-force stuff I mentioned; that was just me outlining a scenario in which a system "technically" clears the bar you'd outlined, yet I end up unmoved (I don't want to end up goalpost-moving).

Though neither are they being "strategic" in the way I expect they'd need to be in order to productively use a billion-token CoT.

Anyhow, this is nice, because I do expect that probably something like this milestone will be reached before AGI

Yeah, I'm also glad to finally have something concrete-ish to watch out for. Thanks for prompting me!

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T02:18:29.605Z · LW · GW

Not for math benchmarks. Here's one way it can "cheat" at them: suppose that the CoT would involve the model generating candidate proofs/derivations, then running an internal (learned, not hard-coded) proof verifier on them, and either rejecting the candidate proof and trying to generate a new one, or outputting it. We know that this is possible, since we know that proof verifiers can be compactly specified.

This wouldn't actually show "agency" and strategic thinking of the kinds that might generalize to open-ended domains and "true" long-horizon tasks. In particular, this would mostly fail the condition (2) from my previous comment.

Something more open-ended and requiring "research taste" would be needed. Maybe a comparable performance on METR's benchmark would work for this (i. e., the model can beat a significantly larger fraction of it at 1 billion tokens compared to 1 million)? Or some other benchmark that comes closer to evaluating real-world performance.

Edit: Oh, math-benchmark performance would convince me if we get access to a CoT sample and it shows that the model doesn't follow the above "cheating" approach, but instead approaches the problem strategically (in some sense). (Which would also require this CoT not to be hopelessly steganographied, obviously.)

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T01:19:22.159Z · LW · GW

Prooobably ~simultaneously, but I can maybe see it coming earlier and in a way that isn't wholly convincing to me. In particular, it would still be a fixed-length task; much longer-length than what the contemporary models can reliably manage today, but still hackable using poorly-generalizing "agency templates" instead of fully general "compact generators of agenty behavior" (which I speculate humans to have and RL'd LLMs not to). It would be some evidence in favor of "AI can accelerate AI R&D", but not necessarily "LLMs trained via SSL+RL are AGI-complete".

Actually, I can also see it coming later. For example, some suppose that the capability researchers invent some method for reliably-and-indefinitely extending the amount of serial computations a reasoning model can productively make use of, but the compute or memory requirements grow very fast with the length of a CoT. Some fairly solid empirical evidence and theoretical arguments in favor of boundless scaling can appear quickly, well before the algorithms are made optimal enough to (1) handle weeks-long CoTs and/or (2) allow wide adoption (thus making it available to you).

I think the second scenario is more plausible, actually.

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-29T00:29:45.118Z · LW · GW

I wish we had something to bet on better than "inventing a new field of science,"

I've thought of one potential observable that is concrete, should be relatively low-capability, and should provoke a strong update towards your model for me:

If there is an AI model such that the complexity of R&D problems it can solve (1) scales basically boundlessly with the amount of serial compute provided to it (or to a "research fleet" based on it), (2) scales much faster with serial compute than with parallel compute, and (3) the required amount of human attention ("babysitting") is constant or grows very slowly with the amount of serial compute.

This attempts to directly get at the "autonomous self-correction" and "ability to think about R&D problems strategically" ideas.

I've not fully thought through all possible ways reality could Goodhart to this benchmark, i. e. "technically" pass it but in a way I find unconvincing. For example, if I failed to include the condition (2), o3 would have probably already "passed" it (since it potentially achieved better performance on ARC-AGI and FrontierMath by sampling thousands of CoTs then outputting the most frequent answer). There might be other loopholes like this...

But it currently seems reasonable and True-Name-y to me.

Comment by Thane Ruthenis on DeepSeek Panic at the App Store · 2025-01-28T22:22:11.625Z · LW · GW

Well, he didn't do it yet either, did he? His new announcement is, likewise, just that: an announcement. Manifold is still 35% on him not following through on it, for example.

Comment by Thane Ruthenis on DeepSeek Panic at the App Store · 2025-01-28T21:41:29.886Z · LW · GW

"Intensely competent and creative", basically, maybe with a side of "obsessed" (with whatever they're cracked at).

Comment by Thane Ruthenis on DeepSeek Panic at the App Store · 2025-01-28T21:36:16.220Z · LW · GW

Supposedly Trump announced that back in October, so it should already be priced in.

(Here's my attempt at making sense of it, for what it's worth.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-28T21:34:31.279Z · LW · GW

Here's a potential interpretation of the market's apparent strange reaction to DeepSeek-R1 (shorting Nvidia).

I don't fully endorse this explanation, and the shorting may or may not have actually been due to Trump's tariffs + insider trading, rather than DeepSeek-R1. But I see a world in which reacting this way to R1 arguably makes sense, and I don't think it's an altogether implausible world.

If I recall correctly, the amount of money globally spent on inference dwarfs the amount of money spent on training. Most economic entities are not AGI labs training new models, after all. So the impact of DeepSeek-R1 on the pretraining scaling laws is irrelevant: sure, it did not show that you don't need bigger data centers to get better base models, but that's not where most of the money was anyway.

And my understanding is that, on the inference-time scaling paradigm, there isn't yet any proven method of transforming arbitrary quantities of compute into better performance:

  • Reasoning models are bottlenecked on the length of CoTs that they've been trained to productively make use of. They can't fully utilize even their context windows; the RL pipelines just aren't up to that task yet. And if that bottleneck were resolved, the context-window bottleneck would be next: my understanding is that infinite context/"long-term" memories haven't been properly solved either, and it's unknown how they'd interact with the RL stage (probably they'd interact okay, but maybe not).
    • o3 did manage to boost its ARC-AGI and (maybe?) FrontierMath performance by... generating a thousand guesses and then picking the most common one...? But who knows how that really worked, and how practically useful it is. (See e. g. this, although that paper examines a somewhat different regime.)
  • Agents, from Devin to Operator to random open-source projects, are still pretty terrible. You can't set up an ecosystem of agents in a big data center and let them rip, such that the ecosystem's power scales boundlessly with the data center's size. For all but the most formulaic tasks, you still need a competent human closely babysitting everything they do, which means you're still mostly bottlenecked on competent human attention.

Suppose that you don't expect the situation to improve: that the inference-time scaling paradigm would hit a ceiling pretty soon, or that it'd converge to distilling search into forward passes (such that the end users end up using very little compute on inference, like today), and that agents just aren't going to work out the way the AGI labs promise.

In such a world, a given task can either be completed automatically by an AI for some fixed quantity of compute X, or it cannot be completed by an AI at all. Pouring ten times more compute on it does nothing.

In such a world, if it were shown that the compute needs of a task can be met with ten times less compute than previously expected, this would decrease the expected demand for compute.

The fact that capable models can be run locally might increase the number of people willing to use them (e. g., those very concerned about data privacy), as might the ability to automatically complete 10x as many trivial tasks. But it's not obvious that this demand spike will be bigger than the simultaneous demand drop.

And I, at least, when researching ways to set up DeepSeek-R1 locally, found myself more drawn to the "wire a bunch of Macs together" option, compared to "wire a bunch of GPUs together" (due to the compactness). If many people are like this, it makes sense why Nvidia is down while Apple is (slightly) up. (Moreover, it's apparently possible to run the full 671b-parameter version locally, and at a decent speed, using a pure RAM+CPU setup; indeed, it appears cheaper than mucking about with GPUs/Macs, just $6,000.)

This world doesn't seem outright implausible to me. I'm bearish on agents and somewhat skeptical of inference-time scaling. And if inference-time scaling does deliver on its promises, it'll likely go the way of search-and-distill.

On balance, I don't actually expect the market to have any idea what's going on, so I don't know that its reasoning is this specific flavor of "well-informed but skeptical". And again, it's possible the drop was due to Trump, nothing to do with DeepSeek at all.

But as I'd said, this reaction to DeepSeek-R1 does not seem necessarily irrational/incoherent to me.

Comment by Thane Ruthenis on johnswentworth's Shortform · 2025-01-27T20:59:39.949Z · LW · GW

I don't have deep expertise in the subject, but I'm inclined to concur with the people saying that the widely broadcast signals don't actually represent one consistent thing, despite your plausible argument to the contrary.

Here's a Scott Alexander post speculating why that might be the case. In short: there was an optimization pressure towards making internal biological signals very difficult to decode, because easily decodable signals were easy target for parasites evolving to exploit them. As the result, the actual signals are probably represented as "unnecessarily" complicated, timing-based combinations of various "basic" chemical, electrical, etc. signals, and they're somewhat individualized to boot. You can't decode them just by looking at any one spatially isolated chunk of the body, by design.

Basically: separate chemical substances (and other components that look "simple" locally/from the outside) are not the privileged basis for decoding internal signals. They're the anti-privileged basis, if anything.

Comment by Thane Ruthenis on Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals · 2025-01-26T19:44:33.420Z · LW · GW

Hmm. This does have the feel of gesturing at something important, but I don't see it clearly yet...

Free association: geometric rationality.

MIRI's old results argue that "corrigibility via uncertainty regarding the utility function" doesn't work, because if the agent maximizes expected utility anyway, it doesn't matter one whit whether we're taking expectation over actions or over utility functions. However, the corrigibility-via-instrumental-goals does have the feel of "make the agent uncertain regarding what goals it will want to pursue next". Is there, therefore, some way to implement something-like-this while avoiding MIRI's counterexample?

Loophole: the counterexample works in the arithmetically-expected utility regime. What if we instead do it in the geometric one? I. e., have an agent take actions that maximize the geometrically-expected product of candidate utility functions? This is a more conservative/egalitarian regime: any one utility function flipping to negative or going to zero wipes out all value, unlike with sums (which are more tolerant of ignoring/pessimizing some terms, and can have "utility monsters"). So it might potentially make the agent actually hesitant to introduce potentially destructive changes to its environment...

(This is a very quick take and it potentially completely misunderstands the concepts involved. But I figure it's better to post than not, in case the connection turns out obvious to anyone else.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-25T20:19:20.241Z · LW · GW

Coming back to this in the wake of DeepSeek r1...

I don't think the cumulative compute multiplier since GPT-4 is that high, I'm guessing 3x, except perhaps for DeepSeek-V3, which wasn't trained compute optimally and didn't use a lot of compute, and so it remains unknown what happens if its recipe is used compute optimally with more compute.

How did DeepSeek accidentally happen to invest precisely the amount of compute into V3 and r1 that would get them into the capability region of GPT-4/o1, despite using training methods that clearly have wildly different returns on compute investment?

Like, GPT-4 was supposedly trained for $100 million, and V3 for $5.5 million. Yet, they're roughly at the same level. That should be very surprising. Investing a very different amount of money into V3's training should've resulted in it either massively underperforming GPT-4, or massively overperforming, not landing precisely in its neighbourhood!

Consider this graph. If we find some training method A, and discover that investing $100 million in it lands us at just above "dumb human", and then find some other method B with a very different ROI, and invest $5.5 million in it, the last thing we should expect is to again land near "dumb human".

Or consider this trivial toy model: You have two linear functions, f(x) = Ax and g(x) = Bx, where x is the compute invested, output is the intelligence of the model, and f and g are different training methods. You pick some x effectively at random (whatever amount of money you happened to have lying around), plug it into f, and get, say, 120. Then you pick a different random value of x, plug it into g, and get... 120 again. Despite the fact that the multipliers A and B are likely very different, and you used very different x-values as well. How come?

The explanations that come to mind are:

  • It actually is just that much of a freaky coincidence.
  • DeepSeek have a superintelligent GPT-6 equivalent that they trained for $10 million in their basement, and V3/r1 are just flexes that they specifically engineered to match GPT-4-ish level.
  • DeepSeek directly trained on GPT-4 outputs, effectively just distilling GPT-4 into their model, hence the anchoring.
  • DeepSeek kept investing and tinkering until getting to GPT-4ish level, and then stopped immediately after attaining it.
  • GPT-4ish neighbourhood is where LLM pretraining plateaus, which is why this capability level acts as a sort of "attractor" into which all training runs, no matter how different, fall.
Comment by Thane Ruthenis on Lao Mein's Shortform · 2025-01-23T18:54:06.900Z · LW · GW

Hm, that's not how it looks like to me? Look at the 6-months plot:

The latest growth spurt started January 14th, well before the Stargate news went public, and this seems like just its straight-line continuation. The graph past the Stargate announcement doesn't look "special" at all. Arguably, we can interpret it as Stargate being part of the reference class of events that are currently driving GE Vernova up – not an outside-context event as such, but still relevant...

Except for the fact that the market indeed did not respond to Stargate immediately. Which makes me think it didn't react at all, and the next bump you're seeing is some completely different thing from the reference class of GE-Vernova-raising events (to which Stargate itself does not in fact belong).

Which is probably just something to do with Trump's expected policies. Note that there was also a jump on November 4th. Might be some general expectation that he's going to compete with China regarding energy, or that he'd cut the red tape on building plants for his tech backers.

In which case, perhaps it's because Stargate was already priced-in – or at least whichever fraction of Stargate ($30-$100 billion?) is real and is actually going to be realized. We already knew datacenters of this scale were going to be built this year, after all.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-22T19:05:45.791Z · LW · GW

Overall, I think that the new Stargate numbers published may (call it 40%) be true, but I also think there is a decent chance this is new administration trump-esque propoganda/bluster (call it 45%),

I think it's definitely bluster, the question is how much of a done deal it is to turn this bluster into at least $100 billion.

I don't this this changes the prior expected path of datacenter investment at all. It's precisely how the expected path was going to look like, the only change is how relatively high-profile/flashy this is being. (Like, if they invest $30-100 billion into the next generation of pretrained models in 2025-2026, and that generation fails, they ain't seeing the remaining $400 billion no matter what they're promising now. Next generation makes or breaks the investment into the subsequent one, just as was expected before this.)

Satya Nadella was just asked about how funding looks for stargate, and said "Microsoft is good for investing 80b"

He does very explicitly say "I am going to spend 80 billion dollars building Azure". Which I think has nothing to do with Stargate.

(I think the overall vibe he gives off is "I don't care about this Stargate thing, I'm going to scale my data centers and that's that". I don't put much stock into this vibe, but it does fit with OpenAI/Microsoft being on the outs.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-22T17:53:57.239Z · LW · GW

So, Project Stargate. Is it real, or is it another "Sam Altman wants $7 trillion"? Some points:

  • The USG invested nothing in it. Some news outlets are being misleading about this. Trump just stood next to them and looked pretty, maybe indicated he'd cut some red tape. It is not an "AI Manhattan Project", at least, as of now.
  • Elon Musk claims that they don't have the money and that SoftBank (stated to have "financial responsibility" in the announcement) has less than $10 billion secured. If true, while this doesn't mean they can't secure an order of magnitude more by tomorrow, this does directly clash with "deploying $100 billion immediately" statement.
    • But Sam Altman counters that Musk's statement is "wrong", as Musk "surely knows".
    • I... don't know which claim I distrust more. Hm, I'd say Altman feeling the need to "correct the narrative" here, instead of just ignoring Musk, seems like a sign of weakness? He doesn't seem like the type to naturally get into petty squabbles like this, otherwise.
    • (And why, yes, this is how an interaction between two Serious People building world-changing existentially dangerous megaprojects looks like. Apparently.)
  • Some people try to counter Musk's claim by citing Satya Nadella's statement that Satya's "good for his $80 billion". But that's not referring to Stargate, that's about Azure. Microsoft is not listed as investing in Stargate at all, it's only a "technology partner".
  • Here's a brief analysis from the CIO of some investment firm. He thinks it's plausible that the stated "initial group of investors" (SoftBank, OpenAI, Oracle, MGX) may invest fifty billion dollars into it over the next four years; not five hundred billion.
    • They don't seem to have the raw cash for even $100 billion – and if SoftBank secured the missing funding from some other set of entities, why aren't they listed as "initial equity funders" in the announcement?

Overall, I'm inclined to fall back on @Vladimir_Nesov's analysis here. $30-50 billion this year seems plausible. But $500 billion is, for now, just Altman doing door-in-the-face as he had with his $7 trillion.

I haven't looked very deeply into it, though. Additions/corrections welcome!

Comment by Thane Ruthenis on Implications of the inference scaling paradigm for AI safety · 2025-01-21T02:57:54.545Z · LW · GW

McLaughlin announced 2025-01-13 that he had joined OpenAI

He gets onboarded only on January 28th, for clarity.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-21T00:57:22.681Z · LW · GW

I think GPT-4 to o3 represent non-incremental narrow progress, but only, at best, incremental general progress.

(It's possible that o3 does "unlock" transfer learning, or that o4 will do that, etc., but we've seen no indication of that so far.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-21T00:04:32.348Z · LW · GW

I'm sure he wants hype, but he doesn't want high expectations that are very quickly falsified

There's a possibility that this was a clown attack on OpenAI instead...

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-21T00:02:38.212Z · LW · GW

Valid, I was split on whether it's worth posting vs. it'd be just me taking my part in spreading this nonsense. But it'd seemed to me that a lot of people, including LW regulars, might've been fooled, so I erred on the side of posting.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-20T23:59:58.199Z · LW · GW

As I'd said, I think he's right about the o-series' theoretic potential. I don't think there is, as of yet, any actual indication that this potential has already been harnessed, and therefore that it works as well as the theory predicts. (And of course, the o-series scaling quickly at math is probably not even an omnicide threat. There's an argument for why it might be – that the performance boost will transfer to arbitrary domains – but that doesn't seem to be happening. I guess we'll see once o3 is public.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-20T23:53:55.714Z · LW · GW

The thing is - last time I heard about OpenAI rumors it was Strawberry. 

That was part of my reasoning as well, why I thought it might be worth engaging with!

But I don't think this is the same case. Strawberry/Q* was being leaked-about from more reputable sources, and it was concurrent with dramatic events (the coup) that were definitely happening.

In this case, all evidence we have is these 2-3 accounts shitposting.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-20T15:40:39.173Z · LW · GW

Alright, so I've been following the latest OpenAI Twitter freakout, and here's some urgent information about the latest closed-doors developments that I've managed to piece together:

  • Following OpenAI Twitter freakouts is a colossal, utterly pointless waste of your time and you shouldn't do it ever.
  • If you saw this comment of Gwern's going around and were incredibly alarmed, you should probably undo the associated update regarding AI timelines (at least partially, see below).
  • OpenAI may be running some galaxy-brained psyops nowadays.

Here's the sequence of events, as far as I can tell:

  1. Some Twitter accounts that are (claiming, without proof, to be?) associated with OpenAI are being very hype about some internal OpenAI developments.
  2. Gwern posts this comment suggesting an explanation for point 1.
  3. Several accounts (e. g., one, two) claiming (without proof) to be OpenAI insiders start to imply that:
    1. An AI model recently finished training.
    2. Its capabilities surprised and scared OpenAI researchers.
    3. It produced some innovation/is related to OpenAI's "Level 4: Innovators" stage of AGI development.
  4. Gwern's comment goes viral on Twitter (example).
  5. A news story about GPT-4b micro comes out, indeed confirming a novel OpenAI-produced innovation in biotech. (But it is not actually an "innovator AI".)
  6. The stories told by the accounts above start to mention that the new breakthrough is similar to GPT-4b: that it's some AI model that produced an innovation in "health and longevity". But also, that it's broader than GPT-4b, and that the full breadth of this new model's surprising emergent capabilities is unclear. (One, two, three.)
  7. Noam Brown, an actual confirmed OpenAI researcher, complains about "vague AI hype on social media", and states they haven't yet actually achieved superintelligence.
  8. The Axios story comes out, implying that OpenAI has developed "PhD-level superagents" and that Sam Altman is going to brief Trump on them. Of note:
    1. Axios is partnered with OpenAI.
    2. If you put on Bounded Distrust lens, you can see that the "PhD-level superagents" claim is entirely divorced from any actual statements made by OpenAI people. The article ties-in a Mark Zuckerberg quote instead, etc. Overall, the article weaves the impression it wants to create out of vibes (with which it's free to lie) and not concrete factual statements.
  9. The "OpenAI insiders" gradually ramp up the intensity of their story all the while, suggesting that the new breakthrough would allow ASI in "weeks, not years", and also that OpenAI won't release this "o4-alpha" until 2026 because they have a years-long Master Plan, et cetera. Example, example.
  10. Sam Altman complains about "twitter hype" being "out of control again".
  11. OpenAI hype accounts deflate.

What the hell was all that?

First, let's dispel any notion that the hype accounts are actual OpenAI insiders who know what they are talking about:

  • "Satoshi" claims to be blackmailing OpenAI higher-ups in order to be allowed to shitpost classified information on Twitter. I am a bit skeptical of this claim, to put it mildly.
  • "Riley Coyote" has a different backstory which is about as convincing by itself, and which also suggests that "Satoshi" is "Riley"'s actual source.

As far as I can tell digging into the timeline, both accounts just started acting as if they are OpenAI associates posting leaks. Not even, like, saying that they're OpenAI associates posting leaks, much less proving that. Just starting to act as if they're OpenAI associates and that everyone knows this. Their tweets then went viral. (There's also the strawberry guy, who also implies to be an OpenAI insider, who also joined in on the above hype-posting, and who seems to have been playing this same game for a year now. But I'm tired of looking up the links, and the contents are intensely unpleasant. Go dig through that account yourself if you want.)

In addition, none of the OpenAI employee accounts with real names that I've been able to find have been participating in this hype cycle. So if OpenAI allowed its employees to talk about what happened/is happening, why weren't any confirmed-identity accounts talking about it (except Noam's, deflating it)? Why only the anonymous Twitter people?

Well, because this isn't real.

That said, the timing is a bit suspect. This hype starting up, followed by the GPT-4b micro release and the Axios piece, all in the span of ~3 days? And the hype men's claims at least partially predicting the GPT-4b micro thing?

There's three possibilities:

  • A coincidence. (The predictions weren't very precise, just "innovators are coming". The details about health-and-longevity and the innovative output got added after the GPT-4b piece, as far as I can tell.)
  • A leak in one of the newspapers working on the GPT-4b story (which the grifters then built a false narrative around).
  • Coordinated action by OpenAI.

One notable point is, the Axios story was surely coordinated with OpenAI, and it's both full of shenanigans and references the Twitter hype ("several OpenAI staff have been telling friends they are both jazzed and spooked by recent progress"). So OpenAI was doing shenanigans. So I'm slightly inclined to believe it was all an OpenAI-orchestrated psyop.

Let's examine this possibility.

Regarding the truth value of the claims: I think nothing has happened, even if the people involved are OpenAI-affiliated (in a different sense from how they claim). Maybe there was some slight unexpected breakthrough on an obscure research direction, at most, to lend an air of technical truth to those claims. But I think it's all smoke and mirrors.

However, the psyop itself (if it were one) has been mildly effective. I think tons of people actually ended up believing that something might be happening (e. g., janus, the AI Notkilleveryoneism Memes guy, myself for a bit, maybe gwern, if his comment referenced the pattern of posting related to the early stages of this same event).

That said, as Eliezer points out here, it's advantageous for OpenAI to be crying wolf: both to drive up/maintain hype among their allies, and to frog-boil the skeptics into instinctively dismissing any alarming claims. Such that, say, if there ever are actual whistleblowers pseudonymously freaking out about unexpected breakthroughs on Twitter, nobody believes them.

That said, I can't help but think that if OpenAI were actually secure in their position and making insane progress, they would not have needed to do any of this stuff. If you're closing your fingers around agents capable of displacing the workforce en masse, if you see a straight shot to AGI, why engage in this childishness? (Again, if Satoshi and Riley aren't just random trolls.)

Bottom line, one of the following seems to be the case:

  • There's a new type of guy, which is to AI/OpenAI what shitcoin-shills are to cryptocurrency.
  • OpenAI is engaging in galaxy-brained media psyops.

Oh, and what's definitely true is that paying attention to what's going viral on Twitter is a severe mistake. I've committed it for the first and last time.

I also suggest that you unroll the update you might've made based on Gwern's comment. Not the part describing to the o-series' potential – that's of course plausible and compelling. The part where that potential seems to have already been confirmed and realized according to ostensible OpenAI leaks – because those leaks seem to be fake. (Unless Gwern was talking about some other demographic of OpenAI accounts being euphorically optimistic on Twitter, which I've somehow missed?)[1]

(Oh, as to Sam Altman meeting with Trump? Well, that's probably because Trump's Sinister Vizier, Sam Altman's sworn nemesis, Elon Musk, is whispering in Trump ear 24/7 suggesting to crush OpenAI, and if Altman doesn't seduce Trump ASAP, Trump will do that. Especially since OpenAI is currently vulnerable due to their legally dubious for-profit transition.

This planet is a clown show.)


I'm currently interested in:

  • Arguments for actually taking the AI hype people's claims seriously. (In particular, were any actual OpenAI employees provably involved, and did I somehow miss them?)
  • Arguments regarding whether this was an OpenAI psyop vs. some random trolls.

Also, pinging @Zvi in case any of those events showed up on his radar and he plans to cover them in his newsletter.

  1. ^

    Also, I can't help but note that the people passing the comment around (such as this, this) are distorting it. The Gwern-stated claim isn't that OpenAI are close to superintelligence, it's that they may feel as if they're close to superintelligence. Pretty big difference!

    Though, again, even that is predicated on actual OpenAI employees posting actual insider information about actual internal developments. Which I am not convinced is a thing that is actually happening.

Comment by Thane Ruthenis on Implications of the inference scaling paradigm for AI safety · 2025-01-20T04:59:30.340Z · LW · GW

If you're wondering why OAers are suddenly weirdly, almost euphorically, optimistic on Twitter

For clarity, which OAers this is talking about, precisely? There's a cluster of guys – e. g. this, this, this – claiming to be OpenAI insiders. That cluster went absolutely bananas the last few days, claiming ASI achieved internally/will be in a few weeks, alluding to an unexpected breakthrough that has OpenAI researchers themselves scared. But none of them, as far as I can tell, have any proof that they're OpenAI insiders.

On the contrary: the Satoshi guy straight-up suggests he's allowed to be an insider shitposting classified stuff on Twitter because he has "dirt on several top employees", which, no. From that, I conclude that the whole cluster is a member of the same species as the cryptocurrency hivemind hyping up shitcoins.

Meanwhile, any actual confirmed OpenAI employees are either staying silent, or carefully deflate the hype. roon is being roon, but no more than is usual for them, as far as I can tell.

So... who are those OAers that are being euphorically optimistic on Twitter, and are they actually OAers? Anyone knows? (I don't think a scenario where low-level OpenAI people are allowed to truthfully leak this stuff on Twitter, but only if it's plausible-deniable, makes much sense.[1] In particular: what otherwise unexplainable observation are we trying to explain using this highly complicated hypothesis? How is that hypothesis privileged over "attention-seeking roon copycats"?)

General question, not just aimed at Gwern.

Edit: There's also the Axios article. But Axios is partnered with OpenAI, and if you go Bounded Distrust on it, it's clear how misleading it is.

  1. ^

    Suppose that OpenAI is following this strategy in order to have their cake and eat it too: engage in a plausible-deniable messaging pattern, letting their enemies dismiss it as hype (and so not worry about OpenAI and AI capability progress) while letting their allies believe it (and so keep supporting/investing in them). But then either (1) the stuff these people are now leaking won't come true, disappointing the allies, or (2) this stuff will come true, and their enemies would know to take these leaks seriously the next time.

    This is a one-time-use strategy. At this point, either (1) allow actual OpenAI employees to leak this stuff, if you're fine with this type of leak, or (2) instruct the hype men to completely make stuff up, because if you expect your followers not to double-check which predictions came true, you don't need to care about the truth value at all.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-17T22:57:13.787Z · LW · GW

Current take on the implications of "GPT-4b micro": Very powerful, very cool, ~zero progress to AGI, ~zero existential risk. Cheers.

First, the gist of it appears to be:

OpenAI’s new model, called GPT-4b micro, was trained to suggest ways to re-engineer the protein factors to increase their function. According to OpenAI, researchers used the model’s suggestions to change two of the Yamanaka factors to be more than 50 times as effective—at least according to some preliminary measures.

The model was trained on examples of protein sequences from many species, as well as information on which proteins tend to interact with one another. [...] Once Retro scientists were given the model, they tried to steer it to suggest possible redesigns of the Yamanaka proteins. The prompting tactic used is similar to the “few-shot” method, in which a user queries a chatbot by providing a series of examples with answers, followed by an example for the bot to respond to.

Crucially, if the reporting is accurate, this is not an agent. The model did not engage in autonomous open-ended research. Rather, humans guessed that if a specific model is fine-tuned on a specific dataset, the gradient descent would chisel into it the functionality that would allow it to produce groundbreaking results in the corresponding domain. As far as AGI-ness goes, this is functionally similar to AlphaFold 2; as far as agency goes, it's at most at the level of o1[1].

To speculate on what happened: Perhaps GPT-4b ("b" = "bio"?) is based on some distillation of an o-series model, say o3. o3's internals contain a lot of advanced machinery for mathematical reasoning. What this result shows, then, is that the protein-factors problem is in some sense a "shallow" mathematical problem that could be easily solved if you think about it the right way. Finding the right way to think about it is itself highly challenging, however – a problem teams of brilliant people have failed to crack – yet deep learning allowed to automate this search and crack it.

This trick likely generalizes. There may be many problems in the world that could be cracked this way[2]: those that are secretly "mathematically shallow" in this manner, and for which you can get a clean-enough fine-tuning dataset.

... Which is to say, this almost certainly doesn't cover social manipulation/scheming (no clean dataset), and likely doesn't cover AI R&D (too messy/open-ended, although I can see being wrong about this). (Edit: And if it Just Worked given any sorta-related sorta-okay fine-tuning dataset, the o-series would've likely generalized to arbitrary domains out-of-the-box, since the pretraining is effectively this dataset for everything. Yet it doesn't.)

It's also not entirely valid to call that "innovative AI", any more than it was valid to call AlphaFold 2 that. It's an innovative way of leveraging gradient descent for specific scientific challenges, by fine-tuning a model pre-trained in a specific way. But it's not the AI itself picking a field to innovate and then producing the innovation; it's humans finding a niche which this class of highly specific (and highly advanced) tools can fill.

It's not the type of thing that can get out of human control.

So, to restate: By itself, this seems like a straightforwardly good type of AI progress. Zero existential risk or movement in the direction of existential risks, tons of scientific and technological utility.

Indeed, on the contrary: if that's the sort of thing OpenAI employees are excited about nowadays, and what their recent buzz about AGI in 2025-2026 and innovative AI and imminent Singularity was all about, that seems like quite a relief.

  1. ^

    If it does runtime search. Which doesn't fit the naming scheme – should be "GPT-o3b" instead of "GPT-4b" or something – but you know how OpenAI is with their names.

  2. ^

    And indeed, OpenAI-associated vagueposting suggests there's some other domain in which they'd recently produced a similar result.

    Edit: Having dug through the vagueposting further, yep, this is also something "in health and longevity", if this OpenAI hype man is to be believed.

    In fact, if we dig into their posts further – which I'm doing in the spirit of a fair-play whodunnit/haruspicy at this point, don't take it too seriously – we can piece together a bit more. This suggests that the innovation is indeed based on applying fine-tuning to a RL'd model. This further implies that the intent of the new iteration of GPT-4b was to generalize it further. Perhaps it was fed not just the protein-factors dataset, but a breadth of various bioscience datasets, chiseling-in a more general-purpose model of bioscience on which a wide variety of queries could be ran?

    Note: this would still be a "cutting-edge biotech simulator" kind of capability, not an "innovative AI agent" kind of capability.

    Ignore that whole thing. On further research, I'm pretty sure all of this was substance-less trolling and no actual OpenAI researchers were involved. (At most, OpenAI's psy-ops team.)

Comment by Thane Ruthenis on Implications of the inference scaling paradigm for AI safety · 2025-01-17T04:10:25.952Z · LW · GW

They don't need to use this kind of subterfuge, they can just directly hire people to do that. Hiring experts to design benchmark questions is standard; this would be no different.

Comment by Thane Ruthenis on Charlie Steiner's Shortform · 2025-01-17T01:33:45.035Z · LW · GW

Perhaps the reasoning is that the AGI labs already have all kinds of internal benchmarks of their own, no external help needed, but the progress on these benchmarks isn't a matter of public knowledge. Creating and open-sourcing these benchmarks, then, only lets the society better orient to the capabilities progress taking place, and so make more well-informed decisions, without significantly advantaging the AGI labs.

Comment by Thane Ruthenis on Everywhere I Look, I See Kat Woods · 2025-01-15T23:27:17.401Z · LW · GW

At a glance, this seems like a high-risk high-reward tactic. I approve of this if this is in fact effective at changing people's minds, and disapprove of this if this is in fact making people cringe and become biased against the ideas she's trying to spread.

My immediate impression was that it's mostly the latter. I agree that these memes seem kind of cringe, especially if they're being spread in communities that are hostile to AI Safety takes and dislike this kind of content (which my cursory familiarity with r/singularity suggests)...

... but glancing at the karma ratings her posts receive, it doesn't seem that she's speaking to a hostile audience using takes that don't land with them. I think most of her posts have at least an average karma rating for a given community, some of them are big hits, and the comments are frequently skeptical but rarely outright derisive. This isn't a major indicator and I've only spent ~5 minutes looking through them, but it does look like it's working.

@KatWoods, do you have any more convincing metrics you're using to evaluate the efficiency of your, ah, propaganda campaign? Genuinely interested in how effective it is.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-14T22:02:22.571Z · LW · GW

It also lines up with a speculation I've always had. In theory LLMs are predictors, but in practice, are they pretty much imitators?

Yep, I think that's basically the case.

@nostalgebraist makes an excellent point that eliciting any latent superhuman capabilities which bigger models might have is an art of its own, and that "just train chatbots" doesn't exactly cut it, for this task. Maybe that's where some additional capabilities progress might still come from.

But the AI industry (both AGI labs and the menagerie of startups and open-source enthusiasts) has so far been either unwilling or unable to move past the chatbot paradigm.

(Also, I would tentatively guess that this type of progress is not existentially threatening. It'd yield us a bunch of nice software tools, not a lightcone-eating monstrosity.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-14T21:51:32.321Z · LW · GW

There's been incremental improvement and various quality-of-life features like more pleasant chatbot personas, tool use, multimodality, gradually better math/programming performance that make the models useful for gradually bigger demographics, et cetera.

But it's all incremental, no jumps like 2-to-3 or 3-to-4.

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-14T21:50:17.388Z · LW · GW

Thanks!

GPT-3 was instead undertrained, being both larger and less performant than the hypothetical compute optimal alternative

You're more fluent in the scaling laws than me: is there an easy way to roughly estimate how much compute would've been needed to train a model as capable as GPT-3 if it were done Chinchilla-optimally + with MoEs? That is: what's the actual effective "scale" of GPT-3?

(Training GPT-3 reportedly took 3e23 FLOPS, and GPT-4 2e25 FLOPS. Naively, the scale-up factor is 67x. But if GPT-3's level is attainable using less compute, the effective scale-up is bigger. I'm wondering how much bigger.)

Comment by Thane Ruthenis on Thane Ruthenis's Shortform · 2025-01-14T00:40:39.488Z · LW · GW

Here's an argument for a capabilities plateau at the level of GPT-4 that I haven't seen discussed before. I'm interested in any holes anyone can spot in it.

Consider the following chain of logic:

  1. The pretraining scaling laws only say that, even for a fixed training method, increasing the model's size and the amount of data you train on increases the model's capabilities – as measured by loss, performance on benchmarks, and the intuitive sense of how generally smart a model is.
  2. Nothing says that increasing a model's parameter-count and the amount of compute spent on training it is the only way to increase its capabilities. If you have two training methods A and B, it's possible that the B-trained X-sized model matches the performance of the A-trained 10X-sized model.
  3. Empirical evidence: Sonnet 3.5 (at least the not-new one), Qwen-2.5-70B, and Llama 3-72B all have 70-ish billion parameters, i. e., less than GPT-3. Yet, their performance is at least on par with that of GPT-4 circa early 2023.
  4. Therefore, it is possible to "jump up" a tier of capabilities, by any reasonable metric, using a fixed model size but improving the training methods.
  5. The latest set of GPT-4-sized models (Opus 3.5, Orion, Gemini 1.5 Pro?) are presumably trained using the current-best methods. That is: they should be expected to be at the effective capability level of a model that is 10X GPT-4's size yet trained using early-2023 methods. Call that level "GPT-5".
  6. Therefore, the jump from GPT-4 to GPT-5, holding the training method fixed at early 2023, is the same as the jump from the early GPT-4 to the current (non-reasoning) SotA, i. e., to Sonnet 3.5.1.
    1. (Nevermind that Sonnet 3.5.1 is likely GPT-3-sized too, it still beats the current-best GPT-4-sized models as well. I guess it straight up punches up two tiers?)
  7. The jump from GPT-3 to GPT-4 is dramatically bigger than the jump from early-2023 SotA to late-2024 SotA. I. e.: 4-to-5 is less than 3-to-4.
  8. Consider a model 10X bigger than GPT-4 but trained using the current-best training methods; an effective GPT-6. We should expect this jump to be at most as significant as the capability jump from early-2023 to late-2024. By point (7), it's likely even less significant than that.
  9. Empirical evidence: Despite the proliferation of all sorts of better training methods, including e. g. the suite of tricks that allowed DeepSeek to train a near-SotA-level model for pocket change, none of the known non-reasoning models have managed to escape the neighbourhood of GPT-4, and none of the known models (including the reasoning models) have escaped that neighbourhood in domains without easy verification.
    1. Intuitively, if we now know how to reach levels above early-2023!GPT-4 using 20x fewer resources, we should be able to shoot well past early-2023!GPT-4 using 1x as much resources – and some of the latest training runs have to have spent 10x the resources that went into original GPT-4.
      1. E. g., OpenAI's rumored Orion, which was presumably both trained using more compute than GPT-4, and via better methods than were employed for the original GPT-4, and which still reportedly underperformed.
      2. Similar for Opus 3.5: even if it didn't "fail" as such, the fact that they choose to keep it in-house instead of giving public access for e. g. $200/month suggests it's not that much better than Sonnet 3.5.1.
    2. Yet, we have still not left the rough capability neighbourhood of early-2023!GPT-4. (Certainly no jumps similar to the one from GPT-3 to GPT-4.)
  10. Therefore, all known avenues of capability progress aside from the o-series have plateaued. You can make the current SotA more efficient in all kinds of ways, but you can't advance the frontier.

Are there issues with this logic?

The main potential one is if all models that "punch up a tier" are directly trained on the outputs of the models of the higher tier. In this case, to have a GPT-5-capabiliies model of GPT-4's size, it had to have been trained on the outputs of a GPT-5-sized model, which do not exist yet. "The current-best training methods", then, do not yet scale to GPT-4-sized models, because they rely on access to a "dumbly trained" GPT-5-sized model. Therefore, although the current-best GPT-3-sized models can be considered at or above the level of early-2023 GPT-4, the current-best GPT-4-sized models cannot be considered to be at the level of GPT-5 if it were trained using early-2023 methods.

Note, however: this would then imply that all excitement about (currently known) algorithmic improvements is hot air. If the capability frontier cannot be pushed by improving the training methods in any (known) way – if training a GPT-4-sized model on well-structured data, and on reasoning traces from a reasoning model, et cetera, isn't enough to push it to GPT-5's level – then pretraining transformers is the only known game in town, as far as general-purpose capability-improvement goes. Synthetic data and other tricks can allow you to reach the frontier in all manners of more efficient ways, but not move past it.

Basically, it seems to me that one of those must be true:

  • Capabilities can be advanced by improving training methods (by e. g. using synthetic data).
    • ... in which case we should expect current models to be at the level of GPT-5 or above. And yet they are not much more impressive than GPT-4, which means further scaling will be a disappointment.
  • Capabilities cannot be advanced by improving training methods.
    • ... in which case scaling pretraining is still the only known method of general capability advancement.
      • ... and if the Orion rumors are true, it seems that even a straightforward scale-up to GPT-4.5ish's level doesn't yield much (or: yields less than was expected).

(This still leaves one potential avenue of general capabilities progress: figuring out how to generalize o-series' trick to domains without easy automatic verification. But if the above reasoning has no major holes, that's currently the only known promising avenue.)

Comment by Thane Ruthenis on Cast it into the fire! Destroy it! · 2025-01-13T21:42:09.654Z · LW · GW

AGI is not the only technology or set of technologies that could be used to let a small set of people (say, 1-100) attain implacable, arbitrarily precise control over the future of humanity. Some obvious examples:

  • Sufficiently powerful industrial-scale social-manipulation/memetic-warfare tools.
  • Superhuman drone armies capable of reliably destroying designated targets while not-destroying designated non-targets.
  • Self-replicating nanotechnology capable of automated end-to-end manufacturing of arbitrarily complicated products out of raw natural resources.
  • Brain uploading, allowing to create a class of infinitely exploitable digital workers with AGI-level capabilities.

Any of those would be sufficient to remove the need to negotiate the direction of the future with vast swathes of humanity. You can just brainwash them into following your vision, or threaten them into compliance with overwhelming military power, or just disassemble them into raw materials for superior manufacturers.

Should we ban all of those as well?

Generalizing, it seems that we should ban technological progress entirely. What if there's some other pathway to ultimate control that I've overlooked when I've thought about it for a minute? Perhaps we should all return to the state of nature?

I don't mean to say you don't have a point. Indeed, I largely agree that there are no humans or human processes that humanity-as-a-whole is in the epistemic position to trust with AGI (though there are some humans I would trust with it; it's theoretically possible to use it ethically). But "we must ban AGI, it's the unique Bad technology" is invalid. Humanity's default long-term prospects seem overwhelmingly dicey well without it.

I don't have a neat alternate proposal for you. But what you're suggesting is clearly not the way.

Comment by Thane Ruthenis on AI Safety as a YC Startup · 2025-01-10T12:32:35.370Z · LW · GW

Why would we not have the "Direction" component standardized to have unit norm?

I think what the OP is getting at is that the space of endeavors has a bunch of privileged directions of high impact, and your impact depends on (1) how good your aim is and (2) how hard you shoot. So it'd be something like magnitude times the sum of cosine similarities with each high-impact vector; or perhaps just the magnitude if we use the high-impact vectors as the basis.

Also, "Magnitude" is probably the wrong term for the component in question; it seems to mean "how much you achieve", but that's actually what "Impact" is measuring! And indeed, impact is very much a function of the direction in which you're going. "Magnitude" should instead be "Effort" or "Short-Term Profit" or something. 

(Yes, I truly believe that nitpicking this toy model is the best use of my time right now.)

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-10T04:30:52.606Z · LW · GW

Minor would count.

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-09T23:52:00.858Z · LW · GW

First of all, 2 and 4 seem closely related to me.

I would also venture to guess with less confidence that 1 and 3 might be because of this as well

Agreed, I do expect that the performance on all of those is mediated by the same variable(s); that's why I called them a "cluster".

benchmarks made by METR who was specifically trying to measure AI R&D ability and agency abilities, and which genuinely do seem to require (small) amounts of agency

I think "agency" is a bit of an overly abstract/confusing term to use, here. In particular, I think it also allows both a "top-down" and a "bottom-up" approach.

Humans have "bottom-up" agency: they're engaging in fluid-intelligence problem-solving and end up "drawing" a decision-making pattern of a specific shape. An LLM, on this model, has a database of templates for such decision-making patterns, and it retrieves the best-fit agency template for whatever problem it's facing. o1/RL-on-CoTs is a way to deliberately target the set of agency-templates an LLM has, extending it. But it doesn't change the ultimate nature of what's happening.

In particular: the bottom-up approach would allow an agent to stay on-target for an arbitrarily long time, creating an arbitrarily precise fit for whatever problem it's facing. An LLM's ability to stay on-target, however, would always remain limited by the length and the expressiveness of the templates that were trained into it.

RL on CoTs is a great way to further mask the problem, which is why the o-series seems to make unusual progress on agency-measuring benchmarks. But it's still just masking it.

can you say what it is about LLM architecture and/or training methods that renders them incapable of thinking in the bottom-up way?

Not sure. I think it might be some combination of "the pretraining phase moves the model deep into the local-minimum abyss of top-down cognition, and the cheaper post-training phase can never hope to get it out of there" and "the LLM architecture sucks, actually". But I would rather not get into the specifics.

Can you describe an intellectual or practical feat, or ideally a problem set, such that if AI solves it in 2025 you'll update significantly towards my position?

"Inventing a new field of science" would do it, as far as more-or-less legible measures go. Anything less than that is too easily "fakeable" using top-down reasoning.

That said, I may make this update based on less legible vibes-based evidence, such as if o3's advice on real-life problems seems to be unusually lucid and creative. (I'm tracking the possibility that LLMs are steadily growing in general capability and that they simply haven't yet reached the level that impresses me personally. But on balance, I mostly don't expect this possibility to be realized.)

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-09T19:06:36.125Z · LW · GW

Can you say more about what skills you think the GPT series has shown ~0 improvement on?

Alright, let's try this. But this is going to be vague.

Here's a cluster of things that SotA AIs seem stubbornly bad at:

  • Innovation. LLMs are perfectly able to understand an innovative idea if it's described to them, even if it's a new idea that was produced after their knowledge-cutoff date. Yet, there hasn't been a single LLM-originating innovation, and all attempts to design "AI scientists" have produced useless slop. They seem to have terrible "research taste", even though they should be able to learn this implicit skill from the training data.
  • Reliability. Humans are very reliable agents, and SotA AIs aren't, even when e. g. put into wrappers that encourage them to sanity-check their work. The gap in reliability seems qualitative, rather than just quantitative.
  • Solving non-templated problems. There seems to be a bimodal distribution of a sort, where some people report LLMs producing excellent code/math, and others report that they fail basic tasks.
  • Compounding returns on problem-solving time. As the graph you provided shows, humans' performance scales dramatically with the time they spent on the problem, whereas AIs' – even o1's – doesn't.

My sense is that LLMs are missing some sort of "self-steering" "true autonomy" quality; the quality that allows humans to:

  • Stare at the actual problem they're solving, and build its highly detailed model in a "bottom-up" manner. Instead, LLMs go "top-down": they retrieve the closest-match template problem from a vast database, fill-in some details, and solve that problem.
    • (Non-templatedness/fluid intelligence.)
  • Iteratively improve their model of a problem over the course of problem-solving, and do sophisticated course-correction if they realize their strategy isn't working or if they're solving the wrong problem. Humans can "snap out of it" if they realize they're messing up, instead of just doing what they're doing on inertia.
    • (Reliability.)
  • Recognize when their model of a given problem represents a nontrivially new "template" that can be memorized and applied in a variety of other situations, and what these situations might be.
    • (Innovation.)

My model is that all LLM progress so far has involved making LLMs better at the "top-down" thing. They end up with increasingly bigger databases of template problems, the closest-match templates end up ever-closer to the actual problems they're facing, their ability to fill-in the details becomes ever-richer, etc. This improves their zero-shot skills, and test-time compute scaling allows them to "feel out" the problem's shape over an extended period and find an ever-more-detailed top-down fit.

But it's still fundamentally not what humans do. Humans are able to instantiate a completely new abstract model of a problem – even if it's initially based on a stored template – and chisel at it until it matches the actual problem near-perfectly. This allows them to be much more reliable; this allows them to keep themselves on-track; this allows them to find "genuinely new" innovations.

The two methods do ultimately converge to the same end result: in the limit of a sufficiently expressive template-database, LLMs would be able to attain the same level of reliability/problem-representation-accuracy as humans. But the top-down method of approaching this limit seems ruinously computationally inefficient; perhaps so inefficient it saturates around GPT-4's capability level.[1]

LLMs are sleep-walking. We can make their dreams ever-closer to reality, and that makes the illusion that they're awake ever-stronger. But they're not, and the current approaches may not be able to wake them up at all.

(As an abstract analogy: imagine that you need to color the space bounded by some 2D curve. In one case, you can take a pencil and do it directly. In another case, you have a collection of cutouts of geometric figures, and you have to fill the area by assembling a collage. If you have a sufficiently rich collection of figures, you can come arbitrarily close; but the "bottom-up" approach is strictly better. In particular, it can handle arbitrarily complicated shapes out-of-the-box, whereas the second approach would require dramatically bigger collections the more complicated the shapes get.)

Edit: Or so my current "bearish on LLMs" model goes. The performance of o3 or GPT-5/6 can very much break it, and the actual mechanisms described are necessarily speculative and tentative.

  1. ^

    Under this toy model, it needn't have saturated around this level; it could've comfortably overshot human capabilities. But this doesn't seem to be what's happening, likely due to some limitation of the current paradigm not covered by this model.

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-09T12:51:10.287Z · LW · GW

I expect this is the sort of thing that can be disproven (if LLM-based AI agents actually do start displacing nontrivial swathes of e. g. non-entry-level SWE workers in 2025-2026), but only "proven" gradually (if "AI agents start displacing nontrivial swathes of some highly skilled cognitive-worker demographic" continually fails to happen year after year after year).

Overall, operationalizing bets/empirical tests about this has remained a cursed problem.

Edit:

As a potentially relevant factor: Were you ever surprised by how unbalanced the progress and the adoption have been? The unexpected mixes of capabilities and incapabilities that AI models have displayed?

My current model is centered on trying to explain this surprising mix (top-tier/superhuman benchmark performance vs. frequent falling-flat-on-its-face real-world performance). My current guess is basically that all capabilities progress has been effectively goodharting on legible performance (benchmarks and their equivalents) while doing ~0 improvement on everything else. Whatever it is benchmarks and benchmark-like metrics are measuring, it's not what we think it is.

So what we will always observe is AI getting better and better at any neat empirical test we can devise, always seeming on the cusp of being transformative, while continually and inexplicably failing to tilt over into actually being transformative. (The actual performance of o3 and GPT-5/6 would be a decisive test of this model for me.)

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-08T17:39:50.318Z · LW · GW

I could imagine Ilya's claim making sense, e.g. if the "experiments" he's talking about are experiments in using the net rather than training the net

What I had in mind is something along these lines. More capable models[1] have various emergent properties. Specific tricks can rely on those properties being present, and work better or worse depending on that.

For example, the o-series training loop probably can't actually "get off the ground" if the base model is only as smart as GPT-2: the model would ~never find its way to correct answers, so it'd never get reinforcement signals. You can still force it to work by sampling a billion guesses or by starting it with very easy problems (e. g., basic arithmetics?), but it'd probably deliver much less impressive results than applying it to GPT-4.

Scaling further down: I don't recall if GPT-2 can make productive use of CoTs, but presumably e. g. GPT-1 can't. At that point, the whole "do RL on CoTs" completely ceases to be a meaningful thing to try.

Generalizing: At a lower level of capabilities, there's presumably a ton of various tricks that deliver a small bump to performance. Some of those tricks would have an effect size comparable to RL-on-CoTs-if-applied-at-this-scale. But out of a sea of those tricks, only a few of them would be such that their effectiveness rises dramatically with scale.

So, a more refined way to make my points would be:

  • If a trick shows promise at a small capability level, e. g. improving performance 10%, it doesn't mean it'd show a similar 10%-improvement if applied at a higher capability level.
    • (Say, because it addresses a deficiency that a big-enough model just doesn't have/which a big-enough pretraining run solves by default.)
  • If a trick shows marginal/no improvement at a small capability level, that doesn't mean it won't show a dramatic improvement at a higher capability level.

a few studies (like e.g. this) which try to run yesterday's algorithms with today's scale, and today's algorithms with yesterday's scale

My guess, based on the above, would be that even if today's algorithms perform better than yesterday's algorithms at smaller scales, the difference between their small-scale capabilities is less than the difference between yesterday's algorithms and today's algorithms at bigger scales. I. e.: some algorithms make nonlinearly better use of compute, such that figuring out which tricks are the best is easier at larger scales. (Telling apart a 5% capability improvement from a 80% one.)

  1. ^

    Whether they're more capable by dint of being bigger (GPT-4), or being trained on better data (Sonnet 3.5.1), or having a better training loop + architecture (DeepSeek V3), etc.

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-08T07:23:13.705Z · LW · GW

Yup, those two do seem to be the cruxes here.

I was trying to argue (among other things) that scaling up basically current methods could result in an increase in productivity among OpenAI capabilities researchers at least equivalent to the productivity you'd get as if the human employees operated 10x faster

You're right, that's a meaningfully different claim and I should've noticed the difference.

I think I would disagree with it as well. Suppose we break up this labor into, say,

  1. "Banal" software engineering.
  2. Medium-difficult systems design and algorithmic improvements (finding optimizations, etc.).
  3. Coming up with new ideas regarding how AI capabilities can be progressed.
  4. High-level decisions regarding architectures, research avenues and strategies, etc. (Not just inventing transformers/the scaling hypothesis/the idea of RL-on-CoT, but picking those approaches out of a sea of ideas, and making the correct decision to commit hard to them.)

In turn, the factors relevant to (4) are:

  • (a) The serial thinking of the senior researchers and the communication/exchange of ideas between them.
    • (Where "the senior researchers" are defined as "the people with the power to make strategic research decisions at a given company".)
  • (b) The outputs of significant experiments decided on by the senior researchers.
  • (c) The pool of untested-at-large-scale ideas presented to the senior researchers.

Importantly, in this model, speeding up (1), (2), (3) can only speed up (4) by increasing the turnover speed of (b) and the quality of (c). And I expect that non-AGI-complete AI cannot improve the quality of ideas (3) and cannot directly speed up/replace (a)[1], meaning any acceleration from it can only come from accelerating the engineering and the optimization of significant experiments.

Which, I expect, are in fact mostly bottlenecked by compute, and 10x'ing the human-labor productivity there doesn't 10x the overall productivity of the human-labor input; it remains stubbornly held up by (a). (I do buy that it can significantly speed it up, say 2x it. But not 10x it.)

Separately, I'm also skeptical that near-term AI can speed up the nontrivial engineering involved in medium-difficult systems design and the management of significant experiments:

Stepping back from engineering vs insights, my sense is that it isn't clear that the AIs will be terrible at insights or broader context. So, I think it will probably be more like they are very fast engineers and ok at experimental direction. Being ok helps a bunch by avoiding the need for human intervention at many points.

It seems to me that AIs have remained stubbornly terrible at this from GPT-3 to GPT-4 to Sonnet 3.5.1 to o1[2]; that the improvement on this hard-to-specify quality has been ~0. I guess we'll see if o3 (or an o-series model based on the next-generation base model) change that. AI does feel right on the cusp of getting good at this...

... just as it felt at the time of GPT-3.5, and GPT-4, and Sonnet 3.5.1, and o1. That just the slightest improvement along this axis would allow us to plug the outputs of AI cognition into its inputs and get a competent, autonomous AI agent.

And yet here we are, still.

It's puzzling to me and I don't quite understand why it wouldn't work, but based on the previous track record, I do in fact expect it not to work.

  1. ^

    In other words: If an AI is able to improve the quality of ideas and/or reliably pluck out the best ideas from a sea of them, I expect that's AGI and we can throw out all human cognitive labor entirely.

  2. ^

    Arguably, no improvement since GPT-2; I think that post aged really well.

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-07T16:58:08.748Z · LW · GW

I think the big advancements require going further afield, outside the current search-space of the major players.

Oh, I very much agree. But any associated software engineering and experiments would then be nontrivial, ones involving setting up a new architecture, correctly interpreting when it's not working due to a bug vs. because it's fundamentally flawed, figuring out which tweaks are okay to make and which tweaks would defeat the point of the experiment, et cetera. Something requiring sophisticated research taste; not something you can trivially delegate-and-forget to a junior researcher (as per @ryan_greenblatt's vision). (And importantly, if this can be delegated to (AI models isomorphic to) juniors, this is something AGI labs can already do just by hiring juniors.)

Same regarding looking for clues in neuroscience/computer-science literature. In order to pick out good ideas, you need great research taste and plausibly a bird's eye view on the entire hardware-software research stack. I wouldn't trust a median ML researcher/engineer's summary; I would expect them to miss great ideas while bringing slop to my attention, such that it'd be more time-efficient to skim over the literature myself.

In addition, this is likely also a part is where "95% of progress comes from the ability to run big experiments" comes into play. Tons of novel tricks/architectures would perform well at a small scale and flounder at a big scale, or vice versa. You need to pick a new approach and go hard on trying to make it work, not just lazily throw an experiment at it. Which is something that's bottlenecked on the attention of a senior researcher, not a junior worker.

 

Overall, it sounds as if... you expect dramatically faster capabilities progress from the AGI labs pivoting towards exploring a breadth of new research directions, with the whole "AI researchers" thing being an unrelated feature? (They can do this pivot with or without them. And as per the compute-constraints arguments, borderline-competent AI researchers aren't going to nontrivially improve on the companies' ability to execute this pivot.)

Comment by Thane Ruthenis on What Indicators Should We Watch to Disambiguate AGI Timelines? · 2025-01-07T07:31:43.760Z · LW · GW

I'm very skeptical of AI being on the brink of dramatically accelerating AI R&D.

My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute. See Ilya Sutskever's claim here:

95% of progress comes from the ability to run big experiments quickly. The utility of running many experiments is much less useful.

What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model. If you pick the trick wrong, you ruin the training run, which (a) potentially costs millions of dollars, (b) wastes the ocean of FLOP you could've used for something else.

And picking the correct trick is primarily a matter of research taste, because:

  • Tricks that work on smaller scales often don't generalize to larger scales.
  • Tricks that work on larger scales often don't work on smaller scales (due to bigger ML models having various novel emergent properties).
  • Simultaneously integrating several disjunctive incremental improvements into one SotA training run is likely nontrivial/impossible in the general case.[1]

So 10x'ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.

And, on top of that, I expect that AGI labs don't actually have the spare compute to do that 10x'ing. I expect it's all already occupied 24/7 running all manners of smaller-scale experiments, squeezing whatever value out of them that can be squeezed out. (See e. g. Superalignment team's struggle to get access to compute: that suggests there isn't an internal compute overhang.)

Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.

As a separate line of argumentation: Suppose that, as you describe it in another comment, we imagine that AI would soon be able to give senior researchers teams of 10x-speed 24/7-working junior devs, to whom they'd be able to delegate setting up and managing experiments. Is there a reason to think that any need for that couldn't already be satisfied?

If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers. They have vast amounts of money now, and LLM-based coding tools seem competent enough to significantly speed up a human programmer's work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.

In addition: the academic-research and open-source communities exist, and plausibly also fill the niche of "a vast body of competent-ish junior researchers trying out diverse experiments". The task of keeping senior researchers up-to-date on openly published insights should likewise already be possible to dramatically speed up by tasking LLMs with summarizing them, or by hiring intermediary ML researchers to do that.

So I expect the market for mid-level software engineers/ML researchers to be saturated.

So, summing up:

  • 10x'ing the ability to run small-scale experiments seems low-value, because:
    • The performance of a trick at a small scale says little (one way or another) about its performance on a bigger scale.
    • Integrating a scalable trick into the SotA-model tech stack is highly nontrivial.
    • Most of the value and insight comes from full-scale experiments, which are bottlenecked on compute and senior-researcher taste.
  • AI likely can't even 10x small-scale experimentation, because that's also already bottlenecked on compute, not on mid-level engineer-hours. There's no "compute overhang"; all available compute is already in use 24/7.
    • If it weren't the case, there's nothing stopping AGI labs from hiring mid-level engineers until they are no longer bottlenecked on their time; or tapping academic research/open-source results.
    • AI-based engineers would plausibly be less efficient than human engineers, because their inference calls would cut into the compute that could instead be spent on experiments.
  • If so, then AI R&D is bottlenecked on research taste, system-design taste, and compute, and there's relatively little non-AGI-level models can contribute to it. Maybe a 2x speed-up, at most, somehow; not a 10x'ing.

(@Nathan Helm-Burger, I recall you're also bullish on AI speeding up AI R&D. Any counterarguments to the above?)

  1. ^

    See the argument linked in the original post, that training SotA models is an incredibly difficult infrastructural problem that requires reasoning through the entire software-hardware stack. If you find a promising trick A that incrementally improves performance in some small setup, and you think it'd naively scale to a bigger setup, you also need to ensure it plays nice with tricks B, C, D.

    For example, suppose that using A requires doing some operation on a hidden state that requires that state to be in a specific representation, but there's a trick B which exploits a specific hardware property to dramatically speed up backprop by always keeping hidden states in a different representation. Then you need to either throw A or B out, or do something non-trivially clever to make them work together.

    And then it's a thousand little things like this; a vast Spaghetti Tower such that you can't improve on a small-seeming part of it without throwing a dozen things in other places in disarray. (I'm reminded of the situation in the semiconductor industry here.)

    In which case, finding a scalable insight isn't enough: even integrating this insight requires full end-to-end knowledge of the tech stack and sophisticated research taste; something only senior researchers have.

Comment by Thane Ruthenis on Shortform · 2025-01-06T10:45:23.449Z · LW · GW

We don't know yet. I expect so.