Posts

Ideas for benchmarking LLM creativity 2024-12-16T05:18:55.631Z
"Can AI Scaling Continue Through 2030?", Epoch AI (yes) 2024-08-24T01:40:32.929Z
"On the Impossibility of Superintelligent Rubik’s Cube Solvers", Claude 2024 [humor] 2024-06-23T21:18:10.013Z
FHI (Future of Humanity Institute) has shut down (2005–2024) 2024-04-17T13:54:16.791Z
Douglas Hofstadter changes his mind on Deep Learning & AI risk (June 2023)? 2023-07-03T00:48:47.131Z
COVID-19 Group Testing Post-mortem? 2022-08-05T16:32:55.157Z
Emergent Ventures/Schmidt (new grantor for individual researchers) 2022-04-09T14:41:05.764Z
Fake Journal Club proposal 2022-03-25T14:23:18.785Z
It Looks Like You're Trying To Take Over The World 2022-03-09T16:35:35.326Z
Capability Phase Transition Examples 2022-02-08T03:32:54.551Z
"Summarizing Books with Human Feedback" (recursive GPT-3) 2021-11-15T17:41:53.189Z
EfficientZero: human ALE sample-efficiency w/MuZero+self-supervised 2021-11-02T02:32:41.856Z
My ML Scaling bibliography 2021-10-23T14:41:45.170Z
AlphaFold 2 paper released: "Highly accurate protein structure prediction with AlphaFold", Jumper et al 2021 2021-07-15T19:27:20.584Z
May 2021 Gwern.net newsletter 2021-06-11T14:13:18.485Z
"Decision Transformer" (Tool AIs are secret Agent AIs) 2021-06-09T01:06:57.937Z
April 2021 Gwern.net newsletter 2021-06-03T15:13:29.138Z
gwern's Shortform 2021-04-24T21:39:14.128Z
March 2021 gwern.net newsletter 2021-04-06T14:06:20.198Z
February 2021 gwern.net newsletter 2021-03-13T14:57:54.645Z
January 2021 gwern.net newsletter 2021-02-04T20:12:39.555Z
December 2020 gwern.net links 2021-01-10T17:21:40.756Z
November 2020 gwern.net newsletter 2020-12-03T22:47:16.917Z
October 2020 gwern.net newsletter 2020-11-01T21:38:46.795Z
/r/MLScaling: new subreddit for NN scaling research/discussion 2020-10-30T20:50:25.973Z
"Scaling Laws for Autoregressive Generative Modeling", Henighan et al 2020 {OA} 2020-10-29T01:45:30.666Z
September 2020 gwern.net newsletter 2020-10-26T13:38:51.107Z
August 2020 gwern.net newsletter 2020-09-01T21:04:58.299Z
July 2020 gwern.net newsletter 2020-08-20T16:39:27.202Z
June 2020 gwern.net newsletter 2020-07-02T14:19:08.696Z
GPT-3 Fiction Samples 2020-06-25T16:12:05.422Z
May Gwern.net newsletter (w/GPT-3 commentary) 2020-06-02T15:40:37.155Z
OpenAI announces GPT-3 2020-05-29T01:49:04.855Z
"AI and Efficiency", OA (44✕ improvement in CNNs since 2012) 2020-05-05T16:32:20.335Z
April 2020 gwern.net newsletter 2020-05-01T20:47:44.867Z
March 2020 gwern.net newsletter 2020-04-03T02:16:02.871Z
February 2020 gwern.net newsletter 2020-03-04T19:05:16.079Z
January 2020 gwern.net newsletter 2020-01-31T18:04:21.945Z
Subscripting Typographic Convention For Citations/Dates/Sources/Evidentials: A Proposal 2020-01-08T22:20:20.290Z
Dec 2019 gwern.net newsletter 2020-01-04T20:48:48.788Z
Nov 2019 gwern.net newsletter 2019-12-02T21:16:04.846Z
October 2019 gwern.net newsletter 2019-11-14T20:26:34.236Z
September 2019 gwern.net newsletter 2019-10-04T16:44:43.147Z
"AI-GAs: AI-generating algorithms, an alternate paradigm for producing general artificial intelligence", Clune 2019 2019-09-10T21:33:08.837Z
August 2019 gwern.net newsletter (popups.js demo) 2019-09-01T17:52:01.011Z
"Designing agent incentives to avoid reward tampering", DeepMind 2019-08-14T16:57:29.228Z
July 2019 gwern.net newsletter 2019-08-01T16:19:59.893Z
How Should We Critique Research? A Decision Perspective 2019-07-14T22:51:59.285Z
June 2019 gwern.net newsletter 2019-07-01T14:35:49.507Z
On Seeing Through 'On Seeing Through: A Unified Theory': A Unified Theory 2019-06-15T18:57:25.436Z

Comments

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-02-19T02:03:57.503Z · LW · GW

"Overtraining" isn't Chinchilla; Chinchilla is just "training". The overtraining being advocated was supra-Chinchilla, with the logic that while you were going off the compute-optimal training, sure, you were more than making up for it by your compute-savings in the deployment phase, which the Chinchilla scaling laws do not address in any way. So there was a fad for training small models for a lot longer.

Comment by gwern on Florian_Dietz's Shortform · 2025-02-17T23:57:34.027Z · LW · GW

The pondering happens in earlier layers of the network, not in the output

Then how does it produce any tokens...?

then training on task Y could inadvertently bias the model to do more or less pondering on mostly-unrelated-but-statistically-correlated topic X.

But if that is what is going on and it accidentally learns to ponder initially due to bogus feedback or error, eventually the spurious correlation should be figured out by the model doing the pondering more, but it not increasing reward, and so it gets unlearned.

(Also, this assumes that RL gives an average reward of 0.0, which I don't know if that's true in practice.)

I think the mean would be taken out by the advantage estimation, so the RLHF continues to increase the probability of tokens being generated from the episodes with above-average reward, and punish the probability of generating the tokens from the below-average reward episodes. This is in effect as if the average reward is always 0.

What would be the implications? The model could develop a political bias to think more deeply about topics related to party X, where X is whatever party has more users giving the model positive feedback. Even if the other topics on party X's agenda are never explicitly talked about (!)

That sounds like the pondering's conclusions are then related to the task.

Comment by gwern on Florian_Dietz's Shortform · 2025-02-17T22:21:46.848Z · LW · GW

This idea could very well be wrong. The gradients may be weakened during backpropagation before they get to the unrelated ideas, because the ideas did not directly contribute to the task.

Under a straightforward RLHF using PPO, I think there wouldn't be much weakening because the REINFORCE operator conceptually simply rewards (or punishes) all tokens generated during an episode, without making much attempt to decide which were 'good' or 'bad'. (That's why it's so high variance.) Any advantage function trying to remove some of the variance probably won't do a good job.

More problematically for your idea, if the conclusions are indeed 'unrelated to the task', then shouldn't they be just as likely to arise in every episode - including the ones where it got negative reward? That would seem like it ought to exactly cancel out any learning of 'pondering'.

You need some incentive somewhere to learn good 'pondering'. (I have an example proposal for 'free play' which tries to teach a sort of 'pondering', but by stopping gradients, so anything learned in the initial steps is 'free', and so it can meta-learn to screw around and get something useful for free.)

Comment by gwern on shortplav · 2025-02-17T17:45:03.509Z · LW · GW

Maybe it would look more random if you presented it segmented by token instead of translated into characters? I'm not familiar with the LLaMA tokenizations, but you seem to imply that a lot of the apparent patterns here are single tokens (like "partiellement" would be very surprising to me as the output of a greedy likelihood-minimizing sampling, but is trivial if it is a single BPE token). This would create a misleading impression of coherence.

Also, as Baginski notes, greedy sampling to minimize likelihood will not minimize total likelihood any more than greedy maximizing likelihood would maximize total likelihood. So it would be worth trying at least 'worst-of-n' sampling to see if it looks more like what you expect, in the same way that best-of-n often helps produce more expected LLM output. (After all, you would expect the tiniest logits to be the worst estimated of all logits, right? Full of sheer numerical noise and error, given that this is pushing 'dark knowledge' to its extremes. Who can really say much better or worse an answer, exactly, '衡' is than 'д' when following '*', etc? So if best-of-n can make such a qualitative difference when greedily sampling from the best-estimated logits..)

Comment by gwern on $300 Fermi Model Competition · 2025-02-17T01:54:14.078Z · LW · GW

Note that text in pretraining may even be an expensive way to go about it: one of the most dramatic demonstrations MS gave us with Sydney was the incredible speed & efficiency of web-search-powered adversarial attacks on LLMs. You don't need to dump a lot of samples onto the Internet and pray they make it into the training data and don't get forgotten, if you can set up a single sample with good SEO and the LLM kindly retrieves it for you and attacks itself with your sample.

This is something to think about: it's not just making it into the training data, it's making it into the agent's prompt or context that can matter. People are currently talking about how Deep Research is an example of the AI trend which will drive paywalls everywhere... which may happen, but consider the positives for people who don't put up paywalls.

Comment by gwern on CstineSublime's Shortform · 2025-02-12T17:30:59.830Z · LW · GW

Why not just 'valuable information', in a Value of Information sense of 'valuable'?

Comment by gwern on nikola's Shortform · 2025-02-10T03:00:19.057Z · LW · GW

The estimate of the compute of their largest version ever (which is a very helpful way to phrase it) at only <=50x GPT-4 is quite relevant to many discussions (props to Nesov) and something Altman probably shouldn't've said.

The estimate of test-time compute at 1000x effective-compute is confirmation of looser talk.

The scientific research part is of uncertain importance but we may well be referring back to this statement a year from now.

Comment by gwern on Inference cost limits the impact of ever larger models · 2025-02-07T21:37:38.774Z · LW · GW

Apropos of very low-latency LLMs and revisiting this topic a little: what does this imply about DRL robotics, rather than animals? Will DRL NNs have to have brains as big as humans in order to run superhuman humanoid robots?

One possible implication is that Portia-like NNs are possible for robotics in general. Robotics may be quite 'easy' in that sense.

It is striking that when we look at NN parameter/FLOPS-counts, we generally do not see 'large' robotics, vision, or sound models, but LLMs; the largest pure-vision models like PaLI-X are <100b-parameters, the largest robotics are usually <10b, with Gato 1's ~1b having been, if anything, unusually large because of all the other stuff it was doing. (I'm very behind on the robotics literature so maybe there are now much larger 100b-parameter models as they move into the 'foundation model' multi-modal/task scaling paradigm, but I'd bet that there still are none >1,000b.) Even sound/image/video generative models, which would be expected to be much larger than necessary for robotics tasks, are often small enough to run on a single consumer GPU, still. And these are usually trained with scaling laws now, so these are compute-optimal sizes and it is not just that they are wildly under-parameterized (the way almost all models were pre-2020).

So, if robotics is intrinsically easy, but animal brains do not show this because of their latency requirements, which forces them into misleadingly expensive brains, the implication is that we can do robotics by lifting the limitations of biological brains, like being forced to learn in realtime, in the real world, one animal at a time, without any sharing.

We should be able to train deep but small NNs in silico: turning all animal problems into Portia problems, if you will, pausing the simulation to let the NNs think & act for as long as necessary to plan the right action, and only then letting time flow to see what happens, and reset it to try again.

We remove all burdens of wallclock time or caloric consumption or childhood development, which are powerful general robotic controllers, and only then use these teacher-models to optimize low-latency controllers. The wider low-latency student models will be easier to train when they simply must imitate the teacher in a supervised-learning setting instead of RL from scratch, and so the size should be a lot better. (If nothing else, the student models can't 'die' if they make a mistake like breaking a latency constraint, so this learning setting is way easier than an animal's task.)


On a related note, it is also striking how far down in size LLMs can be pushed. You can get good reasoning out of tiny billion-parameter LLMs trained hard enough on high-quality-enough data, and the 'densifying experience curve' is steady and rapid (halving period of ~4 months), so we can expect that at some point we may have superhuman reasoning LLMs in the billion or sub-billion parameter range... which are just very, very ignorant, perhaps even more ignorant than you or me, of all the real-world knowledge & text that a proper LLM has. We can't train those from scratch, but we can train trillion-parameter LLMs to suck in all the text in the world, and then exhale training data for small fast cheap models.

So it seems that Moravec's Paradox remains undefeated: as difficult as we find the abstract intellectual capabilities like the process of doing math or reasoning, so difficult we struggle to even write them down to train LLMs on, so difficult to train on we need giant gigawatt datacenters to just get started, they are not intrinsically difficult and in the long run, do not require big expensive NNs.

Comment by gwern on OpenAI releases deep research agent · 2025-02-03T16:54:53.780Z · LW · GW

But does that necessarily matter? Many of those models can't use tools; and since much of the point of the end-to-end RL training of Deep Research is to teach tool use, showing DR results without tool use would be either irrelevant or misleading (eg. it might do worse than the original o3 model it is trained from, when deprived of the tools it is supposed to use).

Comment by gwern on The Game Board has been Flipped: Now is a good time to rethink what you’re doing · 2025-02-03T02:15:12.386Z · LW · GW

Who right now is standing on the sidelines with a killer AI app that could rip up the market if only tokens were a bit cheaper?

OpenAI's Deep Research is looking like something that could be big and they were standing on the sidelines in part because the tokens weren't cheap.

Comment by gwern on Passages I Highlighted in The Letters of J.R.R.Tolkien · 2025-02-03T00:38:27.747Z · LW · GW

Most people do not read many books or spend time in spaces where SAT vocab words would be used at all. If that was the sole determinant, you would then expect any vocab test to fail catastrophically and not predict/discriminate in most of the population (which would have downstream consequences like making SATs weirdly unreliable outside the elite colleges or much less predictive validity for low-performing demographics, the former of which I am unaware of being true and the latter of which I know is false); this would further have the surprising consequence that if a vocab test is, say, r = 0.5 with g while failing catastrophically on most of the population, it would have to be essentially perfectly correlated r = 1 in the remainder to even be arithmetically possible, which just punts the question: how did two book-readers come away from that book with non-overlapping vocabs...?

I have good vocabulary, e.g. 800 on GRE verbal, but feel like I have a pretty bad memory for words and terms that I’ve only seen a few times.

How could you possibly know something like that?

Comment by gwern on 5,000 calories of peanut butter every week for 3 years straight · 2025-02-01T03:57:50.794Z · LW · GW

One benefit of his 'no-nut January' is that by cutting out peanuts entirely, he's also avoiding problems from oxalates. I would expect powdered peanut butter to be as dangerous in that regard.

Comment by gwern on Passages I Highlighted in The Letters of J.R.R.Tolkien · 2025-02-01T00:51:01.744Z · LW · GW

And yet, despite the SAT being so studied for, it remains a pretty good IQ test overall, and SAT-V or the GRE verbal parts OK. I think that's because there are so many words (500k+ in English, and the GRE-V has no compunction about mining the obscurest just to f--- with you), and you would have to study so many in order to meaningful inflate your scores (because after all, while there may be only a hundred 'vocab words' on any given SAT test, you don't know which hundred). Let's see... Here's an interesting-looking reference: "How Many Words Do We Know? Practical Estimates of Vocabulary Size Dependent on Word Definition, the Degree of Language Input and the Participant’s Age", Brysbaert et al 2016

an average 20-year-old native speaker of American English knows 42,000 lemmas and 4,200 non-transparent multiword expressions, derived from 11,100 word families. The numbers range from 27,000 lemmas for the lowest 5% to 52,000 for the highest 5%. Between the ages of 20 and 60, the average person learns 6,000 extra lemmas or about one new lemma every 2 days.

So, if you wanted to boost your score from the mean to the 95th percentile, that seems to imply that you'd have to memorize 10,000 'lemmas' ("Uninflected word from which all inflected words are derived"). That's a big number, and then you have to ask how much work that would be.

If you did this in the optimal way with spaced repetition (ignoring the time it takes to figure out the 10k you want to memorize in the first place or the time to construct the flashcards or any penalty from needing to inefficiently cram them for an upcoming SAT instead of life-long efficient review), which of course still few students do, as spaced repetition systems remain a niche outside of medical school & foreign language study, the SuperMemo rough estimate is a long-term investment of 5 minutes per flashcard, and we'll assume 1 lemma = 1 flashcard. That means you have to invest 10,00 * 5 = 50,000 minutes or 833 hours of studying! Meanwhile, hardly anyone is doing more than 8 hours of studying for the SAT as a whole (among the kids I knew at a prep high school, many didn't even do a weekend course, which would entail about 8 hours of classwork & study). 833 hours for vocab alone would be insane.

That's why people generally learn vocab from passive exposure rather than targeted study. Because no one, not even the most teacher's-pet student, wants to do that. And so vocab measures keep working.

Comment by gwern on sarahconstantin's Shortform · 2025-01-30T20:28:23.160Z · LW · GW

then I think it is also very questionable whether the AI that wins wars is the most "advanced" AI. / People like Dario whose bread-and-butter is model performance invariably over-index on model performance, especially on benchmarks. But practical value comes from things besides the model; what tasks you use it for and how effective you are at deploying it.

Dario is about the last AI CEO you should be making this criticism of. Claude has been notable for a while for the model which somehow winds up being the most useful and having the best 'vibes', even when the benchmarks indicate it's #2 or #3; and meanwhile, it is the Chinese models which historically regress the most from their benchmarks when applied (and DeepSeek models, while not as bad as the rest, still do this and r1 is already looking shakier as people try out heldout problems or benchmarks).

Comment by gwern on Jesse Hoogland's Shortform · 2025-01-28T18:39:11.262Z · LW · GW

Only if you ignore that yesterday was when the Trump GPU tariffs would also be leaking and, pace event-studies, be expected to be changing prices too.

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-28T16:09:55.859Z · LW · GW

It's not RL, but what is RL any more? It's becoming blurry. They don't reward or punish it for anything in the thought token. So it learns thoughts that are helpful in outputting the correct answer.

That's definitely RL (and what I was explaining was simply the obvious basic approach anyone in DRL would think of in this context and so of course there is research trying things like it). It's being rewarded for a non-differentiable global loss where the correct alternative or answer or label is not provided (not even information of the existence of a better decision) and so standard supervised learning is impossible, requiring exploration. Conceptually, this is little different from, say, training a humanoid robot NN to reach a distant point in fewer actions: it can be a hard exploration problem (most sequences of joint torques or actions simply result in a robot having a seizure while laying on the ground going nowhere), where you want to eventually reach the minimal sequence (to minimize energy / wear-and-tear / time) and you start by solving the problem in any way possible, rewarding solely on the final success, and then reward-shape into a desirable answer, which in effect breaks up the hard original problem into two more feasible problems in a curriculum - 'reach the target ever' followed by 'improve a target-reaching sequence of actions to be shorter'.

Comment by gwern on The present perfect tense is ruining your life · 2025-01-28T01:38:25.789Z · LW · GW

While we're at it, one example I learned afterwards was that the 'caribou randomization' story is probably bogus (excerpts):

We will show that hunters do not randomize their behavior, that caribou populations do not fluctuate according to human predation, and that scapulimancy apparently is not selected because it is ecologically advantageous. We shall also show that there is no cross-cultural evidence of divinatory random devices producing randomized subsistence behavior, but rather that people manipulate divination with the explicit or implicit intervention of personal choice.

What is particularly interesting to me is that the apparent beautiful match of this traditional hunting practice with contemporary game theory may be 'too good to be true' because it was actually the opposite: I suspect that the story was made up to launder (secret) game-theoretic work from WWII into academic writing; the original author's career & funder are exactly where that sort of submarine-warfare operations-research idea would come from... (There were many cases post-WWII of civilians carefully laundering war or classified work into publishable form, which means that any history-of-ideas has to be cautious about taking at face value anything published 1940–1960 which looks even a little bit like cryptography, chemistry, physics, statistics, computer science, game theory, or operations research.)

Comment by gwern on DAL's Shortform · 2025-01-28T01:23:05.683Z · LW · GW

Outputs of o1 don't include reasoning traces, so not particularly useful compared to outputs of chatbot models, and very expensive, so only a modest amount can be collected.

It would be more precise to say outputs of o1 aren't supposed to include the reasoning traces. But in addition to the reasoning traces OA voluntarily released, people have been observing what seem to be leaks, and given that the history of LLM robustness to jailbreaks can be summarized as 'nil', it is at least conceivable that someone used a jailbreak+API to exfiltrate a bunch of traces. (Remember that Chinese companies like ByteDance have definitely been willfully abusing the OA API for the purposes of knowledge distillation/cloning and evading bans etc, in addition to a history of extremely cutthroat tactics that FANG would blanch at, so it's a priori entirely plausible that they would do such things.)

I don't believe DeepSeek has done so, but it is technically possible. (Regardless of whether anyone has done so, it is now partially moot given that r1 traces in the DS paper, and based on third party reports thus far, work so well for distillation so everyone can kickstart their own r1-clone with r1 reasoning traces and work from there. There may be more reason to try to exfiltrate o3+ traces, but OA may also decide to not bother, as users are claiming to value and/or enjoy reading the raw traces, and since the secret & capability is out, maybe there's not much point in hiding them any longer.)

Comment by gwern on Habryka's Shortform Feed · 2025-01-27T16:23:27.065Z · LW · GW

There is also GreaterWrong, which I believe caches everything rather than passing through live, so it would be able to restore almost all publicly-visible content, in theory.

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-26T16:58:52.727Z · LW · GW

Right now, it seems to be important to not restrict the transcripts at all. This is a hard exploration problem, where most of the answers are useless, and it takes a lot of time for correct answers to finally emerge. Given that, you need to keep the criteria as relaxed as possible, as they are already on the verge of impossibility.

The r1, the other guys, and OAers too on Twitter now seem to emphasize that the obvious appealing approach of rewarding tokens for predicted correctness or doing search on tokens, just doesn't work (right now). You need to 'let the LLMs yap' until they reach the final correct answer. This appears to be the reason for the bizarre non sequiturs or multi-lingual diversions in transcripts - that's just the cost of rolling out solution attempts which can go anywhere and keeping the winners. They will do all sorts of things which are unnecessary (and conversely, omit tokens which are 'necessary'). Think of it as the equivalent of how DRL agents will 'jitter' and take many unnecessary actions, because those actions don't change the final reward more than epsilon, and the RL feedback just isn't rich enough to say 'you don't need to bounce up and down randomly while waiting for the ball to bounce back, that doesn't actually help or hurt you' (and if you try to reward-shape away those wasteful movements, you may discover your DRL agent converges to a local optimum where it doesn't do anything, ever, because the jitters served to explore the environment and find new tricks, and you made it too expensive to try useless-seeming tricks so it never found any payoffs or laddered its way up in capabilities).

So you wouldn't want to impose constraints like 'must be 100% correct valid Lean proof'. Because it is hard enough to find a 'correct' transcript even when you don't penalize it for spending a while yapping in Japanese or pseudo-skipping easy steps by not writing them down. If you imposed constraints like that, instead of rolling out 1000 episodes and getting 1 useful transcript and the bootstrap working, you'd get 0 useful transcripts and it'd go nowhere.

What you might do is impose a curriculum: solve it any way you can at first, then solve it the right way. Once you have your o1 bootstrap working and have seen large capability gains, you can go back and retrain on the easiest problems with stricter criteria, and work your way back up through the capability levels, but now in some superior way. (In the DRL agent context, you might train to convergence and only then impose a very, very small penalty on each movement, and gradually ramp it up until the performance degrades a little bit but it's no longer jittering.) The same way you might be taught something informally, and then only much later, after you've worked with it a lot, do you go back and learn or prove it rigorously. You might impose a progressive shrinking constraint, for example, where the transcript has to be fewer tokens each time, in order to distill the knowledge into the forward passes to make it vastly cheaper to run (even cheaper, for hard problems, than simply training a small dumb model on the transcripts). You might try to iron out the irrelevancies and digressions by having a judge/critic LLM delete irrelevant parts. You might try to eliminate steganography by rewriting the entire transcript using a different model. Or you might simply prompt it to write a proof in Lean, and score it by whether the final answer validates.

Comment by gwern on What comes after Roam's renaissance? · 2025-01-26T15:44:24.614Z · LW · GW

Fernando Boretti has a good 2022 post "Unbundling Tools for Thought" I don't think I saw before, but which makes some of these points at greater length and I largely agree with.

Comment by gwern on Learning By Writing · 2025-01-26T01:13:05.296Z · LW · GW

Holden was previously Open Philanthropy's CEO and is now settling into his new role at Anthropic.

Wait, what? When did Holden Karnofsky go to Anthropic? Even his website doesn't mention that and still says he's at Carnegie.

Comment by gwern on Mechanisms too simple for humans to design · 2025-01-22T22:56:42.430Z · LW · GW

The shape of your face, and much else besides, will be affected by random chance and environmental influences during the process of development and growth.

The shape of your face will not be affected much by random chance and environmental influences. See: identical twins (including adopted apart).

Comment by gwern on What's Wrong With the Simulation Argument? · 2025-01-22T15:16:11.484Z · LW · GW

There are other, more interesting and important ways to use that compute capacity. Nobody sane, human or alien, is going to waste it on running a crapton of simulations.

Counterpoint: speedrunning and things like 'Twitch plays', which are some of the most popular streaming genres in existence, and exist largely because they are unimportant. A TAS speedrunner may well run millions or billions of simulations simply to try to shave off 1s from the record. (An example I like to cite uses 6 CPU-years to bruteforce NES Arkanoid to achieve nearly optimal play. Unfortunately, he doesn't provide the wallclock equivalent, but I strongly suspect that this project alone simulates more minutes of NES Arkanoid than it was ever played by humans. If not, then I'm quite sure at this point that NES Mario has been played in silico OOMs more than by humans. Plenty of projects like 'My First NEAT project' will do a few years or centuries of NES Mario.)

Comment by gwern on Anders Lindström's Shortform · 2025-01-21T23:21:22.724Z · LW · GW

Why do you think that? Softbank, MS, Oracle, OpenAI etc are not governments, and the press release is not claiming to take any government money. Not to mention, this was to a considerable extent announced a year ago.

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-21T23:05:50.492Z · LW · GW

An important update: "Stargate" (blog) is now officially public, confirming earlier $100b numbers and some loose talk about 'up to $500b' being spent. Noam Brown commentary:

@OpenAI excels at placing big bets on ambitious research directions driven by strong conviction.

This is on the scale of the Apollo Program and Manhattan Project when measured as a fraction of GDP. This kind of investment only happens when the science is carefully vetted and people believe it will succeed and be completely transformative. I agree it’s the right time.

...I don’t think that’s the correct interpretation. DeepSeek shows you can get very powerful AI models with relatively little compute. But I have no doubt that with even more compute it would be an even more powerful model.

Miles Brundage:

If r1 being comparable to o1 surprised you, your mistake was forgetting the 1 part. This is the early stage of a new paradigm, and SOTA is the cheapest it will ever be.

That does NOT mean compute doesn't matter. (I've said roughly this before, but it bears repeating)

...Don't get me wrong, DeepSeek is nothing to sneeze it.

They will almost certainly get much more compute than they have now. But so will OpenAI...

And if DeepSeek keeps up via compute, that does not invalidate the original point re: compute being key.

(This is an example of why I don't expect DeepSeek to leapfrog OA/A/G/FB/xAI/SSI/et al: DS does great work, but $500b is a lot of money, and their capital disadvantage may be, if anything, bigger when you move from a raw parameter/data-scaling regime to an inference/search scaling regime. 6 million dollar training budgets aren't cool. You know what's cool? 6 million GPU training budgets...)

EDIT: the lead author, Daya Guo, of the r1 paper reportedly tweeted (before deleting):

The last work in 2024, nothing can stop us on the path to AGI, except for computational resources.

Comment by gwern on What Goes Without Saying · 2025-01-21T16:01:27.415Z · LW · GW

I'm sure it would be less flattering to me than my version, because people never remember these sorts of conversations the same way. If you think that it might not have happened like that, then just treat it as a hypothetical discussion that could have happened and ponder how contemporary Western lower-education systems can make truly transformative, rather than minor tinkering around the edges, use of AGI which preserves all existing compensation/status/prestige/job/political arrangements and which the teachers' unions and pension plans would not be implacably opposed to.

It's a good thing to think about if you are trying to gauge what sort of economic or societal changes might happen over the next decade, especially if you are trying to use that as a proxy for 'is AGI real', as so many people are. Personally, my conclusion has long been that the economy & society are so rigid that most such arrangements will remain largely intact even if they are dead men walking, and the pace of AI progress is so rapid that you should basically ignore any argument of the form 'but we still have human teachers, therefore, AGI can't be real'.

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-21T03:21:06.693Z · LW · GW

My point there is that he was talking to the reasoning team pre-hiring (forget 'onboarding', who knows what that means), so they would be unable to tell him most things - including if they have a better reason than 'faith in divine benevolence' to think that 'more RL does fix it'.

Comment by gwern on What is the most impressive game LLMs can play well? · 2025-01-21T00:16:10.928Z · LW · GW

A human player beating a random player isn't two random players.

I am more interested in any direct evidence that makes you suspect LLMs are good at chess when prompted appropriately?

Well, there's the DM bullet-chess GPT as a drastic proof of concept. If you believe that LLMs cannot learn to play chess, you have to explain how things like that work.

Comment by gwern on What is the most impressive game LLMs can play well? · 2025-01-20T21:01:35.641Z · LW · GW

There should be plenty of transcripts of random algorithms as baseline versus effective chess algorithms in the training set

I wouldn't think that. I'm not sure I've seen a random-play transcript of chess in my life. (I wonder how long those games would have to be for random moves to end in checkmate?)

the prompt suggests strong play.

Which, unlike random move transcripts, is what you would predict, since the Superalignment paper says the GPT chess PGN dataset was filtered for Elo ("only games with players of Elo 1800 or higher were included in pretraining"), in standard behavior-cloning fashion.

Comment by gwern on Ericf's Shortform · 2025-01-20T20:29:25.365Z · LW · GW

Another example that reality (especially anything involving technology) is not constrained by the need to be realistic. What SF author would dare write a story with meme coins, much less one in which the meme coins involved AIs like Claude?

Comment by gwern on What is the most impressive game LLMs can play well? · 2025-01-20T20:17:59.986Z · LW · GW

Yes.

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-20T20:08:27.861Z · LW · GW

Huh, so you think o1 was the process supervision reward model, and o3 is the distilled policy model to whatever reward model o1 became? That seems to fit.

Something like that, yes. The devil is in the details here.

Surely other labs will also replicate this too? Even the open source community seems close. And Silicon Valley companies often poach staff, which makes it hard to keep a trade secret. Not to mention spies.

Of course. The secrets cannot be kept, and everyone has been claiming to have cloned o1 already. There are dozens of papers purporting to have explained it. (I think DeepSeek may be the only one to have actually done so, however; at least, I don't recall offhand any of the others observing the signature backtracking 'wait a minute' interjections the way DeepSeek sees organically emerging in r1.)

But scaling was never a secret. You still have to do it. And MS has $80b going into AI datacenters this year; how much does open source (or DeepSeek) have?

It's worth pointing out that Inference-time search seems to become harder as the verifier becomes less reliable. Which means that the scaling curves we see for math and code, might get much worse in other domains.

Yes. That's why I felt skeptical about how generalizable the o1 approach is. It doesn't look like a break-out to me. I don't expect much far transfer: being really good at coding doesn't automatically make you a genius at, say, booking plane tickets. (The o1 gains are certainly not universal, the way straightforward data/parameter-scaling gains tend to be - remember that some of the benchmarks actually got worse.) I also expect the o1 approach to tend to plateau: there is no ground truth oracle for most of these things, the way there is for Go. AlphaZero cannot reward-hack the Go simulator. Even for math, where your theorem prover can at least guarantee that a proof is valid, what's the ground-truth oracle for 'came up with a valuable new theorem, rather than arbitrary ugly tautological nonsense of no value'?

So that's one of the big puzzles here for me: as interesting and impressive as o1/o3 is, I just don't see how it justifies the apparent confidence. (Noam Brown has also commented that OA has a number of unpublished breakthroughs that would impress me if I knew, and of course, the money side seems to still be flowing without stint, despite it being much easier to cancel such investments than cause them.)

Is OA wrong, or do they know something I don't? (For example, a distributional phase shift akin to meta-learning.) Or do they just think that these remaining issues are the sort of thing that AI-powered R&D can solve and so it is enough to just get really, really good at coding/math and they can delegate from there on out?

EDIT: Aidan McLaughlin has a good post back in November discussing the problems with RL and why you would not expect the o1 series to lead to AGI when scaled up in sensible ways, which I largely agree with, and says:

But, despite this impressive leap, remember that o1 uses RL, RL works best in domains with clear/frequent reward, and most domains lack clear/frequent reward.

Praying for Transfer Learning: OpenAI admits that they trained o1 on domains with easy verification but hope reasoners generalize to all domains...When I talked to OpenAI’s reasoning team about this, they agreed it was an issue, but claimed that more RL would fix it. But, as we’ve seen earlier, scaling RL on a fixed model size seems to eat away at other competencies! The cost of training o3 to think for a million tokens may be a model that only does math.

On the other hand... o3 didn't only do math, and in RL we also know that RL systems often exhibit phase transitions in terms of meta-learning or generalization, where they overfit to narrow distributions and become superhuman experts which break if anything is even slightly different, but suddenly generalize when train on diverse enough data as a blessing of scale, not in data but data diversity, with LLMs being a major case in point of that, like GPT-2 -> GPT-3. Hm. This was written 2024-11-20, and McLaughlin announced 2025-01-13 that he had joined OpenAI. Hm...

Comment by gwern on The Hidden Status Game in Hospital Slacking · 2025-01-20T19:55:06.458Z · LW · GW

This post was good up until the LLM part, which is largely bullshit and applause lights which make no sense if you actually think about it (ah yes, I'm sure some 'audits' will fix this).

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-20T19:49:10.542Z · LW · GW

The current FrontierMath fracas is a case in point. Did OpenAI have to keep its sponsorship or privileged access secret? No. Surely there was some amount of money that would pay mathematicians to make hard problems, and that amount was not much different from what they did pay Epoch AI. Did that make life easier? Given the number of mathematician-participants saying they would've had second thoughts about participating had they known OA was involved, almost surely.

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-20T19:41:10.608Z · LW · GW

What Jones didn’t suggest (but gwern seems to be saying) is that you can use your search-enhanced model to produce better quality synthetic data to train a larger model on.

Jones wouldn't say that because that's just implicit in expert iteration. In each step of expert iteration, you can in theory be training an arbitrary new model from scratch to imitate the current expert. Usually you hold fixed the CNN and simply train it some more on the finetuned board positions from the MCTS, because that is cheap, but you don't have to. As long as it takes a board position, and it returns value estimates for each possible move, and can be trained, it works. You could train a larger or smaller CNN, a deeper or wider* CNN of the same size, a ViT, a RNN, a random forest... (See also 'self-distillation'.) And you might want to do this if the old expert has some built-in biases, perhaps due to path dependency, and is in a bad local optimum compared to training from a blank slate with the latest best synthetic data.

You can also do this in RL in general. OpenAI, for example, kept changing the OA5 DotA2 bot architecture on the fly to tweak its observations and arches, and didn't restart each time. It just did a net2net or warm initialization, and kept going. (Given the path dependency of on-policy RL especially, this was not ideal, and did come with a serious cost, but it worked, as they couldn't've afforded to train from scratch each time. As the released emails indicate, OA5 was breaking the OA budget as it was.)

Now, it's a great question to ask: should we do that? Doesn't it feel like it would be optimal to schedule the growth of the NN over the course of training in a scenario like Jones 2021? Why pay the expense of the final oversized CNN right from the start when it's still playing random moves? It seems like there ought to be some set of scaling laws for how you progressively expand the NN over the course of training before you then brutally distill it down for a final NN, where it looks like an inverted U-curve. But it's asking too much of Jones 2021 to do that as well as everything else. (Keep in mind that Andy Jones was just one guy with no budget or lab support doing this paper all on his own over, like, a summer. In terms of bang for buck, it is one of the best DL & RL papers of the past decade, and puts the rest of us to shame.)

* for latency. An early example of this is WaveNet for synthesizing audio: it was far too slow for real-time, because it was too deep. It didn't cost too many computations, but the computations were too iterative to allow generating 1s of audio in 1s of wallclock, which renders it completely useless for many purposes, as it will fall behind. But once you have a working WaveNet, you can then distill it into a very wide CNN which does much more parallel computation instead of sequential, and can keep up. It might cost more operations, it might have more parameters, but it'll generate 1s of audio in <1s of wallclock time.

Comment by gwern on What Goes Without Saying · 2025-01-20T18:25:50.983Z · LW · GW

This sort of fundamental disagreement does lead to some frustrating conversations when you are talking at cross-purposes, and where even if both of you understand the difference, one of you may be talking at a different simulacrum level.

It reminds me of a conversation I had some time back with a school principal, which went something like this: He was trying to come up with proposals for how the school system could use LLMs, and naturally asked me for ideas, as I know a lot about LLMs and we'd discussed them in the past.

I replied that it was mostly a waste of time, because there was nothing really useful he could do with LLMs/AI in general. He was surprised - hadn't I been telling him for years about AI scaling and how it was going to enable total cheating and how LLMs were already capable of doing almost all highschool level work and they were only going to keep getting better and shoot into the stratosphere and become PhD level, how could I think that LLMs were not potentially extremely useful to him?

I said that it was because to really make use of LLMs for their ostensible purpose of education, they would have to reorganize the system and fire a lot of people and replace with LLMs, and the actual purpose of the school system was to preserve itself and provide jobs (and increasingly, simply provide 'pensions') and daycare (in that order); and so LLMs were useless to them - even if they were used for things like making reports faster to write, by Parkinson's law that would simply lead to more reports being demanded until the equilibrium was restored. If he proposed anything like that, it would be ignored at best and held against him at worst, and there were probably better things to spend his time on. (Specifically, since there was nothing he could do about AI scaling and any adaptations in the short run would be obsolete in a few years while he still had decades to go, he should instead be thinking hard about his career and start figuring out how to skate to where the puck will be: what does a school principal do in, say, 5 years' time when there is AGI and everything he does on his computer can be done better for pennies a day?)

He hotly denied this as a load of overly-cynical tosh: schools and education are about learning! I knew he knew better than the official cant (he was too good at the bureaucracy to really believe that), and after a lot of arguing, I finally asked him what would happen if LLMs could replace him and all the teachers completely - would he propose they do that and turn in his resignation letter? He admitted that he would not, and at last conceded that a major purpose of the system was also to provide make-work welfare positions for the locals. (The local peoples have high unemployment, poverty, and alcoholism rates and cannot find any meaningful employment in the private-sector.) I noted that given that purpose, LLMs (or any 'efficiency' improvement at all) could not offer any large gains, because what they do is what those people do, but what they did was already superfluous; since it was not politically possible to just send them all home with a UBI welfare check, and make-work jobs were the chosen solution, he should instead be figuring out how to make things less, not more, efficient. (I may or may not have told the Milton Friedman teaspoon joke.)

He reframed his objective as, 'can we make things more pleasant?' This was finally a realistic goal... but also not one I could really help with, because to remove papercuts and genuinely unnecessary friction or pain, you have to know the system in intimate detail, and an outsider like myself can't help there, and one where the benefits will definitionally be small. But I hoped that by coming to a more honest admission about which system he was working in - a system in which OP points #1--4 are not true - he was at least better off for the conversation.

Comment by gwern on What is the most impressive game LLMs can play well? · 2025-01-17T03:09:41.456Z · LW · GW

Given the other reports, like OA's own benchmarking (as well as the extremely large dataset of chess games they mention training on), I am skeptical of this claim, and wonder if this has the same issue as other 'random chess game' tests, where the 'random' part is not neutral but screws up the implied persona.

Comment by gwern on What is the most impressive game LLMs can play well? · 2025-01-17T03:07:28.233Z · LW · GW

Ma et al 2023 is relevant here.

Comment by gwern on Linkpost: Rat Traps by Sheon Han in Asterisk Mag · 2025-01-16T21:52:25.913Z · LW · GW

Oh, the type of weirdness has definitely changed a lot. But I'm just contending that the level of deviancy is a lot lower these days.

You go to a LW meetup now and there's a lot of wealthy, well-scrubbed/dressed AI researchers (they even lift) and academics and executives and bright-eyed Stanford undergrads sniffing for an internship or YC application fodder. One famous wealthy guy is manic, because he's hypomanic & bipolar is overrepresented among entrepreneurs; don't worry, he'll be fine, until after the meetup when he disappears for a few months. Nobody can really talk about what they do, so you make small talk about what you can. (Have you heard that Trump might increase the SALT deduction? Big [RSU sale tax savings] if true!)

You go to a LW meetup c. 2010 and if you are sniffing anywhere, you'll notice a bit of a smell, perhaps from that one dude who looks like he just got out of a prison hunger strike and dresses like Asian Kurt Godel while the other sus dude looks like he just got out of prison period (and about 10:1 odds he did serve a stint). The manic-looking guy is manic because he has slept about 6 hours in the past week trying to make the Uberman polyphasic sleep schedule work; don't worry, he'll be fine after he crashes and sleeps for 16 hours next week. The crazy homeless-looking dude over in the corner really is homeless and a bit crazy, not some billionaire (and will jump in front of a train in a few years), while the guy on the other side of the room in a huddle is the actual cult leader everyone accuses Eliezer of being (and the guy next to him just might be Satoshi Nakamoto), and so on and so forth. (You bond over your shared experiences nearly being expelled from school by Zero Tolerance policies, possibly involving explosives, and ordering drugs off Silk Road 1.)

I'm not saying it's good or bad (arguably it's good, if AI timelines are short, there is negative value to screwing around with all the stuff we used to), but pound for pound, LWers were a lot weirder back then.

Comment by gwern on A Novel Emergence of Meta-Awareness in LLM Fine-Tuning · 2025-01-16T21:35:33.557Z · LW · GW

This refers only to the regular old finetuning, for 4o, and not to the fancy new RL finetuning for o1 that they recently opened up to alpha users, right?

Comment by gwern on Implications of the inference scaling paradigm for AI safety · 2025-01-15T22:56:46.850Z · LW · GW

I think this is missing a major piece of the self-play scaling paradigm, one which has been weirdly absent in most discussions of o1 as well: much of the point of a model like o1 is not to deploy it, but to generate training data for the next model. It was cool that o1's accuracy scaled with the number of tokens it generated, but it was even cooler that it was successfully bootstrapping from 4o to o1-preview (which allowed o1-mini) to o1-pro to o3 to...

EDIT: given the absurd response to this comment, I'd point out that I do not think OA has achieved AGI and I don't think they are about to achieve it either. (Until we see real far transfer, o1-style training may just be like RLHF - 'one weird trick' to juice benchmarks once, shocking everyone with the sudden jump, but then back to normal scaling.) I am trying to figure out what they think.

Every problem that an o1 solves is now a training data point for an o3 (eg. any o1 session which finally stumbles into the right answer can be refined to drop the dead ends and produce a clean transcript to train a more refined intuition). As Noam Brown likes to point out, the scaling laws imply that if you can search effectively with a NN for even a relatively short time, you can get performance on par with a model hundreds or thousands of times larger; and wouldn't it be nice to be able to train on data generated by an advanced model from the future? Sounds like good training data to have!

This means that the scaling paradigm here may wind up looking a lot like the current train-time paradigm: lots of big datacenters laboring to train a final frontier model of the highest intelligence, which will usually be used in a low-search way and be turned into smaller cheaper models for the use-cases where low/no-search is still overkill. Inside those big datacenters, the workload may be almost entirely search-related (as the actual finetuning is so cheap and easy compared to the rollouts), but that doesn't matter to everyone else; as before, what you see is basically, high-end GPUs & megawatts of electricity go in, you wait for 3-6 months, a smarter AI comes out.

I am actually mildly surprised OA has bothered to deploy o1-pro at all, instead of keeping it private and investing the compute into more bootstrapping of o3 training etc. (This is apparently what happened with Anthropic and Claude-3.6-opus - it didn't 'fail', they just chose to keep it private and distill it down into a small cheap but strangely smart Claude-3.6-sonnet. And did you ever wonder what happened with the largest Gemini models or where those incredibly cheap, low latency, Flash models come from...?‡ Perhaps it just takes more patience than most people have.) EDIT: It's not like it gets them much training data: all 'business users' (who I assume would be the majority of o1-pro use) is specifically exempted from training unless you opt-in, and it's unclear to me if o1-pro sessions are trained on at all (it's a 'ChatGPT Pro' level, and I can't quickly find whether a professional plan is considered 'business'). Further, the pricing of DeepSeek's r1 series at something like a twentieth the cost of o1 shows how much room there is for cost-cutting and why you might not want to ship your biggest best model at all compared to distilling down to a small cheap model.

If you're wondering why OAers† are suddenly weirdly, almost euphorically, optimistic on Twitter and elsewhere and making a lot of haha-only-serious jokes (EDIT: to be a little more precise, I'm thinking of Altman, roon, Brown, Sutskever, several others like Will Bryk or Miles Brundage's "Time's Up", Apples, personal communications, and what I think are echoes in other labs' people, and not 'idontexist_nn'/'RileyRalmuto'/'iruletheworldmo' or the later Axios report that "Several OpenAI staff have been telling friends they are both jazzed & spooked by recent progress."), watching the improvement from the original 4o model to o3 (and wherever it is now!) may be why. It's like watching the AlphaGo Elo curves: it just keeps going up... and up... and up...

There may be a sense that they've 'broken out', and have finally crossed the last threshold of criticality, from merely cutting-edge AI work which everyone else will replicate in a few years, to takeoff - cracked intelligence to the point of being recursively self-improving and where o4 or o5 will be able to automate AI R&D and finish off the rest: Altman in November 2024 saying "I can see a path where the work we are doing just keeps compounding and the rate of progress we've made over the last three years continues for the next three or six or nine or whatever" turns into a week ago, “We are now confident we know how to build AGI as we have traditionally understood it...We are beginning to turn our aim beyond that, to superintelligence in the true sense of the word. We love our current products, but we are here for the glorious future. With superintelligence, we can do anything else." (Let DeepSeek chase their tail lights; they can't get the big iron they need to compete once superintelligence research can pay for itself, quite literally.)

And then you get to have your cake and eat it too: the final AlphaGo/Zero model is not just superhuman but very cheap to run too. (Just searching out a few plies gets you to superhuman strength; even the forward pass alone is around pro human strength!)

If you look at the relevant scaling curves - may I yet again recommend reading Jones 2021?* - the reason for this becomes obvious. Inference-time search is a stimulant drug that juices your score immediately, but asymptotes hard. Quickly, you have to use a smarter model to improve the search itself, instead of doing more. (If simply searching could work so well, chess would've been solved back in the 1960s. It's not hard to search more than the handful of positions a grandmaster human searches per second; the problem is searching the right positions rather than slamming into the exponential wall. If you want a text which reads 'Hello World', a bunch of monkeys on a typewriter may be cost-effective; if you want the full text of Hamlet before all the protons decay, you'd better start cloning Shakespeare.) Fortunately, you have the training data & model you need right at hand to create a smarter model...

Sam Altman (@sama, 2024-12-20) (emphasis added):

seemingly somewhat lost in the noise of today:

on many coding tasks, o3-mini will outperform o1 at a massive cost reduction!

i expect this trend to continue, but also that the ability to get marginally more performance for exponentially more money will be really strange

So, it is interesting that you can spend money to improve model performance in some outputs... but 'you' may be 'the AI lab', and you are simply be spending that money to improve the model itself, not just a one-off output for some mundane problem. Few users really need to spend exponentially more money to get marginally more performance, if that's all you get; but if it's simply part of the capex along the way to AGI or ASI...

This means that outsiders may never see the intermediate models (any more than Go players got to see random checkpoints from a third of the way through AlphaZero training). And to the extent that it is true that 'deploying costs 1000x more than now', that is a reason to not deploy at all. Why bother wasting that compute on serving external customers, when you can instead keep training, and distill that back in, and soon have a deployment cost of a superior model which is only 100x, and then 10x, and then 1x, and then <1x...?

Thus, the search/test-time paradigm may wind up looking surprisingly familiar, once all of the second-order effects and new workflows are taken into account. It might be a good time to refresh your memories about AlphaZero/MuZero training and deployment, and what computer Go/chess looked like afterwards, as a forerunner.

* Jones is more relevant than several of the references here like Snell, because Snell is assuming static, fixed models and looking at average-case performance, rather than hardest-case (even though the hardest problems are also going to be the most economically valuable - there is little value to solving easy problems that other models already solve, even if you can solve them cheaper). In such a scenario, it is not surprising that spamming small dumb cheap models to solve easy problems can outperform a frozen large model. But that is not relevant to the long-term dynamics where you are training new models. (This is a similar error to everyone was really enthusiastic about how 'overtraining small models is compute-optimal' - true only under the obviously false assumption that you cannot distill/quantify/prune large models. But you can.)

† What about Anthropic? If they're doing the same thing, what are they saying? Not much, but Anthropic people have always had much better message discipline than OA so nothing new there, and have generally been more chary of pushing benchmark SOTAs than OA. It's an interesting cultural difference, considering that Anthropic was founded by ex-OAers and regularly harvests from OA (but not vice-versa). Still, given their apparent severe compute shortages and the new Amazon datacenters coming online, it seems like you should expect something interesting from them in the next half-year. EDIT: Dario now says he is more confident than ever of superintelligence by 2027, smarter models will release soon, and Anthropic compute shortages will ease as they go >1m GPUs. EDITEDIT: Dylan Patel: "Yeah, um, but at, the same time Google's already got a reasoning model... Anthropic allegedly has one internally that's like really good, better than o3, even, but you know we'll see, um, when they eventually release it." Remember, everyone has an incentive to hype up Chinese AI; few have an incentive to undersell it. And somehow they always forget that time keeps passing, and any public numbers are long outdated and incomplete.

‡ Notably, today (2025-01-21), Google is boasting about a brand new 'flash-thinking' model which beats o1 and does similar inner-monologue reasoning. As I understand, and checking around and looking at the price per token ($0) in Google's web interface and watching the speed (high) of flash-thinking on one of my usual benchmark tasks (Milton poetry), the 'flash' models are most comparable to Sonnet or GPT-mini models, ie. they are small models, and the 'Pro' models are the analogues to Opus or GPT-4/5. I can't seem to find anything in the announcements saying if 'flash-thinking' is trained 'from scratch' for the RL o1 reasoning, or if it's distilled, or anything about the existence of a 'pro-thinking' model which flash-thinking is distilled from... So my inference is that there is probably a large 'pro-thinking' model Google is not talking about, similar to Anthropic not talking about how Sonnet was trained.

Comment by gwern on Where should one post to get into the training data? · 2025-01-15T22:54:00.058Z · LW · GW

Reddit blocks scrapers now aggressively, because it's charging a fortune for access, and The Pile could no longer have been created (Pushshift is down). Reddit is not the worst place to post, but it's also not the best.

Comment by gwern on Passages I Highlighted in The Letters of J.R.R.Tolkien · 2025-01-15T22:21:41.650Z · LW · GW

Tolkien invented their exact usage, but he didn't invent the words. "Elf", obviously, goes way back, but "orc" also goes way back, with meanings similar to the Tolkien usage.

"Zerg", "Protoss", & "SCV", are all neologisms; notably, the least weird ones, "Kerrigan" and "Terran", are quite ordinary words. ('Hydralisk' is a bit in between. 'Hydra' as a prefix is familiar, albeit increasingly hopelessly overloaded with SF/comic connotations, but 'lisk' as a suffix is a very unfamiliar one: 'obelisk' is the only one that comes to mind, and that appears to get 'lisk' as a butchering of Greek and then French.)

An interesting comparison here would be Gene Wolfe's Book of the New Sun, which does something similar: it uses old words in place of neologisms, and for that reason, despite being stuffed with weird terms (so much so you can publish a dictionary of it), words like 'pelagic argosy' or 'fuligin' or 'capote' nevertheless worked as well in the 1980s as they do now, despite not having achieved the cultural currency of 'elves' or 'orcs', and so demonstrating that the 'use old words' trick works in its own right and not simply by mere familiarity.

(But conversely, writing old-timey is no surefire solution. Wolfe's dying-earth fiction was influenced by Hodgson's The Night Land, which is imaginative and influential... and the style is almost ludicrously unreadable, whether in 1912 or 2025.)

Now, why is that? I suspect that it's a mix of unrealized familiarity (you may not have seen 'destrier' often enough to consciously recognize it as a real word, much less define or use it*, but unconsciously you do) and linguistic 'dark knowledge' in recognizing that somehow, the word 'autarch' is valid and a plausible word which could exist, in a way that 'Zerg' or 'Protoss' could not exist. It somehow respects the laws of languages and etymology and spelling, and you recognize that by not immediately rejecting it like most neologisms. (And to some extent, Tolkien's own conlangs, by having their long fictional history to justify various twists & turns, gain a hidden realism that a tidy rationalized hobbyist conlang will not. Something something greebles fractal dimension worldbuilding pattern language something something.)

* this is why vocab can be a good IQ test: word use frequency is the original power law, and because you have been exposed to many more words than you consciously know, and how many of those words 'stick' will reflect your intelligence's efficiency at learning from 1 or 2 uses of a word, and thus provide a good proxy

Comment by gwern on Could my work, "Beyond HaHa" benefit the LessWrong community? · 2025-01-14T00:16:53.683Z · LW · GW

In my case, as a former military firefighter in Brazil

FWIW, I would be interested in any memoirs or lessons learned about that career, quite aside from any formal research. I don't think there are many firefighters, former, military, or otherwise, on LW, and I bet you saw some interesting things.

Comment by gwern on When is reward ever the optimization target? · 2025-01-13T01:13:44.737Z · LW · GW

But humans don't seem to optimize for reward all that often!

You might be interested in an earlier discussion on whether "humans are a hot mess": https://www.lesswrong.com/posts/SQfcNuzPWscEj4X5E/the-hot-mess-theory-of-ai-misalignment-more-intelligent https://www.lesswrong.com/posts/izSwxS4p53JgJpEZa/notes-on-the-hot-mess-theory-of-ai-misalignment

Comment by gwern on Policymakers don't have access to paywalled articles · 2025-01-12T23:03:07.500Z · LW · GW

Yeah, I was afraid that might apply here. It seems like you should still be able to do something like "government employee tier" subscriptions, not targeted at an individual but perhaps something like 'GS-8 and up', set low enough that it would appeal to such customers, perhaps? It is not a gift but a discount, it is not to an individual but to a class, it is part of a market, and it is not conditional on any government action or inaction, and such discounts are very common for 'students', 'veterans', 'first responders' etc, and I've never seen any fineprint warning government employees about it being >$20 despite many such discounts potentially crossing that threshold (eg. Sam's Club offers $50 off a new membership, and that seems clearly >$20, and to be doing it through a whole company devoted to this sort of discount, ID.me).

But I suppose that might be too complex for SA to be interested in bothering with?

Comment by gwern on When is reward ever the optimization target? · 2025-01-12T22:54:18.416Z · LW · GW

Yes. (And they can learn to predict and estimate the reward too to achieve even higher reward than simply optimizing the reward. For example, if you included an input, which said which arm had the reward, the RNN would learn to use that, and so would be able to change its decision without experiencing a single negative reward. A REINFORCE or evolution-strategies meta-trained RNN would have no problem with learning such a policy, which attempts to learn or infer the reward each episode in order to choose the right action.)

Nor is it at all guaranteed that 'the dog will wag the tail' - depending on circumstances, the tail may successfully wag the dog indefinitely. Maybe the outer level will be able to override the inner, maybe not. Because after all, the outer level may no longer exist, or may be too slow to be relevant, or may be changed (especially by the inner level). The 'homunculus' or 'Cartesian boundary' we draw around each level doesn't actually exist; it's just a convenient, leaky, abstraction.

To continue the human example, we were created by evolution on genes, but within a lifetime, evolution has no effect on the policy and so even if evolution 'wants' to modify a human brain to do something other than what that brain does, it cannot operate within-lifetime (except at even lower levels of analysis, like in cancers or cell lineages etc); or, if the human brain is a digital emulation of a brain snapshot, it is no longer affected by evolution at all; and even if it does start to mold human brains, it is such a slow high-variance optimizer that it might take hundreds of thousands or millions of years... and there probably won't even be biological humans by that point, never mind the rapid progress over the next 1-3 generations in 'seizing the means of reproduction' if you will. (As pointed out in the context of Von Neumann probes or gray goo, if you add in error-correction, it is entirely possible to make replication so reliable that the universe will burn out before any meaningful level of evolution can happen, per the Price equation. The light speed delay to colonization also implies that 'cancers' will struggle to spread much if they take more than a handful of generations.)

Comment by gwern on Viliam's Shortform · 2025-01-12T04:17:07.326Z · LW · GW

Today, the cultures are closer, but the subcultures can be larger. Hundred years ago, there would be no such thing as the rationalist community.

That seems like a stretch, whether you put the stress on the 'community' or the 'rationalist' part. Subcultures can be larger, of course, if only because the global population is like 5x larger, but niche subcultures like 'the rationalist community' could certainly have existed then. Nothing much has changed there.

A hundred years ago was 1925; in 1925 there were countless communes, cults, Chinatowns/ghettos (or perhaps a better example would be 'Germantowns'), 'scenes', and other kinds of subcultures and notable small groups. Bay Area LW/rationalists have been analogized to, for example, the (much smaller) Bloomsbury Group, which was still active in 1925; and from whom, incidentally, we can directly trace some intellectual influence through economics, decision theory, libertarianism, and analytic philosophy, even if one rejects any connection with poly etc. We've been analogized to the Vienna Circle as well (and who we trace much more back to), which is in full swing in 1925. Or how about the Fabians before that? Or Technocracy after that? (And in an amusing coincidence, Paul Kurtz turns out to have been born in 1925.) Or things like Esperanto - even now, a century past its heyday, the number of native Esperanto speakers is shockingly comparable to active LW2 users... Then there's fascinating subcultures like the amateur press that nurtured H. P. Lovecraft, who, as of 1925, has grown out of them and is about to start writing the speculative fiction stories that will make him famous.

(And as far as the Amish go, it's worth recalling that they came to the distant large island of America to achieve distance from persecution in Europe - where the Amish no longer exist - and to minimize attrition & interference by 'the English', continue to live in as isolated communities as possible while still consistent with their needs for farmland etc.)

Comment by gwern on Fluoridation: The RCT We Still Haven't Run (But Should) · 2025-01-12T03:44:38.090Z · LW · GW

They really rule out much more than that: −0.14 is from their worst-case:

Looking at the estimates, they are very small and often not statistically-significantly different from zero. Sometimes the estimates are negative and sometimes positive, but they are always close to zero. If we take the largest negative point estimates (−0.0047, col. 1) and the largest standard error for that specification (0.0045), the 95% confidence interval would be −0.014 to 0.004. We may thus rule out negative effects larger than 0.14 standard deviations in cognitive ability if fluoride is increased by 1 milligram/liter (the level often considered when artificially fluoridating the water).

So that is not the realistic estimate, it is the worst-case after double-cherrypicking both the point estimate and the standard error to reverse p-hack a harm. The two most controlled estimates are actually both positive.

(Meanwhile, any claims of decreases, or that one should take the harms 'many times over', is undermined by the other parts like labor income benefiting from fluoridation. Perhaps one should take dental harms more seriously.)