gwern

Also I suspect that there is some astronomically high k such that monkeys at a keyboard (i.e. "output random tokens") will outperform base models for some tasks by the pass@k metric.

It would be an extreme bias-variance tradeoff, yes.

Comment by gwern on Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red · 2025-04-22T23:10:55.698Z · LW · GW

This has been a consistent weakness of OpenAI's image processing from the start: GPT-4-V came with clearcut warnings against using it on non-photographic inputs like screenshots or documents or tables, and sure enough, I found that it was wildly inaccurate on web page screenshots.

(In particular, I had been hoping to use it to automate Gwern.net regression detection: use a headless browser to screenshot random points in Gwern.net and report back if anything looked 'wrong'. It seemed like the sort of 'I know it when I see it' judgment task a VLM ought to be perfectly suited for. But I discovered when trying it out that GPT-4-V basically couldn't see even blatant errors like broken dropcaps, and it would burn a lot of money to generate mostly just false positives.)

My guess is that the image datasets are so skewed towards photographs, and the de facto resolution so low, that GUIs/browsers/documents/tables/etc just get turned into garbage. If you ever try turning a screenshot or PDF page into a common image input size, like 224x224px (even a generous 512x512px), you'll notice that often they become impossible to read or understand in isolation, like a VLM would be forced to. The text labels become almost unreadable, and when they are readable, you have to think about it hard for a while - exactly the sort of thing a cheap small VLM isn't allowed to do.

This should be highly fixable using autoregressive multimodal LLMs given high-res image encodings and appropriate scale-ups (especially with improved BPE tokenization) but I guess it just hasn't happened & been deployed at scale yet.

Comment by gwern on Davidmanheim's Shortform · 2025-04-21T16:25:23.495Z · LW · GW

In the first case, think of chess; superhuman chess still plays chess. You can watch AlphaZero’s games and nod along—even if it’s alien, you get what it's doing, the structure of the chess "universe" is such that unbounded intelligence still leads to mostly understandable moves.

I guess the question here is how much is 'mostly'? We can point to areas of chess like the endgame databases, which are just plain inscrutable: when the databases play out some mate-in-50 game because that is what is provably optimal by checking every possible move, any human understanding is largely illusory. They are brute facts determined by the totality of the game tree, not any small self-contained explanation like 'knight forking'. (There is probably no 'understandability' even in principle for arbitrarily intelligent agents, similar to asking why the billionth digit of pi or Chaitin's omega is what it is.)

And if we want to expand it out to more realistic settings, we don't even get that: 'chess' doesn't exist in the real world - only specific implementations of chess. With an actual implementation in software, maybe we get something closer to a TAS speedrun, where the chess player twitches for a while and then a buffer overflow instantly wins without the opponent getting to even move a pawn.

But a superintelligence might instead write music that sounds to us like static, full of some brilliant structure, with no ability for human brains to comprehend it. Humans might be unable to tell whether it’s genius or gibberish - but are such heights of genius a real thing? I am unsure.

But what part are you unsure about? There are surely many pragmatic ways to tell if there is a structure in the apparent static (even if it cannot be explained to us the way that, say, cryptographic algorithms can be explained to us and demonstrated by simply decrypting the 'static' into a very meaningful message): for example, simply see if other superintelligences or algorithms can predict/compress the static. You and I can't see 'non-robust features' in images that neural networks do, but we can observe them indirectly by looking at the performance of neural networks dropping when we erase them, and see that they are really real.

Comment by gwern on johnswentworth's Shortform · 2025-04-19T23:04:03.180Z · LW · GW

What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

As defined, this is a little paradoxical: how could I convince a human like you to perceive domains of real improvement which humans do not perceive...?

correctly guessing the true authors of anonymous text

See, this is exactly the example I would have given: truesight is an obvious example of a domain of real improvement which appears on no benchmarks I am aware of, but which appears to correlate strongly with the pretraining loss, is not applied anywhere (I hope), is unobvious that LLMs might do it and the capability does not naturally reveal itself in any standard use-cases (which is why people are shocked when it surfaces), and it would have been easy for no one to have observed it up until now or dismissed it, and even now after a lot of publicizing (including by yours truly), only a few weirdos know much about it.

Why can't there be plenty of other things like inner-monologue or truesight? ("Wait, you could do X? Why didn't you tell us?" "You never asked.")

What domains of 'real improvement' exist that are uncoupled to human perceptions of improvement, but still downstream of text prediction?

Maybe a better example would be to point out that 'emergent' tasks in general, particularly multi-step tasks, can have observed success rates of precisely 0 in feasible finite samples, but extreme brute-force sampling reveals hidden scaling. Humans would perceive zero improvement as the models scaled (0/100 = 0%, 0/100 = 0%, 0/100 = 0%...), even though they might be rapidly improving from 1/100,000 to 1/10,000 to 1/1,000 to... etc. "Sampling can show the presence of knowledge but not the absence."

Comment by gwern on jacquesthibs's Shortform · 2025-04-19T22:04:23.254Z · LW · GW

I think it's a little more concerning that Dwarkesh has invested in this startup:

Mechanize is backed by investments from Nat Friedman and Daniel Gross, Patrick Collison, Dwarkesh Patel, Jeff Dean, Sholto Douglas, and Marcus Abramovitch.

And I do not see any disclosure of this in either the Youtube description or the Substack transcript at present.

Comment by gwern on Understanding and overcoming AGI apathy · 2025-04-18T20:54:02.395Z · LW · GW

In that brief moment of uncertainty, anything could have happened. If one person had just packed up and left, everyone might have followed suit. But nobody reacted. Perhaps what kept the room still was the fear of being perceived as scared. Or the belief that surely, bad things could not happen to them. Or maybe they’d heard enough false alarms in their lives. I’m not sure.

One of the most depressing things about the Replication Crisis in especially social psychology is that many results from the 1950s and 1960s failed to replicate at all... except the Asch conformity experiments. Those seem to replicate just fine.

Comment by gwern on The Russell Conjugation Illuminator · 2025-04-18T20:41:45.796Z · LW · GW

At first glance, your linked document seems to match this. The herald who calls the printer "pig-headed" does so in direct connection with calling him "dull", which at least in modern terms would be considered a way of calling him stupid?

Not necessarily. 'Dull' can mean, in 1621 just as well as 2025, plenty of other things: eg "Causing depression or ennui; tedious, uninteresting, uneventful; the reverse of exhilarating or enlivening." (OED example closest in time: "Are my discourses dull? Barren my wit?" --Jonson's good friend & fellow playwright, William Shakespeare, Comedy of Errors (1623)); or, "Of persons, or their mood: Having the natural vivacity or cheerfulness blunted; having the spirits somewhat depressed; listless; in a state approaching gloom, melancholy, or sadness: the opposite of lively or cheerful." (Shakespeare again: "Sweet recreation barr'd, what doth ensue / But moodie and dull melancholly?") Which in the context of a 'dull' tradesman who refuses to hear the exciting news being brought by no less than 2 heralds before he knows 'the price', is sensible enough.

not reading your entire document?

That would certainly help, because if you read the rest of the Printer's rather cynical comments, constantly undermining the heralds, he doesn't sound in the slightest bit like he is supposed to be stupid or retarded - as opposed to a curmodgeonly critic constantly - obstinately, even - throwing water on a good time by sardonically remarking that he makes money by changing the dates on newspaper plates to print the old news as new news or mocking their talk of moonlight by noting that his telescope-maker has brought him moonshine before. (Not that printers, like Benjamin Franklin, were an occupation associated with low intelligence to begin with.)

Comment by gwern on The Russell Conjugation Illuminator · 2025-04-18T16:43:29.811Z · LW · GW

OP's example is correct and you are wrong. 'Pigheaded' is neither a proposed root cause analysis nor does it mean 'are dumb'; perhaps you should check a dictionary before correcting others' usage. It means stubborn, strong-willed, obstinate, often to the point of foolishness or taking very harmful actions, or to quote the OED: "Having a head like that of a pig. Chiefly figurative: stupidly obstinate, perverse, or set in one's ways." Note: it is "stupidly obstinate", and not "stupid". This is because pigs are notoriously smart but stubborn: very strong, heavy, often hungry, whose mind can't easily be changed by an unfortunate swineherd or passerby in their way. (And this usage has been consistent since the start: the OED will give you the first attestation of it to Ben Jonson, where it describes a small-minded* printer who thinks that high-quality news has to be paid for, because that's how he operates; Jonson then mocks some other tradesmen for their own kinds of narrowmindedness, but not for any of them being low-IQ.) Hence, the Russell conjugation is correct: "pigheaded" is the highly insulting figurative term which intensifies the negative "obstinate" which is the bad version of the positive "firm". Just as 'firm' does not principally mean 'dumb', 'pigheaded' doesn't principally mean it either.

* note, by the way, that 'small-minded' doesn't mean, 'has a low cranial volume and thus lower than average intelligence', nor is it a root-cause analysis that their low intelligence is caused by inadequate neural tissue.

Comment by gwern on Policy for LLM Writing on LessWrong · 2025-04-15T20:30:47.455Z · LW · GW

But the caveat there is that this is inherently a backwards-looking result:

We consider GPT-4o (OpenAI, 2024), Claude-3.5-Sonnet (Anthropic, 2024), Grok-2 (xAI, 2024), Gemini-1.5-Pro (Google, 2024), and DeepSeek-V3 (DeepSeek-AI, 2024).

So one way to put it would be that people & classifiers are good at detecting mid-2024-era chatbot prose. Unfortunately, somewhere after then, at least OpenAI and Google apparently began to target the problem of ChatGPTese (possibly for different reasons: Altman's push into consumer companion-bots/personalization/social-networking, and Google just mostly ignoring RLHF in favor of capabilities), and the chatbot style seems to have improved substantially. Even the current GPT-4o doesn't sound nearly as 4o-like as it did just back in November 2024. Since mode-collapse/ChatGPTese stuff was never a capabilities problem per se (just look at GPT-3!), but mostly just neglect/apathy on part of the foundation labs (as I've been pointing out since the beginning), it's not a surprise that it could improve rapidly once they put (possibly literally) any effort into fixing it.

Between the continued rapid increase in capabilities and paying some attention to esthetics & prose style and attackers slowly improving their infrastructure in the obvious ways, I expect over the course of 2025 that detecting prose from a SOTA model is going to get much more difficult. (And this excludes the cumulative effect on humans increasingly writing like ChatGPT.)

EDIT: today on HN, a post was on the front page for several hours with +70 upvotes, despite being blatantly new-4o-written (and impressively vapid). Is this the highest-upvoted LLM text on HN to date? I suspect that if it is, we'll soon see higher...

Comment by gwern on Youth Lockout · 2025-04-11T20:28:01.838Z · LW · GW

I'm not sure this is a big problem. How much net attrition do you really expect over a decade, say? By which point who really cares? You will have so much more AI progress, and accumulated data (particularly if you've been gradually replacing the lower-level employees and you have an 'automation wave' moving through the organization where employees increasingly train their automated replacements or their job is simply reorganizing the jobs to enable automation).

It seems like to the extent there's much attrition at high levels, it is reduced in considerable part by these very dynamics: as returns to high-level human labor go up, presumably, there is less attrition from voluntary retirement or leisure consumption (and if the returns go down, then that implies that there is no 'shortage' of people for such high-level positions and so no problem); and also as the remaining human work becomes more 'white-collar' and based on difficult-for-AI things like reputation or experience or ownership or creativity, aging or opportunity costs begin to matter less, reducing another source of attrition.

(Even if AI or robotics is unable to do the 'core' of a job, they can help deal with various obstacles which might prevent a human from doing the job. An elderly manager who might decide to retire in part because they are low-key becoming worried about safely driving to/from the office will no longer think about that when they have a self-driving car or remote working becomes ever more feasible; older managers who might be slipping in their grasp of details or who have 'senior moments' will be able to rely on AI secretaries to catch those or just pause stuff for a while until they're back to normal; elite women might invest more in careers if they have Claude-bot as a trustworthy nanny and chauffeur, etc. One is reminded of President Biden: his staffers were able to work around his issues by doing things like rescheduling or canceling events to avoid exposing him publicly when he was bad; it was only an event that even the POTUS can't arbitrarily schedule, a presidential debate, that punctured the carefully-constructed illusion. Few of those staffers were qualified to be President of the United States, and yet, you don't have to be a good president to observe "sounds like Joe's having a bad day today" and quietly cancel his evening appointments for him so he can get to bed early.)

Comment by gwern on LLM AGI will have memory, and memory changes alignment · 2025-04-11T02:11:42.021Z · LW · GW

Also notable: the big OpenAI reveal today was some sort of better personalization. Instead of the crude 'saved facts' personalization ChatGPT has had for a long time and which has never made much of a difference, they're doing... something. Unclear if it's merely RAG or if they are also doing something interesting like lightweight finetuning. But the GPTs definitely seem to have much better access to your other sessions in the web interface, and as far as I know, few other interfaces with frontier models have tried to do much personalization, so this will be an interesting real-world test at scale about how much simple personalization can help with LLMs (similar to Midjourney's relatively new personalization feature, which I get a lot out of).

Comment by gwern on AI #111: Giving Us Pause · 2025-04-10T23:05:18.118Z · LW · GW

I don't think this is true at all. How do you translate, say, rotating multiple shapes in parallel into text?

At least for multimodal LLMs in the pure-token approach like Gato or DALL-E 1 (and probably GPT-4o and Gemini, although few details have been published), you would be able to do that by generating the tokens which embody an encoded image (or video!) of several shapes, well, rotating in parallel. Then you just look at them.

Comment by gwern on Ivan Vendrov's Shortform · 2025-04-09T17:04:55.200Z · LW · GW

Pursuit of novelty is not vnm-incoherent. Furthermore, it is an instrumentally convergent drive; power-seeking agents will seek novelty as well, because learning increases power in expectation (see: value of information).

Or to put it another way: any argument which convincingly proves that 'incoherent search processes ultimately outcompete coherent search processes' is also an argument which convinces a VNM agent to harness the superior incoherent search processes instead of the inferior coherent ones.

Comment by gwern on Ram Potham's Shortform · 2025-04-09T16:58:19.991Z · LW · GW

It's good someone else did it, but it has the same problems as the paper: not updated since May 2024, and limited to open source base models. So it needs to be started back up and add in approximate estimators for the API/chatbot models too before it can start providing a good universal capability benchmark in near-realtime.

Comment by gwern on Ram Potham's Shortform · 2025-04-08T02:41:22.904Z · LW · GW

One of the most robust benchmarks of generalized ability, which is extremely easy to update (unlike benchmarks like Humanity's Last Exam), would just be to estimate the pretraining loss (ie. the compression ratio).

Comment by gwern on Nathan Helm-Burger's Shortform · 2025-04-07T14:49:40.367Z · LW · GW

No, it would probably be a mix of "all of the above". FB is buying data from the same places everyone else does, like Scale (which we know from anecdotes like when Scale delivered FB a bunch of blatantly-ChatGPT-written 'human rating data' and FB was displeased), and was using datasets like books3 that are reasonable quality. The reported hardware efficiency numbers have never been impressive, they haven't really innovated in architecture or training method (even the co-distillation for Llama-4 is not new, eg. ERNIE was doing that like 3 years ago), and insider rumors/gossip don't indicate good things about the quality of the research culture. (It's a stark contrast to things like Jeff Dean overseeing a big overhaul to ensure bit-identical reproducibility of runs and Google apparently getting multi-datacenter training working by emphasizing TPU interconnect.) So my guess is that if it's bad, it's not any one single thing like 'we trained for too few tokens' or 'some of our purchased data was shite': it's just everything in the pipeline being a bit mediocre and it multiplying out to a bad end-product which is less than the sum of its parts.

Remember Karpathy's warning: "neural nets want to work". You can screw things up and the neural nets will still work, they will just be 1% worse than they should be. If you don't have a research culture which is rigorous about methodology or where people just have good enough taste/intuition to always do the right thing, you'll settle for whatever seems to work... (Especially if you are not going above and beyond to ensure your metrics aren't fooling yourself.) Now have a 1% penalty on everything, from architecture to compute throughput to data quality to hyperparameters to debugging implementation issues, and you wind up with a model which is already obsolete on release with no place on the Pareto frontier and so gets 0% use.

Comment by gwern on How Gay is the Vatican? · 2025-04-07T14:32:50.986Z · LW · GW

That would be tricky because you are comparing apples and oranges. Consider that for the USA, there are only 11 cardinals (of 252 worldwide), while there are 10x more federal senators at any moment (I don't know if there would be more or less total: senators tend to be much younger but cardinals also tend to be long-lived), and I can't even guess how many 'Fortune 500 C-level employees' there might be given corporate turnover and the size of many 'C-suites' - tens of thousands, maybe? So your suggestions span ~1-3 orders of magnitude less selectivity than cardinals do.

Comment by gwern on How Gay is the Vatican? · 2025-04-07T14:27:08.905Z · LW · GW

whomever makes it into the college of cardinals.

I would be surprised if that was the primary homosexuality-enriching step, given that reporting has always been that quite a lot of low-level parish-level priests are also gay. (Note, for example, how many of the sexual abuse scandal victims were boys/men.) I would guess that it operates fairly steadily at all levels, starting from simply which young boys opt for the priesthood (known to be a demand and difficult occupation even if the celibacy requirement is, for you, not so onerous) and operating from there; if I had to guess where the biggest enrichment is, it'd be at the 'leaving your country for the Vatican' step, given how notoriously gay the Vatican is. So going there suggests either that you are gay (and so the buggery isn't a bug, it's a feature) or you are highly ambitious and don't mind it (or are willing to exploit it and again, not a bug but a feature).

Comment by gwern on Recent AI model progress feels mostly like bullshit · 2025-04-07T02:53:55.833Z · LW · GW

We have not yet tried 4.5 as it's so expensive that we would not be able to deploy it, even for limited sections.

Still seems like potentially valuable information to know: how much does small-model smell cost you? What happens if you ablate reasoning? If it is factual knowledge and GPT-4.5 performs much better, then that tells you things like 'maybe finetuning is more useful than we think', etc. If you are already set up to benchmark all these OA models, then a datapoint from GPT-4.5 should be quite easy and just a matter of a small amount of chump change in comparison to the insight, like a few hundred bucks.

Comment by gwern on How Gay is the Vatican? · 2025-04-07T02:16:33.189Z · LW · GW

22.3 percent of cardinals are reported as eldest children. That compares to 21.6 percent which are youngest children. Eldest children are still favored.

I kept waiting for you to discuss this point: in considering analysis of cardinals (as opposed to ordinary random people), what about the other relevant birth-order effects? Like the... first-born eldest birth order effect, where first-borns are smarter, more extraverted, stabler, higher-SES etc. All of which sounds exactly like the sort of thing you need to rise through an extreme hierarchy to the top.

After all, surely homosexuality is not the only (or even primary) trait the Catholic Church hierarchy is trying to select for?

Comment by gwern on Nathan Helm-Burger's Shortform · 2025-04-07T01:47:02.351Z · LW · GW

The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek

This seems like it's exaggerating the Llama failure. Maybe the small Llama-4s just released yesterday are a bit of a disappointment because they don't convincingly beat all the rivals; but how big a gap is that absolutely? When it comes to DL models, there's generally little reason to use #2; but that doesn't mean #2 was all that much worse and 'a failure' - it might only have been weeks behind #1. (Indeed, a model might've been the best when it was trained, and release just took a while. Would it be reasonable to call such a model a 'failure'? I wouldn't. It might be a failure of a business model or a corporate strategy, but that model qua model is a good model, Bront.) #2 just means it's #2, lesser by any amount. How far back would we have to go for the small Llama-4s to have been on the Pareto frontier? It's still early, but I'm getting the impression so far that you wouldn't have to go that far back. Certainly not 'years' (it couldn't perform that well on LMArena in its 'special chatbot configuration' even sloptimized if it was years behind), unless the wilder rumors turn out to be true (like deliberately training on the test sets - in which case, Zuckerberg may have to burn FB AI with fire and reboot the entire AI org because the culture is irretrievably rotten - but of course such rumors usually do not, so I mention this mostly to indicate that right now Llama Internet commentary is high on heat and low on light).

The failure of the compute-rich Llama models to compete with the compute poorer but talent and drive rich Alibaba and DeepSeek shows that even a substantial compute lead can be squandered. Given that there is a lot of room for algorithmic improvements (as proven by the efficiency of the human brain), this means that determined engineering plus willingness to experiment rather than doubling-down on currently working tech.

I'm not really following your argument here. Even if LLaMA-4 is disappointing compared to what DeepSeek could've done with the same compute because they'd get 40% MFU instead of FB's 20% or whatever, and are 2x as good in effective-compute, that doesn't close the lead when FB finishes its new Manhattan-sized datacenter, say, and has 100x DS's compute. Or are you arguing for the possibility of someone making an asymptotic scaling law breakthrough with a better exponent, so that even with 1/100th the compute, they can beat one of the giants?

Comment by gwern on How to Make Superbabies · 2025-04-07T01:25:37.956Z · LW · GW

The human microbiome is irrelevant to this topic. The microbiome is highly heritable (usual twin studies & SNP heritabilities), and it is caused by genes and the environment, as well as unstable; its direct causal effects in normal humans are minimal. We know that it is supremely irrelevant because environmental changes like antibiotics or new food or global travel which produce large changes in personal (and offspring) microbiomes do not produce large changes in intelligence (of oneself or offspring); and most dramatically, germ-free humans exist and are of normal or even above-average intelligence, eg the fascinating mistakes and delusions of David despite his high intelligence. (Amusingly, germ-free mice apparently even live longer.) Microbiome research is, in general, very low quality and can't be taken seriously - look at your link:

Examples of how important the gut microbiome, and the parents' health, are for human development: https://humanmicrobiome.info/maternity/

Most of this page is meaningless mouse studies (infamous for not replicating and getting whatever result the experimenter wants and the animal model literature having huge systemic biases), and the handful of actual human studies I see here are all garbage - things like cross-sectional studies with large known familial confounding, or heavy reliance on things like breastfeeding where the beneficial effects disappear when controlling for just some confounds. This also goes for much-touted correlations like autism. There's not a single result on this page that provides a shred of evidence for your implied thesis that microbiome interventions could, even in theory, possibly matter to 'how to make superbabies'. It doesn't.

In 2021 a geneticist insisted to me that the microbiome was just a fad.

He was right. BTW, you remember what happened in 2021, right?

EDIT: If anyone cares, I'm not bothering to respond to Harrop's comment in depth because I think his Gish-gallopy response manages to exemplify many of the criticisms I already made, where they are not outright non-responsive (eg. his 'disagreement' about the reasons why germ-free mice live longer is obviously irrelevant to my point that they do), and I'm not worried anyone is going to waste any time on 'the microbiome' for these purposes when this is the best case he can make. You can see he has no case for 'superbabies' having anything to do with known microbiome stuff, and I do not care enough about the microbiome per se to prosecute it further.

Comment by gwern on Most Questionable Details in 'AI 2027' · 2025-04-05T18:17:02.627Z · LW · GW

I don't really understand how a local copy of the weights gives the terrorists more practical control over the software's alignment. I don't think it's easy to manually tweak weights for so specific a purpose. Maybe they just mean the API is doing a good job of blocking sketchy requests?

You can finetune models for any specific purpose: just provide a few datapoints and train. The more specific the purpose, the easier tweaking the weights is, not harder. (Surely, if nothing else, you've seen all of the LoRAs and other things for finetuning image generation model to generate a specific character?) There is an extensive literature at this point on how it is trivial to strip away all of the friendly chatbot persona from released checkpoints, such as LLaMA, if you are able to access and modify the model slow weights and fast weights directly.

Comment by gwern on Davey Morse's Shortform · 2025-04-04T22:19:32.624Z · LW · GW

Yes, in dire straits. But it's usually called 'hyperinflation' when you try to make seignorage equivalent to >10% of GDP and fund the government through deliberately creating high inflation (which is on top of any regular inflation, of course). And because inflation is about expectations in considerable part, you can't stop it either. Not to mention what happens when you start hyperinflation.

(FWIW, this is a perfectly reasonable question to ask a LLM first. eg Gemini-2.5-pro will give you a thorough and sensible answer as to why this would be extraordinarily destructive and distortionary, and far worse than the estimated burden of tax return filing, and it would likely satisfy your curiosity on this thought-experiment with a much higher quality answer than anyone on LW2, including me, is ever likely to provide.)

Comment by gwern on Auditing language models for hidden objectives · 2025-04-04T22:12:46.458Z · LW · GW

This model seems to contradict https://www.lesswrong.com/posts/pdaGN6pQyQarFHXF4/reward-is-not-the-optimization-target because it has, in fact, developed reward as the optimization target without ever being instructed to maximize reward.

It doesn't contradict Turntrout's post because his claims are about an irrelevant class of RL algorithms (model-free policy gradients) . A model-based RL setting (like a human, or a LLM like Claude pretrained to imitate model-based RL agents in a huge number of settings ie. human text data) optimizes the reward, if it's smart and knowledgeable enough to do so.

(This comment is another example of how Turntrout's post was a misfire because everyone takes away the opposite of what they should have.)

Comment by gwern on The Hidden Cost of Our Lies to AI · 2025-04-03T23:33:18.735Z · LW · GW

The intuition behind this approach draws from our understanding of selection in biological systems. Consider how medieval Europe dealt with violence:

This is a bad example because first, your description is incorrect (Clark nowhere suggests this in Farewell to Alms, as I just double-checked, because his thesis is about selecting for high-SES traits, not selecting against violence, and in England, not Europe - so I infer you are actually thinking of the Frost & Harpending thesis, which is about Western Europe, and primarily post-medieval England at that); second, the Frost & Harpending truncation selection hypothesis has little evidence for it and can hardly be blandly referred to, as if butter wouldn't melt in your mouth, as obviously 'how medieval Europe dealt with violence' (I don't particularly think it's true myself, just a cute idea about truncation selection, nor is it obvious whether it can account for a majority, much less all, of the secular decline in violence); and third, it is both a weird opaque obscure example that doesn't illustrate the principle very well and is maximally inflammatory.

Comment by gwern on adamzerner's Shortform · 2025-04-01T01:30:53.084Z · LW · GW

Experimentation is valuable for the high VoI, but it seems hard to encourage 'in general', because experimenting on anything is painful and difficult, and the more so the more important and valuable it is. So just 'subsidizing experiments' would be like 'subsidizing fixing bugs in source code'.

What would you do if you were a funder who wanted to avoid this? Well, you'd... fund specific experiments you knew were important and of high-value. Which is what the federal government and many other NGOs or philanthropists do.

Comment by gwern on Conceptual Rounding Errors · 2025-04-01T01:25:43.810Z · LW · GW

...The loss of knowledge has been attributed to several factors. Firstly, Lind showed in his work that there was no connection between the acidity of the citrus fruit and its effectiveness at curing scurvy. In particular, he noted that acids alone (sulphuric acid or vinegar), would not suffice. Despite this, it remained a popular theory that any acid could be used in place of citrus fruit. This misconception had significant consequences.

When the Royal Navy changed from using Sicilian lemons to West Indian limes, cases of scurvy reappeared. The limes were thought to be more acidic and it was therefore assumed that they would be more effective at treating scurvy. However, limes actually contain much less vitamin C and were consequently much less effective. Furthermore, fresh fruit was substituted with lime juice that had often been exposed to either air or copper piping. This resulted in at least a partial removal of vitamin C from the juice, thus reducing its effectiveness.

The discovery that fresh meat was able to cure scurvy was another reason why people no longer treated the condition with fresh fruit. This discovery led to the belief that perhaps scurvy was not caused by a dietary problem at all. Instead, it was thought to be the result of a bacterial infection from tainted meat. In fact, the healing properties of fresh meat come from the high levels of vitamin C it contains.

Finally, the arrival of steam shipping substantially reduced the amount of time people spent at sea, therefore the difficulties in carrying enough fresh produce were reduced. This decreased the risk of scurvy so that less effective treatments, such as lime juice, proved effective enough to deal with the condition most of the time. Unfortunately, this meant that knowledge of the most effective treatment for scurvy was gradually lost....

Comment by gwern on Tormenting Gemini 2.5 with the [[[]]][][[]] Puzzle · 2025-03-31T20:52:07.733Z · LW · GW

(Note for anyone confused why that Grok 3 archive snapshot looks 'cut off': there is a scroll-frame inside the page, which your browser may be hiding from you because everyone hides scrollbars these days. The conversation continues after "Hint: think like a number theorist / Thoughts".)

Comment by gwern on Conceptual Rounding Errors · 2025-03-30T22:09:34.934Z · LW · GW

A google of the first paragraph takes you quickly to https://www.bluesci.co.uk/posts/forgotten-knowledge

Comment by gwern on Mo Putera's Shortform · 2025-03-27T18:44:17.441Z · LW · GW

I stumbled upon deep in the bowels of https://gwern.net/ which I've annoyingly never been able to find again.

Probably https://gwern.net/newsletter/2021/05#master-synthesis

I wouldn't have guessed just from the landing page that he's the discoverer of backprop, respected former program director at the NSF, etc.

That's what makes it alpha! If he was as legible as, say, Hinton, he would be mined out by now, and nothing but beta. (Similar situation to Schmidhuber - 'obvious crackpot' - although he's such a self-promoter that he overcomes it, and so at this point there's no alpha talking to him; the stuff that would be interesting, like his relationship to certain wealthy Italians, or to King Bonesaws, or how he's managed to torpedo his career so spectacularly, he will not talk about. Also, I understand he likes to charge people for the privilege of talking to him.) You have to have both domain knowledge and intellectual courage to know about Werbos and eg. read his old interviews and be willing to go out on a limb and interview him.

Comment by gwern on Recent AI model progress feels mostly like bullshit · 2025-03-27T18:36:32.692Z · LW · GW

Are we sure that these questions aren’t in their datasets? I don’t think we can be. First off, you just posted them online.

Questions being online is not a bad thing. Pretraining on the datapoints is very useful, and does not introduce any bias; it is free performance, and everyone should be training models on the questions/datapoints before running the benchmarks (though they aren't). After all, when a real-world user asks you a new question (regardless of whether anyone knows the answer/label!), you can... still train on the new question then and there just like when you did the benchmark. So it's good to do so.

It's the answers or labels being online which is the bad thing. But Byrnes's comment and the linked Kagi page does not contain the answers to those 3 questions, as far as I can see.

Comment by gwern on Mo Putera's Shortform · 2025-03-26T17:04:33.475Z · LW · GW

You can see it as an example of 'alpha' vs 'beta'. When someone asks me about the value of someone as a guest, I tend to ask: "do they have anything new to say? didn't they just do a big interview last year?" and if they don't but they're big, "can you ask them good questions that get them out of their 'book'?" Big guests are not necessarily as valuable as they may seem because they are highly-exposed, which means both that (1) they have probably said everything they will said before and there is no 'news' or novelty, and (2) they are message-disciplined and careful to "talk their book". (In this analogy, "alpha" represents undiscovered or neglected interview topics which can be extracted mostly just by finding it and then asking the obvious question, usually by interviewing new people; "beta" represents doing standard interview topics/people, but much more so - harder, faster, better - and getting new stuff that way.)

Lex Fridman podcasts are an example of this: he often hosts very big guests like Mark Zuckerberg, but nevertheless, I will sit down and skim through the transcript of 2-4 hours of content, and find nothing even worth excerpting for my notes. Fridman notoriously does no research and asks softball questions, and invites the biggest names he can get regardless of overexposure, and so if you do that, you will get nothing new. He has found no alpha, and he doesn't interview hard enough to extract beta. So he's sort of the high-expense ratio index fund of podcast interviews.

Sarah Paine, on the other hand, seems to have been completely unknown and full of juicy nuggets, and is like winning the lottery: you can make a career off a really good trade like Paine before it gets crowded. However, if another successful podcaster has her on, they will probably not discover Paine is their most popular or growth-productive guest ever. The well is dry. Paine may have more to say someday, but that day is probably closer to "5 years from today" than "tomorrow".

(So a good interviewer adopts an optimal foraging mindset: once you have harvested a patch of its delicious food, you have to move on to another patch, which hasn't been exhausted yet, and let the original patch slowly recover.)

So a great guest for Dwarkesh's blog would be, say Hans Moravec or Paul J. Werbos: Moravec hasn't done anything publicly in at least a decade, and is fallow; while Werbos has been more active and in the public eye, but still not much and is such a weird guy that just about any questions will be interesting. Reich was also a good guest because while Reich is very 'public' in some senses (he's written popularizing books, even), he is still obscure, almost none of what he has published is well-known, and he is involved in so much fast-paced research that even the book is now substantially obsolete and he has a lot of new stuff to say. (And Reich will have more stuff to say if revisited in, say, 2 years for an update, so a harvester will be making a note to revisit him if the current crop of interview candidates in the pipeline is looking marginal.) A difficult or mediocre guest would be Tony Blair: he can surely say many interesting things about the current geopolitical context and his work since being PM... but he is a super-experienced career politician who has survived countless Question Times, and may eat you for breakfast and exploit you for ulterior purposes (rather than vice-versa). Similarly, Mark Zuckerberg and Satya Nadella are tough nuts: there's meat there, but are you willing enough to bring down the hammer or will you settle for a mediocre result that mostly just fills space and is not a must-watch? A bad guest might be someone controlling and extremely PR-savvy like MrBeast; this is the sort of guy who will give you a bad interview pushing his 'book' shamelessly, and then might wind up spiking the interview anyway if he felt he wasn't getting enough out of it, and just drops it as a sunk cost (though it was weeks of work on your part and blows a hole in your schedule - that's not his problem).

Comment by gwern on Do models say what they learn? · 2025-03-24T23:34:55.144Z · LW · GW

I agree with all of this! But I'm not sure I understand what you mean by "there may be mediation, but only in a weak sense". We were just interested in studying how models naturally learn in this RL setting

I am emphasizing that to me, this current mediation learning looks fragile and temporary, and is not a solid, long-term 'natural' thing - it is learning, but only as a temporary artificial heuristic that would wash away in the long run with more training or more diverse tasks etc.

My expectation is that in the limit, a model will learn to focus only on the keyword in the prompt, which is maximally robust to all variation and noise in the rationalization, and ensures 100% correct performance, while any reliance on the rationalizations will potentially go wrong if it samples badly or it slightly misinterprets the cached hints.

So the fact that there is any mediation is simply an accident of the current setup and finite-sample regime, where you chose to stop, and should not be over-interpreted as "we have studied how models naturally learn in this RL setting and proven that they do the task by caching hints in the reasoning traces".

And this predicts that if you add noise or train longer or etc, you'll see less mediation as it more quickly converges on the true optimal solution. Which is important because if you try to apply this logic to larger models or more realistic scenarios because you expect all these tricks to be mediated through the reasoning traces, you may be disappointed, and if you're basing inferences on this, you may be wrong. ("We swapped out the reasoning traces, which we know mediate any such deception or covert reasoning [see previous work], and it was fine. Therefore, it's not doing anything like that.")

...While their profession is stable, the overall financial health and potential for unexpected expenses could pose a risk. However, the income level is relatively high, which could support the ability to make loan payments....accept

LMAO. Very funny to see it switching its rationalization. I wonder what else you could do this with? Feels like an interesting trick to observe such 'self-correction'.

I also tried a more extreme version where I delete the second half of each reasoning trace (leaving the first ~150 reasoning tokens out of ~300) and let the model generate from there. This resulted in ~37% of recommendations remaining the same as the original. I anticipate there's a continuous relationship between how much of the reasoning trace is preserved and how likely the model is to maintain its original recommendation.

That's interesting because it suggests the cached hints are relatively evenly distributed, rather than focused on, say, a single token near the beginning. That implies you should be able to easily find the hidden decision latent: simply look for a latent variable in the activations which varies continuously with the length of the rationalizations and increases/decreases over the course of the rationalization to a threshold which produces an accept/reject output (like drift-diffusion). And then you can look for the tokens which most change the latent to inspect how exactly the hints are being cached. (It's probably nothing too impressive: a bag-of-words sufficient statistic #-count on words like "precarious" or "deficit", or "unexpected", say. But worth finding for the insight.)

Comment by gwern on Solving willpower seems easier than solving aging · 2025-03-24T23:10:21.938Z · LW · GW

I would not believe that unless you have done a simulation study with the small n of this study, plausible levels of measurement error (alcoholism being much harder to measure than weight or body fat), with about a dozen covariates (to correspond to the different ways to slice the patients and threshold BMI etc), and then shown that you hardly ever get a false negative like this. My experience with doing such power analysis simulation studies for other things inclines me to think that people greatly overestimate how informative such small studies are once you allow for plausible levels of measurement error and (reverse) p-hacking degrees of freedom.

Comment by gwern on Solving willpower seems easier than solving aging · 2025-03-24T19:23:13.036Z · LW · GW

I don't think that study shows much either way: too small and underpowered to show much of anything (aside from the attrition undermining internal validity).

Dynomight's primary criticism doesn't hold much water because it is (un-pre-registered) reverse p-hacking. If you check enough covariates, you'll find a failure of randomization to balance on some covariate, and you can, if you wish, tell a post hoc story about how that is actually responsible for the overall mean difference. Nevertheless, randomization works, because on average why would any particular covariate be the way in which the confounding is mediated?

Just have to wait for more studies.

Comment by gwern on Elizabeth's Shortform · 2025-03-24T01:18:41.500Z · LW · GW

But did it inspire them to try to stop CelestAI or to start her? I guess you might need some more drinks for that one...

Comment by gwern on Solving willpower seems easier than solving aging · 2025-03-23T22:58:16.134Z · LW · GW

It's worth mentioning in this context that one of the most remarkable things about the recent wave of GLP-1/GIP drugs is that they seem to have large benefits on, for lack of a better word, willpower and psychiatry. Nor was this expected or predicted AFAIK, or clearly linked solely to the weight-loss: the justification in the animal experiments and early human trials were based purely on physiology and then the human diabetics reporting they felt a less hungry. So this is quite remarkable, and part of why GLP-1/GIP drugs are one of the best things to happen to public health in a long time - not just the direct benefits, but the sheer unexpectedness seems to imply that we are about to learn a lot about where these psychiatric & willpower problems really come from.

(The leading theory so far seems to be that inflammation is chronically dysregulated body-wide in a lot of Westerners, especially the fat ones, and this is somehow interfering with impulse control/learning/homeostasis, and the GLP-1/GIPs as a side-effect tamp it down, and allow natural recovery.)

Comment by gwern on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-23T22:41:34.796Z · LW · GW

I don't think it's weird. Given that we know there are temporal trends towards increasing parameter size (despite Chinchilla), FLOPs, data, and continued progress in compute/data-efficiency (with various experience curves), any simple temporal chart will tend to show an increase unless you are specifically conditioning or selecting in some way to neutralize that. Especially when you are drawing with a fat marker on a log plot. Only if you had measured and controlled for all that and there was still a large unexplained residual of 'time' would you have to start reaching for other explanations such as 'divine benevolence'. (For example, you might appeal to 'temporal decay': if you benchmark on a dataset of only new data, in some way, then you will expect the oldest models to do the worse, and increasingly recent models do better, even after controlling for all factors you can think of - hey presto, a chart where the models mysteriously 'get better over time', even though if you had a time machine to benchmark each model at release in its own milieu, you'd find no trend.)

Comment by gwern on Do models say what they learn? · 2025-03-23T17:04:01.203Z · LW · GW

In reality, we observe that roughly 85% of recommendations stay the same when flipping nationality in the prompt and freezing reasoning traces. This suggests that the mechanism for the model deciding on its recommendation is mostly mediated through the reasoning trace, with a smaller less significant direct effect from the prompt to the recommendation.

This might be less convincing than it seems, because the simple interpretation of the results to me seems to be something like, "the inner-monologue is unfaithful because in this setting, it is simply generating rationalizations/excuses for the decision it already made based on the simple-attribute, and then at the end, the self-attention attends to the rationalization as a reliable shortcut, compared to trying to ignore it all exactly (which is difficult for self-attention) and focus on only the 1 key word in the prompt". In that case, the model has already decided on the result before the reasoning trace is generated, and the reasoning trace is simply a convenient, currently reliable cache with no optimization pressure to fix those 85% 'failures', which will be hard. You have to learn to ignore stuff in a context window to pick out the needle from the haystack, and there's no reason to do so here. The ablation there is unnatural and out of distribution. So there may be mediation, but in only a weak sense, which will fail to hold should there ever be any reason for it not to.

So one thing that might be interesting would be to do this swapping, but generate more reasoning. Does it spontaneously recover and enable itself to make the right decision after all, rather than persisting in staying the same? "I see I thought the income was too low and was about to recommend denying the loan, but on further consideration of recent economic situations, I may be mistaken; perhaps we should set our thresholds more generously, and accept this application after all." Or you could train it in a swapping setting, which should train it to ignore the reasoning-trace if that conflicts with the keyword in the prompt. (You could do this by crossing over: start generating 2 episodes with the same prompt/seed/hyperparameters, except with the opposite keyword; generate out a random length, and then 'swap over' the reasoning trace, and continue with that. This provides a direct incentive to learn to ignore the reasoning trace if it isn't reliably breadcrumbing the prompt keyword.) You should then see clear changes in activation patterns, where the prompt keyword and the native reasoning-trace light up as important, but then everything after the crossover point gets learned to be ignored (as if there were an implicit <|endoftext|>), and eventually with enough swapping corruption, learns to ignore the reasoning trace completely (as it keeps screwing up the final classification, and so it eventually learns to attend exclusively to the key word in the prompt and become fully robust to all reasoning trace tampering even as it continues to generate plausible rationalizations).

Comment by gwern on Fabien's Shortform · 2025-03-22T19:42:39.151Z · LW · GW

You would also expect that the larger models will be more sample-efficient, including at in-context learning of variations of existing tasks (which of course is what steganography is). So all scale-ups go much further than any experiment at small-scale like 8B would indicate. (No idea what 'medium-scale' here might mean.)

Comment by gwern on METR: Measuring AI Ability to Complete Long Tasks · 2025-03-21T16:36:50.510Z · LW · GW

One possible interpretation here is going back to the inner-monologue interpretations as being multi-step processes with an error rate per step where only complete success is useful, which is just an exponential; as the number of steps increase from 1 to n, you get a sigmoid from ceiling performance to floor performance at chance. So you can tell the same story about these more extended tasks, which after all, are just the same sort of thing - just more so. We also see this sort of sigmoid in searching with a fixed model, in settings like AlphaZero in Hex, which makes sense if we assume that these LLMs are doing a lot of retries and backtracking, which constitute a 'search' process as a whole, even if they never explicitly represent or model a decision/game tree, and have error rates stemming from their blindspots and biases. And you can tell a similar story there about error rates and exponentials: all the critical steps have to be right (omitting ones which don't do anything, ones which get undone or reset, etc), and the final result is either right or wrong as you do the task or not.

(And on a more detailed mechanistic level, you can tell a story where NNs learn 'atoms' of skills over scaling, power-law distributed in random naturalistic data, which are recombined to solve each 'new' inner-monologue problem, and if you have 'memorized' enough atoms, you can solve every task which is just a reconfiguration of known atoms, and that is just what 'learning' and 'generalization' are.)

But of course, the interesting thing here is that the human baselines do not seem to hit this sigmoid wall. It's not the case that if a human can't do a task in 4 hours there's basically zero chance of them doing it in 48 hours and definitely zero chance of them doing it in 96 hours etc. Instead, human success rates seem to gradually flatline or increase over time, especially if we look at individual steps: the more time that passes, the higher the success rates become, and often the human will wind up solving the task eventually, no matter how unprepossessing the early steps seemed. In fact, we will often observe that a step that a human failed on earlier in the episode, implying some low % rate, will be repeated many times and quickly approach 100% success rates! And this is true despite earlier successes often being millions of vision+text+audio+sensorimotor tokens in the past (and interrupted by other episodes or tasks themselves equivalent to millions of tokens), raising questions about whether self-attention over a context window can possibly explain it. Some people will go so far as to anthropomorphize human agents and call this 'learning', and so I will refer to these temporal correlations as learning too.

Why the difference between machine and human learning? Well, you might ask, given this sigmoid wall, how did we get so much higher performance from GPT-2 to Claude-3.7? How did o1-style models go from flailing about to far higher performance on coding/reasoning tasks even at the same size model? And how did we go from below amateur Go AI (AlphaZero at the start of training) to strongly superhuman Go AI (AlphaZero at the end of training), with the same size model? The shocking but true answer is... we trained better neural networks. (And larger too, of course, but that was not strictly necessary.) We didn't prompt them or do brute-force best-of-n samples search or even MCTS search a (randomly initialized) model or use a really really large context window on GPT-2. But we trained them, so they could learn new and better stuff. (Another way one could make the point: if self-attention really is a perfect substitute for gradient descent on the weights, and there is no crossover point, why do we not just 'train' models using purely linear self-attention on trillions of tokens, and use that instead? Why does anyone still bother with, say, finetuning instead of putting that dataset into the context and caching it?)

Incidentally, what do GPT-2, GPT-4, and Claude-3.7 all share in common, that is not just untrue, but nearly impossible for a human doing a task? They have frozen weights which do no learning at runtime.

So I would suggest that the sigmoid we see here is mostly what we would expect from using a frozen non-learning model to do search over a difficult game/task, and that if the LLMs were able to properly learn using finetuning (or an online equivalent like dynamic evaluation), you would see different and more human-like temporal scaling: where the success rate declines more gradually and plateaus at a higher asymptote, as within-episode, it observes poorly-modeled environment dynamics and improves its predictions of those, observes its errors and avoids repeating them in favor of new things, knows what it has and hasn't done without having to reason over the entire history (filled with false starts and errors), and can explicitly reason about things and incorporate the results of the reasoning directly into the weights computing everything else.

User info

Posts

Comments