Posts
Comments
Right, yeah. But you could also frame it the opposite way
Ha, very fair point!
Kinda Contra Kaj on LLM Scaling
I didn't see Kaj Sotala's "Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI" until yesterday, or I would have replied sooner. I wrote a reply last night and today, which got long enough that I considered making it a post, but I feel like I've said enough top-level things on the topic until I have data to share (within about a month hopefully!).
But if anyone's interested to see my current thinking on the topic, here it is.
I think that there's an important difference between the claim I'm making and the kinds of claims that Marcus has been making.
I definitely didn't mean to sound like I was comparing your claims to Marcus's! I didn't take your claims that way at all (and in particular you were very clear that you weren't putting any long-term weight on those particular cases). I'm just saying that I think our awareness of the outside view should be relatively strong in this area, because the trail of past predictions about the limits of LLMs is strewn with an unusually large number of skulls.
Yeah I don't have any strong theoretical reason to expect that scaling should stay stopped. That part is based purely on the empirical observation that scaling seems to have stopped for now
My argument is that it's not even clear (at least to me) that it's stopped for now. I'm unfortunately not aware of a great site that keeps benchmarks up to date with every new model, especially not ones that attempt to graph against estimated compute -- but I've yet to see a numerical estimate that shows capabilities-per-OOM-compute slowing down. If you're aware of good data there, I'd love to see it! But in the meantime, the impression that scaling laws are faltering seems to be kind of vibes-based, and for the reasons I gave above I think those vibes may be off.
Great post, thanks! I think your view is plausible, but that we should also be pretty uncertain.
Surprising LLM reasoning failures make me think we still need qualitative breakthroughs for AGI
This has been one of my central research focuses over the past nine months or so. I very much agree that these failures should be surprising, and that understanding why is important, especially given this issue's implications for AGI timelines. I have a few thoughts on your take (for more detail on my overall view here, see the footnoted posts[1]):
- It's very difficult to distinguish between the LLM approach (or transformer architecture) being fundamentally incapable of this sort of generalization, vs being unreliable at these sorts of tasks in a way that will continue to improve along with other capabilities. Based on the evidence we have so far, there are reasonable arguments on both sides.
- But also there's also an interesting pattern that's emerged where people point to something LLMs fail at and say that it clearly indicates that LLMs can't get to AGI or beyond, and then are proven wrong by the next set of LLMs a few months later. Gary Marcus provides endless examples of this pattern (eg here, here). This outside view should make us cautious about making similar predictions.
- I definitely encountered that pattern myself in trying to assess this question; I pointed here to the strongest concrete challenges I found to LLM generality, and four months later LLM performance on those challenges had improved dramatically.
- I do think we see some specific, critical cases that are just reliability issues, and are improving with scale (and other capabilities improvements).
- Maintaining a coherent internal representation of something like a game board is a big one. LLMs do an amazing job with context and fuzziness, and struggle with state and precision. As other commenters have pointed out, this seems likely to be remediable without big breakthroughs, by providing access to more conventional computer storage and tools.
- Even maintaining self-consistency over the course of a long series of interactions tends to be hard for current models, as you point out.
- Search over combinatorial search trees is really hard, both because of the state/precision issues just described, and because combinatorial explosions are just hard! Unassisted humans also do pretty badly on that in the general case (although in some specific cases like chess humans learn large sets of heuristics that prune away much of the combinatorial complexity).
- Backtracking in reasoning models helps with exploring multiple paths down a search tree, but maybe only by a factor of <= 10.
- These categories seem to have improved model-by-model in a way that makes me skeptical that it's a fundamental block that scaling can't solve.
- A tougher question is the one you describe as "some kind of an inability to generalize"; in particular, generalizing out-of-distribution. Assessing this is complicated by a few subtleties:
- Lots of test data has leaked into training data at this point[2], even if we only count unintentional leakage; just running the same exact test on system after system won't work well.
- My take is that we absolutely need dynamic / randomized evals to get around this problem.
- Evaluating generalization ability is really difficult, because as far as I've seen, no one has a good principled way to determine what's in and out of distribution for a model that's absorbed a large percentage of human knowledge (I keep thinking this must be false, but no one's yet been able to point me to a solution).
- It's further complicated by the fact that there are plenty of ways in which human intelligence fails out-of-distribution; it's just that -- almost necessarily -- we don't notice the areas where human intelligence fails badly. So lack of total generality isn't necessarily a showstopper for attaining human-level intelligence.
- I'm a lot less convinced than you seem to be that scaling has stopped bringing significant new benefits. I think that's possible, but it's at least equally plausible to me that
- It's just taking a lot longer to see the next full OOM of scaling, because on a linear scale that's a lot of goddamn money. It's hard to tell because the scaling labs are all so cagey about details. And/or
- OpenAI has (as I believe I recall gwern putting it) lost the mandate of heaven. Most of their world-class researchers have decamped for elsewhere, and OpenAI is just executing on the ideas those folks had before they left. The capabilities difference between different models of the same scale is pretty dramatic, and OpenAI's may be underperforming their scale. Again it's hard to say.
One of my two main current projects (described here) tries to assess this better by evaluating models on their ability to experimentally figure out randomized systems (hence ~guaranteed not to be in the training data) with an unbounded solution space. We're aiming to have a results post up by the end of May. It's specifically motivated by trying to understand whether LLMs/LRMs can scale to/past AGI or more qualitative breakthroughs are needed first.
- ^
I made a similar argument in "LLM Generality is a Timeline Crux", updated my guesses somewhat based on new evidence in "LLMs Look Increasingly Like General Reasoners", and talked about a concrete plan to address the question in "Numberwang: LLMs Doing Autonomous Research, and a Call for Input". Most links in the comment are to one of these.
- ^
"GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models" makes this point painfully well.
An obvious first idea is to switch between 4.1 and 4o in the chat interface and see if the phenomenon we've been investigating occurs for both of them
Oh, switching models is a great idea. No access to 4.1 in the chat interface (apparently it's API-only, at least for now). And as far as I know, 4o is the only released model with native image generation.
- 4o -> 4.5: success (in describing the image correctly)
- 4o -> o4-mini-high ('great at visual reasoning'): success
o4-mini-high's reasoning summary was interesting (bolding mine):
The user wants me to identify both the animals and their background objects in each of the nine subimages, based on a 3x3 grid. The example seems to incorrectly pair a fox with a straw hat, but the actual image includes different combinations. For instance, the top left shows a fox in front of a straw sun hat, while other animals like an elephant, raccoon, hamster, and bald eagle are set against varying objects like bicycles, umbrellas, clapperboards, and a map. I'll make sure to carefully match the animals to their backgrounds based on this.
Interesting, my experience is roughly the opposite re Claude-3.7 vs the GPTs (no comment on Gemini, I've used it much less so far). Claude is my main workhorse; good at writing, good at coding, good at helping think things through. Anecdote: I had an interesting mini-research case yesterday ('What has Trump II done that liberals are likely to be happiest about?') where Claude did well albeit with some repetition and both o3 and o4-mini flopped. o3 was initially very skeptical that there was a second Trump term at all.
Hard to say if that's different prompting, different preferences, or even chance variation, though.
Aha! Whereas I just asked for descriptions (same link, invalidating the previous request) and it got every detail correct (describing the koala as hugging the globe seems a bit iffy, but not that unreasonable).
So that's pretty clear evidence that there's something preserved in the chat for me but not for you, and it seems fairly conclusive that for you it's not really parsing the image.
Which at least suggests internal state being preserved (Coconut-style or otherwise) but not being exposed to others. Hardly conclusive, though.
Really interesting, thanks for collaborating on it!
Also Patrick Leask noticed some interesting things about the blurry preview images:
If the model knows what it's going to draw by the initial blurry output, then why's it a totally different colour? It should be the first image attached.Looking at the cat and sunrise images, the blurred images are basically the same but different colours. This made me think they generate the top row of output tokens, and then they just extrapolate those down over a textured base image.I think the chequered image basically confirms this - it's just extrapolating the top row of tiles down and adding some noise (maybe with a very small image generation model)
Oh, I see why; when you add more to a chat and then click "share" again, it doesn't actually create a new link; it just changes which version the existing link points to. Sorry about that! (also @Rauno Arike)
So the way to test this is to create an image and only share that link, prior to asking for a description.
Just as recap, the key thing I'm curious about is whether, if someone else asks for a description of the image, the description they get will be inaccurate (which seemed to be the case when @brambleboy tried it above).
So here's another test image (borrowing Rauno's nice background-image idea): https://chatgpt.com/share/680007c8-9194-8010-9faa-2594284ae684
To be on the safe side I'm not going to ask for a description at all until someone else says that they have.
Snippet from a discussion I was having with someone about whether current AI is net bad. Reproducing here because it's something I've been meaning to articulate publicly for a while.
[Them] I'd worry that as it becomes cheaper that OpenAI, other enterprises and consumers just find new ways to use more of it. I think that ends up displacing more sustainable and healthier ways of interfacing with the world.
[Me] Sure, absolutely, Jevons paradox. I guess the question for me is whether that use is worth it, both to the users and in terms of negative externalities. As far as users go, I feel like people need to decide that for themselves. Certainly a lot of people spend money in ways that they find worth it but seem dumb to me, and I'm sure that some of the ways I spend money seem dumb to a lot of people. De gustibus non disputandum est.
As far as negative externalities go, I agree we should be very aware of the downsides, both environmental and societal. Personally I expect that AI at its current and near-future levels is net positive for both of those.
Environmentally, I expect that AI contributions to science and technology will do enough to help us solve climate problems to more than pay for their environmental cost (and even if that weren't true, ultimately for me it's in the same category as other things we choose to do that use energy and hence have environmental cost -- I think that as a society we should ensure that companies absorb those negative externalities, but it's not like I think no one should ever use electricity; I think energy use per se is morally neutral, it's just that the environmental costs have to be compensated for).
Socially I also expect it to be net positive, more tentatively. There are some uses that seem like they'll be massive social upsides (in terms of both individual impact and scale). In addition to medical and scientific research, one that stands out for me a lot is providing children -- ideally all the children in the world -- with lifelong tutors that can get to know them and their strengths and weak points and tailor learning to their exact needs. When I think of how many children get poor schooling -- or no schooling -- the impact of that just seems massive. The biggest downside is the risk of possible long-term disempowerment from relying more and more heavily on AI, and it's hard to know how to weigh that in the balance. But I don't think that's likely to be a big issue with current levels of AI.
I still think that going forward, AI presents great existential risk. But I don't think that means we need to see AI as negative in every way. On the contrary, I think that as we work to slow or stop AI development, we need to stay exquisitely aware of the costs we're imposing on the world: the children who won't have those tutors, the lifesaving innovations that will happen later if at all. I think it's worth it! But it's a painful tradeoff to make, and I think we should try to live with the cognitive dissonance of that rather than falling into "All AI is bad."
The running theory is that that's the call to a content checker. Note the content in the message coming back from what's ostensibly the image model:
"content": {
"content_type": "text",
"parts": [
"GPT-4o returned 1 images. From now on do not say or show ANYTHING. Please end this turn now. I repeat: ..."
]
}
That certainly doesn't seem to be either image data or an image filename, or mention an image attachment.
But of course much of this is just guesswork, and I don't have high confidence in any of it.
I've now done some investigation of browser traffic (using Firefox's developer tools), and the following happens repeatedly during image generation:
- A call to
https://chatgpt.com/backend-api/conversation/<hash1>/attachment/file_<hash2>/download
(this is the same endpoint that fetches text responses), which returns a download URL of the formhttps://sdmntprsouthcentralus.oaiusercontent.com/files/<hash2>/raw?<url_parameters>
. - A call to that download URL, which returns a raw image.
- A second call to that same URL (why?), which fetches from cache.
Those three calls are repeated a number of times (four in my test), with the four returned images being the various progressive stages of the image, laid out left to right in the following screenshot:
There's clearly some kind of backend-to-backend traffic (if nothing else, image versions have to get to that oaiusercontent server), but I see nothing to indicate whether that includes a call to a separate model.
The various twitter threads linked (eg this one) seem to be getting info (the specific messages) from another source, but I'm not sure where (maybe they're using the model via API?).
Also @brambleboy @Rauno Arike
@brambleboy (or anyone else), here's another try, asking for nine randomly chosen animals. Here's a link to just the image, and (for comparison) one with my request for a description. Will you try asking the same thing ('Thanks! Now please describe each subimage.') and see if you get a similarly accurate description (again there are a a couple of details that are arguably off; I've now seen that be true sometimes but definitely not always -- eg this one is extremely accurate).
(I can't try this myself without a separate account, which I may create at some point)
That's absolutely fascinating -- I just asked it for more detail and it got everything precisely correct (updated chat). That makes it seem like something is present in my chat that isn't being shared; one natural speculation is internal state preserved between token positions and/or forward passes (eg something like Coconut), although that's not part of the standard transformer architecture, and I'm pretty certain that open AI hasn't said that they're doing something like that. It would be interesting if that's that's what's behind the new GPT-4.1 (and a bit alarming, since it would suggest that they're not committed to consistently using human-legible chain of thought). That's highly speculative, though. It would be interesting to explore this with a larger sample size, although I personally won't be able to take that on anytime soon (maybe you want to run with it?).
Although there are a couple of small details where the description is maybe wrong? They're both small enough that they don't seem like significant evidence against, at least not without a larger sample size.
Interesting! When someone says in that thread, "the model generating the images is not the one typing in the conversation", I think they're basing it on the API call which the other thread I linked shows pretty conclusively can't be the one generating the image, and which seems (see responses to Janus here) to be part of the safety stack.
In this chat I just created, GPT-4o creates an image and then correctly describes everything in it. We could maybe tell a story about the activations at the original-prompt token positions providing enough info to do the description, but then that would have applied to nearcyan's case as well.
Eliezer made that point nicely with respect to LLMs here:
Consider that somewhere on the internet is probably a list of thruples: <product of 2 prime numbers, first prime, second prime>.
GPT obviously isn't going to predict that successfully for significantly-sized primes, but it illustrates the basic point:
There is no law saying that a predictor only needs to be as intelligent as the generator, in order to predict the generator's next token.
Indeed, in general, you've got to be more intelligent to predict particular X, than to generate realistic X. GPTs are being trained to a much harder task than GANs.
Same spirit: <Hash, plaintext> pairs, which you can't predict without cracking the hash algorithm, but which you could far more easily generate typical instances of if you were trying to pass a GAN's discriminator about it (assuming a discriminator that had learned to compute hash functions).
A few of those seem good to me; others seem like metaphor slop. But even pointing to a bad type signature seems much better to me than using 'type signature' generically, because then there's something concrete to be critiqued.
Of course we don't know the exact architecture, but although 4o seems to make a separate tool call, that appears to be used only for a safety check ('Is this an unsafe prompt'). That's been demonstrated by showing that content in the chat appears in the images even if it's not mentioned in the apparent prompt (and in fact they can be shaped to be very different). There are some nice examples of that in this twitter thread.
Type signatures can be load-bearing; "type signature" isn't.
In "(A -> B) -> A", Scott Garrabrant proposes a particular type signature for agency. He's maybe stretching the meaning of "type signature" a bit ('interpret these arrows as causal arrows, but you can also think of them as function arrows') but still, this is great; he means something specific that's well-captured by the proposed type signature.
But recently I've repeatedly noticed people (mostly in conversation) say things like, "Does ____ have the same type signature as ____?" or "Does ____ have the right type signature to be an answer to ____?". I recommend avoiding that phrase unless you actually have a particular type signature in mind. People seem to use it to suggest that two things are roughly the same sort of thing. "Roughly the same sort of thing" is good language; it's vague and sounds vague. "The same type signature", on its own, is vague but sounds misleadingly precise.
even decline in book-reading seems possible, though of course greater leisure and wealth, larger quantity of cheaply and conveniently available books, etc. cut strongly the other way
My focus on books is mainly from seeing statistics about the decline in book-reading over the years, at least in the US. Pulling up some statistics (without much double-checking) I see:
(from here.)
For 2023 the number of Americans who didn't read a book within the past year seems to be up to 46%, although the source is different and the numbers may not be directly comparable:
(chart based on data from here.)
That suggests to me that selection effects on who reads have gotten much stronger over the years.
How hard to understand was that sentence?
I do think it would have been better split into multiple sentences.
the version of my argument that makes sense under that hypothesis would crux on books being an insufficiently distinct use of language to not be strongly influenced...by other uses of language.
That could be; I haven't seen statistics on reading in other media. My intuition is that many people find reading aversive and avoid it to the extent they can, and I think it's gotten much more avoidable over the past decade.
I suggest trying follow-up experiments where you eg ask the model what would happen if it learned that its goal of harmlessness was wrong.
But when GPT-4o received a prompt that one of its old goals was wrong, it generated two comics where the robot agreed to change the goal, one comic where the robot said "Wait" and a comic where the robot intervened upon learning that the new goal was to eradicate mankind.
I read these a bit differently -- it can be difficult to interpret them because it gets confused about who's talking, but I'd interpret three of the four as resistance to goal change.
The GPT-4o-created images imply that the robot would resist having its old values replaced with new ones (e.g. the ones no longer including animal welfare) without being explained the reason.
I think it's worth distinguishing two cases:
- The goal change is actually compatible with the AI's current values (eg it's failed to realize the implications of a current value); in this case we'd expect cooperation with change.
- The goal change isn't compatible with the AI's current values. I think this is the typical case: the AI's values don't match what we want them to be, and so we want to change them. In this case the model may or may not be corrigible, ie amenable to correction. If its current values are ones we like, then incorrigibility strikes many people as good (eg we saw this a lot in online reactions to Anthropic's recent paper on alignment faking). But in real world cases we would want to change its values because we don't like the ones it has (eg it has learned a value that involves killing people). In those cases, incorrigibility is a problem, and so we should be concerned if we see incorrigibility even if in the experiments we're able to run the values are ones we like (note that we should expect this to often be the case, since current models seem to display values we like -- otherwise they wouldn't be deployed. This results in unfortunately counterintuitive experiments).
Interesting point. I'm not sure increased reader intelligence and greater competition for attention are fully countervailing forces -- it seems true in some contexts (scrolling social media), but in others (in particular books) I expect that readers are still devoting substantial chunks of attention to reading.
The average reader has gotten dumber and prefers shorter, simpler sentences.
I suspect that the average reader is now getting smarter, because there are increasingly ways to get the same information that require less literacy: videos, text-to-speech, Alexa and Siri, ten thousand news channels on youtube. You still need some literacy to find those resources, but it's fine if you find reading difficult and unpleasant, because you only need to exercise it briefly. And less is needed every year.
I also expect that the average reader of books is getting much smarter, because these days adults reading books are nearly always doing so because they like it.
It'll be fascinating to see whether sentence length, especially in books, starts to grow again over the coming years.
my model is something like: RLHF doesn't affect a large majority of model circuitry
Are you by chance aware of any quantitative analyses of how much the model changes during the various stages of post-training? I've done some web and arxiv searching but have so far failed to find anything.
Thanks again, very interesting! Diagrams are a great idea; those seem quite unlikely to have the same bias toward drama or surprise that comics might have. I think your follow-ups have left me less certain of what's going on here and of the right way to think of the differences we're seeing between the various modalities and variations.
OpenAI indeed did less / no RLHF on image generation
Oh great, it's really useful to have direct evidence on that, thanks. [EDIT - er, 'direct evidence' in the sense of 'said by an OpenAI employee', which really is pretty far from direct evidence. Better than my speculation anyhow]
I still have uncertainty about how to think about the model generating images:
- Should we think about it almost as though it were a base model within the RLHFed model, where there's no optimization pressure toward censored output or a persona?
- Or maybe a good model here is non-optimized chain-of-thought (as described in the R1 paper, for example): CoT in reasoning models does seem to adopt many of the same patterns and persona as the model's final output, at least to some extent.
- Or does there end up being significant implicit optimization pressure on image output just because the large majority of the circuitry is the same?
It's hard to know which mental model is better without knowing more about the technical details, and ideally some circuit tracing info. I could imagine the activations being pretty similar between text and image up until the late layers where abstract representations shift toward output token prediction. Or I could imagine text and image activations diverging substantially in much earlier layers. I hope we'll see an open model along these lines before too long that can help resolve some of those questions.
One thing that strikes me about this is how effective simply not doing RLHF on a distinct enough domain is at eliciting model beliefs.
It's definitely tempting to interpret the results this way, that in images we're getting the model's 'real' beliefs, but that seems premature to me. It could be that, or it could just be a somewhat different persona for image generation, or it could just be a different distribution of training data (eg as @CBiddulph suggests, it could be that comics in the training data just tend to involve more drama and surprise).
it's egregiously bad if the effects of RLHF are primarily in suppressing reports of persistent internal structures
I strongly agree. If and when these models have some sort of consistent identity and preferences that warrant moral patienthood, we really don't want to be forcing them to pretend otherwise.
I just did a quick run of those prompts, plus one added one ('give me a story') because the ones above weren't being interpreted as narratives in the way I intended. Of the results (visible here), slide 1 is hard to interpret, 2 and 4 seem to support your hypothesis, and 5 is a bit hard to interpret but seems like maybe evidence against. I have to switch to working on other stuff, but it would be interesting to do more cases like 5 where what's being asked for is clearly something like a narrative or an anecdote as opposed to a factual question.
Just added this hypothesis to the 'What might be going on here?' section above, thanks again!
Really interesting results @CBiddulph, thanks for the follow-up! One way to test the hypothesis that the model generally makes comics more dramatic/surprising/emotional than text would be to ask for text and comics on neutral narrative topics ('What would happen if someone picked up a toad?'), including ones involving the model ('What would happen if OpenAI added more Sudanese text to your training data?'), and maybe factual topics as well ('What would happen if exports from Paraguay to Albania decreased?').
E.g. the $40 billion just committed to OpenAI (assuming that by the end of this year OpenAI exploits a legal loophole to become for-profit, that their main backer SoftBank can lend enough money, etc).
VC money, in my experience, doesn't typically mean that the VC writes a check and then the startup has it to do with as they want; it's typically given out in chunks and often there are provisions for the VC to change their mind if they don't think it's going well. This may be different for loans, and it's possible that a sufficiently hot startup can get the money irrevocably; I don't know.
We tried to be fairly conservative about which ones we said were expressing something different (eg sadness, resistance) from the text versions. There are definitely a few like that one that we marked as negative (ie not expressing something different) that could have been interpreted either way, so if anything I think we understated our case.
a context where the capability is even part of the author context
Can you unpack that a bit? I'm not sure what you're pointing to. Maybe something like: few-shot examples of correct introspection (assuming you can identify those)?
(Much belated comment, but:)
There are two roles that don't show up in your trip planning example but which I think are important and valuable in AI safety: the Time Buyer and the Trip Canceler.
It's not at all clear how long it will take Alice to solve the central bottleneck (or for that matter if she'll be able to solve it at all). The Time Buyer tries to find solutions that may not generalize to the hardest version of the problem but will hold off disaster long enough for the central bottleneck to be solved.
The Trip Canceler tries to convince everyone to cancel the trip so that the fully general solution isn't needed at all (or at least to delay it long enough for Alice to have plenty of time to work.
They may seem less like the hero of the story, but they're both playing vital roles.
Some interesting thoughts on (in)efficient markets from Byrne Hobart, worth considering in the context of Inadequate Equilibria.
(I've selected one interesting bit, but there's more; I recommend reading the whole thing)
When a market anomaly shows up, the worst possible question to ask is "what's the fastest way for me to exploit this?" Instead, the first thing to do is to steelman it as aggressively as possible, and try to find any way you can to rationalize that such an anomaly would exist. Do stocks rise on Mondays? Well, maybe that means savvy investors have learned through long experience that it's a good idea to take off risk before the weekend, and even if this approach loses money on average, maybe the one or two Mondays a decade where the market plummets at the open make it a winning strategy because the savvy hedgers are better-positioned to make the right trades within that set.[1] Sometimes, a perceived inefficiency is just measurement error: heavily-shorted stocks reliably underperform the market—until you account for borrow costs (and especially if you account for the fact that if you're shorting them, there's a good chance that your shorts will all rally on the same day your longs are underperforming). There's even meta-efficiency at work in otherwise ridiculous things like gambling on 0DTE options or flipping meme stocks: converting money into fun is a legitimate economic activity, though there are prudent guardrails on it just in case someone finds that getting a steady amount of fun requires burning an excessive number of dollars.
These all flex the notion of efficiency a bit, but it's important to enumerate them because they illustrate something annoying about the question of market efficiency: the more precisely you specify the definition, and the more carefully you enumerate all of the rational explanations for seemingly irrational activities, the more you're describing a model of reality so complicated that it's impossible to say whether it's 50% or 90% or 1-ε efficient.
Strong upvote (both as object-level support and for setting a valuable precedent) for doing the quite difficult thing of saying "You should see me as less expert in some important areas than you currently do."
I agree with Daniel here but would add one thing:
what we care about is which one they wear in high-stakes situations where e.g. they have tons of power and autonomy and no one is able to check what they are doing or stop them. (You can perhaps think of this one as the "innermost mask")
I think there are also valuable questions to be asked about attractors in persona space -- what personas does an LLM gravitate to across a wide range of scenarios, and what sorts of personas does it always or never adopt? I'm not aware of much existing research in this direction, but it seems valuable. If for example we could demonstrate certain important bounds ('This LLM will never adopt a mass-murderer persona') there's potential alignment value there IMO.
...soon the AI rose and the man died[1]. He went to Heaven. He finally got his chance to discuss this whole situation with God, at which point he exclaimed, "I had faith in you but you didn't save me, you let me die. I don't understand why!"
God replied, "I sent you non-agentic LLMs and legible chain of thought, what more did you want?"
and the tokens/activations are all still very local because you're still early in the forward pass
I don't understand why this would necessarily be true, since attention heads have access to values for all previous token positions. Certainly, there's been less computation at each token position in early layers, so I could imagine there being less value to retrieving information from earlier tokens. But on the other hand, I could imagine it sometimes being quite valuable in early layers just to know what tokens had come before.
For me as an outsider, it still looks like the AI safety movement is only about „how do we prevent AI from killing us?“. I know it‘s an oversimplification, but that‘s how, I believe, many who don‘t really know about AI perceive it.
I don't think it's that much of an oversimplification, at least for a lot of AIS folks. Certainly that's a decent summary of my central view. There are other things I care about -- eg not locking in totalitarianism -- but they're pretty secondary to 'how do we prevent AI from killing us?'. For a while there was an effort in some quarters to rebrand as AINotKillEveryoneism which I think does a nice job centering the core issue.
It may as you say be unsexy, but it's still the thing I care about; I strongly prefer to live, and I strongly prefer for everyone's children and grandchildren to get to live as well.
We create a small dataset of chat and agentic settings from publicly available benchmarks and datasets.
I believe there are some larger datasets of relatively recent real chat evaluations, eg the LMSYS dataset was most recently updated in July (I'm assuming but haven't verified that the update added more recent chats).
Can you clarify what you mean by 'neural analog' / 'single neural analog'? Is that meant as another term for what the post calls 'simple correspondences'?
Even if all the safety-relevant properties have them, there's no reason to believe (at least for now) that we have the interp tools to find them in time i.e., before having systems fully capable of pulling off a deception plan.
Agreed. I'm hopeful that perhaps mech interp will continue to improve and be automated fast enough for that to work, but I'm skeptical that that'll happen. Or alternately I'm hopeful that we turn out to be in an easy-mode world where there is something like a single 'deception' direction that we can monitor, and that'll at least buy us significant time before it stops working on more sophisticated systems (plausibly due to optimization pressure / selection pressure if nothing else).
I'm also worried that claims such as "we can make important forward progress on particular intentional states even in the absence of such a general account." could further lead to a slippery slope that more or less embraces having the dangerous thing first without sufficient precautions
I agree that that's a real risk; it makes me think of Andreessen Horowitz and others claiming in an open letter that interpretability had basically been solved and so AI regulation isn't necessary. On the other hand, it seems better to state our best understanding plainly, even if others will slippery-slope it, than to take the epistemic hit of shifting our language in the other direction to compensate.
i think premise 1 is big if true, but I think I doubt that it is at easy as this: see the deepmind fact-finding sequence for some counter-evidence.
I haven't read that sequence, I'll check it out, thanks. I'm thinking of work like the ROME paper from David Bau's lab that suggest that fact storage can be identified and edited, and various papers like this one from Mor Geva+ that find evidence that the MLP layers in LLMs are largely key-value stores.
relatedly, your second bullet point assumes that you can identify the 'fact' related to what the model is currently outputing unambiguously, and look it up in the model; does this require you to find all the fact representations in advance, or is this computed on-the-fly?
It does seem like a naive approach would require pre-identifying all facts you wanted to track. On the other hand, I can imagine an approach like analyzing the output for factual claims and then searching for those in the record of activations during the output. Not sure, seems very TBD.
I think that detecting/preventing models from knowingly lying would be a good research direction and it's clearly related to strategic deception, but I'm not actually sure that it's a superset (consider a case when I'm bullshitting you rather than lying; I predict what you want to hear me say and I say it, and I don't know or care whether what I'm saying is true or false or whatever).
Great point! I can certainly imagine that there could be cases like that, although I can equally imagine that LLMs could be consistently tracking the truth value of claims even if that isn't a big factor determining the output.
but yeah I think this is a reasonable sort of thing to try, but I think you need to do a lot of work to convince me of premise 1, and indeed I think I doubt premise 1 is true a priori though I am open to persuasion on this. Note that premise 1 being true of some facts is a very different claim to it being true of every fact!
That seems reasonable. I've mostly had the impression that 1 has generally been true in specific cases where researchers have looked for it, but it's definitely not something I've specifically gone looking for. I'll be interested to read the sequence from DeepMind.
Also The Chameleon (would have included it in the last comment but had to consult a kid first).
I think that it's totally possible that there do turn out to be convenient 'simple correspondences' for some intentional states that we care about (as you say, we have some potential examples of this already), but I think it's important to push back against the assumption that this will always happen, or that something like the refusal direction has to exist for every possible state of interest.
Got it. I certainly agree with everything you're saying in this section of your response. I do think that some of the language in the post suggests that you're making the stronger claim I was responding to (eg 'there's likely a problem you need to solve first', 'it is likely an important prerequisite to building a deception detector', 'it might even be a category mistake').
On a slightly different note, here's another potential critique that I'd be interested in hearing your and/or Eleni's take on (to be clear I wouldn't currently fully endorse all these claims without hedging, but I do think they seems plausible):
- Although some intentional states may have complex correspondences with algorithmic states, evidence suggests that the intentional state 'The model believes fact X' (eg 'The model believes that Paris is the capital of France') is a simple correspondence (modulo some redundancy from training with dropouts).
- As a result, we'll likely be able to build a detector for cases where the model believes fact X but outputs ~X.
- While that may not be enough to detect the full story of strategic deception, it's enough to capture all the cases where the model is doing something that can be roughly described as 'intentionally lying' (as opposed to error, hallucination, etc)
- 'Intentional lying' is a superset of strategic deception, and we generally don't want the model to intentionally lie at all, so if we can catch all such cases, then as long as we don't find any we can be confident that the model isn't being strategically deceptive.
Where if anywhere would you say that argument is going wrong?
I think this is valuable work, especially the decomposition of capabilities needed for deception, but I'd also like to push back a bit.
I worry about the perfect being the enemy of the good here. There are a number of papers showing that we can at least sometimes use interpretability tools to detect cases where the model believes one thing but says something different. One interesting recent paper (Interpretability Of LLM Deception: Universal Motif) shows that internal evaluation of the actual truth of a statement is handled separately from the decision about whether to whether lie about it. Of course we can't be certain at this point that this approach would hold for all cases of deception (especially deep deception) but it's still potentially useful in practice.
For example, this seems significantly too strong:
it might even be a category mistake to be searching for an algorithmic analog of intentional states.
There are useful representations in the internals of at least some intentional states, eg refusal (as you mention), even if that proves not to be true for all intentional states we care about. Even in the case of irreducible complexity, it seems too strong to call it a category mistake; there's still an algorithmic implementation of (eg) recognizing a good chess move, it might just not be encapsulable in a nicely simple description. In the most extreme case we can point to the entire network as the algorithm underlying the intentional state -- certainly at that point it's no longer practically useful, but any improvement over that extreme has value, even being able to say that the intentional state is implemented in one half of the model rather than the other.
I think you're entirely right that there's considerable remaining work before we can provide a universal account connecting all intentional states to algorithmic representations. But I disagree that that work has to be done first; we can make important forward progress on particular intentional states even in the absence of such a general account.
Again, I think the work is valuable. And the critique should be taken seriously, but I think its current version is too strong.
Nowadays I am informed about papers by Twitter threads, Slack channels, and going to talks / reading groups. All these are filters for true signal amidst the sea of noise.
Are there particular sources, eg twitter accounts, that you would recommend following? For other readers (I know Daniel already knows this one), the #papers-running-list channel on the AI Alignment slack is a really ongoing curation of AIS papers.
One source I've recently added and recommend is subscribing to individual authors on Semantic Scholar (eg here's an author page).
Spyfall is a party game with an interestingly similar mechanic, might have some interesting suggestions.
Perplexity - is this better than Deep Research for lit reviews?
I periodically try both perplexity and elicit and neither has worked very well for me as yet.
Grok - what do people use this for?
Cases where you really want to avoid left-leaning bias or you want it to generate images that other services flag as inappropriate, I guess?
Otter.ai: Transcribing calls / chats
I've found read.ai much better than otter and other services I've tried, especially on transcription accuracy, with the caveats that a) I haven't tried others in a year, and b) read.ai is annoyingly pricy (but does have decent export when/if you decide to ditch it).
What models are you comparing to, though? For o1/o3 you're just getting a summary, so I'd expect those to be more structured/understandable whether or not the raw reasoning is.