Posts
Comments
While @Polite Infinity in particular is clearly a thoughtful commenter, I strongly support the policy (as mentioned in this gist which includes Raemon's moderation discussion with Polite Infinity) to 'lean against AI content by default' and 'particularly lean towards requiring new users to demonstrate they are generally thoughtful, useful content.' We may conceivably end up in a world where AI content is typically worthwhile reading, but we're certainly not there yet.
Other recent models that show (at least purportedly) the full CoT:
- Deepseek R1-Lite (note that you have to turn 'DeepThink' on)
- Qwen QwQ-32B
However, our Preparedness evaluations should still be seen as a lower bound for potential risks...Moreover, the field of frontier model evaluations is still nascent
Perhaps then we should slow the hell down.
I realize I'm preaching to the choir here, but honestly I don't know how we're supposed to see this as responsible behavior.
Thanks for sketching that out, I appreciate it. Unlearning significantly improving the safety outlook is something I may not have fully priced in.
My guess is that the central place we differ is that I expect dropping in, say, 100k extra capabilities researchers gets us into greater-than-human intelligence fairly quickly -- we're already seeing LLMs scoring better than human in various areas, so clearly there's no hard barrier at human level -- and at that point control gets extremely difficult.
I do certainly agree that there's a lot of low-hanging fruit in control that's well worth grabbing.
Granted -- but I think the chances of that happening are different in my proposed scenario than currently.
I wish I shared your optimism! You've talked about some of your reasons for it elsewhere, but I'd be interested to hear even a quick sketch of roughly how you imagine the next decade to go in the context of the thought experiment, in the 70-80% of cases where you expect things to go well.
If we could trust OpenAI to handle this scenario responsibly, our odds would definitely seem better to me.
I realize that asking about p(doom) is utterly 2023, but I'm interested to see if there's a rough consensus in the community about how it would go if it were now, and then it's possible to consider how that shifts as the amount of time moves forward.
I've been thinking of writing up a piece on the implications of very short timelines, in light of various people recently suggesting them (eg Dario Amodei, "2026 or 2027...there could be a mild delay")
Here's a thought experiment: suppose that this week it turns out that OAI has found a modified sampling technique for o1 that puts it at the level of the median OAI capabilities researcher, in a fairly across-the-board way (ie it's just straightforwardly able to do the job of a researcher). Suppose further that it's not a significant additional compute expense; let's say that OAI can immediately deploy a million instances.
What outcome would you expect? Let's operationalize that as: what do you think is the chance that we get through the next decade without AI causing a billion deaths (via misuse or unwanted autonomous behaviors or multi-agent catastrophes that are clearly downstream of those million human-level AI)?
In short, what do you think are the chances that that doesn't end disastrously?
I see -- I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!
One story here could be that the model is being used only in an API context where it's being asked to take actions on something well-modeled as a Markov process, where the past doesn't matter (and we assume that the current state doesn't incorporate relevant information about the past). There are certainly use cases that fit that ('trigger invisible fence iff this image contains a dog'; 'search for and delete this file'). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.
Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn't apply to current LLMs but seems like a very plausible path for economically useful models like agents.
It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'
I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 - 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me
Epoch's analysis from June supports this view, and suggests it may even be a bit too conservative:
(and that's just for text -- there are also other significant sources of data for multimodal models, eg video)
the untrusted model is stateless between queries and only sees the command history and system state.
What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.
If you're assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.
Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc ('randomized mystery blocksworld'). They do see a 30% accuracy dip when doing that, but o1-preview's performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there's no way to tell, though, since they don't test o1-preview on the fully-private held-out set of problems.
The Ord piece is really intriguing, although I'm not sure I'm entirely convinced that it's a useful framing.
- Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn't apply to many kinds of hyperpolation.
- The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water).
- Given an infinite number of possible additional dimensions, it's not at all clear how a NN could choose a particular one to try to hyperpolate into.
It's a fascinating idea, though, and one that'll definitely stick with me as a possible framing. Thanks!
After some discussion elsewhere with @zeshen, I'm feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).
[EDIT: I originally gave an excessively long and detailed response to your predictions. That version is preserved (& commentable) here in case it's of interest]
I applaud your willingness to give predictions! Some of them seem useful but others don't differ from what the opposing view would predict. Specifically:
- I think most people would agree that there are blind spots; LLMs have and will continue to have a different balance of strengths and weaknesses from humans. You seem to say that those blind spots will block capability gains in general; that seems unlikely to me (and it would shift me toward your view if it clearly happened) although I agree they could get in the way of certain specific capability gains.
- The need for escalating compute seems like it'll happen either way, so I don't think this prediction provides evidence on your view vs the other.
- Transformers not being the main cognitive component of scaffolded systems seems like a good prediction. I expect that to happen for some systems regardless, but I expect LLMs to be the cognitive core for most, until a substantially better architecture is found, and it will shift me a bit toward your view if that isn't the case. I do think we'll eventually see such an architectural breakthrough regardless of whether your view is correct, so I think that seeing a breakthrough won't provide useful evidence.
- 'LLM-centric systems can't do novel ML research' seems like a valuable prediction; if it proves true, that would shift me toward your view.
First of all, serious points for making predictions! And thanks for the thoughtful response.
Before I address specific points: I've been working on a research project that's intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I'd be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you'd find more compelling?
Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.
Absolutely! And then on top of that, it's very easy to mistake using knowledge from the truly vast training data for actual reasoning.
But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.
This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.
In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 - 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).
Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind?
I think that's a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don't know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.
Interesting approach, thanks!
Why does the prediction confidence start at 0.5?
Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.
And how is the "actual accuracy" calculated?
It assumes that you predict every event with the same confidence (namely prediction_confidence
) and then that you're correct on actual_accuracy
of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you'll get a Brier score of 0.25 (ie 3/4 of the way up the right-hand said of the graph).
Of course typically people predict different events with different confidences -- but since overall Brier score is the simple average of the Brier scores on individual events, that part's reasonably intuitive.
But I also find my own understanding to be a bit confused and in need of better sources.
Mine too, for sure.
And agreed, Chollet's points are really interesting. As much as I'm sometimes frustrated with him, I think that ARC-AGI and his willingness to (get someone to) stake substantial money on it has done a lot to clarify the discourse around LLM generality, and also makes it harder for people to move the goalposts and then claim they were never moved).
With respect to Chollet's definition (the youtube link):
- I agree with many of Chollet's points, and the third and fourth items in my list are intended to get at those.
- I do find Chollet a bit frustrating in some ways, because he seems somewhat inconsistent about what he's saying. Sometimes he seems to be saying that LLMs are fundamentally incapable of handling real novelty, and we need something very new and different. Other times he seems to be saying it's a matter of degree: that LLMs are doing the right things but are just sample-inefficient and don't have a good way to incorporate new information. I imagine that he has a single coherent view internally and just isn't expressing it as clearly as I'd like, although of course I can't know.
- I think part of the challenge around all of this is that (AFAIK but I would love to be corrected) we don't have a good way to identify what's in and out of distribution for models trained on such diverse data, and don't have a clear understanding of what constitutes novelty in a problem.
Interesting question! Maybe it would look something like, 'In my experience, the first answer to multiple-choice questions tends to be the correct one, so I'll pick that'?
It does seem plausible on the face of it that the model couldn't provide a faithful CoT on its fine-tuned behavior. But that's my whole point: we can't always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes.
But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg 'Looking Inward'), and I'm not confident that models couldn't introspect on fine-tuned behavior.
I've now made two posts about LLMs and 'general reasoning', but used a fairly handwavy definition of that term. I don't yet have a definition I feel fully good about, but my current take is something like:
- The ability to do deduction, induction, and abduction
- in a careful, step by step way, without many errors that a better reasoner could avoid,
- including in new domains; and
- the ability to use all of that to build a self-consistent internal model of the domain under consideration.
What am I missing? Where does this definition fall short?
Interesting, thanks, I'll have to think about that argument. A couple of initial thoughts:
When we ask whether some CoT is faithful, we mean something like: "Does this CoT allow us to predict the LLM's response more than if there weren't a CoT?"
I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: 'a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.' I think 'Language Models Don’t Always Say What They Think' shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model's choice than without the CoT, but it's nonetheless clearly not faithful.
If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards
I haven't spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don't consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models' ontologies).
It does seem totally plausible to me that o1's CoT is pretty faithful! I'm just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', where they find that models which behave in manipulative or deceptive ways act 'as if they are always responding in the best interest of the users, even in hidden scratchpads'.
It's not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:
Here's log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):
Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):
Update: a recent new paper, 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', described by the authors on LW in 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', finds that post-RLHF, LLMs may identify users who are more susceptible to manipulation and behave differently with those users. This seems like a clear example of LLMs modeling users and also making use of that information.
I agree with 1, which is why the COT will absolutely have to be faithful.
That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won't really need the CoT in the first place).
I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself...
I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT
I don't have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren't legible to monitors, and those can accumulate over the course of an extended context.
Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?
Can you say more? I don't think I see why that would be.
Re 2-3, Agree unless we use models that are incapable of deceptive reasoning without CoT (due to number of parameters or training data).
Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).
Do you know of any material that goes into more detail on the RL pre-training of o1?
As far as I know OpenAI has been pretty cagey about how o1 was trained, but there seems to be a general belief that they took the approach they had described in 2023 in 'Improving mathematical reasoning with process supervision' (although I wouldn't think of that as pre-training).
It might well turn out to be a far harder problem than language-bound reasoning. You seem to have a different view and I'd be very interested in what underpins your conclusions.
I can at least gesture at some of what's shaping my model here:
- Roughly paraphrasing Ilya Sutskever (and Yudkowsky): in order to fully predict text, you have to understand the causal processes that created it; this includes human minds and the physical world that they live in.
- The same strategy of self-supervised token-prediction seems to work quite well to extend language models to multimodal abilities up to and including generating video that shows an understanding of physics. I'm told that it's doing pretty well for robots too, although I haven't followed that literature.
- We know that models which only see text nonetheless build internal world models like globes and game boards.
- Proponents of the view that LLMs are just applying shallow statistical patterns to the regularities of language have made predictions based on that view that have failed repeatedly, such as the claim that no pure LLM would ever able to correctly complete Three plus five equals. Over and over we've seen predictions about what LLMs would never be able to do turn out to be false, usually not long thereafter (including the ones I mention in my post here). At a certain point that view just stops seeming very plausible.
I think your intuition here is one that's widely shared (and certainly seemed plausible to me for a while). But when we cash that out into concrete claims, those don't seem to hold up very well. If you have some ideas about specific limitations that LLMs couldn't overcome based on that intuition (ideally ones that we can get an answer to in the relatively near future), I'd be interested to hear them.
As a small piece of feedback, I found it a bit frustrating that so much of your comment was links posted without clarification of what each one was, some of which were just quote-tweets of the others.
Logan Zollener made a claim referencing a Twitter post which claims that LLMs are mostly simple statistics/prediction machines:
Or rather that a particular experiment training simple video models on synthetic data showed that they generalized in ways different from what the researchers viewed as correct (eg given data which didn't specify, they generalized to thinking that an object was more likely to change shape than velocity).
I certainly agree that there are significant limitations to models' ability to generalize out of distribution. But I think we need to be cautious about what we take that to mean about the practical limitations of frontier LLMs.
- For simple models trained on simple, regular data, it's easy to specify what is and isn't in-distribution. For very complex models trained on a substantial fraction of human knowledge, it seems much less clear to me (I have yet to see a good approach to this problem; if anyone's aware of good research in this area I would love to know about it).
- There are many cases where humans are similarly bad at OOD generalization. The example above seems roughly isomorphic to the following problem: I give you half of the rules of an arbitrary game, and then ask you about the rest of the rules (eg: 'What happens if piece x and piece y come into contact?'. You can make some guesses based on the half of the rules you've seen, but there are likely to be plenty of cases where the correct generalization isn't clear. The case in the experiment seems less arbitrary to us, because we have extensive intuition built up about how the physical world operates (eg that objects fairly often change velocity but don't often change shape), but the model hasn't been shown info that would provide that intuition[1]; why, then, should we expect it to generalize in a way that matches the physical world?
- We have a history of LLMs generalizing correctly in surprising ways that weren't predicted in advance (looking back at the GPT-3 paper is a useful reminder of how unexpected some of the emerging capabilities were at the time).
- ^
Note that I haven't read the paper; I'm inferring this from the video summary they posted.
o1-preview does not seem that much better than other models on extraordinarily difficult (and therefore OOD?) problems.
@Noosphere89 points in an earlier comment to a quote from the FrontierMath technical report:
...we identified all problems that any model solved at least once—a total of four problems—and conducted repeated trials with five runs per model per problem (see Appendix B.2). We observed high variability across runs: only in one case did a model solve a question on all five runs (o1-preview for question 2). When re-evaluating these problems that were solved at least once, o1-preview demonstrated the strongest performance across repeated trials (see Section B.2).
I would love to see comparisons of o1 performance against other models in games like chess and Go.
I would also find that interesting, although I don't think it would tell us as much about o1's general reasoning ability, since those are very much in-distribution.
Also, somewhat unrelated to this post, what do you and others think about x-risk in a world where explicit reasoners like o1 scale to AGI. To me this seems like one of the safest forms of AGI, since much of the computation is happening explicitly, and can be checked/audited by other AI systems and humans.
I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn't entirely straightforward, since
- CoT can be unfaithful to the actual reasoning process.
- I suspect that advanced models will soon (or already?) understand that we're monitoring their CoT, which seems likely to affect CoT contents.
- I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme.
- At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent.
But it does seem very plausibly helpful!
as it's pretty low on the compute scale compared to other models.
Can you clarify that? Do you mean relative to models like AlphaGo, which have the ability to do explicit & extensive tree search?
Yeah, very fair point, those are at least in part defining a scale rather than a threshold (especially the error-free and consistent-model criteria).
Thanks!
It seems unsurprising to me that there are benchmarks o1-preview is bad at. I don't mean to suggest that it can do general reasoning in a highly consistent and correct way on arbitrarily hard problems[1]; I expect that it still has the same sorts of reliability issues as other LLMs (though probably less often), and some of the same difficulty building and using internal models without inconsistency, and that there are also individual reasoning steps that are beyond its ability. My only claim here is that o1-preview knocks down the best evidence that I knew of that LLMs can't do general reasoning at all on novel problems.
I think that to many people that claim may just look obvious; of course LLMs are doing some degree of general reasoning. But the evidence against was strong enough that there was a reasonable possibility that what looked like general reasoning was still relatively shallow inference over a vast knowledge base. Not the full stochastic parrot view, but the claim that LLMs are much less general than they naively seem.
It's fascinatingly difficult to come up with unambiguous evidence that LLMs are doing true general reasoning! I hope that my upcoming project on whether LLMs can do scientific research on toy novel domains can help provide that evidence. It'll be interesting to see how many skeptics are convinced by that project or by the evidence shown in this post, and how many maintain their skepticism.
- ^
And I don't expect that you hold that view either; your comment just inspired some clarification on my part.
Interesting, I didn't know that. But it seems like that assumes that o1's special-sauce training can be viewed as a kind of RLHF, right? Do we know enough about that training to know that it's RLHF-ish? Or at least some clearly offline approach.
For sure! At the same time, a) we've continued to see new ways of eliciting greater capability from the models we have, and b) o1 could (AFAIK) involve enough additional training compute to no longer be best thought of as 'the same model' (one possibility, although I haven't spent much time looking into what we know about o1: they may have started with a snapshot of the 4o base model, put it through additional pre-training, then done an arbitrary amount of RL on CoT). So I'm hesitant to think that 'based on 4o' sets a very strong limit on o1's capabilities.
It continues to be difficult to fully define 'general reasoning', and my mental model of it continues to evolve, but I think of 'system 2 reasoning' as at least a partial synonym.
In the medium-to-long term I'm inclined to taboo the word and talk about what I understand as its component parts, which I currently (off the top of my head) think of as something like:
- The ability to do deduction, induction, and abduction.
- The ability to do those in a careful, step by step way, with almost no errors (other than the errors that are inherent to induction and abduction on limited data).
- The ability to do all of that in a domain-independent way.
- The ability to use all of that to build a self-consistent internal model of the domain under consideration.
Don't hold me to that, though, it's still very much evolving. I may do a short-form post with just the above to invite critique.
That all seems pretty right to me. It continues to be difficult to fully define 'general reasoning', and my mental model of it continues to evolve, but I think of 'system 2 reasoning' as at least a partial synonym.
Humans clearly can do general reasoning. But it's not easy for us.
Agreed; not only are we very limited at it, but we often aren't doing it at all.
So here's my guess on whether LLMs reach general reasoning by pure scaling: it doesn't matter.
I agree that it may be possible to achieve it with scaffolding even if LLMs don't get there on their own; I'm just less certain of it.
I've now expanded this comment to a post -- mostly the same content but with more detail.
https://www.lesswrong.com/posts/wN4oWB4xhiiHJF9bS/llms-look-increasingly-like-general-reasoners
Thanks for sharing, I hadn't seen those yet! I've had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.
How much does o1-preview update your view? It's much better at Blocksworld for example.
Quite substantially. Substantially enough that I'll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati's) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper).
There's one sense in which it's not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it's more like Ryan's hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn't really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.
Here's a first pass on how much this changes my numeric probabilities -- I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity):
- LLMs continue to do better at block world and ARC as they scale: 75% -> 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI).
- LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10% -> 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like 'the full o1 turns out to already be close to the grand prize mark' and the rest on 'OpenAI capabilities researchers manage to use the full o1 to find an improvement to current LLM technique (eg a better prompting approach) that can be easily fixed'.
- Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60% -> 75% -- I'm tempted to put it higher, but it wouldn't be that surprising if o1-mark-2 didn't quite get there even with scaffolding/tools, especially since we don't have clear insight into how much harder the full test set is.
- Same but for the gen after that (GPT-6, Claude 5): 75% -> 90%? I feel less sure about this one than the others; it sure seems awfully likely that o2 plus scaffolding will be able to do it! But I'm reluctant to go past 90% because progress could level off because of training data requirements, maybe the o1 -> o2 jump doesn't focus on optimizing for general reasoning, etc. It seems very plausible that I'll bump this higher on reflection.
- The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9] -> 80%. That sure does seem like the world we're living in. It's not clear to me that o1 couldn't already do original AI research with the right scaffolding. Sakana claims to have gotten there with GPT-4o / Sonnet, but their claims seem overblown to me.
Now that I've seen these, I'm going to have to think hard about whether my upcoming research projects in this area (including one I'm scheduled to lead a team on in the spring, uh oh) are still the right thing to pursue. I may write at least a brief follow-up post to this one arguing that we should all update on this question.
Thanks again, I really appreciate you drawing my attention to these.
What was your first favorite?
Even though I can't critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.
Agreed, that's definitely a general failure mode.
Possibly too cynical, but I find myself wondering whether the conditionality was in fact engineered by OAI in order to achieve this purpose. My impression is that there are a lot of VCs shouting 'Take my money!' in the general direction of OpenAI and/or Altman who wouldn't have demanded the restructuring.
I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently.
Interesting, if you happen to have a link I'd be interested to learn more.
Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons?
I like the idea, but it seems hard to judge 'more novel and [especially] more requiring of intelligence' other than to sort completions in order of human error on each.
I wasn't really talking about the training, I was talking about how well it does on things for which it isn't trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related?
I think there's a lot of work to be done on this still, but there's some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).
I'm glad you think this has been a valuable exchange
I continue to think so :). Thanks again!
Hi, apologies for having failed to respond; I went out of town and lost track of this thread. Reading back through what you've said. Thank you!
I agree that that's presumably the underlying reality. I should have made that clearer.
But it seems like the board would still need to create some justification for public consumption, and for avoiding accusations of violating their charter & fiduciary duty. And it's really unclear to me what that justification is.
I realize this ship has sailed, but: I continue to be confused about how the non-profit board can justify giving up control. Given that their mandate[1] says that their 'primary fiduciary duty is to humanity' and to ensure that AGI 'is used for the benefit of all', and given that more control over a for-profit company is strictly better than less control, on what grounds can they justify relinquishing that control?
[EDIT: I realize that in reality they're just handing over control because Altman wants it, and they've been selected in part for willingness to give Altman what he wants. But it still seems like they need a justification, for public consumption and to avoid accusations of violating their charter and fiduciary duty.]
Am I just being insufficiently cynical here? Does everyone tacitly recognize that the non-profit board isn't really bound by that charter and can do whatever they like? Or is there some justification that they can put forward with a straight face?
Is it maybe something like, 'We trust that OpenAI the for-profit is primarily dedicated to the general well-being of humanity, and so nothing is lost by giving up control'? (But then it seems like more control is still strictly better in case it turns out that in some shocking and unforeseeable way the for-profit later has other priorities). Or is it, 'OpenAI the for-profit is doing good in the world, and they can do much more good if they can raise more money, and there's certainly no way they could raise more money without us giving up control'?
- ^
I'm assuming here that their charter is the same as OpenAI's charter, since I haven't been able to find a distinct non-profit charter.
My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.
Joseph and Johnny did some interesting work on this in 'Understanding SAE Features with the Logit Lens', taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.