Posts

LLMs Look Increasingly Like General Reasoners 2024-11-08T23:47:28.886Z
AIS terminology proposal: standardize terms for probability ranges 2024-08-30T15:43:39.857Z
LLM Generality is a Timeline Crux 2024-06-24T12:52:07.704Z
Language Models Model Us 2024-05-17T21:00:34.821Z
Useful starting code for interpretability 2024-02-13T23:13:47.940Z
eggsyntax's Shortform 2024-01-13T22:34:07.553Z

Comments

Comment by eggsyntax on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T21:20:16.669Z · LW · GW

I see -- I mistakenly read that as part of a broader policy of not showing the model its past actions. Thanks!

Comment by eggsyntax on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T19:51:26.332Z · LW · GW

One story here could be that the model is being used only in an API context where it's being asked to take actions on something well-modeled as a Markov process, where the past doesn't matter (and we assume that the current state doesn't incorporate relevant information about the past). There are certainly use cases that fit that ('trigger invisible fence iff this image contains a dog'; 'search for and delete this file'). It does seem to me, though, that for many (most?) AI use cases, past information is useful, and so the assumptions above fail unless labs are willing to pay the performance cost in the interest of greater safety.

Another set of cases where the assumptions fail is models that are trained to expect access to intrinsic or extrinsic memory; that doesn't apply to current LLMs but seems like a very plausible path for economically useful models like agents.

Comment by eggsyntax on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T18:35:11.791Z · LW · GW

It seems like that assumption runs throughout the post though, eg 'But it won’t know what it actually tried, unless it is able to guess very accurately what it would have said', 'the AI might not even be able to get direct evidence of whether its attempts worked, or what its attempts even were.'

I don't just mean showing the model its past CoT. Current models are often given lots of information about their past actions and interactions so that they can produce better output; why would you expect that to change? Is the assumption that labs will be willing to accept worse performance because they recognize the model can't be trusted?

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-19T14:27:23.748Z · LW · GW

In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 - 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me

Epoch's analysis from June supports this view, and suggests it may even be a bit too conservative:

(and that's just for text -- there are also other significant sources of data for multimodal models, eg video)

Comment by eggsyntax on Why imperfect adversarial robustness doesn't doom AI control · 2024-11-19T13:23:49.702Z · LW · GW

the untrusted model is stateless between queries and only sees the command history and system state.

What justifies the assumption that untrusted models will be stateless in deployment? There are clear performance advantages in many cases to giving models memory, and in practice they already typically receive context containing the history of their actions or interactions.

If you're assuming that commercial labs will stop doing that for the sake of safety despite its advantages, it seems worth making that assumption explicit.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-18T17:47:08.946Z · LW · GW

Great point! In the block world paper, they re-randomize the obfuscated version, change the prompt, etc ('randomized mystery blocksworld'). They do see a 30% accuracy dip when doing that, but o1-preview's performance is still 50x that of the best previous model (and > 200x that of GPT-4 and Sonnet-3.5). With ARC-AGI there's no way to tell, though, since they don't test o1-preview on the fully-private held-out set of problems.

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-18T15:41:46.364Z · LW · GW

The Ord piece is really intriguing, although I'm not sure I'm entirely convinced that it's a useful framing.

  • Some of his examples (eg cosine-ish wave to ripple) rely on the fundamental symmetry between spatial dimensions, which wouldn't apply to many kinds of hyperpolation.
  • The video frame construction seems more like extrapolation using an existing knowledge base about how frames evolve over time (eg how ducks move in the water).
  • Given an infinite number of possible additional dimensions, it's not at all clear how a NN could choose a particular one to try to hyperpolate into.

It's a fascinating idea, though, and one that'll definitely stick with me as a possible framing. Thanks!

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-18T15:36:20.922Z · LW · GW

After some discussion elsewhere with @zeshen, I'm feeling a bit less comfortable with my last clause, building an internal model. I think of general reasoning as essentially a procedural ability, and model-building as a way of representing knowledge. In practice they seem likely to go hand-in-hand, but it seems in-principle possible that one could reason well, at least in some ways, without building and maintaining a domain model. For example, one could in theory perform a series of deductions using purely local reasoning at each step (although plausibly one might need a domain model in order to choose what steps to take?).

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-18T02:28:26.825Z · LW · GW

[EDIT: I originally gave an excessively long and detailed response to your predictions. That version is preserved (& commentable) here in case it's of interest]

I applaud your willingness to give predictions! Some of them seem useful but others don't differ from what the opposing view would predict. Specifically:

  1. I think most people would agree that there are blind spots; LLMs have and will continue to have a different balance of strengths and weaknesses from humans. You seem to say that those blind spots will block capability gains in general; that seems unlikely to me (and it would shift me toward your view if it clearly happened) although I agree they could get in the way of certain specific capability gains.
  2. The need for escalating compute seems like it'll happen either way, so I don't think this prediction provides evidence on your view vs the other.
  3. Transformers not being the main cognitive component of scaffolded systems seems like a good prediction. I expect that to happen for some systems regardless, but I expect LLMs to be the cognitive core for most, until a substantially better architecture is found, and it will shift me a bit toward your view if that isn't the case. I do think we'll eventually see such an architectural breakthrough regardless of whether your view is correct, so I think that seeing a breakthrough won't provide useful evidence.
  4. 'LLM-centric systems can't do novel ML research' seems like a valuable prediction; if it proves true, that would shift me toward your view.
Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-18T01:56:05.169Z · LW · GW

First of all, serious points for making predictions! And thanks for the thoughtful response.

Before I address specific points: I've been working on a research project that's intended to help resolve the debate about LLMs and general reasoning. If you have a chance to take a look, I'd be very interested to hear whether you would find the results of the proposed experiment compelling; if not, then why not, and are there changes that could be made that would make it provide evidence you'd find more compelling?

Humans are eager to find meaning and tend to project their own thoughts onto external sources. We even go so far as to attribute consciousness and intelligence to inanimate objects, as seen in animistic traditions. In the case of LLMs this behaviour could lead to an overly optimistic extrapolation of capabilities from toy problems.

Absolutely! And then on top of that, it's very easy to mistake using knowledge from the truly vast training data for actual reasoning.

But in 2024 the overhang has been all but consumed. Humans continue to produce more data, at an unprecedented rate, but still nowhere near enough to keep up with the demand.

This does seem like one possible outcome. That said, it seems more likely to me that continued algorithmic improvements will result in better sample efficiency (certainly humans need a far tinier amount of language examples to learn language), and multimodal data /synthetic data / self-play / simulated environments continue to improve. I suspect capabilities researchers would have made more progress on all those fronts, had it not been the case that up to now it was easy to throw more data at the models.

In the past couple of weeks lots of people have been saying the scaling labs have hit the data wall, because of rumors of slowdowns in capabilities improvements. But before that, I was hearing at least some people in those labs saying that they expected to wring another 0.5 - 1 order of magnitude of human-generated training data out of what they had access to, and that still seems very plausible to me (although that would basically be the generation of GPT-5 and peer models; it seems likely to me that the generation past that will require progress on one or more of the fronts I named above).

Taking the globe representation as an example, it is unclear to me how much of the resulting globe (or atlas) is actually the result of choices the authors made. The decision to map distance vectors in two or three dimensions seems to change the resulting representation. So, to what extent are these representations embedded in the model itself versus originating from the author’s mind?

I think that's a reasonable concern in the general case. But in cases like the ones mentioned, the authors are retrieving information (eg lat/long) using only linear probes. I don't know how familiar you are with the math there, but if something can be retrieved with a linear probe, it means that the model is already going to some lengths to represent that information and make it easily accessible.

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-14T14:54:24.482Z · LW · GW

Interesting approach, thanks!

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-13T02:43:37.831Z · LW · GW

Why does the prediction confidence start at 0.5? 

Just because predicting eg a 10% chance of X can instead be rephrased as predicting a 90% chance of not-X, so everything below 50% is redundant.

And how is the "actual accuracy" calculated?

It assumes that you predict every event with the same confidence (namely prediction_confidence) and then that you're correct on actual_accuracy of those. So for example if you predict 100 questions will resolve true, each with 100% confidence, and then 75 of them actually resolve true, you'll get a Brier score of 0.25 (ie 3/4 of the way up the right-hand said of the graph).

Of course typically people predict different events with different confidences -- but since overall Brier score is the simple average of the Brier scores on individual events, that part's reasonably intuitive.

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-12T22:15:36.945Z · LW · GW

But I also find my own understanding to be a bit confused and in need of better sources.

Mine too, for sure.

And agreed, Chollet's points are really interesting. As much as I'm sometimes frustrated with him, I think that ARC-AGI and his willingness to (get someone to) stake substantial money on it has done a lot to clarify the discourse around LLM generality, and also makes it harder for people to move the goalposts and then claim they were never moved).

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-12T21:08:09.041Z · LW · GW

With respect to Chollet's definition (the youtube link):

  • I agree with many of Chollet's points, and the third and fourth items in my list are intended to get at those.
  • I do find Chollet a bit frustrating in some ways, because he seems somewhat inconsistent about what he's saying. Sometimes he seems to be saying that LLMs are fundamentally incapable of handling real novelty, and we need something very new and different. Other times he seems to be saying it's a matter of degree: that LLMs are doing the right things but are just sample-inefficient and don't have a good way to incorporate new information. I imagine that he has a single coherent view internally and just isn't expressing it as clearly as I'd like, although of course I can't know.
  • I think part of the challenge around all of this is that (AFAIK but I would love to be corrected) we don't have a good way to identify what's in and out of distribution for models trained on such diverse data, and don't have a clear understanding of what constitutes novelty in a problem.
Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-12T19:06:55.792Z · LW · GW

Interesting question! Maybe it would look something like, 'In my experience, the first answer to multiple-choice questions tends to be the correct one, so I'll pick that'? 

It does seem plausible on the face of it that the model couldn't provide a faithful CoT on its fine-tuned behavior. But that's my whole point: we can't always count on CoT being faithful, and so we should be cautious about relying on it for safety purposes. 

But also @James Chua and others have been doing some really interesting research recently showing that LLMs are better at introspection than I would have expected (eg 'Looking Inward'), and I'm not confident that models couldn't introspect on fine-tuned behavior.

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-12T17:39:18.262Z · LW · GW

I've now made two posts about LLMs and 'general reasoning', but used a fairly handwavy definition of that term. I don't yet have a definition I feel fully good about, but my current take is something like:

  • The ability to do deduction, induction, and abduction
  • in a careful, step by step way, without many errors that a better reasoner could avoid,
  • including in new domains; and
  • the ability to use all of that to build a self-consistent internal model of the domain under consideration.

What am I missing? Where does this definition fall short?

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-12T17:19:52.131Z · LW · GW

Interesting, thanks, I'll have to think about that argument. A couple of initial thoughts:
 

When we ask whether some CoT is faithful, we mean something like: "Does this CoT allow us to predict the LLM's response more than if there weren't a CoT?"

I think I disagree with that characterization. Most faithfulness researchers seem to quote Jacovi & Goldberg: 'a faithful interpretation is one that accurately represents the reasoning process behind the model’s prediction.' I think 'Language Models Don’t Always Say What They Think' shows pretty clearly that that differs from your definition. In their experiment, even though actually the model has been finetuned to always pick option (A), it presents rationalizations of why it picks that answer for each individual question. I think if we looked at those rationalizations (not knowing about the finetuning), we would be better able to predict the model's choice than without the CoT, but it's nonetheless clearly not faithful.

If the NAH is true, those abstractions will be the same abstractions that other sufficiently intelligent systems (humans?) have converged towards

I haven't spent a lot of time thinking about NAH, but looking at what features emerge with sparse autoencoders makes it seem like in practice LLMs don't consistently factor the world into the same categories that humans do (although we still certainly have a lot to learn about the validity of SAEs as a representation of models' ontologies).

 

It does seem totally plausible to me that o1's CoT is pretty faithful! I'm just not confident that we can continue to count on that as models become more agentic. One interesting new datapoint on that is 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', where they find that models which behave in manipulative or deceptive ways act 'as if they are always responding in the best interest of the users, even in hidden scratchpads'.

Comment by eggsyntax on eggsyntax's Shortform · 2024-11-12T15:30:47.136Z · LW · GW

It's not that intuitively obvious how Brier scores vary with confidence and accuracy (for example: how accurate do you need to be for high-confidence answers to be a better choice than low-confidence?), so I made this chart to help visualize it:

Here's log-loss for comparison (note that log-loss can be infinite, so the color scale is capped at 4.0):

 

Claude-generated code and interactive versions (with a useful mouseover showing the values at each point for confidence, accuracy, and the Brier (or log-loss) score):

Comment by eggsyntax on Language Models Model Us · 2024-11-12T15:04:21.557Z · LW · GW

Update: a recent new paper, 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', described by the authors on LW in 'Targeted Manipulation and Deception Emerge when Optimizing LLMs for User Feedback', finds that post-RLHF, LLMs may identify users who are more susceptible to manipulation and behave differently with those users. This seems like a clear example of LLMs modeling users and also making use of that information.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-11T20:26:48.236Z · LW · GW

I agree with 1, which is why the COT will absolutely have to be faithful.

That does sound ideal if we can figure out a way to achieve it (although it seems plausible that if we have good enough interpretability to judge whether CoT is faithful, we won't really need the CoT in the first place).

I agree with 2, but conditional on relatively weak forward passes, and most of the bottleneck to reasoning being through the COT, there is little the model can do about the situation, short of exfiltrating itself...

I also disagree with 3, at least assuming relatively weak forward passes, and the bottleneck to reasoning being largely through COT

I don't have a very well-developed argument here, but intuitively I think there are many simple ways for the model to shape its output in ways that provide its later forward passes with more than 0.0 bits of information which are useful to itself but which aren't legible to monitors, and those can accumulate over the course of an extended context.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-11T20:03:22.522Z · LW · GW

Re. 1, I think outcomes based RL (with some penalty for long responses) should somewhat mitigate this problem, at least if NAH is true?

Can you say more? I don't think I see why that would be.

Re 2-3, Agree unless we use models that are incapable of deceptive reasoning without CoT (due to number of parameters or training data).

Intuitively it seems like CoT would have to get a couple of OOMs more reliable to be able to get a competitively strong model under those conditions (as you point out).

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-11T19:39:59.201Z · LW · GW

Do you know of any material that goes into more detail on the RL pre-training of o1?

As far as I know OpenAI has been pretty cagey about how o1 was trained, but there seems to be a general belief that they took the approach they had described in 2023 in 'Improving mathematical reasoning with process supervision' (although I wouldn't think of that as pre-training).

It might well turn out to be a far harder problem than language-bound reasoning. You seem to have a different view and I'd be very interested in what underpins your conclusions.

I can at least gesture at some of what's shaping my model here:

  • Roughly paraphrasing Ilya Sutskever (and Yudkowsky): in order to fully predict text, you have to understand the causal processes that created it; this includes human minds and the physical world that they live in.
  • The same strategy of self-supervised token-prediction seems to work quite well to extend language models to multimodal abilities up to and including generating video that shows an understanding of physics. I'm told that it's doing pretty well for robots too, although I haven't followed that literature.
  • We know that models which only see text nonetheless build internal world models like globes and  game boards.
  • Proponents of the view that LLMs are just applying shallow statistical patterns to the regularities of language have made predictions based on that view that have failed repeatedly, such as the claim that no pure LLM would ever able to correctly complete Three plus five equals. Over and over we've seen predictions about what LLMs would never be able to do turn out to be false, usually not long thereafter (including the ones I mention in my post here). At a certain point that view just stops seeming very plausible.

I think your intuition here is one that's widely shared (and certainly seemed plausible to me for a while). But when we cash that out into concrete claims, those don't seem to hold up very well. If you have some ideas about specific limitations that LLMs couldn't overcome based on that intuition (ideally ones that we can get an answer to in the relatively near future), I'd be interested to hear them.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-11T18:51:19.463Z · LW · GW

 As a small piece of feedback, I found it a bit frustrating that so much of your comment was links posted without clarification of what each one was, some of which were just quote-tweets of the others.

Logan Zollener made a claim referencing a Twitter post which claims that LLMs are mostly simple statistics/prediction machines:

Or rather that a particular experiment training simple video models on synthetic data showed that they generalized in ways different from what the researchers viewed as correct (eg given data which didn't specify, they generalized to thinking that an object was more likely to change shape than velocity).

I certainly agree that there are significant limitations to models' ability to generalize out of distribution. But I think we need to be cautious about what we take that to mean about the practical limitations of frontier LLMs.

  • For simple models trained on simple, regular data, it's easy to specify what is and isn't in-distribution. For very complex models trained on a substantial fraction of human knowledge, it seems much less clear to me (I have yet to see a good approach to this problem; if anyone's aware of good research in this area I would love to know about it).
  • There are many cases where humans are similarly bad at OOD generalization. The example above seems roughly isomorphic to the following problem: I give you half of the rules of an arbitrary game, and then ask you about the rest of the rules (eg: 'What happens if piece x and piece y come into contact?'. You can make some guesses based on the half of the rules you've seen, but there are likely to be plenty of cases where the correct generalization isn't clear. The case in the experiment seems less arbitrary to us, because we have extensive intuition built up about how the physical world operates (eg that objects fairly often change velocity but don't often change shape), but the model hasn't been shown info that would provide that intuition[1]; why, then, should we expect it to generalize in a way that matches the physical world?
  • We have a history of LLMs generalizing correctly in surprising ways that weren't predicted in advance (looking back at the GPT-3 paper is a useful reminder of how unexpected some of the emerging capabilities were at the time).
  1. ^

    Note that I haven't read the paper; I'm inferring this from the video summary they posted.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-10T00:59:46.020Z · LW · GW

o1-preview does not seem that much better than other models on extraordinarily difficult (and therefore OOD?) problems.

 

@Noosphere89 points in an earlier comment to a quote from the FrontierMath technical report

...we identified all problems that any model solved at least once—a total of four problems—and conducted repeated trials with five runs per model per problem (see Appendix B.2). We observed high variability across runs: only in one case did a model solve a question on all five runs (o1-preview for question 2). When re-evaluating these problems that were solved at least once, o1-preview demonstrated the strongest performance across repeated trials (see Section B.2).

 

I would love to see comparisons of o1 performance against other models in games like chess and Go. 

I would also find that interesting, although I don't think it would tell us as much about o1's general reasoning ability, since those are very much in-distribution.

Also, somewhat unrelated to this post, what do you and others think about x-risk in a world where explicit reasoners like o1 scale to AGI. To me this seems like one of the safest forms of AGI, since much of the computation is happening explicitly, and can be checked/audited by other AI systems and humans.

I agree that LLMs in general look like a simpler alignment problem than many of the others we could have been faced with, and having explicit reasoning steps seems to help further. I think monitoring still isn't entirely straightforward, since

  1. CoT can be unfaithful to the actual reasoning process.
  2. I suspect that advanced models will soon (or already?) understand that we're monitoring their CoT, which seems likely to affect CoT contents.
  3. I would speculate that if such models were to become deceptively misaligned, they might very well be able to make CoT look innocuous while still using it to scheme.
  4. At least in principle, putting optimization pressure on innocuous-looking CoT could result in models which were deceptive without deceptive intent.

But it does seem very plausibly helpful!

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-10T00:44:16.879Z · LW · GW

as it's pretty low on the compute scale compared to other models.

Can you clarify that? Do you mean relative to models like AlphaGo, which have the ability to do explicit & extensive tree search?

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-10T00:34:21.711Z · LW · GW

Yeah, very fair point, those are at least in part defining a scale rather than a threshold (especially the error-free and consistent-model criteria).

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-09T20:01:14.933Z · LW · GW

Thanks!

It seems unsurprising to me that there are benchmarks o1-preview is bad at. I don't mean to suggest that it can do general reasoning in a highly consistent and correct way on arbitrarily hard problems[1]; I expect that it still has the same sorts of reliability issues as other LLMs (though probably less often), and some of the same difficulty building and using internal models without inconsistency, and that there are also individual reasoning steps that are beyond its ability. My only claim here is that o1-preview knocks down the best evidence that I knew of that LLMs can't do general reasoning at all on novel problems.

I think that to many people that claim may just look obvious; of course LLMs are doing some degree of general reasoning. But the evidence against was strong enough that there was a reasonable possibility that what looked like general reasoning was still relatively shallow inference over a vast knowledge base. Not the full stochastic parrot view, but the claim that LLMs are much less general than they naively seem.

It's fascinatingly difficult to come up with unambiguous evidence that LLMs are doing true general reasoning! I hope that my upcoming project on whether LLMs can do scientific research on toy novel domains can help provide that evidence. It'll be interesting to see how many skeptics are convinced by that project or by the evidence shown in this post, and how many maintain their skepticism.

  1. ^

    And I don't expect that you hold that view either; your comment just inspired some clarification on my part.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-09T18:58:25.578Z · LW · GW

Interesting, I didn't know that. But it seems like that assumes that o1's special-sauce training can be viewed as a kind of RLHF, right? Do we know enough about that training to know that it's RLHF-ish? Or at least some clearly offline approach.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-09T15:10:49.075Z · LW · GW

For sure! At the same time, a) we've continued to see new ways of eliciting greater capability from the models we have, and b) o1 could (AFAIK) involve enough additional training compute to no longer be best thought of as 'the same model' (one possibility, although I haven't spent much time looking into what we know about o1: they may have started with a snapshot of the 4o base model, put it through additional pre-training, then done an arbitrary amount of RL on CoT). So I'm hesitant to think that 'based on 4o' sets a very strong limit on o1's capabilities.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-09T15:06:20.001Z · LW · GW

It continues to be difficult to fully define 'general reasoning', and my mental model of it continues to evolve, but I think of 'system 2 reasoning' as at least a partial synonym.

In the medium-to-long term I'm inclined to taboo the word and talk about what I understand as its component parts, which I currently (off the top of my head) think of as something like:

  • The ability to do deduction, induction, and abduction.
  • The ability to do those in a careful, step by step way, with almost no errors (other than the errors that are inherent to induction and abduction on limited data).
  • The ability to do all of that in a domain-independent way.
  • The ability to use all of that to build a self-consistent internal model of the domain under consideration.

 

Don't hold me to that, though, it's still very much evolving. I may do a short-form post with just the above to invite critique.

Comment by eggsyntax on LLMs Look Increasingly Like General Reasoners · 2024-11-09T13:10:49.220Z · LW · GW

That all seems pretty right to me. It continues to be difficult to fully define 'general reasoning', and my mental model of it continues to evolve, but I think of 'system 2 reasoning' as at least a partial synonym.

Humans clearly can do general reasoning. But it's not easy for us.

Agreed; not only are we very limited at it, but we often aren't doing it at all.

So here's my guess on whether LLMs reach general reasoning by pure scaling: it doesn't matter.

I agree that it may be possible to achieve it with scaffolding even if LLMs don't get there on their own; I'm just less certain of it.

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-11-09T12:30:57.105Z · LW · GW

I've now expanded this comment to a post -- mostly the same content but with more detail.

https://www.lesswrong.com/posts/wN4oWB4xhiiHJF9bS/llms-look-increasingly-like-general-reasoners

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-11-06T20:30:28.102Z · LW · GW

Thanks for sharing, I hadn't seen those yet! I've had too much on my plate since o1-preview came out to really dig into it, in terms of either playing with it or looking for papers on it.

 

How much does o1-preview update your view? It's much better at Blocksworld for example.

Quite substantially. Substantially enough that I'll add mention of these results to the post. I saw the near-complete failure of LLMs on obfuscated Blocksworld problems as some of the strongest evidence against LLM generality. Even more substantially since one of the papers is from the same team of strong LLM skeptics (Subbarao Kambhampati's) who produced the original results (I am restraining myself with some difficulty from jumping up and down and pointing at the level of goalpost-moving in the new paper).

There's one sense in which it's not an entirely apples-to-apples comparison, since o1-preview is throwing a lot more inference-time compute at the problem (in that way it's more like Ryan's hybrid approach to ARC-AGI). But since the key question here is whether LLMs are capable of general reasoning at all, that doesn't really change my view; certainly there are many problems (like capabilities research) where companies will be perfectly happy to spend a lot on compute to get a better answer.

Here's a first pass on how much this changes my numeric probabilities -- I expect these to be at least a bit different in a week as I continue to think about the implications (original text italicized for clarity):

  • LLMs continue to do better at block world and ARC as they scale: 75% -> 100%, this is now a thing that has happened (note that o1-preview also showed substantially improved results on ARC-AGI).
  • LLMs entirely on their own reach the grand prize mark on the ARC prize (solving 85% of problems on the open leaderboard) before hybrid approaches like Ryan's: 10% -> 20%, this still seems quite unlikely to me (especially since hybrid approaches have continued to improve on ARC). Most of my additional credence is on something like 'the full o1 turns out to already be close to the grand prize mark' and the rest on 'OpenAI capabilities researchers manage to use the full o1 to find an improvement to current LLM technique (eg a better prompting approach) that can be easily fixed'.
  • Scaffolding & tools help a lot, so that the next gen[7] (GPT-5, Claude 4) + Python + a for loop can reach the grand prize mark[8]: 60% -> 75% -- I'm tempted to put it higher, but it wouldn't be that surprising if o1-mark-2 didn't quite get there even with scaffolding/tools, especially since we don't have clear insight into how much harder the full test set is.
  • Same but for the gen after that (GPT-6, Claude 5): 75%  -> 90%? I feel less sure about this one than the others; it sure seems awfully likely that o2 plus scaffolding will be able to do it! But I'm reluctant to go past 90% because progress could level off because of training data requirements, maybe the o1 -> o2 jump doesn't focus on optimizing for general reasoning, etc. It seems very plausible that I'll bump this higher on reflection.
  • The current architecture, including scaffolding & tools, continues to improve to the point of being able to do original AI research: 65%, with high uncertainty[9] -> 80%. That sure does seem like the world we're living in. It's not clear to me that o1 couldn't already do original AI research with the right scaffolding. Sakana claims to have gotten there with GPT-4o / Sonnet, but their claims seem overblown to me.

 

Now that I've seen these, I'm going to have to think hard about whether my upcoming research projects in this area (including one I'm scheduled to lead a team on in the spring, uh oh) are still the right thing to pursue. I may write at least a brief follow-up post to this one arguing that we should all update on this question.

Thanks again, I really appreciate you drawing my attention to these.

Comment by eggsyntax on Arithmetic is an underrated world-modeling technology · 2024-10-28T17:15:59.814Z · LW · GW

What was your first favorite?

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-26T15:45:20.465Z · LW · GW

Even though I can't critique the details, I do think it is important to note that I often find claims of similarity like this in areas I understand better to not be very illuminating because people want to find similarities/analogies to understand it more easily.

Agreed, that's definitely a general failure mode.

Comment by eggsyntax on The Mask Comes Off: At What Price? · 2024-10-24T01:15:14.414Z · LW · GW

Possibly too cynical, but I find myself wondering whether the conditionality was in fact engineered by OAI in order to achieve this purpose. My impression is that there are a lot of VCs shouting 'Take my money!' in the general direction of OpenAI and/or Altman who wouldn't have demanded the restructuring.

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-24T01:08:04.860Z · LW · GW

I have heard of people getting similar results just using mechanistic schemes of certain parts of normal lossless compression as well, though even more inefficiently.

Interesting, if you happen to have a link I'd be interested to learn more.

Think of it in terms of a sequence of completions that keep getting both more novel and more requiring of intelligence for other reasons? 

I like the idea, but it seems hard to judge 'more novel and [especially] more requiring of intelligence' other than to sort completions in order of human error on each.

I wasn't really talking about the training, I was talking about how well it does on things for which it isn't trained. When it comes across new information, how well does it integrate and use that when it has only seen a little bit or it is nonobviously related?

I think there's a lot of work to be done on this still, but there's some evidence that in-context learning is essentially equivalent to gradient descent (though also some criticism of that claim).

I'm glad you think this has been a valuable exchange

I continue to think so :). Thanks again!

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-24T00:12:58.342Z · LW · GW

Hi, apologies for having failed to respond; I went out of town and lost track of this thread. Reading back through what you've said. Thank you!

Comment by eggsyntax on The Mask Comes Off: At What Price? · 2024-10-24T00:06:54.855Z · LW · GW

I agree that that's presumably the underlying reality. I should have made that clearer. 

But it seems like the board would still need to create some justification for public consumption, and for avoiding accusations of violating their charter & fiduciary duty. And it's really unclear to me what that justification is.

Comment by eggsyntax on The Mask Comes Off: At What Price? · 2024-10-23T19:26:42.072Z · LW · GW

I realize this ship has sailed, but: I continue to be confused about how the non-profit board can justify giving up control. Given that their mandate[1] says that their 'primary fiduciary duty is to humanity' and to ensure that AGI 'is used for the benefit of all', and given that more control over a for-profit company is strictly better than less control, on what grounds can they justify relinquishing that control?

[EDIT: I realize that in reality they're just handing over control because Altman wants it, and they've been selected in part for willingness to give Altman what he wants. But it still seems like they need a justification, for public consumption and to avoid accusations of violating their charter and fiduciary duty.]

Am I just being insufficiently cynical here? Does everyone tacitly recognize that the non-profit board isn't really bound by that charter and can do whatever they like? Or is there some justification that they can put forward with a straight face?

Is it maybe something like, 'We trust that OpenAI the for-profit is primarily dedicated to the general well-being of humanity, and so nothing is lost by giving up control'? (But then it seems like more control is still strictly better in case it turns out that in some shocking and unforeseeable way the for-profit later has other priorities). Or is it, 'OpenAI the for-profit is doing good in the world, and they can do much more good if they can raise more money, and there's certainly no way they could raise more money without us giving up control'?

  1. ^

    I'm assuming here that their charter is the same as OpenAI's charter, since I haven't been able to find a distinct non-profit charter.

Comment by eggsyntax on Exploring SAE features in LLMs with definition trees and token lists · 2024-10-22T20:06:53.693Z · LW · GW

My main insight from all this is that we should be thinking in terms of taxonomisation of features. Some are very token-specific, others are more nuanced and context-specific (in a variety of ways). The challenge of finding maximally activating text samples might be very different from one category of features to another.

Joseph and Johnny did some interesting work on this in 'Understanding SAE Features with the Logit Lens', taxonomizing features as partition features vs suppression features vs prediction features, and using summary statistics to distinguish them.

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-15T22:19:35.891Z · LW · GW

Thanks for the lengthy and thoughtful reply!

I'm planning to make a LW post soon asking for more input on this experiment -- one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I'd love to get your input there as well if you're so moved!

I can tell you that current AI isn't intelligent, but as for what would prove intelligence, I've been thinking about it for a while and I really don't have much.

I tend not to think of intelligence as a boolean property, but of an entity having some level of intelligence (like IQ, although we certainly can't blithely give IQ tests to LLMs and treat the results as meaningful, not that that stops people from doing it). I don't imagine you think of it as boolean either, but calling that out in case I'm mistaken.

Obviously the exact test items being held in reserve is useful, but I don't think it can rule out being included since there are an awful lot of people making training data due to the way these are trained.

Agreed; at this point I assume that anything published before (or not long after) the knowledge cutoff may well be in the training data.

Obfuscation does help, but I wouldn't rule out it figuring out how to deobfuscate things without being generally intelligent

The obfuscation method matters as well; eg I think the Kambhampati team's approach to obfuscation made the problems much harder in ways that are irrelevant or counterproductive to testing LLM reasoning abilities (see Ryan's comment here and my reply for details).

Perhaps if you could genuinely exclude all data during training that in any way has to do with a certain scientific discovery

I'd absolutely love that and agree it would help enormously to resolve these sorts of questions. But my guess is we won't see deliberate exclusions on frontier LLMs anytime in the next couple of years; it's difficult and labor-intensive to do at internet scale, and the leading companies haven't shown any interest in doing so AFAIK (or even in releasing comprehensive data about what the training data was).

For instance, train it on only numbers and addition (or for bonus points, only explain addition in terms of the succession of numbers on the number line) mathematically, then explain multiplication in terms of addition and ask it to do a lot of complicated multiplication. If it does that well, explain division in terms of multiplication, and so on...This is not an especially different idea than the one proposed, of course, but I would find it more telling. If it was good at this, then I think it would be worth looking into the level of intelligence it has more closely, but doing well here isn't proof.

Very interesting idea! I think I informally anectested something similar at one point by introducing new mathematical operations (but can't recall how it turned out). Two questions:

  • Since we can't in practice train a frontier LLM without multiplication, would artificial new operations be equally convincing in your view (eg, I don't know, x # y means sqrt(x - 2y)? Ideally something a bit less arbitrary than that, though mathematicians tend to already write about the non-arbitrary ones).
  • Would providing few-shot examples (eg several demonstrations of x # y for particular values of x and y) make it less compelling?

LLMs are supposedly superhuman at next word prediction

It's fun to confirm that for yourself :) 

an interesting (though not telling) test for an LLM might be varying the amount of informational and intelligence requiring information there is in a completely novel text by an author they have never seen before, and seeing how well the LLM continues to predict the next word. If it remains at a similar level, there's probably something worth looking closely at going in terms of reasoning.

Sorry, I'm failing to understand the test you're proposing; can you spell it out a bit more?

For bonus points, a linguist could make up a bunch of very different full-fledged languages it hasn't been exposed to using arbitrary (and unusual) rules of grammar and see how well it does on those tests in the new languages compared to an average human with just the same key to the languages

I found DeepMind's experiment in teaching Gemini the Kalamang language (which it had never or barely encountered in the training data) really intriguing here, although not definitive evidence of anything (see section 5.2.2.1 of their Gemini paper for details).

I forget what the term for this is (maybe 'data-efficient'?), but the best single test of an area is  to compare the total amount of training information given to the AI in training and prompt to the amount a human gets in that area to get to a certain level of ability across a variety of representative areas. LLMs currently do terribly at this

From my point of view, sample efficiency is interesting but not that relevant; a model may have needed the equivalent of a thousand years of childhood to reach a certain level of intelligence, but the main thing I'm trying to investigate is what that level of intelligence is, regardless of how it got there.

I suspect that in your proposed test, modern AI would likely be able to solve the very easy questions, but would do quite badly on difficult ones. Problem is, I don't know how easy should be expected to be solved. I am again reluctant to opine to strongly on this matter.

My intuition is similar, that it should be able to solve them up to a certain level of difficulty (and I also expect that the difficulty level they can manage correlates pretty well with model size). But as I see it, that's exactly the core point under debate -- are LLM limitations along these lines a matter of scale or a fundamental flaw in the entire LLM approach?

So, as you know, obfuscation is a method of hiding exactly what you are getting at. You can do this for things it already knows obviously, but you can also use whatever methods you use for generating a obfuscations of known data on the novel data you generated. I would strongly advise testing on known data as a comparison.

This is to test how much of the difficulty is based on the form of the question rather than the content. Or in other words, using the same exact words and setup, have completely unknown things, and completely known things asked about. (You can check how well it knows an area using the nonobfuscated stuff.)

Interesting point, thanks. I don't think of the experiment as ultimately involving obfuscated data as much as novel data (certainly my aim is for it to be novel data, except insofar as it follows mathematical laws in a way that's in-distribution for our universe), but I agree that it would be interesting and useful to see how the models do on a similar but known problem (maybe something like the gas laws). I'll add that to the plan.

 

Thanks again for your deep engagement on this question! It's both helpful and interesting to get to go into detail on this issue with someone who holds your view (whereas it's easy to find people to fully represent the other view, and since I lean somewhat toward that view myself I think I have a pretty easy time representing the arguments for it).

Comment by eggsyntax on The Hopium Wars: the AGI Entente Delusion · 2024-10-15T14:18:55.981Z · LW · GW

Thanks for the post! I see two important difficulties with your proposal.

First, you say (quoting your comment below)

It's not in the US self-interest to disempower itself and all its current power centers by allowing a US company to build uncontrollable AGI...The reason that the self-interest hasn't yet played out is that US and Chinese leaders still haven't fully understood the game theory payout matrix.

The trouble here is that it is in the US (& China's) self-interest, as that's seen by some leaders, to take some chance of out-of-control AGI if the alternative is the other side taking over. And either country can create safety standards for consumer products while secretly pursuing AGI for military or other purposes. That changes the payout matrix dramatically. 

I think your argument could work if 

a) both sides could trust that the other was applying its safety standards universally, but that takes international cooperation rather than simple self-interest; or

b) it was common knowledge that AGI was highly likely to be uncontrollable, but now we're back to the same debate about existential risk from AI that we were in before your proposal.

 

Second (and less centrally), as others have pointed out, your definition of tool AI as (in part) 'AI that we can control' begs the question. Certainly for some kinds of tool AI such as AlphaFold, it's easy to show that we can control them; they only operate over a very narrow domain. But for broader sorts of tools like assistants to help us manage our daily tasks, which people clearly want and for which there are strong economic incentives, it's not obvious what level of risk to expect, and again we're back to the same debates we were already having.

 

A world with good safety standards for AI is certainly preferable to a world without them, and I think there's value in advocating for them and in pointing out the risks in the just-scale-fast position. But I think this proposal fails to address some critical challenges of escaping the current domestic and international race dynamics.

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-12T22:29:20.061Z · LW · GW

Interpolation vs extrapolation is obviously very simple in theory; are you going in between points it has trained on or extending it outside of the training set. To just use math as an example

Sorry, I should have been clearer. I agree it's straightforward in cases like the ones you give, I'm really thinking of the case of large language models. It's not at all clear to me that we even have a good way to identify in- vs out-of-distribution for a model trained against much of the internet. If we did, some of this stuff would be much easier to test.

The proposed experiment should be somewhat a test of this, though hardly definitive (not that we as a society are at the stage to do definitive tests).

What would constitute a (minimal-ish) definitive test in your view?

And how do you expect the proposed experiment to go? Would you expect current-generation LLMs to fail completely, or to succeed for simple but not complex cases, or to have an easy time with it?

It seems important to keep in mind that we should probably build things like this from the end to beginning, which is mentioned, so that we know exactly what the correct answer is before we ask, rather than assuming.

Absolutely; this is a huge weakness of much of the existing research trying to test the limitations of LLMs with respect to general reasoning ability, and a large motivation for the experiment (which has just been accepted for the next session of AI Safety Camp; if things go as expected I'll be leading a research team on this experiment).

Perhaps one idea would be to do three varieties of question for each type of question:

1.Non-obfuscated but not in training data (we do less of this than sometimes thought)

2.Obfuscated directly from known training data

3.Obfuscated and not in training data

I'm not sure what it would mean for something not in the training data to be obfuscated. Obfuscated relative to what? In any case, my aim is very much to test something that's definitively not in the training data, because it's been randomly generated and uses novel words.

As to your disagreement where you say scale has always decreased error rate, this may be true when the scale increase is truly massive, 

Sure, I only mean that there's a strong correlation, not that there's a perfect correspondence.

but I have seen scale not help on numerous things in image generation AI

I think it's important to distinguish error rate on the loss function, which pretty reliably decreases with scale, from other measures like 'Does it make better art?', which a) quite plausibly don't improve with scale since they're not not what the model's being trained on, and b) are very much harder to judge. Even 'Is the skin plasticky or unrealistic?' seems tricky (though not impossible) to judge without a human labeler.

Of course, one of the main causes of confusion is that 'Is it good at general reasoning?' is also a hard-to-judge question, and although it certainly seems to have improved significantly with scale, it's hard to show that in a principled way. The experiment I describe is designed to at least get at a subset of that in a somewhat more principled way: can the models develop hypotheses in novel domains, figure out experiments that will test those hypotheses, and come to conclusions that match the underlying ground truth?

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-11T14:10:09.151Z · LW · GW

You could be right about (almost) all of that! I'm definitely not confident that scale is the only thing needed.

Part of the problem here is grounding these kinds of claims down to concrete predictions. What exactly counts as interpolation vs extrapolation? What exactly counts as progress on reasoning errors that's more than 'lightly abated during massive amounts of scaling'? That's one reason I'm excited about the ARC-AGI contest; it provides a concrete benchmark for at least one sort of general reasoning (although it also involves a lot of subjectivity around what counts as a problem of the relevant kind).

I give a description here of an experiment specifically designed to test these questions. I'd be curious to hear your thoughts on it. What results would you anticipate? Does the task in this experiment count as interpolation or extrapolation in your view?

Also, there is no reason to believe further scaling will always decrease error rate per step since this has often not been true!

This is the one claim you make that seems unambiguously wrong to me; although of course there's variation from architectural decisions etc, we've seen a really strong correlation between scale and loss, as shown in the various scaling laws papers. Of course this curve could change at some point but I haven't seen any evidence that we're close to that point.

Comment by eggsyntax on Exploring SAE features in LLMs with definition trees and token lists · 2024-10-10T21:05:37.245Z · LW · GW

I also find myself wondering whether something like this could be extended to generate the maximally activating text for a feature. In the same way that for vision models it's useful to see both the training-data examples that activate most strongly and synthetic max-activating examples, it would be really cool to be able to generate synthetic max-activating examples for SAE features.

Comment by eggsyntax on Exploring SAE features in LLMs with definition trees and token lists · 2024-10-10T21:03:14.627Z · LW · GW

Super cool! Some miscellaneous questions and comments as I go through it:

  • I see that the trees you show are using the encoded vector? What's been your motivation for that? How do the encoded and decoded vectors tend to differ in your experience? Do you see them as meaning somewhat different things? I guess for a perfect SAE (with 0 reconstruction loss) they'd be identical, is that correct?
  • 'Layer 6 SAE feature 17', 'This feature is activated by references to making short statements or brief remarks'
    • This seems pretty successful to me, since the top results are about short stories / speeches.
    • The parts of the definition tree that don't fit that seem similar to the 'hedging' sorts of definitions that you found in the semantic void work, eg 'a group of people who are...'. I wonder whether there might be some way to filter those out and be left with the definitions more unique to the feature.
  • 'Layer 10 SAE feature 777', 'But the lack of numerical tokens was surprising'. This seems intuitively unsurprising to me -- presumably the feature doesn't activate on every instance of a number, even a number in a relevant range (eg '94'), but only when the number is in a context that makes it likely to be a year. So just the token '94' on its own won't be that close to the feature direction. That seems like a key downside of this method, that it gives up context sensitivity (method 1 seems much stronger to me for this reason).
  • 'It's not clear why (for example) some features require larger scaling factors to produce relevant trees and/or lists'. It would be really interesting to look for some value that gets maximized or minimized at the optimum scaling distance, although nothing's immediately jumping out at me.
  • 'Improved control integration: Merging common controls between the two functionalities for streamlined interaction.' Seems like it might be worth fully combining them, so that the output is always showing both, since the method 2 output doesn't take up that much room.

 

Really fascinating stuff, I wonder whether @Johnny Lin would have any interest in making it possible to generate these for features in Neuronpedia.

Comment by eggsyntax on Language Models Model Us · 2024-10-10T17:19:26.273Z · LW · GW

Thanks!

I've seen some of the PII/memorization work, but I think that problem is distinct from what I'm trying to address here; what I was most interested in is what the model can infer about someone who doesn't appear in the training data at all. In practice it can be hard to distinguish those cases, but conceptually I see them as pretty distinct.

The demographics link ('Privacy Risks of General-Purpose Language Models') is interesting and I hadn't seen it, thanks! It seems mostly pretty different from what I'm trying to look at, in that they're looking at questions about models' ability to reconstruct text sequences (including eg genome sequences), whereas I'm looking at questions about what the model can infer about users/authors.

Bias/fairness work is interesting and related, but aiming in a somewhat different direction -- I'm not interested in inference of demographic characteristics primarily because they can have bias consequences (although certainly it's valuable to try to prevent bias!). For me they're primarily a relatively easy-to-measure proxy for broader questions about what the model is able to infer about users from their text. In the long run I'm much more interested in what the model can infer about users' beliefs, because that's what enables the model to be deceptive or manipulative.

I've focused here on differences between the work you linked and what I'm aiming toward, but those are still all helpful references, and I appreciate you providing them!

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-09T21:16:38.919Z · LW · GW

In the 'Evidence for Generality' section I point to a paper that demonstrates that the transformer architecture is capable of general computation (in terms of the types of formal languages it can express). A new paper, 'Autoregressive Large Language Models are Computationally Universal', both a) shows that this is true of LLMs in particular, and b) makes the point clearer by demonstrating that LLMs can simulate Lag systems, a formalization of computation which has been shown to be equivalent to the Turing machine (though less well-known).

Comment by eggsyntax on Language Models Model Us · 2024-10-09T16:33:08.726Z · LW · GW

This is a relatively common topic in responsible AI

If there are other papers on the topic you'd recommend, I'd love to get links or citations.