Posts

AIS terminology proposal: standardize terms for probability ranges 2024-08-30T15:43:39.857Z
LLM Generality is a Timeline Crux 2024-06-24T12:52:07.704Z
Language Models Model Us 2024-05-17T21:00:34.821Z
Useful starting code for interpretability 2024-02-13T23:13:47.940Z
eggsyntax's Shortform 2024-01-13T22:34:07.553Z

Comments

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-15T22:19:35.891Z · LW · GW

Thanks for the lengthy and thoughtful reply!

I'm planning to make a LW post soon asking for more input on this experiment -- one of my goals here is to make this experiment one that both sides of the debate agree in advance would provide good evidence. I'd love to get your input there as well if you're so moved!

I can tell you that current AI isn't intelligent, but as for what would prove intelligence, I've been thinking about it for a while and I really don't have much.

I tend not to think of intelligence as a boolean property, but of an entity having some level of intelligence (like IQ, although we certainly can't blithely give IQ tests to LLMs and treat the results as meaningful, not that that stops people from doing it). I don't imagine you think of it as boolean either, but calling that out in case I'm mistaken.

Obviously the exact test items being held in reserve is useful, but I don't think it can rule out being included since there are an awful lot of people making training data due to the way these are trained.

Agreed; at this point I assume that anything published before (or not long after) the knowledge cutoff may well be in the training data.

Obfuscation does help, but I wouldn't rule out it figuring out how to deobfuscate things without being generally intelligent

The obfuscation method matters as well; eg I think the Kambhampati team's approach to obfuscation made the problems much harder in ways that are irrelevant or counterproductive to testing LLM reasoning abilities (see Ryan's comment here and my reply for details).

Perhaps if you could genuinely exclude all data during training that in any way has to do with a certain scientific discovery

I'd absolutely love that and agree it would help enormously to resolve these sorts of questions. But my guess is we won't see deliberate exclusions on frontier LLMs anytime in the next couple of years; it's difficult and labor-intensive to do at internet scale, and the leading companies haven't shown any interest in doing so AFAIK (or even in releasing comprehensive data about what the training data was).

For instance, train it on only numbers and addition (or for bonus points, only explain addition in terms of the succession of numbers on the number line) mathematically, then explain multiplication in terms of addition and ask it to do a lot of complicated multiplication. If it does that well, explain division in terms of multiplication, and so on...This is not an especially different idea than the one proposed, of course, but I would find it more telling. If it was good at this, then I think it would be worth looking into the level of intelligence it has more closely, but doing well here isn't proof.

Very interesting idea! I think I informally anectested something similar at one point by introducing new mathematical operations (but can't recall how it turned out). Two questions:

  • Since we can't in practice train a frontier LLM without multiplication, would artificial new operations be equally convincing in your view (eg, I don't know, x # y means sqrt(x - 2y)? Ideally something a bit less arbitrary than that, though mathematicians tend to already write about the non-arbitrary ones).
  • Would providing few-shot examples (eg several demonstrations of x # y for particular values of x and y) make it less compelling?

LLMs are supposedly superhuman at next word prediction

It's fun to confirm that for yourself :) 

an interesting (though not telling) test for an LLM might be varying the amount of informational and intelligence requiring information there is in a completely novel text by an author they have never seen before, and seeing how well the LLM continues to predict the next word. If it remains at a similar level, there's probably something worth looking closely at going in terms of reasoning.

Sorry, I'm failing to understand the test you're proposing; can you spell it out a bit more?

For bonus points, a linguist could make up a bunch of very different full-fledged languages it hasn't been exposed to using arbitrary (and unusual) rules of grammar and see how well it does on those tests in the new languages compared to an average human with just the same key to the languages

I found DeepMind's experiment in teaching Gemini the Kalamang language (which it had never or barely encountered in the training data) really intriguing here, although not definitive evidence of anything (see section 5.2.2.1 of their Gemini paper for details).

I forget what the term for this is (maybe 'data-efficient'?), but the best single test of an area is  to compare the total amount of training information given to the AI in training and prompt to the amount a human gets in that area to get to a certain level of ability across a variety of representative areas. LLMs currently do terribly at this

From my point of view, sample efficiency is interesting but not that relevant; a model may have needed the equivalent of a thousand years of childhood to reach a certain level of intelligence, but the main thing I'm trying to investigate is what that level of intelligence is, regardless of how it got there.

I suspect that in your proposed test, modern AI would likely be able to solve the very easy questions, but would do quite badly on difficult ones. Problem is, I don't know how easy should be expected to be solved. I am again reluctant to opine to strongly on this matter.

My intuition is similar, that it should be able to solve them up to a certain level of difficulty (and I also expect that the difficulty level they can manage correlates pretty well with model size). But as I see it, that's exactly the core point under debate -- are LLM limitations along these lines a matter of scale or a fundamental flaw in the entire LLM approach?

So, as you know, obfuscation is a method of hiding exactly what you are getting at. You can do this for things it already knows obviously, but you can also use whatever methods you use for generating a obfuscations of known data on the novel data you generated. I would strongly advise testing on known data as a comparison.

This is to test how much of the difficulty is based on the form of the question rather than the content. Or in other words, using the same exact words and setup, have completely unknown things, and completely known things asked about. (You can check how well it knows an area using the nonobfuscated stuff.)

Interesting point, thanks. I don't think of the experiment as ultimately involving obfuscated data as much as novel data (certainly my aim is for it to be novel data, except insofar as it follows mathematical laws in a way that's in-distribution for our universe), but I agree that it would be interesting and useful to see how the models do on a similar but known problem (maybe something like the gas laws). I'll add that to the plan.

 

Thanks again for your deep engagement on this question! It's both helpful and interesting to get to go into detail on this issue with someone who holds your view (whereas it's easy to find people to fully represent the other view, and since I lean somewhat toward that view myself I think I have a pretty easy time representing the arguments for it).

Comment by eggsyntax on The Hopium Wars: the AGI Entente Delusion · 2024-10-15T14:18:55.981Z · LW · GW

Thanks for the post! I see two important difficulties with your proposal.

First, you say (quoting your comment below)

It's not in the US self-interest to disempower itself and all its current power centers by allowing a US company to build uncontrollable AGI...The reason that the self-interest hasn't yet played out is that US and Chinese leaders still haven't fully understood the game theory payout matrix.

The trouble here is that it is in the US (& China's) self-interest, as that's seen by some leaders, to take some chance of out-of-control AGI if the alternative is the other side taking over. And either country can create safety standards for consumer products while secretly pursuing AGI for military or other purposes. That changes the payout matrix dramatically. 

I think your argument could work if 

a) both sides could trust that the other was applying its safety standards universally, but that takes international cooperation rather than simple self-interest; or

b) it was common knowledge that AGI was highly likely to be uncontrollable, but now we're back to the same debate about existential risk from AI that we were in before your proposal.

 

Second (and less centrally), as others have pointed out, your definition of tool AI as (in part) 'AI that we can control' begs the question. Certainly for some kinds of tool AI such as AlphaFold, it's easy to show that we can control them; they only operate over a very narrow domain. But for broader sorts of tools like assistants to help us manage our daily tasks, which people clearly want and for which there are strong economic incentives, it's not obvious what level of risk to expect, and again we're back to the same debates we were already having.

 

A world with good safety standards for AI is certainly preferable to a world without them, and I think there's value in advocating for them and in pointing out the risks in the just-scale-fast position. But I think this proposal fails to address some critical challenges of escaping the current domestic and international race dynamics.

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-12T22:29:20.061Z · LW · GW

Interpolation vs extrapolation is obviously very simple in theory; are you going in between points it has trained on or extending it outside of the training set. To just use math as an example

Sorry, I should have been clearer. I agree it's straightforward in cases like the ones you give, I'm really thinking of the case of large language models. It's not at all clear to me that we even have a good way to identify in- vs out-of-distribution for a model trained against much of the internet. If we did, some of this stuff would be much easier to test.

The proposed experiment should be somewhat a test of this, though hardly definitive (not that we as a society are at the stage to do definitive tests).

What would constitute a (minimal-ish) definitive test in your view?

And how do you expect the proposed experiment to go? Would you expect current-generation LLMs to fail completely, or to succeed for simple but not complex cases, or to have an easy time with it?

It seems important to keep in mind that we should probably build things like this from the end to beginning, which is mentioned, so that we know exactly what the correct answer is before we ask, rather than assuming.

Absolutely; this is a huge weakness of much of the existing research trying to test the limitations of LLMs with respect to general reasoning ability, and a large motivation for the experiment (which has just been accepted for the next session of AI Safety Camp; if things go as expected I'll be leading a research team on this experiment).

Perhaps one idea would be to do three varieties of question for each type of question:

1.Non-obfuscated but not in training data (we do less of this than sometimes thought)

2.Obfuscated directly from known training data

3.Obfuscated and not in training data

I'm not sure what it would mean for something not in the training data to be obfuscated. Obfuscated relative to what? In any case, my aim is very much to test something that's definitively not in the training data, because it's been randomly generated and uses novel words.

As to your disagreement where you say scale has always decreased error rate, this may be true when the scale increase is truly massive, 

Sure, I only mean that there's a strong correlation, not that there's a perfect correspondence.

but I have seen scale not help on numerous things in image generation AI

I think it's important to distinguish error rate on the loss function, which pretty reliably decreases with scale, from other measures like 'Does it make better art?', which a) quite plausibly don't improve with scale since they're not not what the model's being trained on, and b) are very much harder to judge. Even 'Is the skin plasticky or unrealistic?' seems tricky (though not impossible) to judge without a human labeler.

Of course, one of the main causes of confusion is that 'Is it good at general reasoning?' is also a hard-to-judge question, and although it certainly seems to have improved significantly with scale, it's hard to show that in a principled way. The experiment I describe is designed to at least get at a subset of that in a somewhat more principled way: can the models develop hypotheses in novel domains, figure out experiments that will test those hypotheses, and come to conclusions that match the underlying ground truth?

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-11T14:10:09.151Z · LW · GW

You could be right about (almost) all of that! I'm definitely not confident that scale is the only thing needed.

Part of the problem here is grounding these kinds of claims down to concrete predictions. What exactly counts as interpolation vs extrapolation? What exactly counts as progress on reasoning errors that's more than 'lightly abated during massive amounts of scaling'? That's one reason I'm excited about the ARC-AGI contest; it provides a concrete benchmark for at least one sort of general reasoning (although it also involves a lot of subjectivity around what counts as a problem of the relevant kind).

I give a description here of an experiment specifically designed to test these questions. I'd be curious to hear your thoughts on it. What results would you anticipate? Does the task in this experiment count as interpolation or extrapolation in your view?

Also, there is no reason to believe further scaling will always decrease error rate per step since this has often not been true!

This is the one claim you make that seems unambiguously wrong to me; although of course there's variation from architectural decisions etc, we've seen a really strong correlation between scale and loss, as shown in the various scaling laws papers. Of course this curve could change at some point but I haven't seen any evidence that we're close to that point.

Comment by eggsyntax on Exploring SAE features in LLMs with definition trees and token lists · 2024-10-10T21:05:37.245Z · LW · GW

I also find myself wondering whether something like this could be extended to generate the maximally activating text for a feature. In the same way that for vision models it's useful to see both the training-data examples that activate most strongly and synthetic max-activating examples, it would be really cool to be able to generate synthetic max-activating examples for SAE features.

Comment by eggsyntax on Exploring SAE features in LLMs with definition trees and token lists · 2024-10-10T21:03:14.627Z · LW · GW

Super cool! Some miscellaneous questions and comments as I go through it:

  • I see that the trees you show are using the encoded vector? What's been your motivation for that? How do the encoded and decoded vectors tend to differ in your experience? Do you see them as meaning somewhat different things? I guess for a perfect SAE (with 0 reconstruction loss) they'd be identical, is that correct?
  • 'Layer 6 SAE feature 17', 'This feature is activated by references to making short statements or brief remarks'
    • This seems pretty successful to me, since the top results are about short stories / speeches.
    • The parts of the definition tree that don't fit that seem similar to the 'hedging' sorts of definitions that you found in the semantic void work, eg 'a group of people who are...'. I wonder whether there might be some way to filter those out and be left with the definitions more unique to the feature.
  • 'Layer 10 SAE feature 777', 'But the lack of numerical tokens was surprising'. This seems intuitively unsurprising to me -- presumably the feature doesn't activate on every instance of a number, even a number in a relevant range (eg '94'), but only when the number is in a context that makes it likely to be a year. So just the token '94' on its own won't be that close to the feature direction. That seems like a key downside of this method, that it gives up context sensitivity (method 1 seems much stronger to me for this reason).
  • 'It's not clear why (for example) some features require larger scaling factors to produce relevant trees and/or lists'. It would be really interesting to look for some value that gets maximized or minimized at the optimum scaling distance, although nothing's immediately jumping out at me.
  • 'Improved control integration: Merging common controls between the two functionalities for streamlined interaction.' Seems like it might be worth fully combining them, so that the output is always showing both, since the method 2 output doesn't take up that much room.

 

Really fascinating stuff, I wonder whether @Johnny Lin would have any interest in making it possible to generate these for features in Neuronpedia.

Comment by eggsyntax on Language Models Model Us · 2024-10-10T17:19:26.273Z · LW · GW

Thanks!

I've seen some of the PII/memorization work, but I think that problem is distinct from what I'm trying to address here; what I was most interested in is what the model can infer about someone who doesn't appear in the training data at all. In practice it can be hard to distinguish those cases, but conceptually I see them as pretty distinct.

The demographics link ('Privacy Risks of General-Purpose Language Models') is interesting and I hadn't seen it, thanks! It seems mostly pretty different from what I'm trying to look at, in that they're looking at questions about models' ability to reconstruct text sequences (including eg genome sequences), whereas I'm looking at questions about what the model can infer about users/authors.

Bias/fairness work is interesting and related, but aiming in a somewhat different direction -- I'm not interested in inference of demographic characteristics primarily because they can have bias consequences (although certainly it's valuable to try to prevent bias!). For me they're primarily a relatively easy-to-measure proxy for broader questions about what the model is able to infer about users from their text. In the long run I'm much more interested in what the model can infer about users' beliefs, because that's what enables the model to be deceptive or manipulative.

I've focused here on differences between the work you linked and what I'm aiming toward, but those are still all helpful references, and I appreciate you providing them!

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-10-09T21:16:38.919Z · LW · GW

In the 'Evidence for Generality' section I point to a paper that demonstrates that the transformer architecture is capable of general computation (in terms of the types of formal languages it can express). A new paper, 'Autoregressive Large Language Models are Computationally Universal', both a) shows that this is true of LLMs in particular, and b) makes the point clearer by demonstrating that LLMs can simulate Lag systems, a formalization of computation which has been shown to be equivalent to the Turing machine (though less well-known).

Comment by eggsyntax on Language Models Model Us · 2024-10-09T16:33:08.726Z · LW · GW

This is a relatively common topic in responsible AI

If there are other papers on the topic you'd recommend, I'd love to get links or citations.

Comment by eggsyntax on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders · 2024-09-25T19:48:21.597Z · LW · GW

That all makes sense, thanks. I'm really looking forward to seeing where this line of research goes from here!

Comment by eggsyntax on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders · 2024-09-25T19:38:35.239Z · LW · GW

Determining ground-truth definitely seems like the tough aspect there. Very good idea to come up with 'starts with _' as a case where that issue is tractable, and another good idea to tackle it with toy models where you can control that up front. Thanks!

Comment by eggsyntax on [Paper] A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders · 2024-09-25T13:58:46.043Z · LW · GW

What a great discovery, that's extremely cool. Intuitively, I would worry a bit that the 'spelling miracle' is such an odd edge case that it may not be representative of typical behavior, although just the fact that 'starts with _' shows up as an SAE feature assuages that worry somewhat. I can see why you'd choose it, though, since it's so easy to mechanically confirm what tokens ought to trigger the feature. Do you have some ideas for non-spelling-related features that would make good next tests?

Comment by eggsyntax on If I wanted to spend WAY more on AI, what would I spend it on? · 2024-09-22T17:20:13.253Z · LW · GW

Maybe next year, the critical point will be reached where spending a lot on inference to make many tries at each necessary step will become effective.

 

That raises an excellent point that hasn't been otherwise brought up -- it's clear that there are at least some cases already where you can get much better performance by doing best-of-n with large n. I'm thinking especially of Ryan Greenblatt's approach to ARC-AGI, where that was pretty successful (n = 8000). And as Ryan points out, that's the approach that AlphaCode uses as well (n = some enormous number). That seems like plausibly the best use of a lot of money with current LLMs.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-20T12:08:51.212Z · LW · GW

"simulations or training situations" doesn't necessarily sound like fun.

Seems like some would be and some wouldn't. Although those are the 'medium significance' ones; the largest category is the 188 that used 'low significance' tasks. Still doesn't map exactly to 'fun', but I expect those ones are at least very low stress.

Generally, comparing kids vs adults could be interesting, although it is difficult to say what would be an equivalent mental effort. Specifically I am curious about the impact of school. Oh, we should also compare homeschooled kids vs kids in school, to separate the effects of school and age.

That would definitely be interesting; it wouldn't surprise me if at least a couple of the studies in the meta-analysis did that.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-19T13:35:34.687Z · LW · GW

If your typical case of mental effort is solving puzzles and playing computer games, you will find mental effort pleasant. If instead your typical case is something like "a teacher tells me to solve a difficult problem in a stressful situation, and if I fail, I will be punished", you will find mental effort unpleasant.

I'm somewhat doubtful that this is the main moderator. The meta-analysis codes the included studies according to whether 'the participants’ task behavior either affected other people or affected some real-world outcome'. Only 14 of the studies were like that; of the rest, 148 were 'simulations or training situations' and the remaining 188 were low-significance, ie there was nothing at stake. I would guess that many of them were game-like. That significance difference had nearly no effect (−0.03, 95% CI [−0.27, 0.21]) on how aversive participants found the task.

That doesn't rule out your second suggestion, that people find mental effort unpleasant if they've associated it over time with stressful and consequential situations, but it's evidence against that being a factor for the particular task.

It does very much depend on the person, though ('a well-established line of research shows that people vary in their need for cognition, that is, their “tendency to engage in and enjoy effortful cognitive endeavors”'). I suspect that the large majority of LessWrong participants are people who enjoy mental effort.

Comment by eggsyntax on "Real AGI" · 2024-09-18T16:55:35.529Z · LW · GW

I think it's a useful & on-first-read coherent definition; my main suggestion is to not accompany it with 'AGI'; we already have too many competing standards definitions for AGI for another one to be useful. Holden Karnofsky's choice to move away from that with 'Transformative AI' made it much easier to talk about clearly.

Humans have at least four types

What are the four types? I can imagine breaking that up along different axes.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-18T16:21:08.592Z · LW · GW

Two interesting things from this recent Ethan Mollick post:

  1. He points to this recent meta-analysis that finds pretty clearly that most people find mental effort unpleasant. I suspect that this will be unsurprising to many people around here, and I also suspect that some here will be very surprised due to typical mind fallacy.
  2. It's no longer possible to consistently identify AI writing, despite most people thinking that they can; I'll quote a key paragraph with some links below, but see the post for details. I'm reminded of the great 'can you tell if audio files are compressed?' debates, where nearly everyone thought that they could but blind testing proved they couldn't (if they were compressed at a decent bitrate).

People can’t detect AI writing well. Editors at top linguistics journals couldn’t. Teachers couldn’t (though they thought they could - the Illusion again). While simple AI writing might be detectable (“delve,” anyone?), there are plenty of ways to disguise “AI writing” styles through simples prompting. In fact, well-prompted AI writing is judged more human than human writing by readers.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-18T13:31:59.181Z · LW · GW

I think that in particular, one kind of alignment problem that's clearly not in that reference class is: 'Given utility function U, will action A have net-positive consequences?'.

Yeah, I do actually think that in practice this problem is in the reference class, and that we are much better at judging and critiquing/verifying outcomes compared to actually doing an outcome, as evidenced by the very large amount of people who do the former compared to the latter.

I'm talking about something a bit different, though: claiming in advance that A will have net-positive consequences vs verifying in advance that A will have net-positive consequences. I think that's a very real problem; a theoretical misaligned AI can hand us a million lines of code and say, 'Run this, it'll generate a cure for cancer and definitely not do bad things', and in many cases it would be difficult-to-impossible to confirm that.

We could, as Tegmark and Omohundro propose, insist that it provide us a legible and machine-checkable proof of safety before we run it, but then we're back to counting on all players to behave responsibly. (although I can certainly imagine legislation / treaties that would help a lot there).

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T19:09:41.811Z · LW · GW

Thanks for the thoughtful responses. 

Another way to say it is we already have lots of evidence from other fields on whether verification is easier than generation, and the evidence shows that this is the case, so the mapping is already mostly given to us.

 

Note I'm referring to incorrectly judging compared to their internal values system, not incorrectly judging compared to another person's values.

I think there are many other cases where verification and generation are both extremely difficult, including ones where verification is much harder than generation. A few examples:

  • The Collatz conjecture is true.
  • The net effect of SB-1047 will be positive [given x values].
  • Trump will win the upcoming election.
  • The 10th Busy Beaver number is <number>.
  • Such and such a software system is not vulnerable to hacking[1].

I think we're more aware of problems in NP, the kind you're talking about, because they're ones that we can get any traction on.

So 'many problems are hard to solve but easy to verify' isn't much evidence that capabilities/alignment falls into that reference class.

I think that in particular, one kind of alignment problem that's clearly not in that reference class is: 'Given utility function U, will action A have net-positive consequences?'. Further, there are probably many cases where a misaligned AI may happen to know a fact about the world, one which the evaluator doesn't know or hasn't noticed, that means that A will have very large negative effects.

  1. ^

    This is the classic blue-team / red-team dichotomy; the defender has to think of and prevent every attack; the attacker only has to come up with a single vulnerability. Or in cryptography, Schneier's Law: 'Anyone can invent a security system so clever that she or he can't think of how to break it.'

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T16:08:51.581Z · LW · GW

I don't immediately find that piece very convincing; in short I'm skeptical that the author's claims are true for a) smarter systems that b) are more agentic and RL-ish. A few reasons:

  • The core difficulty isn't with how hard reward models are to train, it's with specifying a reward function in the first place in a way that's robust enough to capture all the behavior and trade-offs we want. LLMs aren't a good example of that, because the RL is a relatively thin layer on top of pretraining (which has a trivially specified loss function). o1 arguably shifts that balance enough that it'll be interesting to see how prosaically-aligned it is.
  • We have very many examples of reward misspecification and goal misgeneralization in RL; it's historically been quite difficult to adequately specify a reward function for agents acting in environments.
  • This becomes way more acute as capabilities move past the level where humans can quickly and easily choose the better output (eg as the basis for a reward model for RLHF).
  • That said, I do certainly agree that LLMs are reasonably good at understanding human values; maybe it's enough to have such an LLM judge proposed agent goals and actions on that basis and issue reward. It's not obvious to me that that works in practice, or is efficient enough to be practical.
  • I'm pretty skeptical of: '...it is significantly asymptotically easier to e.g. verify a proof than generate a new one...and this to some extent maps to the distinction between alignment and capabilities.' I think there's a lot of missing work there to be able to claim that mapping.
  • 'Very few problems are caused by incorrectly judging or misgeneralizing bad situations as good and vice-versa.' I think this is false. Consider 'Biden (/Trump) was a great president.' The world is full of situations where humans differ wildly on whether they're good or bad.

Maybe I've just failed to cross the inferential distance here, but on first read I'm pretty skeptical.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T15:36:13.728Z · LW · GW

I was just amused to see a tweet from Subbarao Kambhampati in which he essentially speculates that o1 is doing search and planning in a way similar to AlphaGo...accompanied by a link to his 'LLMs Can't Plan' paper. 

I think we're going to see some goalpost-shifting from a number of people in the 'LLMs can't reason' camp.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T15:28:38.408Z · LW · GW

alignment generalizing further than capabilities for pretty deep reasons

Do you have a link to a paper / LW post / etc on that? I'd be interested to take a look.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T14:52:52.534Z · LW · GW

Given the race dynamic and the fact that some major players don't even recognize safety as a valid concern, it seems extremely likely to me that at least some will take whatever shortcuts they can find (in the absence of adequate legislation, and until/unless we get a large warning shot).

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T14:50:18.752Z · LW · GW

Something I hadn't caught until my second read of OpenAI's main post today: we do at least get a handful of (apparent) actual chains of thought (search 'we showcase the chain of thought' to find them). They're extremely interesting.

  • They're very repetitive, with the model seeming to frequently remind itself of its current hypotheses and intermediate results (alternately: process supervision rewards saying correct things even if they're repetitious; presumably that trades off against a length penalty?).
  • The CoTs immediately suggest a number of concrete & straightforward strategies for improving the process and results; I think we should expect pretty rapid progress for this approach.
  • It's fascinating to watch the model repeatedly tripping over the same problem and trying to find a workaround (eg search for 'Sixth word: mynznvaatzacdfoulxxz (22 letters: 11 pairs)' in the Cipher example, where the model keeps having trouble with the repeated xs at the end). The little bit of my brain that can't help anthropomorphizing  these models really wants to pat it on the head and give it a cookie when it finally succeeds.
  • Again, it's unambiguously doing search (at least in the sense of proposing candidate directions, pursuing them, and then backtracking to pursue a different direction if they don't work out -- some might argue that this isn't sufficient to qualify).
Comment by eggsyntax on eggsyntax's Shortform · 2024-09-17T12:54:01.895Z · LW · GW

GPT-o1's extended, more coherent chain of thought -- see Ethan Mollick's crossword puzzle test for a particularly long chain of goal-directed reasoning[1] -- seems like a relatively likely place to see the emergence of simple instrumental reasoning in the wild. I wouldn't go so far as to say I expect it (I haven't even played with o1-preview yet), but it seems significantly more likely than previous LLM models.

Frustratingly, for whatever reason OpenAI has chosen not to let users see the actual chain of thought, only a model-generated summary of it. We don't know how accurate the summary is, and it seems likely that it omits any worrying content (OpenAI: 'We also do not want to make an unaligned chain of thought directly visible to users').

This is unfortunate from a research perspective. Probably we'll eventually see capable open models along similar lines, and can do that research then.

[EDIT: to be clear, I'm talking here about very simple forms of instrumental reasoning. 'Can I take over the world to apply more compute to this problem' seems incredibly unlikely. I'm thinking about things more like, 'Could I find the answer online instead of working this problem out myself' or anything else of the form 'Can I take actions that will get me to the win, regardless of whether they're what I was asked to do?'.]

  1. ^

    Incidentally, the summarized crossword-solving CoT that Mollick shows is an exceptionally clear demonstration of the model doing search, including backtracking.

Comment by eggsyntax on If I wanted to spend WAY more on AI, what would I spend it on? · 2024-09-16T16:12:22.200Z · LW · GW

If AI agents could be trusted to generate a better signal/noise ratio by delegation than by working-alongside the AI (where the bottleneck is the human)

They can't typically (currently) do better on their own than working alongside a human, but a) a human can delegate a lot more tasks than they can collaborate on (and can delegate more cheaply to an AI than to another human), and b) though they're not as good on their own they're sometimes good enough.

Consider call centers as a central case here. Companies are finding it a profitable tradeoff to replace human call-center workers with AI even if the AI makes more mistakes, as long as it doesn't make too many mistakes.

Comment by eggsyntax on My disagreements with "AGI ruin: A List of Lethalities" · 2024-09-16T14:00:50.859Z · LW · GW

Okay, my simulation point was admittedly a bit of colorful analogy

Fair enough; if it's not load-bearing for your view, that's fine. I do remain skeptical, and can sketch out why if it's of interest, but feel no particular need to continue.

Comment by eggsyntax on My disagreements with "AGI ruin: A List of Lethalities" · 2024-09-15T23:16:28.025Z · LW · GW

It's the same reason for why we can't break out of the simulation IRL, except we don't have to face adversarial cognition, so the AI's task is even harder than our task.

This seems like it contains several ungrounded claims. Maybe I'm misreading you? But it seems weight-bearing for your overall argument, so I want to clarify.

  • We may not be in a simulation, in which case not being able to break out is no evidence of the ease of preventing breakout.
  • If we are in a simulation, then we only know that the simulation has been good enough to keep us in it so far, for a very short time since we even considered the possiblity that we were in a simulation, with barely any effort put into trying to determine whether we're in one or how to break out. We might find a way to break out once we put more effort into it, or once science and technology advance a bit further, or once we're a bit smarter than current-human.
  • We have no idea whether we're facing adversarial cognition.
  • If we are in a simulation, it's probably being run by beings much smarter than human, or at least much more advanced (certainly humans aren't anywhere remotely close to being able to simulate an entire universe containing billions of sentient minds). For the analogy to hold, the AI would have to be way below human level, and by hypothesis it's not (since we're talking about AI smart enough to be dangerous). 
Comment by eggsyntax on AIS terminology proposal: standardize terms for probability ranges · 2024-09-15T21:01:15.236Z · LW · GW

Thanks! Quite similar to the Kesselman tags that @gwern uses (reproduced in this comment below), and I'd guess that one is decended from the other. Although it has somewhat different range cutoffs for each because why should anything ever be consistent.

Here are the UK ones in question (for ease of comparison):

Defence Intelligence: Probability Yardstick infographic
Comment by eggsyntax on Universal Basic Income and Poverty · 2024-09-15T12:31:37.021Z · LW · GW

I think a more central question would be: do a nontrivial number of people in those parts of Europe work at soul-crushing jobs with horrible bosses? If so, what is it that they would otherwise lack that makes them feel obligated to do so?

Comment by eggsyntax on OpenAI o1 · 2024-09-13T17:58:16.357Z · LW · GW

Just to make things even more confusing, the main blog post is sometimes comparing o1 and o1-preview, with no mention of o1-mini:

And then in addition to that, some testing is done on 'pre-mitigation' versions and some on 'post-mitigation', and in the important red-teaming tests, it's not at all clear what tests were run on which ('red teamers had access to various snapshots of the model at different stages of training and mitigation maturity'). And confusingly, for jailbreak tests, 'human testers primarily generated jailbreaks against earlier versions of o1-preview and o1-mini, in line with OpenAI’s policies. These jailbreaks were then re-run against o1-preview and GPT-4o'. It's not at all clear to me how the latest versions of o1-preview and o1-mini would do on jailbreaks that were created for them rather than for earlier versions. At worst, OpenAI added mitigations against those specific jailbreaks and then retested, and it's those results that we're seeing.

Comment by eggsyntax on Contra Yudkowsky on 2-4-6 Game Difficulty Explanations · 2024-09-13T16:52:48.330Z · LW · GW

In reality, every answer provides (at most) one bit of information

 

Quibble: this is true in expectation but not in reality. Suppose we're playing twenty questions, and my first question is, 'Is it a giraffe?' and you say yes. That's a lot more than one bit of information! It's just that in expectation those many bits are outweighed by the much larger number of cases where it isn't a giraffe and I get much less than one bit of information from your 'no' answer.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-13T15:59:45.975Z · LW · GW

Do you happen to have evidence that they used process supervision? I've definitely heard that rumored, but haven't seen it confirmed anywhere that I can recall.

when COTs are sampled from these language models in OOD domains, misgeneralization is expected.

Offhand, it seems like if they didn't manage to train that out, it would result in pretty bad behavior on non-STEM benchmarks, because (I'm guessing) output conditional on a bad CoT would be worse than output with no CoT at all. They showed less improvement on non-STEM benchmarks than on math/coding/reasoning, but certainly not a net drop. Thoughts? I'm not confident in my guess there.

I'm a little concerned that even if they find very problematic behaviour, they will blame it on clearly expected misgeneralization and therefore no significant steps will be taken, especially because there is no reputational damage

That's a really good point. As long as benchmark scores are going up, there's not a lot of incentive to care about whatever random stuff the model says, especially if the CoT is sometimes illegible anyway. Now I'm really curious about whether red-teamers got access to the unfiltered CoT at all.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-13T15:51:05.432Z · LW · GW

Oh, that's an interesting thought, I hadn't considered that. Different models seems like it would complicate the training process considerably. But different heads/MoE seems like it might be a good strategy that would naturally emerge during training. Great point, thanks.

Comment by eggsyntax on OpenAI o1 · 2024-09-13T15:46:34.959Z · LW · GW

'...after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.'

(from 'Hiding the Chains of Thought' in their main post)

Comment by eggsyntax on OpenAI o1 · 2024-09-13T15:39:59.527Z · LW · GW

I was puzzled by that latter section (my thoughts in shortform here). Buck suggests that it may be mostly a smokescreen around 'We don't want to show the CoT because competitors would fine-tune on it'.

Maybe they taught the model to reason about safety rules but not to obey them in the chain of thought itself.

That's my guess (without any inside information): the model knows the safety rules and can think-out-loud about them in the CoT (just as it can think about anything else) but they're not fine-tuning on CoT content for ‘policy compliance or user preferences’.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-13T15:34:47.182Z · LW · GW

Elsewhere @Wei Dai points out the apparent conflict between 'we cannot train any policy compliance or user preferences onto the chain of thought' (above) and the following from the Safety section (emphasis mine):

We found that integrating our policies for model behavior into the chain of thought of a reasoning model is an effective way to robustly teach human values and principles. By teaching the model our safety rules and how to reason about them in context...

Comment by eggsyntax on AIS terminology proposal: standardize terms for probability ranges · 2024-09-13T00:06:08.245Z · LW · GW

I was unfamiliar with the intelligence community's work in this area until it came up in another response to this post. And I haven't run across the phrase 'words of estimative probability' before at all until your mention. Thank you!

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-12T23:15:12.201Z · LW · GW

I agree that it's quite plausible that the model could behave in that way, it's just not clear either way.

I disagree with your reasoning, though, in that to whatever extent GPT-o1 is misuse-resistant, that same resistance is available to the model when doing CoT, and my default guess is that it would apply the same principles in both cases. That could certainly be wrong! It would be really helpful if the system card would have given a figure for how often the CoT contained inappropriate content, rather than just how often the CoT summary contained inappropriate content.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-12T21:44:55.636Z · LW · GW

If they're avoiding doing RL based on the CoT contents,

Note they didn’t say this. They said the CoT is not optimised for ‘policy compliance or user preferences’. Pretty sure what they mean is the didn’t train the model not to say naughty things in the CoT.

 

Fair point. I had imagined that there wouldn't be RL directly on CoT other than that, but on reflection that's false if they were using Ilya Sutskever's process supervision approach as was rumored.

CoT optimised to be useful in producing the correct answer is a very different object to CoT optimised to look good to a human, and a priori I expect the former to be much more likely to be faithful.

Agreed!
 

I think you might be overthinking this. The CoT has not been optimised not to say naughty things. OpenAI avoid putting out models that haven’t been optimised not to say naughty things. The choice was between doing the optimising, or hiding the CoT.

Maybe. But it's not clear to me that in practice it would say naughty things, since it's easier for the model to learn one consistent set of guidelines for what to say or not say than it is to learn two. Annoyingly, in the system card, they give a figure for how often the CoT summary contains inappropriate content (0.06% of the time) but not for how often the CoT itself does. What seems most interesting to me is that if the CoT did contain inappropriate content significantly more often, that would suggest that there's benefit to accuracy if the model can think in an uncensored way.

And even if it does, then sure, they might choose not to allow CoT display (to avoid PR like 'The model didn't say anything naughty but it was thinking naughty thoughts'), but it seems like they could have avoided that much more cheaply by just applying an inappropriate-content filter for the CoT content and filtering it out or summarizing it (without penalizing the model) if that filter triggers.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-12T20:16:15.789Z · LW · GW

Copying a comment on this from @Buck elsewhere that seems pretty plausible:

my guess is that the true reason is almost 100% that they want to avoid letting other people fine-tune on their CoT traces

because it’s a competitive advantage

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-12T20:12:15.803Z · LW · GW

Thoughts on a passage from OpenAI's GPT-o1 post today:

We believe that a hidden chain of thought presents a unique opportunity for monitoring models. Assuming it is faithful and legible, the hidden chain of thought allows us to "read the mind" of the model and understand its thought process. For example, in the future we may wish to monitor the chain of thought for signs of manipulating the user. However, for this to work the model must have freedom to express its thoughts in unaltered form, so we cannot train any policy compliance or user preferences onto the chain of thought. We also do not want to make an unaligned chain of thought directly visible to users.

Therefore, after weighing multiple factors including user experience, competitive advantage, and the option to pursue the chain of thought monitoring, we have decided not to show the raw chains of thought to users. We acknowledge this decision has disadvantages. We strive to partially make up for it by teaching the model to reproduce any useful ideas from the chain of thought in the answer. For the o1 model series we show a model-generated summary of the chain of thought.

This section is interesting in a few ways:

  • 'Assuming it is faithful and legible' -- we have reason to believe that it's not, at least not on previous models, as they surely know. Do they have reason to believe that it is for o1, or are they just ignoring that issue?
  • 'we cannot train any policy compliance or user preferences onto the chain of thought' -- sure, legit. Although LLM experiments that use a "hidden" CoT or scratchpad may already show up enough times in the training data that I expect LLMs trained on the internet to understand that the scratchpads aren't really hidden. If they don't yet, I expect they will soon.
  • 'We also do not want to make an unaligned chain of thought directly visible to users.' Why?
    • I guess I can see a story here, something like, 'The user has asked the model how to build a bomb, and even though the model is prosaically-aligned, it might put instructions for building a bomb in the CoT before deciding not to share them.' But that raises questions.
    • Is this behavior they're actually seeing in the model? It's not obvious to me that you'd expect it to happen. If they're avoiding doing RL based on the CoT contents, then certainly it could happen, but it seems like it would be strictly more complex behavior, and so not very likely to spontaneously emerge.
    • Although I can also imagine a story where there'd be pressure for it to emerge. Can the model reason more clearly if it has the opportunity to think uncensored thoughts?
  • But also 'for the o1 model series we show a model-generated summary of the chain of thought.' It seems strange to spend a lot more forward passes to summarize the CoT as opposed to just doing a single pass through a model trained to detect content that violates policy and omitting the CoT if that triggers.
  • In addition to the previous justifications, they cite 'user experience' and 'competitive advantage'. The former seems silly at first blush; how will users' experience be negatively affected by a CoT that's hidden by default and that they never need to look at? I'm curious about what sort of 'competitive advantage' they're talking about. Maybe the CoT would reveal a highly-structured system prompt for how to do CoT that accounts for a lot of the decreased loss on reasoning tasks?
Comment by eggsyntax on AIS terminology proposal: standardize terms for probability ranges · 2024-09-11T13:42:03.627Z · LW · GW

Agreed that that's often an improvement! And as I say in the post, I do think that more often than not (> 50%) the best choice is to just give the numbers. My intended proposal is just to standardize in cases where the authors find it more natural to use natural language terms.

Comment by eggsyntax on AIS terminology proposal: standardize terms for probability ranges · 2024-09-10T15:37:43.793Z · LW · GW

I've just run across this data visualization, with data gathered through polling /r/samplesize (approximately reproducing a 1999 CIA analysis, CIA chart here), showing how people tend to translate terms like 'highly likely' into numerical probabilities. I'm showing the pretty joyplot version because it's so very pretty, but see the source for the more conventional and arguably more readable whisker plots.

 

https://raw.githubusercontent.com/zonination/perceptions/master/joy1.png
Comment by eggsyntax on Self-explaining SAE features · 2024-09-10T14:47:08.716Z · LW · GW

One common type of example of that is when a feature clearly activates in the presence of a particular word, but then when trying to use that to predict whether the feature will activate on particular text, it turns out that the feature activates on that word in some cases but not all, sometimes with no pattern I can discover to distinguish when it does and doesn't.

Comment by eggsyntax on Self-explaining SAE features · 2024-09-10T14:44:16.837Z · LW · GW

Thanks! I think the post (or later work) might benefit from a discussion of using your judgment as a proxy for accuracy, its strengths & weaknesses, maybe a worked example. I'm somewhat skeptical of human judgement because I've seen a fair number of examples of a feature seeming (to me) to represent one thing, and then that turning out to be incorrect on further examination (eg if my explanation, if used by an LLM to score whether a particular piece of text should trigger the feature, turns out not to do a good job of that).

Comment by eggsyntax on AIS terminology proposal: standardize terms for probability ranges · 2024-09-08T15:50:21.668Z · LW · GW

Somewhat relevant: 'The Effects of Communicating Uncertainty on Public Trust in Facts and Numbers' and @Jeffrey Heninger's discussion of it on the AI Impacts blog.

Comment by eggsyntax on LLM Generality is a Timeline Crux · 2024-09-08T13:39:06.558Z · LW · GW

Yes, but note in the simulator/Bayesian meta-RL view, it is important that the LLMs do not "produce a response": they produce a prediction of 'the next response'.

Absolutely! In the comment you're responding to I nearly included a link to 'Role-Play with Large Language Models'; the section there on playing 20 questions with a model makes that distinction really clear and intuitive in my opinion.

there's many different meanings which are still possible, and you're not sure which one is 'true' but they all have a lot of different posterior probabilities by this point, and you hedge your bets as to the exact next token

Just for clarification, I think you're just saying here that the model doesn't place all its prediction mass on one token but instead spreads it out, correct? Another possible reading is that you're saying that the model tries to actively avoid committing to one possible meaning (ie favors next tokens that maintain superposition), and I thought I remembered seeing evidence that they don't do that.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-07T17:04:22.406Z · LW · GW

One point I make in 'LLM Generality is a Timeline Crux': if reliability is the bottleneck, that seems like a substantial point in favor of further scaling solving the problem. If it's a matter of getting from, say, 78% reliability on some problem to 94%, that seems like exactly the sort of thing scaling will fix (since in fact we've seen Number Go Up with scale on nearly all capabilities benchmarks). Whereas that seems less likely if there are some kinds of problems that LLMs are fundamentally incapable of, at least on the current architectural & training approach.

Comment by eggsyntax on eggsyntax's Shortform · 2024-09-07T16:52:02.079Z · LW · GW

I do think reliability is quite important. As one potential counterargument, though, you can get by with lower reliability if you can add additional error checking and error correcting steps. The research I've seen is somewhat mixed on how good LLMs are at catching their own errors (but I haven't dived into it deeply or tried to form a strong opinion from that research).