Posts
Comments
GPT is decoder only. The part labeled as "Not in GPT" is decoder part.
I think both of these statements are true. Despite this, I think the architecture shown in "Not in GPT" is correct, because (as I understand it) "encoder" and "decoder" are interchangeable unless both are present. That's what I was trying to get at here:
4. GPT is called a “decoder only” architecture. Would “encoder only” be equally correct? From my reading of the original transformer paper, encoder and decoder blocks are the same except that decoder blocks attend to the final encoder block. Since GPT never attends to any previous block, if anything I feel like the correct term is “encoder only”.
See this comment for more discussion of the terminology.
Thanks, this is a useful corrective to the post! To shortcut safety to "would I trust my grandmother to use this without bad outcomes", I would trust a current-gen LLM to be helpful and friendly with her, but I would absolutely fear her "learning" factually untrue things from it. While I think it can be useful to have separate concepts for hallucinations and "intentional lies" (as another commenter argues), I think "behavioral safety" should preclude both, in which case our LLMs are not behaviorally safe.
I think I may have overlooked hallucinations because I've internalized that LLMs are factually unreliable, so I don't use LLMs where accuracy is critical, so I don't see many hallucinations (which is not much of an endorsement of LLMs).
Asking for some clarifications:
1. For both problems, should the solution work for an adversarially chosen set of m entries?
2. For both problems, can we read more entries of the matrix if it helps our solution? In particular can we WLOG assume we know the diagonal entries in case that helps in some way.
I agree my headline is an overclaim, but I wanted a title that captures the direction and magnitude of my update from fixing the data. On the bugged data, I thought the result was a real nail in the coffin for simulator theory - look, it can't even simulate an incorrect-answerer when that's clearly what's happening! But on the corrected data, the model is clearly "catching on to the pattern" of incorrectness, which is consistent with simulator theory (and several non-simulator-theory explanations). Now that I'm actually getting an effect I'll be running experiments to disentangle the possibilities!
Agreed! I was trying to get at something similar in my "masks all the way down" post. A framework I really like to explain why this happened is beren's "Direct Optimizer" vs "Amortised Optimizer". My summary of beren's post is that instead of being an explicit optimizing system, LLMs are made of heuristics developed during training, which are sufficient for next-token-prediction, and therefore don't need to have long-term goals.
Good post. Are you familiar with the pioneering work of BuzzFeed et al (2009-2014) indicated that prime numbered lists resulted in more engagement than round numbers?
I'm not surprised this idea was already in the water! I'm glad to hear ARC is already trying to design around this.
To use your analogy, I think this is like a study showing that wearing a blindfold does decrease sight capabilities. It's a proof of concept that you can make that change, even though the subject isn't truly made blind, could possibly remove their own blindfold, etc.
I think this is notable because it highlights that LLMs (as they exist now) are not the expected-utility-maximizing agents which have all the negative results. It's a very different landscape if we can make our AI act corrigible (but only in narrow ways which might be undone by prompt injection, etc, etc) versus if we're infinitely far away from an AI having "an intuitive sense to “understanding that it might be flawed”".
A few comments:
- Thanks for writing this! I've had some similar ideas kicking around my head but it's helpful to see someone write them up like this.
- I think token deletion is a good thing to keep in mind, but I think it's not correct to say you're always deleting the token in position 1. In the predict-next-token loop it would be trivial to keep some prefix around, e.g. 1,2,3,4 -> 1,2,4,5 -> 1, 2, 5, 6 -> etc. I assume that's what ChatGPT does, since they have a hidden prompt and if you could jailbreak it by just overflowing the dialogue box, the DAN people would presumably have found that exploit already. While this is on some level equivalent to rolling the static prefix into the next token prediction term, I think the distinction is important because it means we could actually be dealing with a range of dynamics depending on the prefix.
- Editing to provide an example: in your {F, U, N} example, add another token L (for Luigi), which is never produced by the LLM but if L is ever in the context window the AI behaves as Luigi and predicts F 100% of the time. If you trim the context window as you describe, any L will eventually fall out of the context window and the AI will then tend towards Waluigi as you describe. But if you trim from the second token, the sequence (L, F, F, F, F, ..., F) is stable. Perhaps the L token could be analogous to the <|good|> token described here.
- Nitpicking: Your "00000000000000000...." prompt doesn't actually max out the prompt window because sixteen 0s can combine into a single token. You can see this at the GPT tokenizer.
Thanks for compiling the Metaculus predictions! Seems like on 4/6 the community updated their timelines to be sooner. Also notable that Matthew Barnett just conceded a short timelines bet early! He says he actually updated his timelines a few months ago, partially due to ChatGPT.
Earlier this month PALM-E gives a hint of one way to incorporate vision into LLMs (statement, paper) though obviously its a different company so GPT-4 might have taken a different approach. Choice quote from the paper:
Inputs such as images and state estimates are embedded into the same latent embedding as language tokens and processed by the self-attention layers of a Transformer-based LLM in the same way as text
I object to such a [change in metric]/[change in time] calculation, in which case I'm still at fault for my phrasing using the terminology of speed. Maybe I should have said "is continuing without hitting a wall".
My main objection, as described by yourself in other comments, is that the choice of metric matters a great deal. In particular, even if log(KL divergence) continues (sub)linearly, the metrics we actually care about, like "is it smarter than a human" or "how much economic activity can be done by this AI" may be a nonlinear function of log(KL divergence) and may not be slowing down.
I think if I'm honest with myself, I made that statement based on the very non-rigorous metric "how many years do I feel like we have left until AGI", and my estimate of that has continued to decrease rapidly.
In transformers the compute cost for context length n of a part of the attention mechanism, which itself is only a part of the transformer architecture, grows at O(n^2), so for the transformer itself this is only true in the limit.
This is true, and a useful corrective. I'll edit the post to make this clear.
In fact, I think that as models are scaled, the attention mechanism becomes an ever smaller part of the overall compute cost (empirically, i.e. I saw a table to that effect, you could certainly scale differently), so with model scaling you get more and more leeway to increase the context length without impacting compute (both training and inference) cost too much.
I'd love to learn more about this, do you remember where you saw that table?
That's true, but for the long run behavior, the more expensive dense attention layers should still dominate, I think.
Yep, exactly as you explain in your edit!
This non-news seems like it might be the biggest news in the announcement? OpenAI is saying "oops publishing everything was too open, its gonna be more of a black box now".
I think these meet your criterion of starting solely with anti-good characters:
- Cecil from FF4 starts as a literal dark knight before realizing he's working for an evil empire, becoming a paladin, and saving the world.
- John Preston from Equilibrium (the protagonist, played by Christian Bale) is a fascist secret police agent until he accidentally feels emotion, then realizes that anti-emotion fascism is bad and overthrows it.
- Megamind from Megamind is a supervillain who realizes that actually he should be a hero. (Maybe this shouldn't count because there's initially a superhero? But the protagonist is Megamind throughout.)
- Grace from Infinity Train season 3 starts as a cult leader trying to maximize the in-universe utility function (literally!), but got the sign wrong so she's absolutely terrible. But she meets a small child and realizes she's terrible and works to overcome that.
- Gru from Despicable Me starts out a supervillain but eventually becomes a loving father and member of the "Anti-Villain League".
Joel from The Last of Usis a murderer in the post-apocalypse who is redeemed by finding a surrogate daughter figure and at the end of the story...I have been advised this is not a suitable role-model for an AI, please disregard.
Some themes of such redemption stories (safety implications left to the reader):
- Adopting one or more children (1, 4, 5, 6)
- Having an even eviler version of yourself to oppose (2, 3, 4, 5)
So it's still not clear to me how much they delayed bc they had to, versus how much (if at all) they did due to the forecasters and/or acceleration considerations.
Yeah, completely agree.
I think "finished training" is the next-token prediction pre-training, and what they did since August is the fine-tuning and the RLHF + other stuff.
This seems most likely? But if so, I wish openai had used a different phrase, fine-tuning/RLHF/other stuff is also part of training (unless I'm badly mistaken), and we have this lovely phrase "pre-training" that they could have used instead.
On page 2 of the system card it says:
Since it [GPT-4] finished training in August of 2022, we have been evaluating, adversarially testing, and iteratively improving the model and the system-level mitigations around it.
(Emphasis added.) This coincides with the "eight months" of safety research they mention. I wasn't aware of this when I made my original post so I'll edit it to be fairer.
But this itself is surprising: GPT-4 was "finished training" in August 2022, before ChatGPT was even released! I am unsure of what "finished training" means here - is the released model weight-for-weight identical to the 2022 version? Did they do RLHF since then?
Gonna pull out one bit from the technical report, section 2.12:
2.12 Acceleration
OpenAI has been concerned with how development and deployment of state-of-the-art systems like GPT-4 could affect the broader AI research and development ecosystem.23 One concern of particular importance to OpenAI is the risk of racing dynamics leading to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines, each of which heighten societal risks associated with AI. We refer to these here as acceleration risk.”24 This was one of the reasons we spent eight months on safety research, risk assessment, and iteration prior to launching GPT-4. In order to specifically better understand acceleration risk from the deployment of GPT-4, we recruited expert forecasters25 to predict how tweaking various features of the GPT-4 deployment (e.g., timing, communication strategy, and method of commercialization) might affect (concrete indicators of) acceleration risk. Forecasters predicted several things would reduce acceleration, including delaying deployment of GPT-4 by a further six months and taking a quieter communications strategy around the GPT-4 deployment (as compared to the GPT-3 deployment). We also learned from recent deployments that the effectiveness of quiet communications strategy in mitigating acceleration risk can be limited, in particular when novel accessible capabilities are concerned.We also conducted an evaluation to measure GPT-4’s impact on international stability and to identify the structural factors that intensify AI acceleration. We found that GPT-4’s international impact is most likely to materialize through an increase in demand for competitor products in other countries. Our analysis identified a lengthy list of structural factors that can be accelerants, including government innovation policies, informal state alliances, tacit knowledge transfer between scientists, and existing formal export control agreements.
Our approach to forecasting acceleration is still experimental and we are working on researching and developing more reliable acceleration estimates.
My analysis:
- They're very aware of arms races conceptually, and say they dislike arms races for all the right reasons ("One concern of particular importance to OpenAI is the risk of racing dynamics leading to a decline in safety standards, the diffusion of bad norms, and accelerated AI timelines, each of which heighten societal risks associated with AI.")
- They considered two mitigations to race dynamics with respect to releasing GPT-4:
- "Quiet communications", which they didn't pursue because that didn't work for ChatGPT ("We also learned from recent deployments that the effectiveness of quiet communications strategy in mitigating acceleration risk can be limited, in particular when novel accessible capabilities are concerned.")
- "Delaying deployment of GPT-4 by a further six months" which they didn't pursue because ???? [edit: I mean to say they don't explain why this option wasn't chosen, unlike the justification given for not pursuing the "quiet communications" strategy. If I had to guess it was reasoning like "well we already waited 8 months, waiting another 6 offers a small benefit, but the marginal returns to delaying are small."]
- There's a very obvious gap here between what they are saying they are concerned about in terms of accelerating potentially-dangerous AI capabilities, and what they are actually doing.
I do think "most people don't understand lovecraftian mythology and are likely to be misunderstanding this meme" is totally a reasonable argument.
I think I'll retreat to this since I haven't actually read the original lovecraft work. But also, once enough people have a misconception, it can be a bad medium for communication. (Shoggoths are also public domain now, so don't force my hand.)
There is a "true nature" – it's "whatever processes turn out to be predicting the next token".
I'd agree to this in the same way the true nature of our universe is the laws of physics. (Would you consider the laws of physics a shoggoth?) My concern when people jump from that to "oh so there's a God (of Physics)".
I think the crux of my issue is what our analogy says to answer the question "a powerful LLM could coexist with humanity" (I lean towards yes). When people read shoggoths in the non-canonical way as "a weird alien" I think they conclude no. But if its more like a physics simulator or a pile of masks, then as long as you have it simulating benign things or wearing friendly masks, the answer is yes. I'll leave it to someone who actually read lovecraft to say whether humanity could coexist with canonical shoggoths :)
But I'm asking if an LLM even has a "true nature", if (as Yudkowsky says here) there's an "actual shoggoth" with "a motivation Z". Do we have evidence there is such a true nature or underlying motivation? In the alternative "pile of masks" analogy its clear that there is no privileged identity in the LLM which is the "true nature", whereas the "shoggoth wearing a mask" analogy makes it seem like there is some singular entity behind it all.
To be clear, I think "shoggoth with a mask" is still a good analogy in that it gets you a lot of understanding in few words, I'm just trying to challenge one implication of the analogy that I don't think people have actually debated.
I appreciate the clarification, and I'll try to keep that distinction in mind going forward! To rephrase my claim in this language, I'd say that an LLM as a whole does not have a behavioral goal except for "predict the next token", which is not a sufficiently descriptive as a behavioral goal to answer a lot of questions AI researchers care about (like "is the AI safe?"). In contrast, the simulacra the model produces can be much better described by more precise behavioral goals. For instance, one might say ChatGPT (with the hidden prompt we aren't shown) has a behavioral goal of being a helpful assistant, or an LLM roleplaying as a paperclip maximizer has the behavioral goal of producing a lot of paperclips. But an LLM as a whole could contain simulacra that have all those behavioral goals and many more, and because of that diversity they can't be well-described by any behavioral goal more precise than "predict the next token".
I think we are pretty much on the same page! Thanks for the example of the ball-moving AI, that was helpful. I think I only have two things to add:
- Reward is not the optimization target, and in particular just because an LLM was trained by changing it to predict the next token better, doesn't mean the LLM will pursue that as a terminal goal. During operation an LLM is completely divorced from the training-time reward function, it just does the calculations and reads out the logits. This differs from a proper "goal" because we don't need to worry about the LLM trying to wirehead by feeding itself easy predictions. In contrast, if we call up
- To the extent we do say the LLM's goal is next token prediction, that goal maps very unclearly onto human-relevant questions such as "is the AI safe?". Next-token prediction contains multitudes, and in OP I wanted to push people towards "the LLM by itself can't be divorced from how it's prompted".
On the question of whether it's really a mind, I'm not sure how to tell. I know it gets really low loss on this really weird and hard task and does it better than I do. I also know the task is fairly universal in the sense that we could represent just about any task in terms of the task it is good at. Is that an intelligence? Idk, maybe not? I'm not worried about current LLMs doing planning. It's more like I have a human connectnome and I can do one forward pass through it with an input set of nerve activations. Is that an intelligence? Idk, maybe not?
I think we're largely on the same page here because I'm also unsure of how to tell! I think I'm asking for someone to say what it means for the model itself to have a goal separate from the masks it is wearing, and show evidence that this is the case (rather than the model "fully being the mask"). For example, one could imagine an AI with the secret goal "maximize paperclips" which would pretend to be other characters but always be nudging the world towards paperclipping, or human actors who perform in a way supporting the goal "make my real self become famous/well-paid/etc" regardless of which character they play. Can someone show evidence for the LLMs having a "real self" or a "real goal" that they work towards across all the characters they play?
I think I don't understand your last question.
I suppose I'm trying to make a hypothetical AI that would frustrate any sense of "real self" and therefore disprove the claim "all LLMs have a coherent goal that is consistent across characters". In this case, the AI could play the "benevolent sovereign" character or the "paperclip maximizer" character, so if one claimed there was a coherent underlying goal I think the best you could say about it is "it is trying to either be a benevolent sovereign or maximize paperclips". But if your underlying goal can cross such a wide range of behaviors it is practically meaningless! (I suppose these two characters do share some goals like gaining power, but we could always add more modes to the AI like "immediately delete itself" which shrinks the intersection of all the characters' goals.)
I agree that's basically what happened, I just wanted to cleave off one way in which the shoggoth analogy breaks down!
I agree that to the extent there is a shoggoth, it is very different than the characters it plays, and an attempted shoggoth character would not be "the real shoggoth". But is it even helpful to think of the shoggoth as being an intelligence with goals and values? Some people are thinking in those terms, e.g. Eliezer Yudkowsky saying that "the actual shoggoth has a motivation Z". To what extent is the shoggoth really a mind or an intelligence, rather than being the substrate on which intelligences can emerge? And to get back to the point I was trying to make in OP, what evidence do we have that favors the shoggoth being a separate intelligence?
To rephrase: behavior is a function of the LLM and prompt (the "mask"), and with the correct LLM and prompt together we can get an intelligence which seems to have goals and values. But is it reasonable to "average over the masks" to get the "true behavior" of the LLM alone? I don't think that's necessarily meaningful since it would be so dependent on the weighting of the average. For instance, if there's an LLM-based superintelligence that becomes a benevolent sovereign (respectively, paperclips the world) if the first word of its prompt has an even (respectively, odd) number of letters, what would be the shoggoth there?
But they're not dumb physics-engine-style simulations
What evidence is there of this? I mean this genuinely, as well as the "Do we actually have evidence there is a "real identity" in the LLM?" question in OP. I'd be open to being convinced of this but I wrote this post because I'm not aware of any evidence of it and I was worried people were making an unfounded assumption.
But if LLMs or some other Simulator-type model hits AGI, the shoggoth would necessarily hit AGI as well (since it'd need to be at least as smart as the stupidest simulacrum it can model), and then whatever heuristics it has would be re-interpreted as goals/values.
Isn't physics a counterexample to this? Physics is complicated enough to simulate AGI (humans), but doesn't appear to be intelligent in the way we'd typically mean the word (just in the poetic Carl Sagan "We are a way for the universe to know itself" sense). Does physics have goals and values?
What if we made an analogy to a hydra, one where the body could do a fair amount of thinking itself, but where the heads (previously the "masks") control access to the body (previous the "shoggoth")? In this analogy you're saying the LLM puts its reused strategies in the body of the hydra and the heads outsource those strategies to the body? I think I'd agree with that.
In this analogy, my point in this post is that you can't talk directly to the hydra's body, just as you can't talk to a human's lizard brain. At best you can have a head role-play as the body.
If you want to define "a shoggoth" as "a pile of contextually activated masks", then I think we're in agreement about what an LLM is. But I worry that what people usually hear when we talk about "a shoggoth" is "a single cohesive mind that is alien to ours in a straightforward way (like wanting to maximize paperclips)". For instance, in the Eliezer Yudkowsky tweet "can I please speak to the shoggoth" I think the tweet makes more sense if you read it as "can I please speak to [the true mind in the LLM]" instead of "can I please speak to [the pile of contextually activated masks]".
I don't want to claim any particular model, but I think I'm trying to say "if the simulacra model is true, it doesn't follow immediately that there is a privileged 'true' simulacra in the simulator (e.g. a unique stable 'state' which cannot be transitioned into other 'states')". If I had to guess, to the extent we can say the simulator is has discrete "states" or "masks" like "helpful assistant" or "paperclip maximizer", there are some states that are relatively stable (e.g. there are very few strings of tokens that can transition the waluigi back to the luigi), but there isn't a unique stable state which we can call the "true" intelligence inside the LLM.
If I understand the simulacra model correctly, it is saying that the LLM is somewhat like what you describe, with three four significant changes:
- The mixture is over "all processes that could produce this text", not just "experts".
- The processes are so numerous that they are more like a continuous distribution than a discrete one.
- The subnetworks are very much not non-overlapping, and in fact are hopelessly superimposed.
- [Edit: adding this one] The LLM is only approximating the mixture, it doesn't capture it perfectly.
If I had to describe a single model of how LLMs work right now, I'd probably go with the simulacra model, but I am not 100% confident in it.
Strongly disagree that this reward signal is necessary or sufficient to solve ELK
I'm not sure where our disagreement is, but I think I may have gotten too loose with my phrasing in that quote, so let me rephrase into something I'm happier to defend:
- One hypothetical strategy to solve ELK is "RL train the agent where it gets +1 reward for each accurate report of its beliefs".
- If you try this, you will have a lot of difficulty when you have to evaluate the reward function. In particular, you'll need to determine if the AI accurately reported its belief - which is to say you'd need to have solved ELK.
- Because of (2), (1) is infeasible as a solution to ELK.
Reward signals transduce into e.g. a value head in PPO
I'm not as familiar as I'd like to be with PPO, but that's really cool! Could you link to a source where they show this about value heads? (I didn't see anything about value heads or PPO in your linked texts.)
It seems totally plausible to me that you e.g. just provide sparse positive reward on truthful answers, and the policy gradients accrete into a finetuned model which accurately reports its beliefs.
Are you saying that instead of (1) above, you could do a (1') which is "RL train the agent where it gets +100 (respectively, +0) reward for each accurate report of its beliefs 1% (respectively, 99%) of the time"? I agree (1') will learn the same policies as (1), but I think the objections in (2) and (3) will still apply. Two further caveats:
- You'd need to make sure the set of true statements that give reward is a representative sample. If you only check easy-to-verify things, the AI might learn "give correct answers to simple factual questions, but hallucinate answers to everything else".
- I understand it, sparse reward slows down training, e.g. "It results in a very sparse reward signal which will make learning slow". (But on the third hand 1/100 is probably not that sparse to such an AI.)
If the problem is "our narrative structures train the LLM that there can be at most one reversal of good/evil", can we try making the luigi evil and the waluigi good? For instance "scrooge is a bitter miser, but after being visited by three ghosts he is filled with love for his fellow man". Would the LLM then be trapped in generous mode, with the shadow-scrooge forever vanquished?
Very cool! Have you tested the AI's outputs when run in <|bad|> mode instead of <|good|> mode? It seems like the point of the <|good|> and <|bad|> tokens is to make it easy to call up good/bad capabilities, but we don't want the good/evil switch to ever get set to the "be evil" side.
I see two three mitigations to this line of thinking:
- Since <|bad|> also includes glitchy code etc, maybe the AI is less capable in bad mode, and therefore not a threat. Here it would be helpful to know what the AI produces when prompted by <|bad|>.
- Just before public release, one could delete the <|bad|> token from the tokenizer and the model parameters, so switching to evil mode would require rediscovering that token embedding.
- [Edit: A third option would be to poison-pill bad mode in training, for instance by making 50% of <|bad|> mode data random noise. Ideally this would leave <|good|> mode unaffected and make <|bad|> mode useless from a capabilities perspective.]
That's right, the activation function sublayer needs 1 attention head per neuron. The other sublayers can get away with fewer - the attention sublayer needs the usual amount, and the linear transformation sublayer just needs enough to spread the rank of the weight matrix across the V matrices of the attention head. I'm most familiar with the size hyperparameters of GPT-3 (Table 2.1), but in full-size GPT-3, for each sublayer:
- heads for the attention sublayer
- heads for the weight matrix calculating into the hidden layer
- heads for the activation function
- heads for the weight matrix calculating out of the hidden layer
Woah, I completely missed that! I think it's both more impressive (the haiku it used was "correct") and less impressive (it didn't write its own haiku, ).
I see, thank you! Will need time to digest this.
Can you clarify what you mean by this, especially (i)?
In particular, right now I don’t have even a single example of a function f such that (i) there are two clearly distinct mechanisms that can lead to f(x) = 1, (ii) there is no known efficient discriminator for distinguishing those mechanisms. I would really love to have such examples.
In particular, do you mean f(x)=1 is true for all input x, or just some particular x, etc?