Posts
Comments
I feel like people publish articles like this all the time, and usually when you do surveys these people definitely prefer to have the option to take this job instead of not having it, and indeed frequently this kind of job is actually much better than their alternatives.
Insofar as you're arguing with me for posting this, I... never claimed that that wasn't true?
I didn't leave it as a "simple" to-do, but rather an offer to collaboratively hash something out.
That said: If people don't even know what it would look like when they see it, how can one update on evidence? What is Nate looking at which tells him that GPT doesn't "want things in a behavioralist sense"? (I bet he's looking at something real to him, and I bet he could figure it out if he tried!)
I claim we are many scientific insights away from being able to talk about these questions at the level of precision necessary to make predictions like this.
To be clear, I'm not talking about formalizing the boundary. I'm talking about a bet between people, adjudicated by people.
(EDIT: I'm fine with a low sensitivity, high specificity outcome -- we leave it unresolved if it's ambiguous / not totally obvious relative to the loose criteria we settled on. Also, the criterion could include randomly polling n alignment / AI people and asking them how "behaviorally-wanting" the system seemed on a Likert scale. I don't think you need fundamental insights for that to work.)
Closest to the third, but I'd put it somewhere between .1% and 5%. I think 15% is way too high for some loose speculation about inductive biases, relative to the specificity of the predictions themselves.
(I think you're still playing into an incorrect frame by talking about "simplicity" or "speed biases.")
Bold claim. Want to make any concrete predictions so that I can register my different beliefs?
Excellent retrospective/update. I'm intimately familiar with the emotional difficulty of changing your mind and admitting you were wrong.
Friendship is Optimal is science fiction:
On this 11th anniversary of the release of Friendship is Optimal, I’d like to remind everyone that it’s a piece of speculative fiction and was a product of it’s time. I’ve said this before in other venues, but Science Marches On and FiO did not predict how things have turned out. The world looks very different.
A decade ago, people speculated that AI would think symbolically and would try to maximize a utility function. Someone would write a Seed AI that would recursively self improve its source code. And since value is complex and fragile, we were unlikely to get our specification of the utility function correct and would create an agent which wanted to do things that conflicted with things we wanted. That’s possible because intelligence doesn’t imply that it would share our values. And the AI would want to disempower us because obtaining power is an instrumental goal of all utility functions. And thus any AI has the incentive to become smarter than all humans and then bide its time until it suddenly disempowers us. You then end up with a cinematic universe filled with focus on formal utility functions, systems which maximize one, formal decision theory, formal game theory, and emulation of other agents to try to figure out how they’ll respond to a given action.
Nothing we have actually looks like this story! Nothing! None of the systems we’ve made have a utility function, at least in the sense of the traditional MIRI narrative! AlphaGo doesn’t have a utility function like that! GPT doesn’t have a utility function like that! None of these things are agents! Even AutoGPT isn’t an agent, in the traditional MIRI sense!
I mean, certainly there is a strong pressure to do well in training—that's the whole point of training.
I disagree. This claim seems to be precisely what my original comment was critiquing:
It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of "in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. 'try to do well' or reason about its own training process in order to get a low enough loss, otherwise the model gets 'selected against'."
And then you wrote, as some things you believe:[1]
The model needs to figure out how to somehow output a distribution that does well in training...
Doing that is quite hard for the distributions that we care about and requires a ton of cognition and reasoning in any situation where you don't just get complete memorization (which is highly unlikely under the inductive biases)...
Both the deceptive and sycophantic models involve directly reasoning about the training process internally to figure out how to do well on it. The aligned model likely also requires some reasoning about the training process, but only indirectly due to understanding the world being important and the training process being a part of the world.
This is the kind of claim I was critiquing in my original comment!
but you shouldn't equate them into one group and say it's a motte-and-bailey. Different people just think different things.
My model of you makes both claims, and I indeed see signs of both kinds of claims in your comments here.
- ^
Thanks for writing out a bunch of things you believe, by the way! That was helpful.
We know that SGD is selecting for models based on some combination of loss and inductive biases, but we don't know the exact tradeoff.
Actually, I've thought more, and I don't think that this dual-optimization perspective makes it better. I deny that we "know" that SGD is selecting on that combination, in the sense which seems to be required for your arguments to go through.
It sounds to me like I said "here's why you can't think about 'what gets low loss'" and then you said[1] "but what if I also think about certain inductive biases too?" and then you also said "we know that it's OK to think about it this way." No, I contend that we don't know that. That was a big part of what I was critiquing.
As an alert—It feels like your response here isn't engaging with the points I raised in my original comment. I expect I talked past you and you, accordingly, haven't seen the thing I'm trying to point at.
- ^
this isn't a quote, this is just how your comment parsed to me
Deceptive alignment seems to only be supported by flimsy arguments. I recently realized that I don't have good reason to believe that continuing to scale up LLMs will lead to inner consequentialist cognition to pursue a goal which is roughly consistent across situations. That is: a model which not only does what you ask it to do (including coming up with agentic plans), but also thinks about how to make more paperclips even while you're just asking about math homework.
Aside: This was kinda a "holy shit" moment, and I'll try to do it justice here. I encourage the reader to do a serious dependency check on their beliefs. What do you think you know about deceptive alignment being plausible, and why do you think you know it? Where did your beliefs truly come from, and do those observations truly provide
I agree that conditional on entraining consequentialist cognition which has a "different goal" (as thought of by MIRI; this isn't a frame I use), the AI will probably instrumentally reason about whether and how to deceptively pursue its own goals, to our detriment.
I contest that there's very little reason to expect "undesired, covert, and consistent-across-situations inner goals" to crop up in LLMs to begin with. An example alternative prediction is:
LLMs will continue doing what they're told. They learn contextual goal-directed behavior abilities, but only apply them narrowly in certain contexts for a range of goals (e.g. think about how to win a strategy game). They also memorize a lot of random data (instead of deriving some theory which simply explains its historical training data a la Solomonoff Induction).
Not only is this performant, it seems to be what we actually observe today. The AI can pursue goals when prompted to do so, but it isn't pursuing them on its own. It basically follows instructions in a reasonable way, just like GPT-4 usually does.
Why should we believe the "consistent-across-situations inner goals -> deceptive alignment" mechanistic claim about how SGD works? Here are the main arguments I'm aware of:
- Analogies to evolution (e.g. page 6 of Risks from Learned Optimization)
- I think these loose analogies provide basically no evidence about what happens in an extremely different optimization process (SGD to train LLMs).
- Counting arguments: there are more unaligned goals than aligned goals (e.g. as argued in How likely is deceptive alignment?)
- These ignore the importance of the parameter->function map. (They're counting functions when they need to be counting parameterizations.) Classical learning theory made the (mechanistically) same mistake in predicting that overparameterized models would fail to generalize.
- I also basically deny the relevance of the counting argument, because I don't buy the assumption of "there's gonna be an inner 'objective' distinct from inner capabilities; let's make a counting argument about what that will be."
- Speculation about simplicity bias: SGD will entrain consequentialism because that's a simple algorithm for "getting low loss"
- But we already know that simplicity bias in the NN prior can be really hard to reason about.
- I think it's unrealistic to imagine that we have the level of theoretical precision to go "it'll be a future training process and the model is 'getting selected for low loss', so I can now make this very detailed prediction about the inner mechanistic structure."[1]
- I falsifiably predict that if you try to use this kind of logic or counting argument today to make falsifiable predictions about unobserved LLM generalization, you're going to lose Bayes points left and right.
Without deceptive alignment/agentic AI opposition, a lot of alignment threat models ring hollow. No more adversarial steganography or adversarial pressure on your grading scheme or worst-case analysis or unobservable, nearly unfalsifiable inner homonculi whose goals have to be perfected.
Instead, we enter the realm of tool AI[2] which basically does what you say.[3] I think that world's a lot friendlier, even though there are still some challenges I'm worried about -- like an AI being scaffolded into pursuing consistent goals. (I think that's a very substantially different risk regime, though)
- ^
(Even though this predicted mechanistic structure doesn't have any apparent manifestation in current reality.)
- ^
Tool AI which can be purposefully scaffolded into agentic systems, which somewhat handles objections from Amdahl's law.
- ^
This is what we actually have today, in reality. In these setups, the agency comes from the system of subroutine calls to the LLM during e.g. a plan/critique/execute/evaluate loop a la AutoGPT.
It could learn to generalize based on color or it could learn to generalize based on shape. And which one we get is just a question of which one is simpler and easier for gradient descent to implement and which one is preferred by inductive biases, they both do equivalently well in training, but you know, one of them consistently is always the one that gradient descent finds, which in this situation is the color detector.
As an aside, I think this is more about data instead of "how easy is it to implement." Specifically, ANNs generalize based on texture because of the random crop augmentations. The crops are generally so small that there isn't a persistent shape during training, but there is a persistent texture for each class, so of course the model has to use the texture. Furthermore, a vision system modeled after primate vision also generalized based on texture, which is further evidence against ANN-specific architectural biases (like conv layers) explaining the discrepancy.
However, if the crops are made more "natural" (leaving more of the image intact, I think), then classes do tend to have persistent shapes during training. Accordingly, networks reliably learn to generalize based on shapes (just like people do!).
models where the reason it was aligned is because it’s trying to game the training signal.
Isn't the reason it's aligned supposed to be so that it can pursue its ulterior motive, and if it looks unaligned the developers won't like it and they'll shut it down? Why do you think the AI is trying to game the training signal directly, instead of just managing the developers' perceptions of it?
Returning to the frame of evidence strength: part of the reason for this sort of norm is that it lets the listener decide how much evidence "person says X" gives about "X", rather than the claimant making that decision on everybody else' behalf and then trying to propagate their conclusion.
This is a good norm. Sideways of this point, though, it seems to me that it'd be good to note both "it's confused to say X" and also "but there's a straightforward recovery Y of the main point, which some people find convincing and others don't."
Relatedly: to imagine the AI starting to succeed at those long-horizon tasks without imagining it starting to have more wants/desires (in the "behaviorist sense" expanded upon below) is, I claim, to imagine a contradiction—or at least an extreme surprise.
This seems like a great spot to make some falsifiable predictions which discriminate your particular theory from the pack. (As it stands, I don't see a reason to buy into this particular chain of reasoning.)
AIs will increasingly be deployed and tuned for long-term tasks, so we can probably see the results relatively soon. So—do you have any predictions to share? I predict that AIs can indeed do long-context tasks (like writing books with foreshadowing) without having general, cross-situational goal-directedness.[1]
I have a more precise prediction:
AIs can write novels with at least 50% winrate against a randomly selected novel from a typical American bookstore, as judged by blinded human raters or LLMs which have at least 70% agreement with human raters on reasonably similar tasks.
Credence: 70%; resolution date: 12/1/2025
Conditional on that, I predict with 85% confidence that it's possible to do this with AIs which are basically as tool-like as GPT-4. I don't know how to operationalize that in a way you'd agree to.
(I also predict that on 12/1/2025, there will be a new defense offered for MIRI-circle views, and a range of people still won't update.)
- ^
I expect most of real-world "agency" to be elicited by the scaffolding directly prompting for it (e.g. setting up a plan/critique/execute/summarize-and-postmortem/repeat loop for the LLM), and for that agency to not come from the LLM itself.
Sobering look into the human side of AI data annotation:
Instructions for one of the tasks he worked on were nearly identical to those used by OpenAI, which meant he had likely been training ChatGPT as well, for approximately $3 per hour.
“I remember that someone posted that we will be remembered in the future,” he said. “And somebody else replied, ‘We are being treated worse than foot soldiers. We will be remembered nowhere in the future.’ I remember that very well. Nobody will recognize the work we did or the effort we put in.”
Idea: Speed up ACDC by batching edge-checks. The intuition is that most circuits will have short paths through the transformer, because Residual Networks Behave Like Ensembles of Relatively Shallow Networks (https://arxiv.org/pdf/1605.06431.pdf). Most edges won't be in most circuits. Therefore, if you're checking KL of random-resampling edges e1 and e2, there's only a small chance that e1 and e2 interact with each other in a way important for the circuit you're trying to find. So statistically, you can check eg e1, ... e100 in a batch, and maybe ablate 95 edges all at once (due to their individual KLs being low).
If true, this predicts that given a threshold T, and for most circuit subgraphs H of the full network graph G, and for the vast majority of e1, e2 in H:
KL(G || H \ {e2} ) - KL(G || H) < T
iff
KL(G || H \ {e2, e1} ) - KL(G || H \ {e1}) < T
(That is, e1's inclusion doesn't change your pruning decision on e2)
Neel Nanda suggests that it's particularly natural to batch edges in the same layer.
It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.
No, I don't think I'm positing that—in fact, I said that the aligned model doesn't do this.
I don't understand why you claim to not be doing this. Probably we misunderstand each other? You do seem to be incorporating a "(strong) pressure to do well in training" in your reasoning about what gets trained. You said (emphasis added):
The argument for deceptive alignment is that deceptive alignment might be the easiest way for the model to figure out how to do well in training.
This seems to be engaging in the kind of reasoning I'm critiquing.
We now have two dual optimization problems, "minimize loss subject to some level of inductive biases" and "maximize inductive biases subject to some level of loss" which we can independently investigate to produce evidence about the original joint optimization problem.
Sure, this (at first pass) seems somewhat more reasonable, in terms of ways of thinking about the problem. But I don't think the vast majority of "loss-minimizing" reasoning actually involves this more principled analysis. Before now, I have never heard anyone talk about this frame, or any other recovery which I find satisfying.
So this feels like a motte-and-bailey, where the strong-and-common claim goes like "we're selecting models to minimize loss, and so if deceptive models get lower loss, that's a huge problem; let's figure out how to not make that be true" and the defensible-but-weak-and-rare claim is "by considering loss minimization given certain biases, we can gain evidence about what kinds of functions SGD tends to train."
The problem is that figuring out how to do well at training is actually quite hard[1]
It seems to me like you're positing some "need to do well in training", which is... a kinda weird frame. In a weak correlational sense, it's true that loss tends to decrease over training-time and research-time.
But I disagree with an assumption of "in the future, the trained model will need to achieve loss scores so perfectly low that the model has to e.g. 'try to do well' or reason about its own training process in order to get a low enough loss, otherwise the model gets 'selected against'." (It seems to me that you are making this assumption; let me know if you are not.)
I don't know why so many people seem to think model training works like this. That is, that one can:
- verbally reason about whether a postulated algorithm "minimizes loss" (example algorithm: using a bunch of interwoven heuristics to predict text), and then
- brainstorm English descriptions of algorithms which, according to us, get even lower loss (example algorithm: reason out the whole situation every forward pass, and figure out what would minimize loss on the next token), and then
- since algorithm-2 gets "lower loss" (as reasoned about informally in English), one is licensed to conclude that SGD is incentivized pick the latter algorithm.
I think that loss just does not constrain training that tightly, or in that fashion.[2] I can give a range of counterevidence, from early stopping (done in practice to use compute effectively) to knowledge distillation (shows that for a given level of expressivity, training on a larger teacher model's logits will achieve substantially lower loss than supervised training to convergence from random initialization, which shows that training to convergence isn't even well-described as "minimizing loss").
And I'm not aware of good theoretical bounds here either; the cutting-edge PAC-Bayes results are, like, bounding MNIST test error to 2.7% on an empirical setup which got 2%. That's a big and cool theoretical achievement, but -- if my understanding of theory SOTA is correct -- we definitely don't have the theoretical precision to be confidently reasoning about loss minimization like this on that basis.
I feel like this unsupported assumption entered the groundwater somehow and now looms behind lots of alignment reasoning. I don't know where it comes from. On the off-chance it's actually well-founded, I'd deeply appreciate an explanation or link.
- ^
FWIW I think this claim runs afoul of what I was trying to communicate in reward is not the optimization target. (I mention this since it's relevant to a discussion we had last year, about how many people already understood what I was trying to convey.)
- ^
I also see no reason to expect this to be a good "conservative worst-case", and this is why I'm so leery of worst-case analysis a la ELK. I see little reason that reasoning this way will be helpful in reality.
I think it's inappropriate to call evolution a "hill-climbing process" in this context, since those words seem optimized to sneak in parallels to SGD. Separately, I think that evolution is a bad analogy for AGI training.
I mostly... can just barely see an ear above the train if I look, after being told to look there. I don't think it's "clear." I also note that these are black-box attacks on humans which originated from ANNs; these are transferred attacks from eg a CNN.
Another point for feature universality. Subtle adversarial image manipulations influence both human and machine perception:
... we find that adversarial perturbations that fool ANNs similarly bias human choice. We further show that the effect is more likely driven by higher-order statistics of natural images to which both humans and ANNs are sensitive, rather than by the detailed architecture of the ANN.
When writing about RL, I find it helpful to disambiguate between:
A) "The policy optimizes the reward function" / "The reward function gets optimized" (this might happen but has to be reasoned about), and
B) "The reward function optimizes the policy" / "The policy gets optimized (by the reward function and the data distribution)" (this definitely happens, either directly -- via eg REINFORCE -- or indirectly, via an advantage estimator in PPO; B follows from the update equations)
(Updating a bit because of these responses -- thanks, everyone, for responding! I still believe the first sentence, albeit a tad less strongly.)
Because it's shorter edit distance in its internal ontology; it's plausibly NN-simple to take existing plan-grading procedures, internal to the model, and then hooking those more directly into its logit-controllers.
Also note that probably it internally hooks up lots of ways to make decisions, and this only has to be one (substantial) component. Possibly I'd put .3 or .45 now instead of .55 though.
Explaining Wasserstein distance. I haven't seen the following explanation anywhere, and I think it's better than the rest I've seen.
The Wasserstein distance tells you the minimal cost to "move" one probability distribution into another . It has a lot of nice properties.[1] Here's the chunk of math (don't worry if you don't follow it):
The Wasserstein 1-distance between two probability measures and is
where is the set of all couplings of and .
What's a "coupling"? It's a joint probability distribution over such that its two marginal distributions equal and . However, I like to call these transport plans. Each plan specifies a way to transport a distribution into another distribution :
(EDIT: The line should be flipped.)
Now consider a given point in 's support, say the one with the dotted line below it. 's density must be "reallocated" into 's distribution. That reallocation is specified by the conditional distribution , as shown by the vertical dotted line. Marginalizing over , transports all of 's density and turns it into ! (This is why we required the marginalization.)
Then the Wasserstein 1-distance is simple in this case, where are distributions on . The 1-distance of a plan is simply the expected absolute distance from the line!
Then the Wasserstein just finds the infimum over all possible transport plans! Spiritually, the Wasserstein 1-distance[2] tells you the cost of the most efficient way to take each point in and redistribute it into . Just evaluate each transport plan by looking at the expected deviation from the identity line .
Exercise: For on , where is but translated to the right by , use this explanation to explain why the 1-distance equals .
- ^
The distance is motivated by section 1 of "Optimal Transport and Wasserstein Distance."
- ^
This explanation works for -distance for , it just makes the math a little more cluttered.
Steganography could be used by AIs within a single short chain-of-thought to have more serial thinking time. This could include “thoughts” about the overseer and how to play the training game correctly. If this behavior is incentivized, it would make deceptive alignment more likely, since it would remove some of the pressure against deceptive alignment coming from speed priors.
I mean, IIUC the speed prior still cuts against this, since instead of thinking about training the network could just be allocating more capacity to doing the actual task it's trained to do. That doesn't seem to change with additional filler thinking tokens.
if planning, thinking, and communications in dense vector spaces become vastly more powerful than text-based planning, thinking, and communication, then the alignment tax from sticking with systems mostly relying on text will grow.
Doesn't this already apply partially to the current work? Sure, you constrain the input embeddings to the valid token embedding vectors, but it also attends to the previous residual streams (and not just the tokens they output).
I'm also really glad to see this. Seems like a very positive set of changes. Props to MIRI for taking concrete and decisive action.
I expect to not see this, conditional on adding a stipulation like "the AI wasn't scaffolded and then given a goal like 'maximize profit'", because I could imagine the AI-system coming up with nasty subgoals. In particular, I don't expect egregiously bad actions from autoregressive sampling of an LLM tasked with doing scientific research.
I think this is really lucid and helpful:
I expect some people to object that the point of the evolutionary analogy is precisely to show that the high-level abstract objective of the optimization process isn't incorporated into the goals of the optimized product, and that this is a reason for concern because it suggests an unpredictable/uncontrollable mapping between outer and inner optimization objectives.
My point here is that, if you want to judge an optimization process's predictability/controllability, you should not be comparing some abstract notion of the process's "true outer objective" to the result's "true inner objective". Instead, you should consider the historical trajectory of how the optimization process actually adjusted the behaviors of the thing being optimized, and consider how predictable that thing's future behaviors are, given past behaviors / updates.
@Kaj_Sotala argues above that this perspective implies greater consistency in human goals between the ancestral and modern environments, since the goals evolution actually historically selected for in the ancestral environment are ~the same goals humans pursue in the modern environment.
How is this not an excellent example of how under novel circumstances, inner-optimizers (like human brains) can almost all (serial sperm donor cases like hundreds out of billions) diverge extremely far (if forfeiting >10,000% is not diverging far, what would be?) from the optimization process's reward function (within-generation increase in allele frequencies), while pursuing other rewards (whatever it is you are enjoying doing while very busy not ever donating sperm)?
I think it's inappropriate to use technical terms like "reward function" in the context of evolution, because evolution's selection criteria serve vastly different mechanistic functions from eg a reward function in PPO.[1] Calling them both a "reward function" makes it harder to think precisely about the similarities and differences between AI RL and evolution, while invalidly making the two processes seem more similar. That is something which must be argued for, and not implied through terminology.
- ^
And yes, I wish that "reward function" weren't also used for "the quantity which an exhaustive search RL agent argmaxes." That's bad too.
The meme of "current alignment work isn't real work" seems to often be supported by a (AFAICT baseless) assumption that LLMs have, or will have, homunculi with "true goals" which aren't actually modified by present-day RLHF/feedback techniques. Thus, labs aren't tackling "the real alignment problem", because they're "just optimizing the shallow behaviors of models." Pressed for justification of this confident "goal" claim, proponents might link to some handwavy speculation about simplicity bias (which is in fact quite hard to reason about, in the NN prior), or they might start talking about evolution (which is pretty unrelated to technical alignment, IMO).
Are there any homunculi today? I'd say "no", as far as our limited knowledge tells us! But, as with biorisk, one can always handwave at future models. It doesn't matter that present models don't exhibit signs of homunculi which are immune to gradient updates, because, of course, future models will.
Quite a strong conclusion being drawn from quite little evidence.
I expect you to be making a correct and important point here, but I don't think I get it yet. I feel confused because I don't know what it would mean for this frame to make false predictions. I could say "Evolution selected me to have two eyeballs" and I go "Yep I have two eyeballs"? "Evolution selected for [trait with higher fitness]" and then "lots of people have trait of higher fitness" seems necessarily true?
I feel like I'm missing something.
Oh. Perhaps it's nontrivial that humans were selected to value a lot of stuff, and (different, modern) humans still value a lot of stuff, even in today's different environment? Is that the point?
In the sections before that, I argued that there’s no single thing that evolution selects for; rather, the thing that it’s changing is constantly changing itself.
"The thing that it's selecting for is itself constantly changing"?
It feels to me like lots of alignment folk ~only make negative updates. For example, "Bing Chat is evidence of misalignment", but also "ChatGPT is not evidence of alignment." (I don't know that there is in fact a single person who believes both, but my straw-models of a few people believe both.)
It seems reasonably likely that covid was an accidental lab leak (though attribution is hard) and it also seems like it wouldn't have been that hard to engineer covid in a lab.
Seems like a positive update on human-caused bioterrorism, right? It's so easy to let stuff leak that covid accidentally gets out, and it might even have been easy to engineer, but (apparently) no one engineered it, nor am I aware of this kind of intentional bioterrorism happening in other places. People apparently aren't doing it. See Gwern's Terrorism is not effective.
Maybe smart LLMs come out. I bet people still won't be doing it.
So what's the threat model? One can say "tail risks", but--as OP points out--how much do LLMs really accelerate people's ability to deploy dangerous pathogens, compared to current possibilities? And what off-the-cuff probabilities are we talking about, here?
I feel like an important point isn't getting discussed here -- What evidence is there on tutor-relevant tasks being a blocking part of the pipeline, as opposed to manufacturing barriers? Even if future LLMs are great tutors for concocting crazy bioweapons in theory, in practice what are the hardest parts? Is it really coming up with novel pathogens? (I don't know)
ok, I am quite confident you will get tons of evidence that AI systems are not aligned with you within the next few years. My primary question is what you will actually do as soon as you have identified a system as unaligned or dangerous in this way
Any operationalizations that people might make predictions on?
This is an excellent reply, thank you!
So I'm pretty skeptical of the claim that anything remotely analogous is going on inside of current GPTs, especially within a single forward pass.
I think I broadly agree with your points. I think I'm more imagining "similarity to humans" to mean "is well-described by shard theory; eg its later-network steering circuits are contextually activated based on a compositionally represented activation context." This would align with greater activation-vector-steerability partway through language models (not the only source I have for that).
However, interpreting GPT: the logit lens and eg DoLA suggests that predictions are iteratively refined throughout the forward pass, whereas presumably shard theory (and inner optimizer threat models) would predict most sophisticated steering happens later in the network.
Well, if someone originally started worrying based on strident predictions of sophisticated internal reasoning with goals independent of external behavior, then realizing that's currently unsubstantiated should cause them to down-update on AI risk. That's why it's relevant. Although I think we should have good theories of AI internals.
I think this post would be improved by removing the combative tone and the framing of people who disagree as "gullible."
Anyways, insofar as you mean to argue "this observation is no evidence for corrigibility", I think your basic premise is wrong. We can definitely gather evidence about generalization behavior (e.g. "does the model actually let itself be shut down") by prompting the model by asking "would you let yourself be shut down?"
I claim that P(model lets itself be shut down | (model says it would let itself be shut down)) > P(model lets itself be shut down | NOT(model says it would let itself be shut down)). By Conservation of Expected Evidence, observing the first event is Bayesian evidence that the model would "actually" let itself be shut down.
I think it's reasonable to debate how strong the evidence is, and I can imagine thinking it's pretty weak. But I disagree with claims that these quantities don't bear on each other.
> evolution does not grow minds, it grows hyperparameters for minds.
Imo this is a nitpick that isn't really relevant to the point of the analogy. Evolution is a good example of how selection for X doesn't necessarily lead to a thing that wants ('optimizes for') X; and more broadly it's a good example for how the results of an optimization process can be unexpected.
I think it's extremely relevant, if we want to ensure that we only analogize between processes which share enough causal structure to ensure that lessons from e.g. evolution actually carry over to e.g. AI training (due to those shared mechanisms). If the shared mechanisms aren't there, then we're playing reference class tennis because someone decided to call both processes "optimization processes."
Not sure where to land on that. It seems like both are good analogies? Brains might not be using gradients at all[1], whereas evolution basically is.
I mean, does it matter? What if it turns out that gradient descent itself doesn't affect inductive biases as much as the parameter->function mapping? If implicit regularization (e.g. SGD) isn't an important part of the generalization story in deep learning, will you down-update on the appropriateness of the evolution/AI analogy?
The fact that constitutional AI works at all, that we can point at these abstract concepts like 'freedom' and language models are able to drive a reinforcement learning optimization process to hit the right behavior-targets from the abstract principle is very strong evidence that they understand the meaning of those abstract concepts.
"It understands but it doesn't care!"
There is this bizarre motte-and-bailey people seem to do around this subject.
I agree. I am extremely bothered by this unsubstantiated claim. I recently replied to Eliezer:
Getting a shape into the AI's preferences is different from getting it into the AI's predictive model.
It seems like you think that human preferences are only being "predicted" by GPT-4, and not "preferred." If so, why do you think that?
I commonly encounter people expressing sentiments like "prosaic alignment work isn't real alignment, because we aren't actually getting the AI to care about X." To which I say: How do you know that? What does it even mean for that claim to be true or false? What do you think you know, and why do you think you know it? What empirical knowledge of inner motivational structure could you be leveraging to make these claims, such that you are far more likely to make these claims in worlds where the claims are actually true?
(On my pessimistic days, I wonder if this kind of claim gets made because humans write suggestive phrases like "predictive loss function" in their papers, next to the mathematical formalisms.)
Thanks for the reply. Let me clarify my position a bit.
I think that GPTs (and every other kind of current AI system) are not doing anything that is even close to isomorphic to the processing that happens inside the human brain.
I didn't mean to (positively) claim that GPTs have near-isomorphic motivational structure (though I think it's quite possible).
I meant to contend that I am not aware of any basis for confidently claiming that LLMs like GPT-4 are "only predicting what comes next", as opposed to "choosing" or "executing" one completion, or "wanting" to complete the tasks they are given, or—more generally—"making decisions on the basis of the available context, such that our ability to behaviorally steer LLMs (e.g. reducing sycophancy) is real evidence about our control over LLM motivations."
Concerning "GPTs are predictors", the best a priori argument I can imagine is: GPT-4 was pretrained on CE loss, which itself is related to entropy, related to information content, related to Shannon's theorems isolating information content in the context of probabilities, which are themselves nailed down by Cox's theorems which do axiomatically support the Bayesian account of beliefs and belief updates... But this long-winded indirect axiomatic justification of "beliefs" does not sufficiently support some kind of inference like "GPTs are just predicting things, they don't really want to complete tasks." That's a very strong claim about the internal structure of LLMs.
(Besides, the inductive biases probably have more to do with the parameter->function map, than the implicit regularization caused by the pretraining objective function; more a feature of the data, and less a feature of the local update rule used during pretraining...)
Me: Huh? The user doesn’t try to shut down the AI at all.
For people, at least, there is a strong correlation between "answers to 'what would you do in situation X?'" and "what you actually do in situation X." Similarly, we could also measure these correlations for language models so as to empirically quantify the strength of the critique you're making. If there's low correlation for relevant situations, then your critique is well-placed.
(There might be a lot of noise, depending on how finicky the replies are relative to the prompt.)
Saying "don't do X" seems inefficient for recall. Given the forward-chaining nature of human memory, wouldn't it be better to say "If going to X, then don't"? That way, if you think "what if I do X?", you recall "don't."
(This is not an actual proposal for a communication norm.)
I don’t think anyone even remotely close to understanding this fact would use it in that context.
I dispute this claim being called a "fact", but I'm open to having my mind changed. Is there a clear explanation of this claim somewhere? I've read Lethalities and assorted bits of Nate's writing.
The fact that a large language model trained to predict natural language text can generate that dialogue, has no bearing on the AI's actual motivations
What are "actual motivations"?
When you step off distribution, the results look like random garbage to you.
This is false, by the way. Deep learning generalizes quite well, (IMO) probably because the parameter->function map is strongly biased towards good generalizations. Specifically, many of GPT-4's use cases (like writing code, or being asked to do rap battles of Trump vs Naruto) are "not that similar to the empirical training distribution (represented as a pmf over context windows)."
Imagine capturing an alien and forcing it to act in a play. An intelligent alien actress could learn to say her lines in English, to sing and dance just as the choreographer instructs. That doesn't provide much assurance about what will happen when you amp up the alien's intelligence.
This nearly assumes the conclusion. This argument assumes that there is an 'alien' with "real" preexisting goals, and that the "rehearsing" is just instrumental alignment. (Simplicia points this out later.)
Simplicia: I agree that the various coherence theorems suggest that the superintelligence at the end of time will have a utility function, which suggests that the intuitive obedience behavior should break down at some point between here and the superintelligence at the end of time.
I actually don't agree with Simplicia here. I don't think having a utility function means that the obedience will break down.
If a search process would look for ways to kill you given infinite computing power, you shouldn't run it with less and hope it doesn't get that far.
Don't design agents which exploit adversarial inputs.
but the actual situation turned out to be astronomically more forgiving, thanks to the inductive biases of SGD.
Again, it's probably due to the parameter->function mapping, where there are vastly more "generalizing" parameterizations of the network. SGD/"implicit regularization" can be shown to be irrelevant at least in a simple toy problem. (Quintin Pope writes about related considerations.)
The reason—one of the reasons—that you can't train a superintelligence by using humans to label good plans, is because at some power level, your planner figures out how to hack the human labeler.
A flatly invalid reasoning step. Reward is not the optimization target, and neither are labels.
Thanks, this was an oversight on my part.
In the context of AI and reward systems, Goodhart's Law means that when a reward becomes the objective for an AI agent, the AI agent will do everything it can to maximize the reward function, rather than the original intention.
I think this is either ambiguous or not correct. Reward is not the optimization target; Models Don't "Get Reward", and so on.I think it would be more accurate to state:
In the context of AI and reward systems, Goodhart's Law means that when a proxy reward function reinforces undesired behavior, the AI will learn to do things we don't want. The better the AI is at exploring, the more likely it is to find undesirable behavior which is spuriously rated highly by the reward model.