Conditioning Predictive Models: Making inner alignment as easy as possible
post by evhub, Adam Jermyn (adam-jermyn), Johannes Treutlein (Johannes_Treutlein), Rubi J. Hudson (Rubi), kcwoolverton · 2023-02-07T20:04:20.272Z · LW · GW · 2 commentsContents
4. Making inner alignment as easy as possible Plausible internal structures A framework for comparing internal structures Analyzing the case for deceptive alignment Comparing camera complexities Other relevant factors The RLHF conditioning hypothesis Dealing with internal cognitive resource management Transparency and interpretability None 2 comments
This is the fourth of seven posts in the Conditioning Predictive Models Sequence [? · GW] based on the paper “Conditioning Predictive Models: Risks and Strategies” by Evan Hubinger, Adam Jermyn, Johannes Treutlein, Rubi Hudson, and Kate Woolverton. Each post in the sequence corresponds to a different section of the paper.
4. Making inner alignment as easy as possible
At the beginning, we posited the assumption that large language models could be well-understood as predictive models of the world. At the time, however, that was just an assumption—now, we want to return to that assumption and try to understand how likely it is to actually be true.
Furthermore, in addition to needing a predictive model (as opposed to e.g. a deceptive agent), we also want our predictor to have a fixed, physical understanding of its cameras rather than operate as a general inductor to avoid the problem of anthropic capture [AF · GW]. Additionally, as we’ll discuss in more depth in this section [AF · GW], we’ll also need a prediction model that is managing its own internal cognitive resources in the right way.
Though we think that ensuring these desiderata could be quite difficult, we nevertheless think that this presents the easiest inner alignment problem that we are aware of among any potentially safe and competitive approaches [AF · GW]. Furthermore, since we believe that inner alignment—and deceptive alignment in particular—pose some of the most dangerous and hardest to address of all known AI safety problems [? · GW], we think that any improvement in the overall difficulty of that problem should be taken quite seriously as a reason to favor predictive model approaches.
Plausible internal structures
There are many possible ways large language models could work internally. Previously, we suggested some examples—specifically:
- an agent minimizing its cross-entropy loss;
- an agent maximizing long-run predictive accuracy;
- a deceptive agent trying to gain power in the world;
- a general inductor;
- a predictive model of the world (with fixed, physical “cameras” translating world states into observed tokens);
- a loose collection of heuristics;
- a generative model of token transitions;
- a simulator that picks from a repertoire of humans to simulate;
- a proxy-aligned agent optimizing proxies like grammatical correctness.
To start with, for our purposes here, we’ll eliminate those internal structures that don’t scale with capabilities—that is, we only want to consider plausible internal structures of models that perform well enough on the language model pre-training task that they are able to generalize to other cognitive tasks at a human level or above. Thus, we’ll eliminate (6) through (9) from the above list—(6) because predicting agents like humans should require some degree of optimization, (7)/(8) for the reasons we outlined previously [AF · GW] that LLMs have to be able to predict the world, and (9) because such proxies should eventually yield worse performance than actual prediction.[1]
That leaves us with (1)/(2), variants on a sycophantic reward maximizer; (3), a deceptive agent; and (4)/(5), different ways of directly attempting to produce predictions.
The distinction between (4) and (5) lies in exactly how a model produces predictions given its understanding of the world, and we are quite uncertain about what that might look like in practice. Unfortunately, we expect that the only way to figure out how models accomplish this translation is through transparency and interpretability and not theoretical analysis.[2]
Nevertheless, we still think it is possible to make some high-level generalizable claims regarding which classes of internal structures are most likely. To do that, we first have to figure out some framework for comparing and contrasting the remaining plausible internal structures.
A framework for comparing internal structures
Structurally, we’ll isolate the following two features of different types of predictive models:
- What “camera” in the world is it tracking?
- This could be a model of a physical camera pointed at a scene we want the model to predict, but it doesn’t need to correspond directly to the observations we want to extract and could instead be some other way of mapping the world to data the model wants to use.
- For instance, a deceptive paperclip maximizer might have a “camera” that counts the number of paperclips in the world.
- How does the model compute its output from what it predicts its “camera” will show?
To start with, let’s see what this decomposition looks like for the remaining plausible internal structures from above:
- An agent minimizing its cross-entropy loss:
- What “camera” is it tracking?
- The relevant camera here is effectively “what data points are in the training set, conditional on being in training.” Note that conditioning on being in training is necessary, since otherwise “what data points are in the training set” isn’t well-defined during deployment when no such training set exists.
- How does the model compute its output from what it predicts its “camera” will show?
- The model selects the output which minimizes the cross-entropy loss relative to what it thinks the camera will show (that is, what it thinks will be in the training set).
- What “camera” is it tracking?
- An agent maximizing long-run predictive accuracy:
- What “camera” is it tracking?
- This model’s camera here needs to track prompts given to the model over a long time-horizon, along with a distribution over correct continuations. What “correct” means could vary, however, from something closer to 1a where the model conditions on being in training, to something closer to 4a or 5a where it is predicting a more general camera.
- How does the model compute its output from what it predicts its “camera” will show?
- The model chooses an overall policy over outputs that maximizes its predictive accuracy (or some actual proper scoring rule) aggregated over its horizon length. Thus, such a model might choose to output continuations with low accuracy on some time steps in order to make future predictions easier. Such a model might even be deceptive, making this mostly a sub-class of 3.
- What “camera” is it tracking?
- A deceptive agent trying to gain power in the world:
- What “camera” is it tracking?
- For a deceptive model, the “camera” that it needs to track is the objective that it’s attempting to maximize. As a simple example, a paperclip maximizer would have a “camera” that tracks the number of paperclips.
- How does the model compute its output from what it predicts its “camera” will show?
- It selects the output which causes the camera to show the best result according to its objective.
- What “camera” is it tracking?
- A general inductor:
- What “camera” is it tracking?
- A general inductor keeps track of many different hypotheses for what its “cameras” might represent, potentially drawing from the space of all computable data-generating procedures. It updates the weights of these hypothesis using Bayes rule on the observed outputs.
- How does the model compute its output from what it predicts its “camera” will show?
- Given a distribution over possible cameras, a general inductor predicts whatever observations would come next on the different possible cameras, weighted by how likely it currently thinks each possible camera is.
- What “camera” is it tracking?
- A predictive model of the world (with fixed, physical cameras):
- What “camera” is it tracking?
- The camera here is some physical generalization of the data-generating procedure. For example, “whatever appears on these websites from 2015 to 2022.”
- How does the model compute its output from what it predicts its “camera” will show?
- The model outputs whatever the most likely next observation is for its cameras to show. This is similar to 4b., but with the model only considering physical cameras rather than arbitrary data-generation processes.
- What “camera” is it tracking?
Though it may not immediately look like it, we think that this decomposition here is highly related to the world_model + optimization_procedure + mesa_objective
decomposition in “How likely is deceptive alignment? [AF · GW]”. The primary differences being that: we are thinking of the objective as primarily being about what “camera” the model is paying attention to, we’re thinking of how the model uses its camera as its optimization procedure, and we’re eliding the world model as it’s effectively the same in all cases here. Thus, we think a similar analysis to that used in “How likely is deceptive alignment?” should be applicable here as well.
The primary difference is that, in “How likely is deceptive alignment?”, the internal structures compared there are different—specifically, Hubinger isolates three distinct mechanisms via which a model could end up looking like it was doing the right thing on any training input. In the context of a model trained on a prediction task, these look like:
- Internal alignment: The model has internalized the goal of predicting “camera” observations. The goal is hardcoded in the weights, and the model can directly pursue it.
- Corrigible alignment: The model has internalized a pointer to the prediction goal. It doesn’t know what this goal entails, but it knows how to figure it out at runtime from its world model.
- Deceptive alignment: The model has some other goal, but pursues the prediction goal during training because this is instrumentally useful. The model doesn’t have any hardcoded information about the prediction goal, and instead derives it at runtime as part of planning for its true goal.
In our context, any of the plausible non-deceptive internal structures (1, 4, or 5) can be implemented either via internal alignment or corrigible alignment—the difference just being exactly how the camera model gets hardcoded.[3] Thus, for our purposes here, we will generally lump internally and corrigible aligned models together as non-deceptive predictive models.
Notably, however, all such non-deceptive predictive models require hardcoding something about the prediction goal, making them potentially more complex than deceptively aligned models. In our opinion, however, the case for deceptive alignment is substantially weaker in the context of such predictive models than in the more standard picture presented by “How likely is deceptive alignment? [AF · GW]”. We discuss this further below.
Analyzing the case for deceptive alignment
In “How likely is deceptive alignment? [AF · GW]”, two scenarios are presented for how a model could become deceptively aligned: the high path-dependence scenario and the low path-dependence scenario. In the high path-dependence case, models first develop proxy goals. Then, when their understanding of the training objective exceeds the alignment of their proxy goals, it becomes favorable for the model to become deceptive. In the low path-dependence case, deception is favored by a general simplicity bias because it allows for substituting simpler long-term goals for the prediction goal and still getting good performance.
Importantly, both of these scenarios critically rely on the true goal being relatively complex and hard to learn. In the high path-dependence case, the true goal needs to be relatively complex for the proxy goal alignment to lag behind the world modeling capability—and in the low path-dependence case, the true goal needs to be relatively complex for the simplest long-term goal to be substantially simpler than the true goal. This complexity gap is needed for deceptive alignment to be favored over internal alignment and for it to be favored over corrigible alignment—in particular, the case for deceptive alignment being favored over corrigible alignment requires that it take many more bits to specify a pointer to the true goal than to specify a very simple long-term goal.
The relevant notion of complexity here is roughly “description complexity given a good world model,” because both deceptive and aligned models need to have a good world model to achieve low loss. In the case of LLM pre-training, the intended goal is actually quite simple by this measure: the vast majority of the complexity of predicting the next token is wrapped up in modeling the world—if all that’s left is just minimizing the cross-entropy loss, that should be almost trivial to encode.
This is critical and so worth emphasizing: unlike most goals we might want to instill in a model, the prediction goal is relatively simple, such that its complexity is competitive with the simplest possible long-term goals. As a result, for this specific goal, the internally aligned models don’t necessarily pay a complexity penalty for their alignment.
By contrast, objectives like “satisfy human preferences” are quite complex to specify even given a perfect world-model, since they have to specify some ground truth for how to compute “human preferences” from base facts about reality. Concretely, an internally aligned model needs to learn the actual truth of human preferences, which are very complicated. Similarly, a corrigibly aligned model needs to learn that it should be satisfying human preferences, which is simpler but still potentially requires many bits to describe where human preferences live in the world-model. A deceptively aligned model, on the other hand, figures out at runtime both that it needs to say the words that make the human overseers press the reward button and what those words are.
Comparing camera complexities
One wrinkle in the above story—that the prediction objective is so simple that it’s as easy to find as a deceptive objective—is that the prediction goal is not just “minimize cross-entropy loss:” the model also needs to have some conceptualization of its “cameras” that tracked what training data it thought it might get. In general, to do well on the LLM pre-training task, you need to not just understand base facts about the world, but also understand how the data stream that you’re predicting is influenced by those base facts.
On its own, the cameras should just be another fact about the world: the data was actually collected by someone in some (known) fashion. The tricky part, however, comes in how the model learns to generalize that data collection procedure: an aligned predictive model always needs to know how to compute the function from the world to its observations to be able to do the inference necessary to make predictions. For example, it needs to know how to deal with situations where there are multiple possible data streams to consider, where there are no humans to collect or curate data, etc. Put another way, there are many possible ways for a model to map beliefs over the world to camera outputs, and at minimum the model needs to specify enough information to pick one of these generalizations.
The simplest way for a predictive model to model its cameras is likely to function as a general inductor that simply infers over all possible data streams that occur in the world which are most likely to be the source of the model’s input data. That gives us the anthropic capture problem we discussed previously [AF · GW], however.
If we don’t want anthropic capture, the model instead has to have some notion of how to generalize its cameras that isn’t the fully general one. This notion can still involve inference over different camera types/properties, but it at least needs to hardcode a distribution over physical cameras that rules out simulated cameras in which to do the inference.
Unfortunately, one of the simplest ways to do this is to be a deceptive model. Such a model need not hardcode anything about how its cameras work at all, since it can just infer the camera generalization at runtime that is most likely to get it the most power in the world and then do that. Conceptually, the concern is that “do whatever gets you the most power” might be the best way of compressing a good-seeming camera generalization, and so might be favored by inductive biases.
As we mentioned previously, however, deceptive models also need to specify “cameras” of some sort—that is, they need to specify the “camera” that observes the value of their objective, since they need to know in any possible world state how well they’re doing on their internal mesa-objective. Thus, we think that the key criterion which is likely to determine how likely deceptive alignment is for predictive models is the relative complexity of the “cameras” that each different type of model has to track.
Notably, this does elide potential complexity differences in how different models compute their output from their cameras, though overall we think that such differences are likely to be relatively minor by comparison. In particular, in all the cases we considered above, the relevant procedures were essentially all relatively simple optimization procedures.
Additionally, this criterion suggests a way of potentially improving this situation further: we could try to use as simple a camera as possible in collecting the data. The idea here would be that, by making the desirable camera as easy to specify as possible, we help make it more competitive in terms of complexity with the possible cameras of deceptive models—that is, if we make the data collection procedure as straightforward as possible, it should help reduce the complexity gap between the deceptive and non-deceptive models.[4]
Other relevant factors
Though the simplicity of the desired prediction objective is the primary reason that we think the inner alignment problem is likely to be easier in the case of training predictive models, there are some other additional considerations that are worth pointing out here as well.
First, another factor that might reduce the likelihood of deceptive alignment in the LLM case—relative to e.g. the “satisfy human preferences” case—is that the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning). Since situational awareness is a necessary prerequisite to deceptive alignment, this should mean that the fully safe period where the model has no situational awareness could be substantially longer for predictive models.[5] That said, this period still should not last forever: understanding how you fit into the world does help you make better predictions, and eventually developing that understanding will become the most efficient way for gradient descent to extract an additional performance improvement.[6] Furthermore, at least in the low path-dependence scenario, deceptive alignment might just be a simpler and more natural basin for gradient descent to find, regardless of whether situational awareness on its own is useful.
Second, LLM pre-training is, at least on its face, a highly myopic objective—it doesn’t directly incentivize models to optimize for anything over the long term. As a result, since deceptive alignment requires non-myopia—as the model has to care about getting something in the future for it to be worth it to act deceptive now—the fact that non-myopia isn’t directly incentivized by the training process here means that making the model non-myopic is an additional step that gradient descent has to take to make the model deceptive that doesn’t simply happen by default.
That said, to be capable of doing well on the pre-training task, it is quite important to be able to predict highly non-myopic agents—namely humans. As a result, we should expect that pre-trained LLMs will have the cognitive machinery necessary to be capable of doing long-term optimization, even if they don’t by default use that machinery for anything other than myopic prediction. And as long as the machinery is present, it might be relatively easy for gradient descent to repurpose the machinery into making the model itself deceptive rather than just using it for prediction. We think that this sort of repurposing could become especially likely in the presence of RL fine-tuning, as we will discuss shortly [AF · GW].
Third, it is worth pointing out that one potentially tricky factor here is that there is a sense in which we are actually making the deceptive model’s job easier. If we are primarily using our model to do prediction of counterfactual humans, the fact that they are counterfactual means that we can’t directly check whether or not our model is correct. This is compared to a situation where we have access to some ground truth that we can, for example, occasionally use to spot check the model. In the absence of any sort of ability to do ground truth spot-checking like that, a deceptive model has a lot more leeway in being able to essentially say whatever it wants and get away with it. In our opinion, however, we think that relying on such checks is highly fraught, for a number of reasons:
- they depend on the deceptive model not knowing when you’re going to check it or not;
- they require that it not be possible for a deceptive model to ever cause a catastrophe using only a single output;
- they rely on there not being other exploitable side-channels;
- and even if you set everything up right, the deceptive model can just gamble that you probably won’t check on it on any particular input.
As a result, we are generally wary of any approach that attempts to contain deceptive models once created, and are instead in favor of approaches that attempt to prevent their creation in the first place—e.g. via having inductive biases that favor the non-deceptive models over the deceptive models, as we think could plausibly be the case here.
Moreover, there are plausible spot-checks we could perform anyway. For instance, we could check to see if the way the predictions play out is consistent with what we see in our world, even though the worlds being predicted are counterfactual. This is necessarily fuzzy, as we ourselves are not perfect predictive agents, but we do know some things about how the world changes with time and it probably makes sense to check that those are satisfied before relying too heavily on a predictive model.
The RLHF conditioning hypothesis
As we will discuss later [AF · GW], reinforcement-learning-based fine-tuning approaches—particularly reinforcement learning from human feedback (RLHF)—provide a potentially very flexible way to condition models [AF · GW].
The primary problem, however, is that it’s very hard for us to know whether the result of such a fine-tuning procedure is a predictive model implementing a conditional selected to get high reward, or a potentially deceptive agent optimizing for high reward. And these are very different: if it’s a predictive model, then we can use all of the techniques discussed previously to pick a conditional that we think is safe and then select our reward to get that conditional; if it’s an agent, however, then we have no reason to believe that any amount of changing the conditional will prevent it from continuing to optimize for its objective, potentially deceptively.
We’ll call the hypothesis that RLHF is well-modeled as producing a predictive model that is implementing a particular conditional of a pre-trained distribution—rather than it e.g. producing some sort of agent—the RLHF conditioning hypothesis. One of the primary open questions we’d like to answer is the truth of this hypothesis.
We think that the RLHF conditioning hypothesis is plausible, but uncertain. To attempt to understand the likelihood of the RLHF conditioning hypothesis being true, we’ll look at both a high and low path-dependence [AF · GW] story for how RLHF could turn a predictive model into a deceptive agent.
First, the basic high path-dependence case for how we think RL fine-tuning could lead to an agent that is no longer well-described as a predictive model is if RL fine-tuning results in gradient descent effectively repurposing existing optimization machinery—e.g. that was previously just used for predicting agents—into being a direct component of the way the model computes its actions. For example, we could imagine a situation where we do RLHF on some sort of helpfulness objective. At the beginning, we just get a prediction of a particular very helpful agent, but then gradient descent realizes that we don’t actually need to model the helpful agent in its entirety and can just cut everything out other than its core optimization process.
Second, for the low path-dependence case, the problem is that the complexity of describing the right sort of “helpful agent” could become larger than the complexity of just implementing a highly agentic process directly. The idea is that, once you are doing RLHF for some task, instead of asking for general prediction, you are now asking for prediction under a particular conditional, and that conditional could be quite complex, potentially substantially more complex than a deceptive agent with some sort of simple goal that just infers whatever conditional is necessary to get high reward (for the purpose of being selected for by the training process).
Concretely, there are two differences between pre-training and RLHF that are causing us issues here. In the high path-dependence case, the problem is that the RLHF model no longer has to include the full complexity of the original prediction task, just the narrower task of predicting a particular agent. In the low path-dependence case, the problem is that the RLHF model has to pay for the complexity of implementing the conditional, whereas in the case of literal prompting the conditional is just given to the model without it having to encode it internally.
Though we think that these problems do present serious difficulties to the RLHF conditioning hypothesis, there are potential ways around them. In general, we think that the RLHF conditioning hypothesis is substantially more likely to hold in the high path-dependence case compared to the low path-dependence case, for the simple reason that the main thing you have going for you in the RLHF case is that you’re starting from a model you think is actually predictive, which should matter a lot more in the high path-dependence case compared to the low path-dependence case. Furthermore, in the high path-dependence case, we can try to get around the problem of the RLHF model not having to solve the full prediction task just via basic regularization techniques: if we intersperse pre-training with RLHF training, or use some sort of KL regularization (as we discuss in Section 5 [AF · GW]), we should be able to force the RLHF model to still be capable of the more general prediction task.
Finally, it’s worth pointing out that if the RLHF conditioning hypothesis is false, there may still be other ways to get the sort of implicit conditionals that we would otherwise want RLHF for. Sequential reasoning techniques such as chain of thought prompting, as we discussed previously [AF · GW], or alternatives like soft prompting, as we discuss in the open problems [AF · GW], could substitute for RLHF. Furthermore, if the RLHF conditioning hypothesis is false, that’s a problem for all RLHF-based approaches, since it leaves us with no reason to expect that RLHF won’t just produce deceptive models.
Dealing with internal cognitive resource management
In addition to avoiding deceptive alignment and ensuring we get the right sort of “camera” model, there are a couple of other additional inner alignment concerns related to the difficulties of internal cognitive resource management that are also worth addressing here as well.
First, trying to use aligned predictors rather than aligned agents opens us up to potential tiling/forwarding [AF · GW] problems. That is: even a prediction model outputting predictions in exactly the right way might internally decide to do a search over algorithms to solve some problem and in so doing find and run something misaligned.
Previously [AF · GW], we mentioned the importance from a competitiveness perspective of prediction models being able to efficiently manage their own internal cognitive resources. However, one way of managing internal cognitive resources is running internal searches over programs, and if that happens it could potentially find something misaligned and be quite problematic. Additionally, what makes this sort of problem harder in the case of prediction models compared to aligned agents is that you can’t just solve the problem by having the model want to solve the problem: an aligned agent doesn’t want to build a misaligned AI because that would be bad, but a prediction model doesn’t care, it just outputs predictions.
There are a couple of potential ways to address these sorts of tiling/forwarding problems. To start with, you could try to get the model to audit itself in some capacity. If, by the time the model is capable of running such internal searches over algorithms, it is also capable of auditing itself to discover such searches and ensure they’re done correctly, then oversight techniques like relaxed adversarial training [AF · GW] should solve the problem. If the sequencing goes the other way, however—if oversight is harder than running internal searches over algorithms—then oversight may not work. In that case, though, another thing that could be done is relying more on external cognitive resources rather than internal cognitive resources—e.g. via sequential reasoning techniques. In particular, you could get a model that is capable of doing powerful tasks with enough sequential reasoning without being capable of ever running an entire search over algorithms within a single forward pass. Furthermore, as long as the combination of all the individual steps of sequential reasoning becomes capable of overseeing algorithm search in a single forward pass before a single forward pass becomes capable of doing search over algorithms, then oversight should work here as well.
Second, another potential inner alignment concern here is ensuring that a model efficiently managing its own internal cognitive resources doesn’t start to conclude that the rest of the world also counts as an “internal cognitive resource.” In that case, such a model might just start optimizing the entire world to help it make predictions. To be confident that something like this won’t happen, we likely need to be confident that our model will respect the correct Cartesian boundary between itself and the rest of the world, as discussed in “Agents Over Cartesian World Models [AF · GW].” As discussed in that post, while it is certainly possible to have agents with the correct Cartesian boundary, behaviorally distinguishing agents with the right Cartesian boundary from those with the wrong Cartesian boundary seems very difficult, at least without fooling the model in some way regarding its own situation. As a result, the hope here primarily has to rest on either the correct sorts of boundaries actually being more natural, or transparency and interpretability being able to ensure we get the right sort of boundary.
Transparency and interpretability
Overall, we think that LLM inner alignment should be substantially easier than for other kinds of models, though still not trivial. While we think it’s plausible that training predictive models simply won’t lead to deceptive alignment, having some way to at least verify that—e.g. via transparency and interpretability—would give us much more confidence.
Ideally, we’d like to be able to use transparency and interpretability tools to predict when models are likely to become deceptive by understanding whether they possess precursors to deceptive alignment like situational awareness, long-term goals, etc. Additionally, transparency and interpretability could also help with ensuring that predictive models conceptualize their “cameras” physically rather than acting as general inductors over data streams.
Different levels of transparency and interpretability—as described in “A transparency and interpretability tech tree [AF · GW]”—could help with this in different ways. Ideally, worst-case robust-to-training transparency could help us directly train for predictive models with the right sorts of cameras and generalization—but since we think there’s a reasonable chance that this approach simply works by default, worst-case inspection or training process transparency could be sufficient to just ensure that we get on the right track.
Even beyond that, in the limit of the most powerful tools (along the lines of solutions to ELK), we would be particularly excited by tools that let us intervene directly in the model’s understanding of the world. Such access could let us e.g. directly condition on the absence of other AIs, easily produce counterfactual histories, etc. Each of these abilities would directly resolve several significant concerns with conditioning predictive models. We don’t think that such capabilities are strictly necessary for this approach, however.
On (9), though we think that such proxy pseudo-alignment [? · GW] is plausible early on, we think that the actual prediction task is simple enough that eventually it should be possible to eliminate the gap between such proxies and the actual prediction task via additional training. ↩︎
Note that the only remaining plausible internal structure with a goal that isn’t some sort of prediction is the deceptively aligned agent. That’s because we think that deception is the main way that a non-predictor agent could perform well on the pre-training task of next token prediction. In particular, for an agent to perform well on token prediction, it must have some goal such that producing good predictions does well under that goal. Thus, if the agent’s goal is not to actually do prediction, then it must either a) sacrifice some performance by trying to do the wrong thing or b) choose to output good predictions only as an instrumental goal for the purpose of sticking around in the training process. ↩︎
This is also true for the deceptive models—they too can have whatever long-term objective they’re tracking implemented via either an internal alignment or corrigible alignment mechanism (that is, either hardcoded directly or specified via a pointer). ↩︎
Alternatively, if we don’t want to have to change the data collection procedure, we could also try to train a separate camera model that we place on top of our main predictive model that takes some input with information about the world and outputs the data that the “cameras” would show given that information. From the perspective of the predictive model, inputs to the camera model would still effectively be a type of camera, but it might be a substantially simpler one, which could help. However, there is still the problem of actually getting the camera model to learn the right thing—but if we can separately train the camera model with e.g. a strong speed bias, it might be hard for it to learn something like a general inductor. ↩︎
It is possible that a potential lack of incentive to learn situational awareness could pose a competitiveness problem. In our opinion, however, we think that you should be able to get the necessary competitiveness properties that you might want situational awareness for from predicting humans. In particular, you could get a situation where the model is predicting situationally-aware humans despite not itself being situationally aware. For example: if we plug in a description of the model's situation and then condition on what a human would do in response to that, the model need not be situationally aware itself to produce a good response, it only needs the human to be situationally aware. ↩︎
Some ways in which situational awareness could improve performance on next token prediction include: modeling the data curation process, helping predict other AIs via the model introspecting on its own structure, the same thing for ML papers, predicting the actions of AI labs by understanding how their AIs work, the model predicting its own output if any such output shows up in training (e.g. via RLHF), etc. ↩︎
2 comments
Comments sorted by top scores.
comment by Jacob Pfau (jacob-pfau) · 2023-02-09T23:17:01.526Z · LW(p) · GW(p)
the incentive for a model to become situationally aware (that is, to understand how it itself fits into the world) is only minimally relevant to performance on the LLM pre-training objective (though note that this can cease to be true if we introduce RL fine-tuning).
Why is this supposed to be true? Intuitively, this seems to clash with the authors view that anthropic reasoning is likely to be problematic. From another angle, I expect performance gain from situational awareness to increase as dataset cleaning/curation increases. Dataset cleaning has increased in stringency over time. As a simple example, see my post on dataset deduplication and situational awareness [LW · GW].
↑ comment by evhub · 2023-02-10T19:57:05.617Z · LW(p) · GW(p)
To be clear, I think situational awareness is relevant in pre-training, just less so than in many other cases (e.g. basically any RL setup, including RLHF) where the model is acting directly in the world (and when exactly in the model's development it gets an understanding of the training process matters a lot for deceptive alignment [LW · GW]).
From footnote 6 above:
Some ways in which situational awareness could improve performance on next token prediction include: modeling the data curation process, helping predict other AIs via the model introspecting on its own structure, the same thing for ML papers, predicting the actions of AI labs by understanding how their AIs work, the model predicting its own output if any such output shows up in training (e.g. via RLHF), etc.