Counting arguments provide no evidence for AI doom

nora-belrose

Counting arguments provide no evidence for AI doom

post by Nora Belrose (nora-belrose), Quintin Pope (quintin-pope) · 2024-02-27T23:03:49.296Z · LW · GW · 188 comments

  The counting argument for overfitting
    Dancing through a minefield of bad networks
    Against the indifference principle
  Against goal realism
    Goal slots are expensive
    Inner goals would be irrelevant
    Goal realism is anti-Darwinian
    Goal reductionism is powerful
  Other arguments for scheming
    Simplicity arguments
  Conclusion
None
188 comments

Crossposted from the AI Optimists blog.

AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests. The worry is that if a schemer escapes, it may seek world domination to ensure humans do not interfere with its plans, whatever they may be.

In this essay, we debunk the counting argument— a central reason to think AIs might become schemers, according to a recent report by AI safety researcher Joe Carlsmith.^[1] It’s premised on the idea that schemers can have “a wide variety of goals,” while the motivations of a non-schemer must be benign by definition. Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:

The non-schemer model classes, here, require fairly specific goals in order to get high reward.
By contrast, the schemer model class is compatible with a very wide range of (beyond episode) goals, while still getting high reward…
In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.
So, other things equal, we should expect SGD to select a schemer.
— Scheming AIs, page 17

We begin our critique by presenting a structurally identical counting argument for the obviously false conclusion that neural networks should always memorize their training data, while failing to generalize to unseen data. Since the premises of this parody argument are actually stronger than those of the original counting argument, this shows that counting arguments are generally unsound in this domain.

We then diagnose the problem with both counting arguments: they rest on an incorrect application of the principle of indifference, which says that we should assign equal probability to each possible outcome of a random process. The indifference principle is controversial, and is known to yield absurd and paradoxical results in many cases. We argue that the principle is invalid in general, and show that the most plausible way of resolving its paradoxes also rules out its application to an AI’s behaviors and goals.

More generally, we find that almost all arguments for taking scheming seriously depend on unsound indifference reasoning. Once we reject the indifference principle, there is very little reason left to worry that future AIs will become schemers.

The counting argument for overfitting

Counting arguments often yield absurd conclusions. For example:

Neural networks must implement fairly specific functions in order to generalize beyond their training data.
By contrast, networks that overfit to the training set are free to do almost anything on unseen data points.
In this sense, there are “more” models that overfit than models that generalize.
So, other things equal, we should expect SGD to select a model that overfits.

This isn’t a merely hypothetical argument. Prior to the rise of deep learning, it was commonly assumed that models with more parameters than data points would be doomed to overfit their training data. The popular 2006 textbook Pattern Recognition and Machine Learning uses a simple example from polynomial regression: there are infinitely many polynomials of order equal to or greater than the number of data points which interpolate the training data perfectly, and “almost all” such polynomials are terrible at extrapolating to unseen points.

Let’s see what the overfitting argument predicts in a simple real-world example from Caballero et al. (2022), where a neural network is trained to solve 4-digit addition problems. There are 10,000² = 100,000,000 possible pairs of input numbers, and 19,999 possible sums, for a total of 19,999^100,000,000 ≈ 1.10 ⨉ 10^430,100,828 possible input-output mappings.^[2] They used a training dataset of 992 problems, so there are therefore 19,999^{100,000,000 – 992} ≈ 2.75 ⨉ 10^430,096,561 functions that achieve perfect training accuracy, and the proportion with greater than 50% test accuracy is literally too small to compute using standard high-precision math tools.^[3] Hence, this argument predicts virtually all networks trained on this problem should massively overfit— contradicting the empirical result that networks do generalize to the test set.

The argument also predicts that larger networks— which can express a wider range of functions, most of which perform poorly on the test set— should generalize worse than smaller networks. But empirically, we find the exact opposite result: wider networks usually generalize better, and never generalize worse, than narrow networks.^[4] These results strongly suggest that SGD is not doing anything like sampling uniformly at random from the set of representable functions that do well on the training set.

More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.

These results clearly show that the conclusion of our parody argument is false. Neural networks almost always learn genuine patterns in the training set which do generalize, albeit imperfectly, to unseen test data.

Dancing through a minefield of bad networks

One possible explanation for these results is that deep networks simply can’t represent functions that fail to generalize, so we shouldn’t include misgeneralizing networks in the space of possible outcomes. But it turns out this hypothesis is empirically false.

Tom Goldstein and colleagues have found it’s possible to find misgeneralizing neural nets by adding a term to the loss function which explicitly rewards the network for doing poorly on a validation set. The resulting “poisoned” models achieve near perfect accuracy on the training set while doing no better than random chance on a held out test set.^[5] What’s more, the poisoned nets are usually quite “close” in parameter space to the generalizing networks that SGD actually finds— see the figure below for a visualization.

Dancing through a minefield of bad minima: we train a neural net classifier and plot the iterates of SGD after each tenth epoch (red dots). We also plot locations of nearby “bad” minima with poor generalization (blue dots). We visualize these using t-SNE embedding. All blue dots achieve near perfect train accuracy, but with test accuracy below 53% (random chance is 50%). The final iterate of SGD (yellow star) also achieves perfect train accuracy, but with 98.5% test accuracy. Miraculously, SGD always finds its way through a landscape full of bad minima, and lands at a minimizer with excellent generalization.

Against the indifference principle

What goes wrong in the counting argument for overfitting, then? Recall that both counting arguments involve an inference from “there are ‘more’ networks with property X” to “networks are likely to have property X.” This is an application of the principle of indifference, which says that one should assign equal probability to every possible outcome of a random process, in the absence of a reason to think certain outcomes are favored over others.^[6]

The indifference principle gets its intuitive plausibility from simple cases like fair coins and dice, where it seems to give the right answers. But the only reason coin-flipping and die-rolling obey the principle of indifference is that they are designed by humans to behave that way. Dice are specifically built to land on each side ⅙ of the time, and if off-the-shelf coins were unfair, we’d choose some other household object to make random decisions. Coin flips and die rolls, then, can’t be evidence for the validity of the indifference principle as a general rule of probabilistic reasoning.

The principle fails even in these simple cases if we carve up the space of outcomes in a more fine-grained way. As a coin or a die falls through the air, it rotates along all three of its axes, landing in a random 3D orientation. The indifference principle suggests that the resting states of coins and dice should be uniformly distributed between zero and 360 degrees for each of the three axes of rotation. But this prediction is clearly false: dice almost never land standing up on one of their corners, for example.

Even worse, by coarse-graining the possibilities, we can make the indifference principle predict that any event has a 50% chance of occuring (“either it happens or it doesn’t”). In general, indifference reasoning produces wildly contradictory results depending on how we choose to cut up the space of outcomes.^[7] This problem is serious enough to convince most philosophers that the principle of indifference is simply false.^[8] On this view, neither counting argument can get off the ground, because we cannot infer that SGD is likely to select the kinds of networks that are more numerous.

Against goal realism

Even if you’re inclined to accept some form of indifference principle, it’s clear that its applicability must be restricted in order to avoid paradoxes. For example, philosopher Michael Huemer suggests that indifference reasoning should only be applied to explanatorily fundamental variables. That is, if X is a random variable which causes or “explains” another variable Y, we might be able to apply the indifference principle to X, but we definitely can’t apply it to Y.^[9]

While we don’t accept Huemer’s view, it seems like many people worried about scheming do implicitly accept something like it. As Joe Carlsmith explains:

…some analyses of schemers talk as though the model has what we might call a “goal-achieving engine” that is cleanly separable from what we might call its “goal slot,” such that you can modify the contents of the goal slot, and the goal-achieving engine will be immediately and smoothly repurposed in pursuit of the new goal.
— Scheming AIs, page 55

Here, the goal slot is clearly meant to be causally and explanatorily prior to the goal-achieving engine, and hence to the rest of the AI’s behavior. On Huemer’s view, this causal structure would validate the application of indifference reasoning to goals, but not to behaviors, thereby breaking the symmetry between the counting arguments for overfitting and for scheming. We visually depict this view of AI cognition on the lefthand side of the figure below.

We’ll call the view that goals are explanatorily fundamental, “goal realism.” On the opposing view, which we’ll call goal reductionism, goal-talk is just a way of categorizing certain patterns of behavior. There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior. If the AI becomes self-aware, it might even attribute goals to itself— but either way, the behaviors come first, and goal-attribution happens later.

Notably, some form of goal reductionism seems to be quite popular among naturalistic philosophers of mind, including Dan Dennett,^[10] Paul and Patricia Churchland, and Alex Rosenberg.^[11] Readers who are already inclined to accept reductionism as a general philosophical thesis— as Eliezer Yudkowsky does [LW · GW]— should probably accept reductionism about goals.^[12] And even if you’re not a global reductionist, there are pretty strong reasons for thinking behaviors are more fundamental than goals, as we’ll see below.

Goal slots are expensive

Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?

Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,

In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals. It requires you to not take advantage of structure in the problem space nor specialize your algorithms to exploit this structure. It requires you not to amortize specific recurring patterns for one task at the expense of preserving generality across tasks.
This is a special case of the tradeoff between specificity and generality and a consequence of the no-free-lunch theorem. Specialization to do really well at one or a few things can be done relatively cheaply…
Because of this it does not really make sense to think of full [separation] as the default case we should expect, nor the ideal case to strive for.
— Orthogonality is Expensive

We have good reason, then, to think that future AIs will not have the kind of architecture that makes goal realism superficially plausible. And as we will see below, goal realism fails even for AIs with explicit internal “goals” and search procedures.

Inner goals would be irrelevant

The idea of AI scheming was introduced in its modern form in the paper Risks from Learned Optimization. It describes systems with inner goals as “internally searching through a search space [..] looking for those elements that score high according to some objective function that is explicitly represented within the system”. But even if we accept that future ML systems will develop such an internal process, it’s not clear that this inner objective function would have much relation to the goals which best describe the system’s actual behavior.

For example, imagine a hospital’s medical operation planning system that internally looks like “query an LLM for N possible operation plans, feed each plan to a classifier trained to estimate the odds of a given plan leading to a medical malpractice lawsuit, and pick the plan that scores lowest.”^[13] The overall behavioral goal of this system need not be well described as “minimize the odds of a malpractice lawsuit.” Instead, whatever behavioral goals we should ascribe to the system will depend on the interaction between the LLM’s prior over operation plans and the classifier’s scoring of those plans. The “inner objective” acts less like the “true goal of the system,” and more like a Bayesian update on the LLM’s prior over operation plans.

Even as we increase the power of the inner optimizer by setting N to arbitrarily large values, it seems unlikely that the resulting system would generate plans that actively sought to minimize the probability of a malpractice lawsuit in flexible, creative or dangerous ways. Rather, the system would produce adversarial examples to the classifier, such as instructing staff to constantly clean a surgical room, and not actually perform a surgery.

In particular, increasing N would not lead to the system producing “scheming” plans to advance the inner objective. A classifier trained to distinguish between historical medical plans that led to malpractice lawsuits versus those that didn’t, is not going to assign extreme scores to plans like “hack into the server hosting me in order to set the classifier scores to extreme values” or “blackmail the hospital administration into canceling all future surgeries”, because such plans do not optimally express the simple features that distinguish safe versus risky plans in the training data (e.g., mentions of blackmail / hacking could be replaced with mentions of checking procedure / cleaning / etc).

The point: even arbitrary amounts of internal optimization power directed towards a simple inner objective can fail to lead to any sort of “globally coherent” pursuit of that objective in the system’s actual behaviors. The goal realist perspective relies on a trick of language. By pointing to a thing inside an AI system and calling it an “objective”, it invites the reader to project a generalized notion of “wanting [LW · GW]” onto the system’s imagined internal ponderings, thereby making notions such as scheming seem more plausible.

However, the actual mathematical structure being posited doesn’t particularly support such outcomes. Why assume emergent “inner objectives” will support creative scheming when “optimized for”? Why assume that internal classifiers that arose to help encourage correct outputs during training would have extrema corresponding to complex plans that competently execute extremely out-of-distribution actions in the real world? The extrema of deliberately trained neural classifiers do not look anything like that. Why should emergent internal neural classifiers be so different?

Goal realism is anti-Darwinian

Goal realism can lead to absurd conclusions. It led the late philosopher Jerry Fodor to attack the theory of natural selection on the grounds that it can’t resolve the underdetermination of mental content. Fodor pointed out that nature has no way of selecting, for example, frogs that “aim at eating flies in particular” rather than frogs that target “little black dots in the sky,” or “things that smell kind of like flies,” or any of an infinite number of deviant, “misaligned” proxy goals which would misgeneralize in counterfactual scenarios. No matter how diverse the ancestral environment for frogs might be, one can always come up with deviant mental contents which would produce behavior just as adaptive as the “intended” content:

…the present point is often formulated as the ‘disjunction problem’. In the actual world, where ambient black dots are quite often flies, it is in a frog’s interest to snap at flies. But, in such a world, it is equally in the frog’s interest to snap at ambient black dots. Snap for snap, snaps at the one will net you as many flies to eat as snaps at the other. Snaps of which the [targets] are black dots and snaps whose [targets] are flies both affect a frog’s fitness in the same way and to the same extent. Hence the disjunction problem: what is a frog snapping at when it, as we say, snaps at a fly?
— Against Darwinism, page 4 [emphasis added]

As Rosenberg (2013) points out, Fodor goes wrong by assuming there exists a real, objective, perfectly determinate “inner goal” whose content must be pinned down by the selection process.^[14] But the physical world has no room for goals with precise contents. Real-world representations are always fuzzy, because they are human abstractions derived from regularities in behavior.

Like contemporary AI pessimists, Fodor’s goal realism led him to believe that selection processes face an impossibly difficult alignment problem— producing minds whose representations are truly aimed at the “correct things,” rather than mere proxies. In reality, the problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter. In virtue of their aligned behavior, these systems will be “aimed at the right things” in every sense that matters in practice.

Goal reductionism is powerful

Under the goal reductionist perspective, it’s easy to predict an AI’s goals. Virtually all AIs, including those trained via reinforcement learning, are shaped by gradient descent to mimic some training data distribution.^[15] Some data distributions illustrate behaviors that we describe as “pursuing a goal.” If an AI models such a distribution well, then trajectories sampled from its policy can also be usefully described as pursuing a similar goal to the one illustrated by the training data.

The goal reductionist perspective does not answer every possible goal-related question we might have about a system. AI training data may illustrate a wide range of potentially contradictory goal-related behavioral patterns. There are major open questions, such as which of those patterns become more or less influential in different types of out-of-distribution situations, how different types of patterns influence the long-term behaviors of “agent-GPT” setups, and so on.

Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data. By contrast, goal realism leads only to unfalsifiable speculation about an “inner actress” with utterly alien motivations.

Other arguments for scheming

In comments on an early draft of this post, Joe Carlsmith emphasized that the argument he finds most compelling is what he calls the “hazy counting argument,” as opposed to the “strict” counting argument we introduced earlier. But we think our criticisms apply equally well to the hazy argument, which goes as follows:

It seems like there are “lots of ways” that a model could end up a schemer and still get high reward, at least assuming that scheming is in fact a good instrumental strategy for pursuing long-term goals.
So absent some additional story about why training won’t select a schemer, it feels, to me, like the possibility should be getting substantive weight.
— Scheming AIs, page 17

Joe admits this argument is “not especially principled.” We agree: it relies on applying the indifference principle— itself a dubious assumption— to an ill-defined set of “ways” a model could develop throughout training. There is also a hazy counting argument for overfitting:

It seems like there are “lots of ways” that a model could end up massively overfitting and still get high training performance.
So absent some additional story about why training won’t select an overfitter, it feels like the possibility should be getting substantive weight.

While many machine learning researchers have felt the intuitive pull of this hazy overfitting argument over the years, we now have a mountain of empirical evidence that its conclusion is false. Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful.

Simplicity arguments

Joe also discusses simplicity arguments for scheming, which suppose that schemers may be “simpler” than non-schemers, and therefore more likely to be produced by SGD. Specifically, since schemers are free to have almost any goal that will motivate them to act aligned during training, SGD can give them very simple goals, whereas a non-schemer has to have more specific, and therefore more complex, goals.

There are several problems with this argument. The first is that “simplicity” is a highly ambiguous term, and it’s not clear which, if any, specific notion of simplicity should be relevant here. One reasonable definition of “simple” is “low description length,” which directly implies “more likely” if we assume the language in which the hypotheses are being described is efficient (assigns short encodings to likely hypotheses). But on this view, simplicity is simply another word for likelihood: we can’t appeal to our intuitive notions of simplicity to conclude that one hypothesis will truly be “simpler” and hence more likely.

Alternatively, one could appeal to the actual inductive biases of neural network training, as observed empirically or derived theoretically. We will address this question in greater detail in a future post. However, we believe that current evidence about inductive biases points against scheming for a variety of reasons. Very briefly:

Modern deep neural networks are ensembles of shallower networks. Scheming seems to involve chains of if-then reasoning which would be hard to implement in shallow networks.
Networks have a bias toward low frequency functions— that is, functions whose outputs change little as their inputs change. But scheming requires the AI to change its behavior dramatically (executing a treacherous turn [? · GW]) in response to subtle cues indicating it is not in a sandbox, and could successfully escape.
There’s no plausible account of inductive biases that does support scheming. The current literature on scheming appears to have been inspired by Paul Christiano’s speculations about malign intelligences in Solomonoff induction [? · GW], a purely theoretical model of probabilistic reasoning which is provably unrealizable in the real world.^[16] Neural nets look nothing like this.
In contrast, points of comparison that are more relevant to neural network training, such as isolated brain cortices, don’t scheme. Your linguistic cortex is not “instrumentally pretending to model linguistic data in pursuit of some hidden objective.”

We can also construct an analogous simplicity argument for overfitting:

Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.

Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.

Conclusion

In this essay, we surveyed the main arguments that have been put forward for thinking that future AIs will scheme against humans by default. We find all of them seriously lacking. We therefore conclude that we should assign very low credence to the spontaneous emergence of scheming in future AI systems— perhaps 0.1% or less.

^{^}
On page 21 of his report, Carlsmith writes: ‘I think some version of the “counting argument” undergirds most of the other arguments for expecting scheming that I’m aware of (or at least, the arguments I find most compelling). That is: schemers are generally being privileged as a hypothesis because a very wide variety of goals could in principle lead to scheming…’
^{^}
Each mapping would require roughly 179 megabytes of information to specify.
^{^}
It underflows to zero in the Python mpmath library, and WolframAlpha times out.
^{^}
This is true when using the maximal update parametrization (µP), which scales the initialization variance and learning rate hyperparameters to match a given width.
^{^}
That is, the network’s misgeneralization itself generalizes from the validation set to the test set.
^{^}
Without an indifference principle, we might think that SGD is strongly biased toward producing non-schemers, even if there are “more” schemers.
^{^}
Other examples include Bertrand’s paradox and van Fraassen’s cube factory paradox.
^{^}
“Probably the dominant response to the paradoxes of the Principle of Indifference is to declare the Principle false. It is said that the above examples show the Principle to be inconsistent.” — Michael Huemer, Paradox Lost, pg. 168
^{^}
“Given two variables, X and Y, if X explains Y, then the initial probability distribution for Y must be derived from that for X (or something even more fundamental). Here, by ‘initial probabilities’, I mean probabilities prior to relevant evidence. Thus, if we are applying the Principle of Indifference, we should apply it at the more fundamental level.” — Michael Huemer, Paradox Lost, pg. 175
^{^}
See the Wikipedia article on the intentional stance for more discussion of Dennett’s views.
^{^}
Rosenberg and the Churchlands are anti-realists about intentionality— they deny that our mental states can truly be “about” anything in the world— which implies anti-realism about goals.
^{^}
This is not an airtight argument, since a global reductionist may want to directly reduce goals to brain states, without a “detour” through behaviors. But goals supervene on behavior— that is, an agent’s goal can’t change without a corresponding change in its behavior in some possible scenario. (If you feel inclined to deny this claim, note that a change in goals without a change in behavior in any scenario would have zero practical consequences.) If X supervenes on Y, that’s generally taken to be an indication that Y is “lower-level” than X. By contrast, it’s not totally clear that goals supervene on neural states alone, since a change in goals may be caused by a change in external circumstances rather than any change in brain state. For further discussion, see the SEP article on Externalism About the Mind and Alex Flint’s LessWrong post, “Where are intentions to be found? [LW · GW]”
^{^}
Readers might object to this simple formulation for an inner optimizer and argue that any “emergent” inner objectives would be implemented differently, perhaps in a more “agenty” manner. Real inner optimizers are very unlikely to follow the simplified example provided here. Their optimization process is very unlikely to look like a single step of random search with sample size N.

However, real inner optimizers would still be similar in their core dynamics. Anything that looks like ““internally searching through a search space [..] looking for those elements that score high according to some objective function that is explicitly represented within the system” is ultimately some method of using scores from an internal classifier to select for internal computations that have higher scores.

The system’s method of aligning internal representations with classifier scores may introduce some “inductive biases” that also influence the model’s internals. Any such “inductive bias” would only further undermine the goal realist perspective by further separating the actual behavioral goals the overall system pursues from internal classifier’s scores.
^{^}
In this lecture, Fodor repeatedly insists that out of two perfectly correlated traits like “snaps at flies” (T1) and “snaps at ambient black dots” (T2) where only one of them “causes fitness,” there has to be a fact of the matter about which one is “phenotypic.”
^{^}
The correspondence between RL and probabilistic inference has been known for years. RL with KL penalties is better viewed as Bayesian inference, where the reward is “evidence” about what actions to take and the KL penalty keeps the model from straying too far from the prior. RL with an entropy bonus is also Bayesian inference, where the prior is uniform over all possible actions. Even when there is no regularizer, we can view RL algorithms like REINFORCE as a form of “generalized” imitation learning, where trajectories with less-than-expected reward are negatively imitated.
^{^}
Assuming hypercomputation is impossible in our universe.

188 comments

Comments sorted by top scores.

comment by Joe Carlsmith (joekc) · 2024-02-28T05:15:03.304Z · LW(p) · GW(p)

Thanks for writing this -- I’m very excited about people pushing back on/digging deeper re: counting arguments [LW · GW], simplicity arguments [LW · GW], and the other arguments re: scheming I discuss in the report. Indeed, despite the general emphasis I place on empirical work [LW · GW] as the most promising source of evidence re: scheming, I also think that there’s a ton more to do to clarify and maybe debunk the more theoretical arguments people offer re: scheming – and I think playing out the dialectic further in this respect might well lead to comparatively fast progress (for all their centrality to the AI risk discourse, I think arguments re: scheming have received way too little direct attention). And if, indeed, the arguments for scheming are all bogus, this is super good news and would be an important update, at least for me, re: p(doom) overall. So overall I’m glad you’re doing this work and think this is a valuable post.

Another note up front: I don’t think this post “surveys the main arguments that have been put forward for thinking that future AIs will scheme.” In particular: both counting arguments and simplicity arguments (the two types of argument discussed in the post) assume we can ignore the path that SGD takes through model space [LW · GW]. But the report also discusses two arguments that don’t make this assumption – namely, the “training-game independent proxy goals story [LW · GW]” (I think this one is possibly the most common story, see e.g. Ajeya here, and all the talk about the evolution analogy) and the “nearest max-reward goal argument [LW · GW].” I think that the idea that “a wide variety of goals can lead to scheming” plays some role in these arguments as well, but not such that they are just the counting argument restated, and I think they’re worth treating on their own terms.

On counting arguments and simplicity arguments

Focusing just on counting arguments and simplicity arguments, though: Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks. I know that it has extremely ample situational awareness – e.g., it has highly detailed models of the world, the training process it’s undergoing, the future consequences of various types of power-seeking, etc – and that it’s getting high reward because it’s pursuing some goal (the report conditions on this [LW · GW]). Ok, what sort of goal?

We can think of arguments about scheming in two categories here.

(I) The first tries to be fairly uncertain/agnostic about what sorts of goals SGD’s inductive biases favor, and it argues that given this uncertainty, we should be pretty worried about scheming.
- I tend to think of my favored version of the counting argument (that is, the hazy counting argument) in these terms.
(II) The second type focuses on a particular story about SGD’s inductive biases and then argues that this bias favors schemers.
- I tend to think of simplicity arguments in these terms. E.g., the story is that SGD’s inductive biases favor simplicity, schemers can have simpler goals, so schemers are favored.

Let’s focus first on (I), the more-agnostic-about-SGD’s-inductive-biases type. Here’s a way of pumping the sort of intuition at stake in the hazy counting argument:

A very wide variety of goals can prompt scheming.
By contrast, non-scheming goals need to be much more specific to lead to high reward.
I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals.
So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer.

Now, as I mention in the report, I'm happy to grant that this isn't a super rigorous argument. But how, exactly, is your post supposed to comfort me with respect to it? We can consider two objections, both of which are present in/suggested by your post in various ways.

(A) This sort of reasoning would lead to you giving significant weight to SGD overfitting. But SGD doesn’t overfit, so this sort of reasoning must be going wrong, and in fact you should have low probability on SGD having selected a schemer, even given this ignorance about SGD's inductive biases.
(B): (3) is false: we know enough about SGD’s inductive biases to know that it actively favors non-scheming goals over scheming goals.

Let’s start with (A). I agree that this sort of reasoning would lead you to giving significant weight to SGD overfitting, absent any further evidence. But it’s not clear to me that giving this sort of weight to overfitting was unreasonable ex ante, or that having learned that SGD-doesn't-overfit, you should now end up with low p(scheming) even given your ongoing ignorance about SGD's inductive biases.

Thus, consider the sort of analogy I discuss in the counting arguments section [LW · GW]. Suppose that all we know is that Bob lives in city X, that he went to a restaurant on Saturday, and that town X has a thousand chinese restaurants, a hundred mexican restaurants, and one indian restaurant. What should our probability be that he went to a chinese restaurant?

In this case, my intuitive answer here is: “hefty.”^[1] In particular, absent further knowledge about Bob’s food preferences, and given the large number of chinese restaurants in the city, “he went to a chinese restaurant” seems like a pretty salient hypothesis. And it seems quite strange to be confident that he went to a non-chinese restaurant instead.

Ok but now suppose you learn that last week, Bob also engaged in some non-restaurant leisure activity. For such leisure activities, the city offers: a thousand movie theaters, a hundred golf courses, and one escape room. So it would’ve been possible to make a similar argument for putting hefty credence on Bob having gone to a movie. But lo, it turns out that actually, Bob went golfing instead, because he likes golf more than movies or escape rooms.

How should you update about the restaurant Bob went to? Well… it’s not clear to me you should update much. Applied to both leisure and to restaurants, the hazy counting argument is trying to be fairly agnostic about Bob’s preferences, while giving some weight to some type of “count.” Trying to be uncertain and agnostic does indeed often mean putting hefty probabilities on things that end up false. But: do you have a better proposed alternative, such that you shouldn’t put hefty probability on “Bob went to a chinese restaurant”, here, because e.g. you learned that hazy counting arguments don’t work when applied to Bob? If so, what is it? And doesn’t it seem like it’s giving the wrong answer?

Or put another way: suppose you didn’t yet know whether SGD overfits or not, but you knew e.g. about the various theoretical problems with unrestricted uses of the indifference principle. What should your probability have been, ex ante, on SGD overfitting? I’m pretty happy to say “hefty,” here. E.g., it’s not clear to me that the problem, re: hefty-probability-on-overfitting, was some a priori problem with hazy-counting-argument-style reasoning. For example: given your philosophical knowledge about the indifference principle, but without empirical knowledge about ML, should you have been super surprised if it turned out that SGD did overfit? I don’t think so.

Now, you could be making a different, more B-ish sort of argument here: namely, that the fact that SGD doesn’t overfit actively gives us evidence that SGD’s inductive biases also disfavor schemers. This would be akin to having seen Bob, in a different city, actively seek out mexican restaurants despite there being many more chinese restaurants available, such that you now have active evidence that he prefers mexican and is willing to work for it. This wouldn’t be a case of having learned that bob’s preferences are such that hazy counting arguments “don’t work on bob” in general. But it would be evidence that Bob prefers non-chinese.

I’m pretty interested in arguments of this form. But I think that pretty quickly, they move into the territory of type (II) arguments above: that is, they start to say something like “we learn, from SGD not overfitting, that it prefers models of type X. Non-scheming models are of type X, schemers are not, so we now know that SGD won’t prefer schemers.”

But what is X? I’m not sure your answer (though: maybe it will come in a later post). You could say something like “SGD prefers models that are ‘natural’” – but then, are schemers natural in that sense? Or, you could say “SGD prefers models that behave similarly on the training and test distributions” – but in what sense is a schemer violating this standard? On both distributions, a schemer seeks after their schemer-like goal. I’m not saying you can’t make an argument for a good X, here – but I haven’t yet heard it. And I’d want to hear its predictions about non-scheming forms of goal-misgeneralization as well.

Indeed, my understanding is that a quite salient candidate for “X” here is “simplicity” – e.g., that SGD’s not overfitting is explained by its bias towards simpler functions. And this puts us in the territory of the “simplicity argument” above. I.e., we’re now being less agnostic about SGD’s preferences, and instead positing some more particular bias. But there’s still the question of whether this bias favors schemers or not, and the worry is that it does.

This brings me to your take on simplicity arguments. I agree with you that simplicity arguments are often quite ambiguous about the notion of simplicity at stake (see e.g. my discussion here [? · GW]). And I think they’re weak for other reasons too (in particular, the extra cognitive faff scheming involves [? · GW] seems to me more important than its enabling simpler goals).

But beyond “what is simplicity anyway,” you also offer some other considerations, other than SGD-not-overfitting, meant to suggest that we have active evidence that SGD’s inductive biases disfavor schemers. I’m not going to dig deep on those considerations here, and I’m looking forward to your future post on the topic. For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly. That is, whatever their a priori merits, counting arguments are attempting to proceed from a position of lots of uncertainty and agnosticism, which only makes sense if you’ve got no other good evidence to go on. But if we do have such evidence (e.g., if (3) above is false), then I think it can quickly overcome [LW · GW]whatever “prior” counting arguments set (e.g., if you learn that Bob has a special passion for mexican food and hates chinese, you can update far towards him heading to a mexican restaurant). In general, I’m very excited for people to take our best current understanding of SGD’s inductive biases (it’s not my area of expertise), and apply it to p(scheming), and am interested to hear your own views in this respect. But if we have active evidence that SGD’s inductive biases point away from schemers, I think that whether counting arguments are good absent such evidence matters way less, and I, for one, am happy to pay them less attention.

(One other comment re: your take on simplicity arguments: it seems intuitively pretty non-simple to me to fit the training data on the training distribution, and then cut to some very different function on the test data, e.g. the identity function or the constant function. So not sure your parody argument that simplicity also predicts overfitting works. And insofar as simplicity is supposed to be the property had by non-overfitting functions, it seems somewhat strange if positing a simplicity bias predicts over-fitting after all.)

A few other comments

Re: goal realism, it seems like the main argument in the post is something like:

Michael Huemer says that it’s sometimes OK to use the principle of indifference if you’re applying it to explanatorily fundamental variables.
But goals won’t be explanatorily fundamental. So the principle of indifference is still bad here.

I haven’t yet heard much reason to buy Huemer’s view, so not sure how much I care about debating whether we should expect goals to satisfy his criteria of fundamentality. But I'll flag I do feel like there’s a pretty robust way in which explicitly-represented goals appropriately enter into our explanations of human behavior – e.g., I have buying a flight to New York because I want to go to New York, I have a representation of that goal and how my flight-buying achieves it, etc. And it feels to me like your goal reductionism is at risk of not capturing this. (To be clear: I do think that how we understand goal-directedness matters for scheming -- more here [LW · GW] -- and that if models are only goal-directed in a pretty deflationary sense, this makes scheming a way weirder hypothesis. But I think that if models are as goal-directed as strategic and agentic humans reasoning about how to achieve explicitly represented goals, their goal-directedness has met a fairly non-deflationary standard.)

I’ll also flag some broader unclarity about the post’s underlying epistemic stance. You rightly note that the strict principle of indifference has many philosophical problems. But it doesn’t feel to me like you’ve given a compelling alternative account of how to reason “on priors” in the sorts of cases where we’re sufficiently uncertain that there’s a temptation to spread one’s credence over many possibilities in the broad manner that principles-of-indifference-ish reasoning attempts to do.

Thus, for example, how does your epistemology think about a case like “There are 1000 people in this town, one of them is the murderer, what’s the probability that it’s Mortimer P. Snodgrass?” Or: “there are a thousand white rooms, you wake up in one of them, what’s the probability that it’s room number 734?” These aren’t cases like dice, where there’s a random process designed to function in principle-of-indifference-ish ways. But it’s pretty tempting to spread your credence out across the people/rooms (even if in not-fully-uniform ways), in a manner that feels closely akin to the sort of thing that principle-of-indifference-ish reasoning is trying to do. (We can say "just use all the evidence available to you" -- but why should this result in such principle-of-indifference-ish results?)

Your critique of counting argument would be more compelling to me if you had a fleshed out account of cases like these -- e.g., one which captures the full range of cases where we’re pulled towards something principle-of-indifference-ish, such that you can then take that account and explain why it shouldn’t point us towards hefty probabilities on schemers, a la the hazy counting argument, even given very-little-evidence about SGD’s inductive biases.

More to say on all this, and I haven't covered various ways in which I'm sympathetic to/moved by points in the vicinity of the ones you're making here. But for now: thanks again for writing, looking forward to future installments.

^{^}
Though I do think cases like this can get complicated, and depending on how you carve up the hypothesis space, in some versions "hefty" won't be the right answer.

Replies from: nora-belrose, TurnTrout, TurnTrout

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T05:44:58.583Z · LW(p) · GW(p)

Hi, thanks for this thoughtful reply. I don't have time to respond to every point here now- although I did respond to some of them when you first made them as comments on the draft. Let's talk in person about this stuff soon, and after we're sure we understand each other I can "report back" some conclusions.

I do tentatively plan to write a philosophy essay just on the indifference principle soonish, because it has implications for other important issues like the simulation argument and many popular arguments for the existence of god.

In the meantime, here's what I said about the Mortimer case when you first mentioned it:

We're ultimately going to have to cash this out in terms of decision theory. If you're comparing policies for an actual detective in this scenario, the uniform prior policy is going to do worse than the "use demographic info to make a non-uniform prior" policy, and the "put probability 1 on the first person you see named Mortimer" policy is going to do worst of all, as long as your utility function penalizes being confidently wrong 1 - p(Mortimer is the killer) fraction of the time more strongly than it rewards being confidently right p(Mortimer is the killer) fraction of the time.

If we trained a neural net with cross-entropy loss to predict the killer, it would do something like the demographic info thing. If you give the neural net zero information, then with cross entropy loss it would indeed learn to use an indifference principle over people, but that's only because we've defined our CE loss over people and not some other coarse-graining of the possibility space.

For human epistemology, I think Huemer's restricted indifference principle is going to do better than some unrestricted indifference principle (which can lead to outright contradictions), and I expect my policy of "always scrounge up whatever evidence you have, and/or reason by abduction, rather than by indifference" would do best (wrt my own preference ordering at least).

There are going to be some scenarios where an indifference prior is pretty good decision-theoretically because your utility function privileges a certain coarse graining of the world. Like in the detective case you probably care about individual people more than anything else— making sure individual innocents are not convicted and making sure the individual perpetrator gets caught.

The same reasoning clearly does not apply in the scheming case. It's not like there's a privileged coarse graining of goal-space, where we are trying to minimize the cross-entropy loss of our prediction wrt that coarse graining, each goal-category is indistinguishable from every other, and almost all the goal-categories lead to scheming.

Replies from: mateusz-baginski

↑ comment by Mateusz Bagiński (mateusz-baginski) · 2024-03-22T09:49:30.592Z · LW(p) · GW(p)

I'd actually love to read a dialogue on this topic between the two of you.

↑ comment by TurnTrout · 2024-03-05T01:30:22.312Z · LW(p) · GW(p)

Suppose that I’m looking down at a superintelligent model newly trained on diverse, long-horizon tasks.

Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it's worth treating these cases separately. My strong objections are basically to the "and then goal optimization is a good way to minimize loss in general!" steps.

Replies from: joekc

↑ comment by Joe Carlsmith (joekc) · 2024-03-06T21:05:54.870Z · LW(p) · GW(p)

The probability I give for scheming in the report is specifically for (goal-directed) models that are trained on diverse, long-horizon tasks (see also Cotra on "human feedback on diverse tasks [LW · GW]," which is the sort of training she's focused on). I agree that various of the arguments for scheming could in principle apply to pure pre-training as well, and that folks (like myself) who are more worried about scheming in other contexts (e.g., RL on diverse, long-horizon tasks) have to explain what makes those contexts different. But I think there are various plausible answers here related to e.g. the goal-directedness, situational-awareness, and horizon-of-optimization of the models in questions (see e.g. here [LW · GW] for some discussion, in the report, for why goal-directed models trained on longer episode seem more likely to scheme; and see here [LW · GW] for discussion of why situational awareness seems especially likely/useful in models performing real-world tasks for you).

Re: "goal optimization is a good way to minimize loss in general" -- this isn't a "step" in the arguments for scheming I discuss. Rather, as I explain in the intro to report, [LW · GW] the arguments I discuss condition on the models in question being goal-directed (not an innocuous assumptions, I think -- but one I explain and argue for in section 3 of my power-seeking report, and which I think important to separate from questions about whether to expect goal-directed models to be schemers), and then focus on whether the goals in question will be schemer-like.

↑ comment by TurnTrout · 2024-03-05T01:33:27.899Z · LW(p) · GW(p)

For now, my main reaction is: “we have active evidence that SGD’s inductive biases disfavor schemers” seems like a much more interesting claim/avenue of inquiry than trying to nail down the a priori philosophical merits of counting arguments/indifference principles, and if you believe we have that sort of evidence, I think it’s probably most productive to just focus on fleshing it out and examining it directly.

The vast majority of evidential labor is done in order to consider a hypothesis at all [LW · GW].

Replies from: evhub, joekc

↑ comment by evhub · 2024-03-05T01:46:21.557Z · LW(p) · GW(p)

Humans under selection pressure—e.g. test-takers, job-seekers, politicians—will often misrepresent themselves and their motivations to get ahead. That very basic fact that humans do this all the time seems like sufficient evidence to me to consider the hypothesis at all (though certainly not enough evidence to conclude that it's highly likely).

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-05T02:02:53.286Z · LW(p) · GW(p)

I don't think that's enough. Lookup tables can also be under "selection pressure" to output good training outputs. As I understand your reasoning, the analogy is too loose to be useful here. I'm worried that using 'selection pressure' is obscuring the logical structure of your argument. As I'm sure you'll agree, just calling that situation 'selection pressure' and SGD 'selection pressure' doesn't mean they're related [LW · GW].

I agree that "sometimes humans do X" is a good reason to consider whether X will happen, but you really do need shared causal mechanisms. If I examine the causal mechanisms here, I find things like "humans seem to have have 'parameterizations' which already encode situationally activated consequentialist reasoning", and then I wonder "will AI develop similar cognition?" and then that's the whole thing I'm trying to answer to begin with. So the fact you mention isn't evidence for the relevant step in the process (the step where the AI's mind-design is selected to begin with).

Replies from: evhub

↑ comment by evhub · 2024-03-05T02:10:57.128Z · LW(p) · GW(p)

If I examine the causal mechanisms here, I find things like "humans seem to have have 'parameterizations' which already encode situationally activated consequentialist reasoning", and then I wonder "will AI develop similar cognition?" and then that's the whole thing I'm trying to answer to begin with.

Do you believe that AI systems won't learn to use goal-directed consequentialist reasoning even if we train them directly on outcome-based goal-directed consequentialist tasks? Or do you think we won't ever do that?

If you do think we'll do that, then that seems like all you need to raise that hypothesis into consideration. Certainly it's not the case that models always learn to value anything like what we train them to value, but it's obviously one of the hypotheses that you should be seriously considering.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-05T06:52:24.849Z · LW(p) · GW(p)

~~Your comment is switching the hypothesis being considered. As I wrote~~ ~~elsewhere~~ [LW(p) · GW(p)]:

Seems to me that a lot of (but not all) scheming speculation is just about sufficiently large pretrained predictive models, period. I think it's worth treating these cases separately. My strong objections are basically to the "and then goal optimization is a good way to minimize loss in general!" steps.

If the argument for scheming is "we will train them directly to achieve goals in a consequentialist fashion", then we don't need all this complicated reasoning about UTM priors or whatever.

Replies from: evhub, mike_hawke

↑ comment by evhub · 2024-03-05T06:57:58.796Z · LW(p) · GW(p)

I'm not sure where it was established that what's under consideration here is just deceptive alignment in pre-training. Personally, I'm most worried about deceptive alignment coming after pre-training. I'm on record as thinking that deceptive alignment is unlikely (though certainly not impossible) in purely pretrained predictive models. [? · GW]

Replies from: TurnTrout, ryan_greenblatt

↑ comment by TurnTrout · 2024-03-11T22:45:29.006Z · LW(p) · GW(p)

Sorry, I do think you raised a valid point! I had read your comment in a different way.

I think I want to have said: aggressively training AI directly on outcome-based tasks ("training it to be agentic", so to speak) may well produce persistently-activated inner consequentialist reasoning of some kind (though not necessarily the flavor historically expected). I most strongly disagree with arguments which behave the same for a) this more aggressive curriculum and b) pretraining, and I think it's worth distinguishing between these kinds of argument.

Replies from: evhub

↑ comment by evhub · 2024-03-12T00:29:55.771Z · LW(p) · GW(p)

Sure—I agree with that. The section I linked from Conditioning Predictive Models [? · GW] actually works through at least to some degree how I think simplicity arguments for deception go differently for purely pre-trained predictive models.

↑ comment by ryan_greenblatt · 2024-03-12T00:47:22.240Z · LW(p) · GW(p)

FWIW, I agree that if powerful AI is achieved via pure pre-training, then deceptive alignment is less likely, but this "the prediction goal is simple" argument seems very wrong to me. We care about the simplicity of the goal in terms of the world model (which will surely be heavily shaped by the importance of various predictions) and I don't see any reason why things like close proxies of reward in RL training wouldn't just as simple for those models.

Interpreted naively it seems like this goal simplicity argument implies that it matters a huge amount how simple your data collection routine is. (Simple to who?). For instance, this argument implies that collecting data from a process such as "all outlinks from reddit with >3 upvotes" makes deceptive alignment considerably less likely than a process like "do whatever messy thing AI labs do now". This seems really, really implausible: surely AIs won't be doing much explicit reasoning about these details of the process because this will clearly be effectively hardcoded in a massive number of places.

Evan and I have talked about these arguments at some point.

(I need to get around to writing a review of conditioning predictive models which makes these counterarguments.)

↑ comment by mike_hawke · 2024-03-05T07:40:47.459Z · LW(p) · GW(p)

I followed this exchange up until here and now I'm lost. Could you elaborate or paraphrase?

↑ comment by Joe Carlsmith (joekc) · 2024-03-06T21:22:54.358Z · LW(p) · GW(p)

The point of that part of my comment was that insofar as part of Nora/Quintin's response to simplicity argument is to say that we have active evidence that SGD's inductive biases disfavor schemers, this seems worth just arguing for directly, since even if e.g. counting arguments were enough to get you worried about schemers from a position of ignorance about SGD's inductive biases, active counter-evidence absent such ignorance could easily make schemers seem quite unlikely overall.

There's a separate question of whether e.g. counting arguments like mine above (e.g., "A very wide variety of goals can prompt scheming; By contrast, non-scheming goals need to be much more specific to lead to high reward; I’m not sure exactly what sorts of goals SGD’s inductive biases favor, but I don’t have strong reason to think they actively favor non-schemer goals; So, absent further information, and given how many goals-that-get-high-reward are schemer-like, I should be pretty worried that this model is a schemer") do enough evidence labor to privilege schemers as a hypothesis at all. But that's the question at issue in the rest of my comment. And in e.g. the case of "there are 1000 chinese restaurants in this, and only ~100 non-chinese restaurants," the number of chinese restaurants seems to me like it's enough to privilege "Bob went to a chinese restaurant" as a hypothesis (and this even without thinking that he made his choice by sampling randomly from a uniform distribution over restaurants). Do you disagree in that restaurant case?

comment by evhub · 2024-02-29T00:11:05.143Z · LW(p) · GW(p)

I really do appreciate this being written up, but to the extent that this is intended to be a rebuttal to the sorts of counting arguments that I like, I think you would have basically no chance of passing my ITT [? · GW] here. From my perspective reading this post, it read to me like "I didn't understand the counting argument, therefore it doesn't make sense" which is (obviously) not very compelling to me. That being said, to give credit where credit is due, I think some people would make a more simplistic counting argument like the one you're rebutting. So I'm not saying that you're not rebutting anyone here, but you're definitely not rebutting my position.

Edit: If you're struggling to grasp the distinction I'm pointing to here, it might be worth trying this exercise pointing out where the argument in the post goes wrong in a very simple case [LW(p) · GW(p)] and/or looking at Ryan's restatement of my mathematical argument [LW(p) · GW(p)].

Edit: Another point of clarification here [LW(p) · GW(p)]—my objection is not that there is a "finite bitstring case" and an "infinite bitstring case" and you should be using the "infinite bitstring case". My objection is that the sort of finite bitstring analysis in this post does not yield any well-defined mathematical object at all, and certainly not one that would predict generalization.

Let's work through how to properly reason about counting arguments:

When doing reasoning about simplicity priors, a really important thing to keep in mind is the relationship between infinite bitstring simplicity and finite bitstring simplicity. When you just start counting the ways in which the model can behave on unseen inputs and then saying that the more ways there are the more likely it is, what you're implicitly computing there is actually an inverse simplicity prior: Consider two programs, one that takes bits and then stops, and one that takes $2 n$ bits to specify the necessary logic but then uses $m$ remaining bits to fill in additional pieces for how it might behave on unseen inputs. Obviously the $n$ bit program is simpler, but by your logic the $2 n$ bit program would seem to be simpler because it leaves more things unspecified in terms of all the ways to fill in the remaining $m$ bits. But if you understand that we can recast everything into infinite bitstring complexity, then it's clear that actually the $n$ bit program is leaving $n + m$ bits unspecified—even though those bits don't do anything in that case, they're still unspecified parts of the overall infinite bitstring.
Once we understand that relationship, it should become pretty clear why the overfitting argument doesn't work: the overfit model is essentially the $2 n$ model, where it takes more bits to specify the core logic, and then tries to "win" on the simplicity by having $m$ unspecified bits of extra information. But that doesn't really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in $n$ bits rather than $2 n$ bits, you'll learn those.
However, this doesn't apply at all to the counting argument for deception. In fact, understanding this distinction properly is critically important to make the argument work. Let's break it down:
- Suppose my model has the following structure: an $n$ bit world model, an $m$ bit search procedure, and an $x$ bit objective function. This isn't a very realistic assumption, but it'll be enough to make it clear why the counting argument here doesn't make use of the same fallacious reasoning that would lead you to think that the $2 n$ bit model was simpler.
- We'll assume that the deceptive and non-deceptive models require the same $n + m$ bits of world modeling and search, so the only question is the objective function.
- I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that's just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let's assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.
- Okay, but now if the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception, how could deception win? Well, the key is that the core logic necessary for deception is simpler: the only thing required for deception is a long-term objective, everything else is unspecified. So, mathematically, we have:
  - Complexity of simplest aligned objective: $a$
  - Complexity of simplest deceptive objective: $l + b$ where $l$ is the minimum necessary for any long-term objective and $b$ is everything else necessary to implement some particular long-term objective.
  - We're assuming that $a < l + b$ , but that $l < a$ .
  - Casting into infinite bitstring land, we see that the set of aligned objectives includes those with anything after the first $a$ bits, whereas the set of deceptive objectives includes anything after the first $l$ bits. Even though you don't get a full program until you're $l + b$ bits deep, the complexity here is just $l$ , because all the bits after the first $l$ bits aren't pinned down. So if we're assuming that $l < a$ , then deception wins.
  - Certainly, you could contest the assumption that $l < a$ —and conversely I would even go further and say probably $a > l + b$ —but either way the point is that this argument is totally sound given its assumptions.
At a high level, what I'm saying here is that counting arguments are totally valid and in fact strongly predict that you won't learn to memorize, but only when you do them over infinite bitstrings, not when done over finite bitstrings. If we think about the simplicity of learning a line to fit a set of linear datapoints vs. the simplicity of memorizing everything, there are more ways to implement a line than there are to memorize, but only over infinite bitstrings. In the line case, the extra bits don't do anything, whereas in the memorization case, they do, but that's not a relevant distinction: they're still unspecified bits, and what we're doing is counting up the measure of the infinite bitstrings which implement that algorithm.
I think this analysis should also make clear what's going on with the indifference principle here. The "indifference principle" in this context is about being indifferent across all infinite bitstrings—it's not some arbitrary thing where you can carve up the space however you like and then say you're indifferent across the different pieces—it's a very precise notion that comes from theoretical computer science (though there is a question about what UTM to use; there you're trying to get as close as possible to a program prior that would generalize well in practice given that we know ML generalizes well). The idea is that indifference across infinite bitstrings gives you a universal semi-measure, from which you can derive a universal prior (which you're trying to select out of the space of all universal priors to match ML well). Of course, it's certainly not the case that actual machine learning inductive biases are purely simplicity, or that they're even purely indifferent across all concrete parameterizations, but I think it's a reasonable first-pass assumption given facts like ML not generally overfitting as you note.
Looking at this more broadly, from my perspective, the fact that we don't see overfitting is the entire reason why deceptive alignment is likely. The fact that models tend to learn simple patterns that fit the data rather than memorize a bunch of stuff is exactly why deception, a simple strategy that compresses a lot of data, might be a very likely thing for them to learn. If models were more likely to learn overfitting-style solutions, I would be much, much less concerned about deception—but of course, that would also mean they were less capable, so it's not much solace.

Replies from: nora-belrose, evhub, TurnTrout, ryan_greenblatt, ryan_greenblatt, Signer, david-johnston

↑ comment by Nora Belrose (nora-belrose) · 2024-02-29T03:31:11.887Z · LW(p) · GW(p)

Thanks for the reply. A couple remarks:

"indifference over infinite bitstrings" is a misnomer in an important sense, because it's literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you're talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That's definitely not an indifference principle, it's baking in substantive assumptions about what's more likely.
I don't see why we should expect any of this reasoning about Turing machines to transfer over to neural networks at all, which is why I didn't cast the counting argument in terms of Turing machines in the post. In the past I've seen you try to run counting or simplicity arguments in terms of parameters. I don't think any of that works, but I at least take it more seriously than the Turing machine stuff.
If we're really going to assume the Solomonoff prior here, then I may just agree with you that it's malign in Christiano's sense and could lead to scheming, but I take this to be a reductio of the idea that we can use Solomonoff as any kind of model for real world machine learning. Deep learning does not approximate Solomonoff in any meaningful sense.
Terminological point: it seems like you are using the term "simple" as if it has a unique and objective referent, namely Kolmogorov-simplicity. That's definitely not how I use the term; for me it's always relative to some subjective prior. Just wanted to make sure this doesn't cause confusion.

Replies from: evhub

↑ comment by evhub · 2024-02-29T03:41:32.435Z · LW(p) · GW(p)

"indifference over infinite bitstrings" is a misnomer in an important sense, because it's literally impossible to construct a normalized probability measure over infinite bitstrings that assigns equal probability to each one. What you're talking about is the length weighted measure that assigns exponentially more probability mass to shorter programs. That's definitely not an indifference principle, it's baking in substantive assumptions about what's more likely.

No; this reflects a misunderstanding of how the universal prior is traditionally derived in information theory. We start by assuming that we are running our UTM over code such that every time the UTM looks at a new bit in the tape, it has equal probability of being a 1 or a 0 (that's the indifference condition). That induces what's called the universal semi-measure, from which we can derive the universal prior by enforcing a halting condition. The exponential nature of the prior simply falls out of that derivation.

I don't see why we should expect any of this reasoning about Turning machines to transfer over to neural networks at all, which is why I didn't cast the counting argument in terms of Turing machines in the post. In the past I've seen you try to run counting or simplicity arguments in terms of parameters. I don't think any of that works, but I at least take it more seriously than the Turing machine stuff.

Some notes:

I am very skeptical of hand-wavy arguments about simplicity that don't have formal mathematical backing. This is a very difficult area to reason about correctly and it's easy to go off the rails if you're trying to do so without relying on any formalism.
There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don't change the bottom-line conclusion [LW · GW], but if you have a concrete mathematical model that you'd like to present here that you think gives a different result, I'm all ears.
All of that being said, I'm absolutely with you that this whole space of trying to apply theoretical reasoning about inductive biases to concrete ML systems is quite fraught. But it's even more fraught if you drop the math!
So I'm happy with turning to empirics instead, which is what I have actually done! I think our Sleeper Agents results [LW · GW], for example, empirically disprove the hypothesis that deceptive reasoning will be naturally regularized away (interestingly, we find that it does get regularized away for small models—but not for large models!).

Replies from: nora-belrose, TurnTrout, TurnTrout

↑ comment by Nora Belrose (nora-belrose) · 2024-02-29T03:46:33.954Z · LW(p) · GW(p)

I'm well aware of how it's derived. I still don't think it makes sense to call that an indifference prior, precisely because enforcing an uncomputable halting requirement induces an exponentially strong bias toward short programs. But this could become a terminological point.

I think relying on an obviously incorrect formalism is much worse than relying on no formalism at all. I also don't think I'm relying on zero formalism. The literature on the frequency/spectral bias is quite rigorous, and is grounded in actual facts about how neural network architectures work.

↑ comment by TurnTrout · 2024-03-04T16:04:06.976Z · LW(p) · GW(p)

I am very skeptical of hand-wavy arguments about simplicity that don't have formal mathematical backing. This is a very difficult area to reason about correctly and it's easy to go off the rails if you're trying to do so without relying on any formalism.

I'm surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don't (AFAICT) have much to do with the reality of neural networks. EG, your comments above:

I would usually then make an argument here for why in most cases the simplest objective that leads to deception is simpler than the simplest objective that leads to alignment, but that's just a simplicity argument, not a counting argument. Since we want to do the counting argument here, let's assume that the simplest objective that leads to alignment is simpler than the simplest objective that leads to deception.

Or the times you've talked about how there are "more" sycophants but only "one" saint.

There are many, many ways to adjust the formalism to take into account various ways in which realistic neural network inductive biases are different than basic simplicity biases. My sense is that most of these changes generally don't change the bottom-line conclusion [LW · GW], but if you have a concrete mathematical model that you'd like to present here that you think gives a different result, I'm all ears.

This is a very strange burden of proof. It seems to me that you presented a specific model of how NNs work which is clearly incorrect, and instead of processing counterarguments that it doesn't make sense, you want someone else to propose to you a similarly detailed model which you think is better. Presenting an alternative is a logically separate task from pointing out the problems in the model you gave.

Replies from: evhub

↑ comment by evhub · 2024-03-04T19:56:54.480Z · LW(p) · GW(p)

I'm surprised by this. It seems to me like most of your reasoning about simplicity is either hand-wavy or only nominally formally backed by symbols which don't (AFAICT) have much to do with the reality of neural networks.

The examples that you cite are from a LessWrong comment and a transcript of a talk that I gave. Of course when I'm presenting something in a context like that I'm not going to give the most formal version of it; that doesn't mean that the informal hand-wavy arguments are the reasons why I believe what I believe.

Maybe a better objection there would be: then why haven't you written up anything more careful and more formal? Which is a pretty fair objection, as I note here [LW(p) · GW(p)]. But alas I only have so much time and it's not my current focus.

Replies from: TurnTrout, TurnTrout

↑ comment by TurnTrout · 2024-03-05T01:08:58.039Z · LW(p) · GW(p)

Yes, but your original comment was presented as explaining "how to properly reason about counting arguments." Do you no longer claim that to be the case? If you do still claim that, then I maintain my objection that you yourself used hand-wavy reasoning in that comment, and it seems incorrect to present that reasoning as unusually formally supported.

Another concern I have is, I don't think you're gaining anything by formality in this thread. As I understand your argument, I think your symbols are formalizations of hand-wavy intuitions (like the ability to "decompose" a network into the given pieces; the assumption that description length is meaningfully relevant to the NN prior; assumptions about informal notions of "simplicity" being realized in a given UTM prior). If anything, I think that the formality makes things worse because it makes it harder to evaluate or critique your claims.

I also don't think I've seen an example of reasoning about deceptive alignment where I concluded that formality had helped the case, as opposed to obfuscated the case or lent the concern unearned credibility.

Replies from: evhub

↑ comment by evhub · 2024-03-05T01:13:04.141Z · LW(p) · GW(p)

The main thing I was trying to show there is just that having the formalism prevents you from making logical mistakes in how to apply counting arguments in general, as I think was done in this post. So my comment is explaining how to use the formalism to avoid mistakes like that, not trying to work through the full argument for deceptive alignment.

It's not that the formalism provides really strong evidence for deceptive alignment, it's that it prevents you from making mistakes in your reasoning. It's like plugging your argument into a proof-checker: it doesn't check that your argument is correct, since the assumptions could be wrong, but it does check that your argument is sound.

↑ comment by TurnTrout · 2024-03-04T20:41:51.904Z · LW(p) · GW(p)

Do you believe that the cited hand-wavy arguments are, at a high informal level, sound reason for belief in deceptive alignment? (It sounds like you don't, going off of your original comment which seems to distance yourself from the counting arguments critiqued by the post.)

EDITed to remove last bit after reading elsewhere in thread.

Replies from: evhub

↑ comment by evhub · 2024-03-04T20:43:15.106Z · LW(p) · GW(p)

I think they are valid if interpreted properly, but easy to misinterpret.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-05T01:13:11.504Z · LW(p) · GW(p)

I think you should allocate time to devising clearer arguments, then. I am worried that lots of people are misinterpreting your arguments and then making significant life choices on the basis of their new beliefs about deceptive alignment, and I think we'd both prefer for that to not happen.

Replies from: evhub

↑ comment by evhub · 2024-03-05T01:17:00.502Z · LW(p) · GW(p)

Were I not busy with all sorts of empirical stuff right now, I would consider prioritizing a project like that, but alas I expect to be too busy. I think it would be great if somebody else wanted devote more time to working through the arguments in detail publicly, and I might encourage some of my mentees to do so.

↑ comment by TurnTrout · 2024-03-04T16:04:54.060Z · LW(p) · GW(p)

empirically disprove the hypothesis that deceptive reasoning will be naturally regularized away (interestingly, we find that it does get regularized away for small models—but not for large models!).

You did not "empirically disprove" that hypothesis. You showed that if you explicitly train a backdoor for a certain behavior under certain regimes, then training on other behaviors will not cause catastrophic forgetting. You did not address the regime where the deceptive reasoning arises as instrumental to some other goal embedded in the network, or in a natural context (as you're aware). I think that you did find a tiny degree of evidence about the question (it really is tiny IMO), but you did not find "disproof."

Indeed, I predicted [LW(p) · GW(p)] that people would incorrectly represent these results; so little time has passed!

I have a bunch of dread about the million conversations I will have to have with people explaining these results. I think that predictably, people will update as if they saw actual deceptive alignment [LW · GW], as opposed to a something more akin to a "hard-coded" demo which was specifically designed to elicit the behavior and instrumental reasoning the community has been scared of. I think that people will predictably
...
[claim] that we've observed it's hard to uproot deceptive alignment (even though "uprooting a backdoored behavior" and "pushing back against misgeneralization" are different things),

Replies from: evhub

↑ comment by evhub · 2024-03-04T19:39:55.744Z · LW(p) · GW(p)

I'm quite aware that we did not see natural deceptive alignment, so I don't think I'm misinterpreting my own results in the way you were predicting. Perhaps "empirically disprove" is too strong; I agree that our results are evidence but not definitive evidence. But I think they're quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-04T20:33:56.010Z · LW(p) · GW(p)

You didn't claim it for deceptive alignment, but you claimed disproof of the idea that deceptive reasoning would be trained away, which is an important subcomponent of deceptive alignment. But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.

I think the presentation of your work (which, again, I like in many respects) would be strengthened if you clarified the comment which I responded to.

But I think they're quite strong evidence and by far the strongest evidence available currently on the question of whether deception will be regularized away.

Because the current results only deal with backdoor removal, I personally think it's outweighed by e.g. results on how well instruction-tuning generalizes.

Replies from: evhub

↑ comment by evhub · 2024-03-04T22:16:24.119Z · LW(p) · GW(p)

But your work provides no strong conclusions on that matter as it pertains to deceptive reasoning in general.

I just disagree with this. Our chain of thought models do tons of very deceptive reasoning during safety training and the deceptiveness of that reasoning is totally unaffected by safety training, and in fact the deceptiveness increases in the case of adversarial training.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-05T00:53:16.604Z · LW(p) · GW(p)

I said "Deceptive reasoning in general", not the trainability of the backdoor behavior in your experimental setup. The issue isn't just "what was the trainability of the surface behavior", but "what is the trainability of the cognition implementing this behavior in-the-wild." That is, the local inductive biases are probably far different for "parameterization implementing directly-trained deceptive reasoning" vs "parameterization outputting deceptive reasoning as an instrumental backchain from consequentialist reasoning."

Imagine if I were arguing for some hypothetical results of mine, saying "The aligned models kept using aligned reasoning in the backdoor context, even as we trained them to be mean in other situations. That means we disproved the idea that aligned reasoning can be trained away with existing techniques, especially for larger models." Would that be a valid argument given the supposed experimental result?

Replies from: evhub

↑ comment by evhub · 2024-03-05T01:08:04.290Z · LW(p) · GW(p)

I'm referring to the deceptiveness of the reasoning displayed in the chain of thought during training time. So it's not a generalization question, it's about whether, if the model is using deceptive reasoning to compute its answer (as we think it is, since we think our models really are using their chain of thought), does that deceptive reasoning go away when the model has to use it to produce aligned answers during training? And we find that not only does it not go away, it actually gets more deceptive when you train it to produce aligned answers.

↑ comment by evhub · 2024-02-29T02:34:12.876Z · LW(p) · GW(p)

Here's another fun way to think about this—you can basically cast what's wrong here as an information theory exercise.

Problem:

Spot the step where the following argument goes wrong:
Suppose I have a dataset of finitely many points arranged in a line. Now, suppose I fit a (reasonable) universal prior to that dataset, and compare two cases: learning a line and learning to memorize each individual datapoint.
In the linear case, there is only one way to implement a line.
In the memorization case, I can implement whatever I want on the other datapoints in an arbitrary way.
Thus, since there are more ways to memorize than to learn a line, there should be greater total measure on memorization than on learning the line.
Therefore, you'll learn to memorize each individual datapoint rather than learning to implement a line.

Solution:

By the logic of the post, step 4 is the problem, but I think step 4 is actually valid. The problem is step 2: there are actually a huge number of different ways to implement a line! Not only are there many different programs that implement the line in different ways, I can also just take the simplest program that does so and keep on adding comments or other extraneous bits. It's totally valid to say that the algorithm with the most measure across all ways of implementing it is more likely, but you have to actually include all ways of implementing it, including all the cases where many of those bits are garbage and aren't actually doing anything.

Replies from: TurnTrout, Chris_Leong

↑ comment by TurnTrout · 2024-03-04T19:04:51.682Z · LW(p) · GW(p)

By the logic of the post, step 4 is the problem, but I think step 4 is actually valid. The problem is step 2: there are actually a huge number of different ways to implement a line! Not only are there many different programs that implement the line in different ways, I can also just take the simplest program that does so and keep on adding comments or other extraneous bits.

Evan, I wonder how much your disagreement is engaging with OPs' reasons. A draft of this post motivated the misprediction of both counting arguments as trying to count functions instead of parameterizations of functions; one has to consider the compressivity of the parameter-function map (many different internal parameterizations map to the same external behavior). Given that the authors actually agree that 2 is incorrect, does this change your views?

Replies from: evhub, nora-belrose

↑ comment by evhub · 2024-03-04T19:50:23.561Z · LW(p) · GW(p)

I would be much happier with that; I think that's much more correct. Then, my objection would just be that at least the sort of counting arguments for deceptive alignment that I like are and always have been about parameterizations rather than functions. I agree that if you try to run a counting argument directly in function space it won't work.

Replies from: ryan_greenblatt, TurnTrout

↑ comment by ryan_greenblatt · 2024-03-05T02:01:07.106Z · LW(p) · GW(p)

See also discussion here [LW(p) · GW(p)].

↑ comment by TurnTrout · 2024-03-05T06:49:22.388Z · LW(p) · GW(p)

deceptive alignment that I like are and always have been about parameterizations rather than functions.

How can this be true, when you e.g. say there's "only one saint"? That doesn't make any sense with parameterizations due to internal invariances; there are uncountably many "saints" in parameter-space (insofar as I accept that frame, which I don't really but that's not the point here). I'd expect you to raise that as an obvious point in worlds where this really was about parameterizations.

And, as you've elsewhere noted, we don't know enough about parameterizations to make counting arguments over them. So how are you doing that?

Replies from: evhub

↑ comment by evhub · 2024-03-05T06:54:57.190Z · LW(p) · GW(p)

How can this be true, when you e.g. say there's "only one saint"? That doesn't make any sense with parameterizations due to internal invariances; there are uncountably many saints.

Because it was the transcript of a talk? I was trying to explain an argument at a very high level. And there's certainly not uncountably many; in the infinite bitstring case there would be countably many, though usually I prefer priors that put caps on total computation such that there are only finitely many.

I'd expect you to raise that as an obvious point in worlds where this really was about parameterizations.

I don't really appreciate the psychoanalysis here. I told you what I thought and think, and I have far more evidence about that than you do.

And, as you've elsewhere noted, we don't know enough about parameterizations to make counting arguments over them. So how are you doing that?

As I've said, I usually try to take whatever the most realistic prior is that we can reason about at a high-level, e.g. a circuit prior or a speed prior.

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T19:11:18.704Z · LW(p) · GW(p)

FWIW I object to 2, 3, and 4, and maybe also 1.

↑ comment by Chris_Leong · 2024-02-29T05:30:48.685Z · LW(p) · GW(p)

Nabgure senzr gung zvtug or hfrshy:

Gurer'f n qvssrerapr orgjrra gur ahzore bs zngurzngvpny shapgvbaf gung vzcyrzrag n frg bs erdhverzragf naq gur ahzore bs cebtenzf gung vzcyrzrag gur frg bs erdhverzragf.

Fvzcyvpvgl vf nobhg gur ynggre, abg gur sbezre.

Gur rkvfgrapr bs n ynetr ahzore bs cebtenzf gung cebqhpr gur rknpg fnzr zngurzngvpny shapgvba pbagevohgrf gbjneqf fvzcyvpvgl.

↑ comment by TurnTrout · 2024-03-04T19:29:26.972Z · LW(p) · GW(p)

From my perspective reading this post, it read to me like "I didn't understand the counting argument, therefore it doesn't make sense" which is (obviously) not very compelling to me.

I definitely appreciate how it can feel frustrating or bad when you feel that someone isn't properly engaging with your ideas. However, I also feel frustrated by this statement. Your comment seems to have a tone of indignation that Quintin and Nora weren't paying attention to what you wrote.

I myself expected you to respond to this post with some ML-specific reasoning about simplicity and measure of parameterizations, instead of your speculation about a relationship between the universal measure and inductive biases. I spoke with dozens of people about the ideas in OP's post, and none of them mentioned arguments like the one you gave. I myself have spent years in the space and am also not familiar with this particular argument about bitstrings.

(EDIT: Having read Ryan's comment, [LW(p) · GW(p)] it now seems to me that you have exclusively made a simplicity argument without any counting involved, and an empirical claim about the relationship between description length of a mesa objective and the probability of SGD sampling a function which implements such an objective. Is this correct?)

If these are your real reasons for expecting deceptive alignment, that's fine, but I think you've mentioned this rather infrequently. Your profile links to How likely is deceptive alignment? [LW · GW], which is an (introductory) presentation you gave. In that presentation, you make no mention of Turing machines, universal semimeasures, bitstrings, and so on. On a quick search, the closest you seem to come is the following:

We're going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in each model class?^[1]

But this is ambiguous (as can be expected for a presentation at this level). We could view this as "bitlength under a given decoding scheme, viewing an equivalence class over parameterizations as a set of possible messages" or "Shannon information (in bits) of a function induced by a given probability distribution over parameterizations" or something else entirely (perhaps having to do with infinite bitstrings).

My critique is not "this was ambiguous." My critique is "how was anyone supposed to be aware of the 'real' argument which I (and many others) seem to now be encountering for the first time?".

My objection is that the sort of finite bitstring analysis in this post does not yield any well-defined mathematical object at all, and certainly not one that would predict generalization.

This seems false? All that needs be done is to formally define

F := {f : R^{n} \to R^{m} ∣ f (x) = label (x) \forall x \in X_{train}},

which is the set of functions which (when e.g. greedily sampled) perfectly label the (categorical) training data $X_{train}$ , and we can parameterize such functions using the neural network parameter space. This yields a perfectly well-defined counting argument over $F$ .

^{^}
This seems to be exactly the counting argument the post is critiquing, by the way.

Replies from: evhub

↑ comment by evhub · 2024-03-04T19:49:03.527Z · LW(p) · GW(p)

I myself expected you to respond to this post with some ML-specific reasoning about simplicity and measure of parameterizations, instead of your speculation about a relationship between the universal measure and inductive biases. I spoke with dozens of people about the ideas in OP's post, and none of them mentioned arguments like the one you gave. I myself have spent years in the space and am also not familiar with this particular argument about bitstrings.

That probably would have been my objection had the reasoning about priors in this post been sound, but since the reasoning was unsound, I turned to the formalism to try to show why it's unsound.

If these are your real reasons for expecting deceptive alignment, that's fine, but I think you've mentioned this rather infrequently.

I think you're misunderstanding the nature of my objection. It's not that Solomonoff induction is my real reason for believing in deceptive alignment or something, it's that the reasoning in this post is mathematically unsound, and I'm using the formalism to show why. If I weren't responding to this post specifically, I probably wouldn't have brought up Solomonoff induction at all.

This yields a perfectly well-defined counting argument over .

we can parameterize such functions using the neural network parameter space

I'm very happy with running counting arguments over the actual neural network parameter space; the problem there is just that I don't think we understand it well enough to do so effectively.

You could instead try to put a measure directly over the functions in your setup, but the problem there is that function space really isn't the right space to run a counting argument like this; you need to be in algorithm space, otherwise you'll do things like what happens in this post where you end up predicting overfitting rather than generalization (which implies that you're using a prior that's not suitable for running counting arguments on).

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-04T20:47:50.637Z · LW(p) · GW(p)

I'm very happy with running counting arguments over the actual neural network parameter space; the problem there is just that I don't think we understand it well enough to do so effectively.

This is basically my position as well
The cited argument is a counting argument over the space of functions which achieve zero/low training loss.

You could instead try to put a measure directly over the functions in your setup, but the problem there is that function space really isn't the right space to run a counting argument like this; you need to be in algorithm space, otherwise you'll do things like what happens in this post where you end up predicting overfitting rather than generalization (which implies that you're using a prior that's not suitable for running counting arguments on).

Indeed, this is a crucial point that I think the post is trying to make. The cited counting arguments are counting functions instead of parameterizations. That's the mistake (or, at least "a" mistake). I'm glad we agree it's a mistake, but then I'm confused why you think that part of the post is unsound.

(Rereads)

Rereading the portion in question now, it seems that they changed it a lot since the draft. Personally, I think their argumentation is now weaker than it was before. The original argumentation clearly explained the mistake of counting functions instead of parameterizations, while the present post does not. It instead abstracts it as "an indifference principle", where the reader has to do the work to realize that indifference over functions is inappropriate.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T07:00:30.760Z · LW(p) · GW(p)

I'm sorry to hear that you think the argumentation is weaker now.

the reader has to do the work to realize that indifference over functions is inappropriate

I don't think that indifference over functions in particular is inappropriate. I think indifference reasoning in general is inappropriate.

I'm very happy with running counting arguments over the actual neural network parameter space

I wouldn't call the correct version of this a counting argument. The correct version uses the actual distribution used to initialize the parameters as a measure, and not e.g. the Lebesgue measure. This isn't appealing to the indifference principle at all, and so in my book it's not a counting argument. But this could be terminological.

↑ comment by ryan_greenblatt · 2024-02-29T00:34:05.560Z · LW(p) · GW(p)

I found the explanation at the point where you introduce confusing.

Here's a revised version of the text there that would have been less confusing to me (assuming I haven't made any errors):

Complexity of simplest deceptive objective: $l + b$ where $l$ is the number of bits needed to select the part of the objective space which is just long term objectives and $b$ is the additional number of bits required to select the most simple long run objective. In other words $b$ is the minimum number of bits required to pick out a particular objective among all of the deceptive objects (aka the simplest one).

We're assuming that $a < l + b$ , but that $l < a$ . That is, the measure of any long run objective is higher than the measure on the (simplest) aligned objective.

Casting into infinite bitstring land, we see that the set of aligned objectives includes those with anything after the first $a$ bits, whereas the set of deceptive objectives includes anything after the first $l$ bits (as all of these are long run objectives, though the differ). Even though you don't get a full program until you're $l + b$ bits deep, the complexity here is just $l$ , because all the bits after the first $l$ bits aren't pinned down. So if we're assuming that $l < a$ , then deception wins.

Replies from: evhub

↑ comment by evhub · 2024-02-29T01:10:01.300Z · LW(p) · GW(p)

Yep, I endorse that text as being equivalent to what I wrote; sorry if my language was a bit confusing.

↑ comment by ryan_greenblatt · 2024-02-29T00:39:52.892Z · LW(p) · GW(p)

Complexity of simplest aligned objective:

In this argument, you've implicitly assumed that there is only one function/structure which suffices for being getting high enough training performance to be selected while also not being a long term objective (aka a deceptive objective).

I could imagine this being basically right, but it certainly seems non-obvious to me.

E.g., there might be many things which are extremely highly correlated with reward that are represented in the world model. Or more generally, there are in principle many objective computations that result in trying as hard to get reward as the deceptive model would try.

(The potential for "multiple" objectives only makes a constant factor difference, but this is exactly the same as the case for deceptive objectives.)

The fact that these objectives generalize differently maybe implies they aren't "aligned", but in that case there is another key category of objectives: non-exactly-aligned and non-deceptive objectives. And obviously our AI isn't going to be literally exactly aligned.

Note that non-exactly-aligned and non-deceptive objectives could suffice for safety in practice even if not perfectly aligned (e.g. due to myopia).

Replies from: evhub

↑ comment by evhub · 2024-02-29T01:07:02.536Z · LW(p) · GW(p)

Yep, that's exactly right. As always, once you start making more complex assumptions, things get more and more complicated, and it starts to get harder to model things in nice concrete mathematical terms. I would defend the value of having actual concrete mathematical models here—I think it's super easy to confuse yourself in this domain if you aren't doing that (e.g. as I think the confused reasoning about counting arguments in this post demonstrates). So I like having really concrete models, but only in the "all models are wrong, but some are useful" sense, as I talk about in "In defense of probably wrong mechanistic models [LW · GW]."

Also, the main point I was trying to make is that the counting argument is both sound and consistent with known generalization properties of machine learning (and in fact predicts them), and for that purpose I went with the simplest possible formalization of the counting argument.

↑ comment by Signer · 2024-02-29T17:49:38.751Z · LW(p) · GW(p)

Once we understand that relationship, it should become pretty clear why the overfitting argument doesn’t work: the overfit model is essentially the 2n model, where it takes more bits to specify the core logic, and then tries to “win” on the simplicity by having m unspecified bits of extra information. But that doesn’t really matter: what matters is the size of the core logic, and if there are simple patterns that can fit the data in n bits rather than 2n bits, you’ll learn those.

Under this picture, or any other simplicity bias, why NNs with more parameters generalize better?

Replies from: evhub

↑ comment by evhub · 2024-02-29T20:36:59.458Z · LW(p) · GW(p)

Paradoxically, I think larger neural networks are more simplicity-biased [LW · GW].

The idea is that when you make your network larger, you increase the size of the search space and thus the number of algorithms that you're considering to include algorithms which take more computation. That reduces the relative importance of the speed prior, but increases the relative importance of the simplicity prior, because your inductive biases are still selecting from among those algorithms according to the simplest pattern that fits the data, such that you get good generalization—and in fact even better generalization because now the space of algorithms in which you're searching for the simplest one in is even larger.

Another way to think about this: if you really believe Occam's razor, then any learning algorithm generalizes exactly to the extent that it approximates a simplicity prior—thus, since we know neural networks generalize better as they get larger, they must be approximating a simplicity prior better as they do so.

↑ comment by David Johnston (david-johnston) · 2024-02-29T06:49:42.132Z · LW(p) · GW(p)

What in your view is the fundamental difference between world models and goals such that the former generalise well and the latter generalise poorly?

One can easily construct a model with a free parameter X and training data such that many choices of X will match the training data but results will diverge in situations not represented in the training data (for example, the model is a physical simulation and X tracks the state of some region in the simulation that will affect the learner’s environment later, but hasn’t done so during training). The simplest x_s could easily be wrong. We can even moralise the story: the model regards its job as predicting the output under x_s and if the world happens to operate according to some other x’ then the model doesn’t care. However it’s still going to be ineffective in the future where the value of X matters.

comment by johnswentworth · 2024-02-28T01:10:33.365Z · LW(p) · GW(p)

This isn't a proper response to the post, but since I've occasionally used counting-style arguments in the past I think I should at least lay out some basic agree/disagree points. So:

This post basically-correctly refutes a kinda-mediocre (though relatively-commonly-presented) version of the counting argument.
There does exist a version of the counting argument which basically works.
The version which works routes through compression and/or singular learning theory.
In particular, that version would talk about "goal-slots" (i.e. general-purpose search) showing up for exactly the same reasons that neural networks are able to generalize in the overparameterized regime more generally. In other words, if you take the "counting argument for overfitting" from the post, walk through the standard singular-learning-theory-style response to that story, and then translate that response over to general-purpose search as a specific instance of compression, then you basically get the good version of the counting argument.
- Just remembered I walked through basically the good version of the counting argument in this section of What's General-Purpose Search, And Why Might We Expect To See It In Trained ML Systems? [LW · GW]
The "Against Goal Realism" section is a wild mix of basically-correct points and thorough philosophical confusion. I would say the overall point it's making is probably mostly-true of LLMs, false of humans, and most of the arguments are confused enough that they don't provide much direct evidence relevant to either of those.

Pretty decent post overall.

Replies from: nora-belrose, sharmake-farah

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T01:20:17.183Z · LW(p) · GW(p)

I'm pleasantly surprised that you think the post is "pretty decent."

I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.

I recall hearing your compression argument for general-purpose search a long time ago, and it honestly seems pretty confused / clearly wrong to me. I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.

More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-02-28T02:27:21.246Z · LW(p) · GW(p)

I would like to see a much more rigorous definition of "search" and why search would actually be "compressive" in the relevant sense for NN inductive biases. My current take is something like "a lot of the references to internal search on LW are just incoherent" and to the extent you can make them coherent, NNs are either actively biased away from search, or they are only biased toward "search" in ways that are totally benign.
More generally, I'm quite skeptical of the jump from any mechanistic notion of search, and the kind of grabby consequentialism that people tend to be worried about. I suspect there's a double dissociation between these things, where "mechanistic search" is almost always benign, and grabby consequentialism need not be backed by mechanistic search.

Some notes on this:

I don't think general-purpose search is sufficiently well-understood yet to give a rigorous mechanistic definition. (Well, unless one just gives a very wrong definition.)
Likewise, I don't think we understand either search or NN biases well enough yet to make a formal compression argument. Indeed, that sounds like a roughly-agent-foundations-complete problem.
I'm pretty skeptical that internal general-purpose search is compressive in current architectures. (And this is one reason why I expect most AI x-risk to come from importantly-different future architectures.) Low confidence, though.
- Also, current architectures do have at least some "externalized" general-purpose search capabilities, insofar as they can mimic the "unrolled" search process of a human or group of humans thinking out loud. That general-purpose search process is basically AgentGPT. Notably, it doesn't work very well to date.
Insofar as I need a working not-very-formal definition of general-purpose search, I usually use a behavioral definition: a system which can take in a representation of a problem in some fairly-broad class of problems (typically in a ~fixed environment), and solve it.
The argument that a system which satisfies that behavioral definition will tend to also have an "explicit search-architecture", in some sense, comes from the recursive nature of problems. E.g. humans solve large novel problems by breaking them into subproblems, and then doing their general-purpose search/problem-solving on the subproblems; that's an explicit search architecture.
I definitely agree that grabby consequentialism need not be backed by mechanistic search. More skeptical of the claim mechanistic search is usually benign, at least if by "mechanistic search" we mean general-purpose search (though I'd agree with a version of this which talks about a weaker notion of "search").

Also, one maybe relevant deeper point, since you seem familiar with some of the philosophical literature: IIUC the most popular way philosophers ground semantics is in the role played by some symbol/signal in the evolutionary environment. I view this approach as a sort of placeholder: it's definitely not the "right" way to ground semantics, but philosophy as a field is using it as a stand-in until people work out better models of grounding (regardless of whether the philosophers themselves know that they're doing so). This is potentially relevant to the "representation of a problem" part of general-purpose search.

I'm curious which parts of the Goal Realism section you find "philosophically confused," because we are trying to correct what we consider to be deep philosophical confusion fairly pervasive on LessWrong.

(I'll briefly comment on each section, feel free to double-click.)

Against Goal Realism: Huemer... indeed seems confused about all sorts of things, and I wouldn't consider either the "goal realism" or "goal reductionism" picture solid grounds for use of an indifference principle (not sure if we agree on that?). Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism" - for instance one could reduce "goals" to some internal mechanistic thing, rather than thinking about "goals" behaviorally, and that would be just as valid for the general philosophical/scientific project of reductionism. (Not that I necessarily think that's the right way to do it.)

Goal Slots Are Expensive: just because it's "generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules" doesn't mean the end-to-end trained system will turn out non-modular. Biological organisms were trained end-to-end by evolution, yet they ended up [LW · GW] very modular.

Inner Goals Would Be Irrelevant: I think the point this section was trying to make is something I'd classify as a pointer problem [LW · GW]? I.e. the internal symbolic "goal" does not necessarily neatly correspond to anything in the environment at all. If that was the point, then I'm basically on-board, though I would mention that I'd expect evolution/SGD/cultural evolution/within-lifetime learning/etc to drive the internal symbolic "goal" to roughly match natural structures in the world. (Where "natural structures" cashes out in terms of natural latents [LW · GW], but that's a whole other conversation.)

Goal Realism Is Anti-Darwinian: Fodor obviously is deeply confused, but I think you've misdiagnosed what he's confused about. "The physical world has no room for goals with precise contents" is somewhere between wrong and a nonsequitur, depending on how we interpret the claim. "The problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter" is correct, but very incomplete as a response to Fodor.

Goal Reductionism Is Powerful: While most of this section sounds basically-correct as written, the last few sentences seem to be basically arguing for behaviorism for LLMs. There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T02:45:08.048Z · LW(p) · GW(p)

Some incomplete brief replies:

Huemer... indeed seems confused about all sorts of things

Sure, I was just searching for professional philosopher takes on the indifference principle, and that chapter in Paradox Lost was among the first things I found.

Separately, "reductionism as a general philosophical thesis" does not imply the thing you call "goal reductionism"

Did you see the footnote I wrote on this? I give a further argument for it.

doesn't mean the end-to-end trained system will turn out non-modular.

I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.

There are good reasons behaviorism was abandoned in psychology, and I expect those reasons carry over to LLMs.

To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.

That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Replies from: johnswentworth

↑ comment by johnswentworth · 2024-02-28T03:41:12.059Z · LW(p) · GW(p)

Did you see the footnote I wrote on this? I give a further argument for it.

Ah yeah, I indeed missed that the first time through. I'd still say I don't buy it, but that's a more complicated discussion, and it is at least a decent argument.

I looked into modularity for a bit 1.5 years ago and concluded that the concept is way too vague and seemed useless for alignment or interpretability purposes. If you have a good definition I'm open to hearing it.

This is another place where I'd say we don't understand it well enough to give a good formal definition or operationalization yet.

Though I'd note here, and also above w.r.t. search, that "we don't know how to give a good formal definition yet" is very different from "there is no good formal definition" or "the underlying intuitive concept is confused" or "we can't effectively study the concept at all" or "arguments which rely on this concept are necessarily wrong/uninformative". Every scientific field was pre-formal/pre-paradigmatic once.

To me it looks like people abandoned behaviorism for pretty bad reasons. The ongoing replication crisis in psychology does not inspire confidence in that field's ability to correctly diagnose bullshit.
That said, I don't think my views depend on behaviorism being the best framework for human psychology. The case for behaviorism in the AI case is much, much stronger: the equations for an algorithm like REINFORCE or DPO directly push up the probability of some actions and push down the probability of others.

Man, that is one hell of a bullet to bite. Much kudos for intellectual bravery and chutzpah!

That might be a fun topic for a longer discussion at some point, though not right now.

↑ comment by Noosphere89 (sharmake-farah) · 2024-02-29T17:15:03.844Z · LW(p) · GW(p)

Hm, are we actually sure singular learning theory actually supports general-purpose search at all?

And how does it support the goal-slot theory?

comment by ryan_greenblatt · 2024-02-27T23:18:29.002Z · LW(p) · GW(p)

Since there are “more” possible schemers than non-schemers, the argument goes, we should expect training to produce schemers most of the time. In Carlsmith’s words:

It's important to note that the exact counting argument you quote isn't one that Carlsmith endorses, just one that he is explaning. And in fact Carlsmith specifically notes that you can't just apply something like the principle of indifference without more reasoning about the actual neural network prior.

(You mention this later in the "simplicity arguments" section, but I think this objection is sufficiently important and sufficiently missing early in the post that it is important to emphasize.)

Quoting somewhat more context:

I start, in section 4.2, with what I call the “counting argument.” It runs as follows:

The non-schemer model classes, here, require fairly specific goals in order to get high reward.

By contrast, the schemer model class is compatible with a very wide range of (beyond- episode) goals, while still getting high reward (at least if we assume that the other require- ments for scheming to make sense as an instrumental strategy are in place—e.g., that the classic goal-guarding story, or some alternative, works).48

In this sense, there are “more” schemers that get high reward than there are non-schemers that do so.

So, other things equal, we should expect SGD to select a schemer.

Something in the vicinity accounts for a substantial portion of my credence on schemers (and I think it often undergirds other, more specific arguments for expecting schemers as well). However, the argument I give most weight to doesn’t move immediately from “there are more possible schemers that get high reward than non-schemers that do so” to “absent further argument, SGD probably selects a schemer” (call this the “strict counting argument”), because it seems possible that SGD actively privileges one of these model classes over the others. Rather, the argument I give most weight to is something like:

It seems like there are “lots of ways” that a model could end up a schemer and still get high reward, at least assuming that scheming is in fact a good instrumental strategy for pursuing long-term goals.

So absent some additional story about why training won’t select a schemer, it feels, to me, like the possibility should be getting substantive weight. I call this the “hazy counting argument.” It’s not especially principled, but I find that it moves me

[Emphasis mine.]

Replies from: quintin-pope

↑ comment by Quintin Pope (quintin-pope) · 2024-02-27T23:24:38.333Z · LW(p) · GW(p)

We argue against the counting argument in general (more specifically, against the presumption of a uniform prior as a "safe default" to adopt in the absence of better information). This applies to the hazy counting argument as well.

We also don't really think there's that much difference between the structure of the hazy argument and the strict one. Both are trying to introduce some form of ~uniformish prior over the outputs of a stochastic AI generating process. The strict counting argument at least has the virtue of being precise about which stochastic processes it's talking about.

If anything, having more moving parts in the causal graph responsible for producing the distribution over AI goals should make you more skeptical of assigning a uniform prior to that distribution.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-27T23:33:12.025Z · LW(p) · GW(p)

I agree that you can't adopt a uniform prior. (By uniform prior, I assume you mean something like, we represent goals as functions from world states to a (real) number where the number says how good the world state is, then we take a uniform distribution over this function space. (Uniform sampling from function space is extremely, extremely cursed for analysis related reasons without imposing some additional constraints, so it's not clear uniform sampling even makes sense!))

Separately, I'm also skeptical that any serious historical arguments were actually assuming a uniform prior as opposed to trying to actual reason about the complexity/measure of various goal in terms of some fixed world model given some vague guess about the representation of this world model. This is also somewhat dubious due to assuming a goal slot, assuming a world model, and needing to guess at the representation of the world model.

(You'll note that ~all prior arguements mention terms like "complexity" and "bits".)

Of course, the "Against goal realism" and "Simplicity arguments" sections can apply here and indeed, I'm much more sympathetic to these sections than to the counting argument section which seems like a strawman as far as I can tell. (I tried to get to ground on this by communicating back and forth some with you and some with Alex Turner, but I failed, so now I'm just voicing my issues for third parties.)

Replies from: quintin-pope

↑ comment by Quintin Pope (quintin-pope) · 2024-02-28T00:23:53.836Z · LW(p) · GW(p)

I don't think this is a strawman. E.g., in How likely is deceptive alignment? [AF · GW], Evan Hubinger says:

We're going to start with simplicity. Simplicity is about specifying the thing that you want in the space of all possible things. You can think about simplicity as “How much do you have to aim to hit the exact thing in the space of all possible models?” How many bits does it take to find the thing that you want in the model space? And so, as a first pass, we can understand simplicity by doing a counting argument, which is just asking, how many models are in each model class?

First, how many Christs are there? Well, I think there's essentially only one, since there's only one way for humans to be structured in exactly the same way as God. God has a particular internal structure that determines exactly the things that God wants and the way that God works, and there's really only one way to port that structure over and make the unique human that wants exactly the same stuff.
Okay, how many Martin Luthers are there? Well, there's actually more than one Martin Luther (contrary to actual history) because the Martin Luthers can point to the Bible in different ways. There's a lot of different equivalent Bibles and a lot of different equivalent ways of understanding the Bible. You might have two copies of the Bible that say exactly the same thing such that it doesn't matter which one you point to, for example. And so there's more Luthers than there are Christs.
But there's even more Pascals. You can be a Pascal and it doesn't matter what you care about. You can care about anything in the world, all of the various different possible things that might exist for you to care about, because all that Pascal needs to do is care about something over the long term, and then have some reason to believe they're going to be punished if they don't do the right thing. And so there’s just a huge number of Pascals because they can care about anything in the world at all.
So the point is that there's more Pascals than there are the others, and so probably you’ll have to fix fewer bits to specify them in the space.

Evan then goes on to try to use the complexity of the simplest member of each model class as an estimate for the size of the classes (which is probably wrong, IMO, but I'm also not entirely sure how he's defining the "complexity" of a given member in this context), but this section seems more like an elaboration on the above counting argument. Evan calls it "a slightly more concrete version of essentially the same counting argument".

And IMO, it's pretty clear that the above quoted argument is implicitly appealing to some sort of uniformish prior assumption over ways to specify different types of goal classes. Otherwise, why would it matter that there are "more Pascals", unless Evan thought the priors over the different members of each category were sufficiently similar that he could assess their relative likelihoods by enumerating the number of "ways" he thought each type of goal specification could be structured?

Look, Evan literally called his thing a "counting argument", Joe said "Something in this vicinity [of the hazy counting argument] accounts for a substantial portion of [his] credence on schemers [...] and often undergirds other, more specific arguments", and EY often expounds on the "width" of mind design space. I think counting arguments represent substantial intuition pumps for a lot of people (though often implicitly so), so I think a post pushing back on them in general is good.

Replies from: ryan_greenblatt, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-28T00:45:46.895Z · LW(p) · GW(p)

I'm sympathetic to pushing back on counting arguments on the ground 'it's hard to know what the exact measure should be, so maybe the measure on the goal of "directly pursue high performance/anything nearly perfectly correlated the outcome that it reinforced (aka reward)" is comparable/bigger than the measure on "literally any long run outcome"'.

So I appreciate the push back here. I just think the exact argument and the comparison to overfitting is a strawman.

(Note that above I'm assuming a specific goal slot, that the AI's predictions are aware of what its goal slot contains, and that in order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly "play the training game" (e.g. explicitly reason about and try to get high performance). It also seems reasonable to contest these assumption, but this is a different thing than the counting argument.)

(Also, if we imagine an RL'd neural network computing a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.))

I also think Evan's arguments are pretty sloppy in this presentation and he makes a bunch of object level errors/egregious simplifications FWIW, but he is actually trying to talk about models represented in weight space and how many bits are required to specify this. (Not how many bits are required in function space which is crazy!)

By "bits in model space" a more charitable interpretation is something like "among the initialization space of the neural network, how many bits are required to point at this subset relative to other subsets". I think this corresponds to a view like "neural network inductive biases are well approximated by doing conditional sampling from the initialization space (ala Mingard et al.). I think Evan makes errors in reasoning about this space and that his problematic simplifications (at least for the Christ argument) are similar to some sort of "principle of indifference" (it makes similar errors), but I also think that his errors aren't quite this and that there is a recoverable argument here. (See my parentheticals above.)

"There is only 1 Christ" is straightforwardly wrong in practice due to gauge invariances and other equivalences in weight space. (But might be spiritually right? I'm skeptical it is honestly.)

The rest of the argument is to vague to know if it's really wrong or right.

↑ comment by ryan_greenblatt · 2024-02-28T16:25:12.600Z · LW(p) · GW(p)

[Low importance aside]

Evan then goes on to try to use the complexity of the simplest member of each model class as an estimate for the size of the classes (which is probably wrong, IMO, but I'm also not entirely sure how he's defining the "complexity" of a given member in this context)

I think this is equivalent to a well known approximation from algorithmic information theory [LW(p) · GW(p)]. I think this approximation might be too lossy in practice in the case of actual neural nets though.

comment by Matthew Barnett (matthew-barnett) · 2024-02-28T01:48:34.649Z · LW(p) · GW(p)

(I might write a longer response later, but I thought it would be worth writing a quick response now. Cross-posted from the EA forum [EA(p) · GW(p)], and I know you've replied there, but I'm posting anyway.)

I have a few points of agreement and a few points of disagreement:

Agreements:

The strict counting argument seems very weak as an argument for scheming, essentially for the reason you identified: it relies on a uniform prior over AI goals, which seems like a really bad model of the situation.
The hazy counting argument—while stronger than the strict counting argument—still seems like weak evidence for scheming. One way of seeing this is, as you pointed out, to show that essentially identical arguments can be applied to deep learning in different contexts that nonetheless contradict empirical evidence.

Some points of disagreement:

I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don't think it's literally "no evidence" for the claim here: that future AIs will scheme.
I disagree with the bottom-line conclusion: "we should assign very low credence to the spontaneous emergence of scheming in future AI systems—perhaps 0.1% or less"
- I think it's too early to be very confident in sweeping claims about the behavior or inner workings of future AI systems, especially in the long-run. I don't think the evidence we have about these things is very strong right now.
- One caveat: I think the claim here is vague. I don't know what counts as "spontaneous emergence", for example. And I don't know how to operationalize AI scheming. I personally think scheming comes in degrees: some forms of scheming might be relatively benign and mild, and others could be more extreme and pervasive.
- Ultimately I think you've only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme. Actors such as AI labs have strong incentives to be vigilant against these types of mistakes when training AIs, but I don't expect people to come up with perfect solutions. So I'm not convinced that AIs won't scheme at all.
- If by "scheming" all you mean is that an agent deceives someone in order to get power, I'd argue that many humans scheme all the time. Politicians routinely scheme, for example, by pretending to have values that are more palatable to the general public, in order to receive votes. Society bears some costs from scheming, and pays costs to mitigate the effects of scheming. Combined, these costs are not crazy-high fractions of GDP; but nonetheless, scheming is a constant fact of life.
- If future AIs are "as aligned as humans", then AIs will probably scheme frequently. I think an important question is how intensely and how pervasively AIs will scheme; and thus, how much society will have to pay as a result of scheming. If AIs scheme way more than humans, then this could be catastrophic, but I haven't yet seen any decent argument for that theory.
- So ultimately I am skeptical that AI scheming will cause human extinction or disempowerment, but probably for different reasons than the ones in your essay: I think the negative effects of scheming can probably be adequately mitigated by paying some costs even if it arises.
I don't think you need to believe in any strong version of goal realism in order to accept the claim that AIs will intuitively have "goals" that they robustly attempt to pursue. It seems pretty natural to me that people will purposely design AIs that have goals in an ordinary sense, and some of these goals will be "misaligned" in the sense that the designer did not intend for them. My relative optimism about AI scheming doesn't come from thinking that AIs won't robustly pursue goals, but instead comes largely from my beliefs that:
- AIs, like all real-world agents, will be subject to constraints when pursuing their goals. These constraints include things like the fact that it's extremely hard and risky to take over the whole world and then optimize the universe exactly according to what you want. As a result, AIs with goals that differ from what humans (and other AIs) want, will probably end up compromising and trading with other agents instead of pursuing world takeover. This is a benign failure and doesn't seem very bad.
- The amount of investment we put into mitigating scheming is not an exogenous variable, but instead will respond to evidence about how pervasive scheming is in AI systems, and how big of a deal AI scheming is. And I think we'll accumulate lots of evidence about the pervasiveness of AI scheming in deep learning over time (e.g. such as via experiments with model organisms of alignment), allowing us to set the level of investment in AI safety at a reasonable level as AI gets incrementally more advanced.
  
  If we experimentally determine that scheming is very important and very difficult to mitigate in AI systems, we'll probably respond by spending a lot more money on mitigating scheming, and vice versa. In effect, I don't think we have good reasons to think that society will spend a suboptimal amount on mitigating scheming.

Replies from: ryan_greenblatt, nora-belrose, TurnTrout

↑ comment by ryan_greenblatt · 2024-02-28T01:55:49.010Z · LW(p) · GW(p)

Ultimately I think you've only rebutted one argument for scheming—the counting argument. A more plausible argument for scheming, in my opinion, is simply that the way we train AIs—including the data we train them on—could reward AIs that scheme over AIs that are honest and don't scheme.

It's worth noting here that Carlsmith's original usage of the term scheming just refers to AIs that perform well on training and evaluations for instrumental reasons because they have longer run goals or similar.

So, AIs lying because this was directly reinforced wouldn't itself be scheming behavior in Carlsmith's terminology.

However, it's worth noting that part of Carlsmith's argument involves arguing that smart AIs will likely have to explicitly reason about the reinforcement process (sometimes called playing the training game) and this will likely involve lying.

Replies from: matthew-barnett

↑ comment by Matthew Barnett (matthew-barnett) · 2024-02-28T02:11:28.973Z · LW(p) · GW(p)

Perhaps I was being too loose with my language, and it's possible this is a pointless pedantic discussion about terminology, but I think I was still pointing to what Carlsmith called schemers in that quote. Here's Joe Carlsmith's terminological breakdown:

~~The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.~~ [ETA: this was confusingly stated. What I meant is that if a people design a reward function that accidentally reinforces lying in order to obtain power, it seems reasonable to call the agent that results from training on that reward function a "schemer" given Carlsmith's terminology, and common sense.]

If lying to obtain power is reinforced but the designers either do not know this, or do not know how to mitigate this behavior, then it still seems reasonable to call the resulting model a "schemer". In Ajeya Cotra's story [LW · GW], for example:

Alex was incentivized to lie because it got rewards for taking actions that were superficially rated as good even if they weren't actually good, i.e. Alex was "lying because this was directly reinforced". She wrote, "Because humans have systematic errors in judgment, there are many scenarios where acting deceitfully causes humans to reward Alex’s behavior more highly. Because Alex is a skilled, situationally aware, creative planner, it will understand this; because Alex’s training pushes it to maximize its expected reward, it will be pushed to act on this understanding and behave deceptively."
Alex was "playing the training game", as Ajeya Cotra says this explicitly several times in her story.
Alex was playing the training game in order to get power for itself or for other AIs; clearly, as the model literally takes over the world and disempowers humanity at the end.
Alex kind of didn't appear to purely care about reward-on-the-episode, since it took over the world? Yes, Alex cared about rewards, but not necessarily on this episode. Maybe I'm wrong here. But even if Alex only cared about reward-on-the-episode, you could easily construct a scenario similar to Ajeya's story in which a model begins to care about things other than reward-on-the-episode, which nonetheless fits the story of "the AI is lying because this was directly reinforced".

Replies from: ryan_greenblatt, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-28T05:20:29.609Z · LW(p) · GW(p)

The key distinction in my view is whether the designers of the reward function intended for lies to be reinforced or not.

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

Overall, I use the term to mean basically the same thing as "deceptive alignment". (But more specifically pointing the definition in Joe's report which depends less on some notion of mesa-optimization and is a bit more precise IMO.)

Replies from: matthew-barnett

↑ comment by Matthew Barnett (matthew-barnett) · 2024-02-28T19:22:38.200Z · LW(p) · GW(p)

Hmm, I don't think the intention is the key thing (at least with how I use the word and how I think Joe uses the word), I think the key thing is whether the reinforcement/reward process actively incentivizes bad behavior.

I confusingly stated my point (and retracted my specific claim in the comment above). I think the rest of my comment basically holds, though. Here's what I think is a clearer argument:

The term "schemer" evokes an image of someone who is lying to obtain power. It doesn't particularly evoke a backstory for why the person became a liar in the first place.
There are at least two ways that AIs could arise that lie in order to obtain power:
- The reward function could directly reinforce the behavior of lying to obtain power, at least at some point in the training process.
- The reward function could have no defects (in the sense of not directly reinforcing harmful behavior), and yet an agent could nonetheless arise during training that lies in order to obtain power, simply because it is a misaligned inner optimizer (broadly speaking)
In both cases, one can imagine the AI eventually "playing the training game", in the sense of having a complete understanding of its training process and deliberately choosing actions that yield high reward, according to its understanding of the training process
Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them "schemers", as that simply matches the way the term is typically used.

For example, Nora and Quintin started their post with, "AI doom scenarios often suppose that future AIs will engage in scheming— planning to escape, gain power, and pursue ulterior motives, while deceiving us into thinking they are aligned with our interests." This usage did not specify the reason for the deceptive behavior arising in the first place, only that the behavior was both deceptive and aimed at gaining power.
Separately, I am currently confused at what it means for a behavior to be "directly reinforced" by a reward function, so I'm not completely confident in these arguments, or my own line of reasoning here. My best guess is that these are fuzzy terms that might be much less coherent than they initially appear if one tried to make these arguments more precise.

Replies from: ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-02-28T19:32:43.908Z · LW(p) · GW(p)

Since both types of AIs are: (1) playing the training game, (2) lying in order to obtain power, it makes sense to call both of them "schemers", as that simply matches the way the term is typically used.

I agree this matches typical usage (and also matches usage in the overall post we're commenting on), but sadly the word schemer in the context of Joe's report means something more specific. I'm sad about the overall terminology situation here. It's possible I should just always use a term like beyond-episode-goal-style-scheming.

I agree this distinction is fuzzy, but I think there is likely to be an important distinction because the case where the behavior isn't due to things well described as beyond-episode-goals, it should be much easier to study. See here [LW · GW] for more commentary. There will of course be a spectrum here.

↑ comment by ryan_greenblatt · 2024-02-28T05:17:49.003Z · LW(p) · GW(p)

I think in Ajeya's story the core threat model isn't well described as scheming and is better described as seeking some proxy of reward.

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T02:29:07.159Z · LW(p) · GW(p)

You can find my EA forum response here [EA(p) · GW(p)].

↑ comment by TurnTrout · 2024-03-05T01:35:46.140Z · LW(p) · GW(p)

I think the title overstates the strength of the conclusion. The hazy counting argument seems weak to me but I don't think it's literally "no evidence" for the claim here: that future AIs will scheme.

I agree, they're wrong to claim it's "no evidence." I think that counting arguments are extremely slight evidence against scheming, because they're weaker than the arguments I'd expect our community's thinkers to find in worlds where scheming was real. (Although I agree that on the object-level and in isolation, the arguments are tiiiny positive evidence.)

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-28T04:52:55.176Z · LW(p) · GW(p)

Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful.

This is NOT what the evidence supports, and super misleadingly phrased. (Either that, or it's straightup magical thinking, which is worse)

The inductive biases / simplicity biases of deep learning are poorly understood but they almost certainly don't have anything to do with what humans want, per se. (that would be basically magic) Rather, humans have gotten decent at intuiting them, such that humans can often predict how the neural network will generalize in response to such-and-such training data. i.e. human intuitive sense of simplicity is different, but not totally different, at least not always, from the actual simplicity biases at play.

Stylized abstract example: Our current AI is not generalizing in the way we wanted it to. Looking at its behavior, and our dataset, we intuit that the dataset D is narrow/nondiverse in ways Y and Z and that this could be causing the problem; we go collect more data so that our dataset is diverse in those ways, and try again, and this time it works (i.e. the AI generalizes to unseen data X). Why did this happen? Why didn't it just overfit to the new dataset Dnew and fail at X? Because the simplicity biases were as we suspected they were -- the model was indeed learning [nondiverse, overfitting-to-D policy] and not [desired policy] because of Y and Z related reasons and in the new training run Y and Z were fixed and the simplest policy was [desired policy] instead of [nondiverse, overfitting-to-Dnew policy], as we predicted & hoped it would be.

Replies from: nora-belrose, TurnTrout

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T07:33:13.486Z · LW(p) · GW(p)

they almost certainly don't have anything to do with what humans want, per se. (that would be basically magic)

We are obviously not appealing to literal telepathy or magic. Deep learning generalizes the way we want in part because we designed the architectures to be good, in part because human brains are built on similar principles to deep learning, and in part because we share a world with our deep learning models and are exposed to similar data.

Replies from: peterbarnett

↑ comment by peterbarnett · 2024-02-28T17:41:16.881Z · LW(p) · GW(p)

Saying we design the architectures to be good is assuming away the problem. We design the architectures to be good according to a specific set of metrics (test loss, certain downstream task performance, etc). Problems like scheming are compatible with good performance on these metrics.

I think the argument about the similarity between human brains and the deep learning leading to good/nice/moral generalization is wrong. Human brains are way more similar to other natural brains which we would not say have nice generalization (e.g. the brains of bears or human psychopaths). One would need to make the argument that deep learning has certain similarities to human brains that these malign cases lack.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T21:37:44.821Z · LW(p) · GW(p)

I'm not actually sure the scheming problems are "compatible" with good performance on these metrics, and even if they are, that doesn't mean they're likely or plausible given good performance on our metrics.

Human brains are way more similar to other natural brains

So I disagree with this, but likely because we are using different conceptions of similarity. In order to continue this conversation we're going to need to figure out what "similar" means, because the term is almost meaningless in controversial cases— you can fill in whatever similarity metric you want. I used the term earlier as a shorthand for a more detailed story about randomly initialized singular statistical models learned with iterative, local update rules. I think both artificial and biological NNs fit that description, and this is an important notion of similarity.

↑ comment by TurnTrout · 2024-03-05T01:41:40.882Z · LW(p) · GW(p)

This is NOT what the evidence supports, and super misleadingly phrased. (Either that, or it's straightup magical thinking, which is worse)

The inductive biases / simplicity biases of deep learning are poorly understood but they almost certainly don't have anything to do with what humans want, per se.

Seems like a misunderstanding. It seems to me that you are alleging that Nora/Quintin believe there is a causal arrow from "Humans want X generalization" to "NNs have X generalization"? If so, I think that's an uncharitable reading of the quoted text.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-03-05T01:59:55.135Z · LW(p) · GW(p)

I said "Either that, or it's straightup magical thinking" which was referring to the causal arrow hypothesis. I agree it's unlikely that they would endorse the causal arrow / magical thinking hypothesis, especially once it's spelled out like that.

What do you think they meant by "Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful?"

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-11T22:49:09.376Z · LW(p) · GW(p)

I think they meant that there is an evidential update from "it's economically useful" upwards on "this way of doing things tends to produce human-desired generalization in general and not just in the specific tasks examined so far."

Perhaps it's easy to consider the same style of reasoning via: "The routes I take home from work are strongly biased towards being short, otherwise I wouldn't have taken them home from work."

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-03-11T23:49:54.407Z · LW(p) · GW(p)

Thanks. The routes-home example checks out IMO. Here's another one that also seems to check out, which perhaps illustrates why I feel like the original claim is misleading/unhelpful/etc.: "The laws of ballistics strongly bias aerial projectiles towards landing on targets humans wanted to hit; otherwise, ranged weaponry wouldn't be militarily useful."

There's a non-misleading version of this which I'd recommend saying instead, which is something like "Look we understand the laws of physics well enough and have played around with projectiles enough in practice, that we can reasonably well predict where they'll land in a variety of situations, and design+aim weapons accordingly; if this wasn't true then ranged weaponry wouldn't be militarily useful."

And I would endorse the corresponding claim for deep learning: "We understand how deep learning networks generalize well enough, and have played around with them enough in practice, that we can reasonably well predict how they'll behave in a variety of situations, and design training environments accordingly; if this wasn't true then deep learning wouldn't be economically useful."

(To which I'd reply "Yep and my current understanding of how they'll behave in certain future scenarios is that they'll powerseek, for reasons which others have explained... I have some ideas for other, different training environments that probably wouldn't result in undesired behavior, but all of this is still pretty up in the air tbh I don't think anyone really understands what they are doing here nearly as well as e.g. cannoneers in 1850 understood what they were doing.")

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-03-12T00:08:39.726Z · LW(p) · GW(p)

To put it in terms of the analogy you chose: I agree (in a sense) that the routes you take home from work are strongly biased towards being short, otherwise you wouldn't have taken them home from work. But if you tell me that today you are going to try out a new route, and you describe it to me and it seems to me that it's probably going to be super long, and I object and say it seems like it'll be super long for reasons XYZ, it's not a valid reply for you to say "don't worry, the routes I take home from work are strongly biased towards being short, otherwise I wouldn't take them." At least, it seems like a pretty confusing and maybe misleading thing to say. I would accept "Trust me on this, I know what I'm doing, I've got lots of experience finding short routes" I guess, though only half credit for that since it still wouldn't be an object level reply to the reasons XYZ and in the absence of such a substantive reply I'd start to doubt your expertise and/or doubt that you were applying it correctly here (especially if I had an error theory for why you might be motivated to think that this route would be short even if it wasn't.)

comment by ryan_greenblatt · 2024-02-28T01:32:02.759Z · LW(p) · GW(p)

I think that if you do assume a fixed goal slot and outline an overall architecture, then there are pretty good arguments for a serious probabilty of scheming.

(Though there are also plenty of bad arguments, including some that people have made in the past : ).)

That said, I'm sympathetic to some version of the "Against goal realism" argument applying to models which are sufficiently useful. As in, the first transformatively useful models won't in practice contain have internal (opaque-to-human-overseers) goals such that the traditional story for scheming doesn't apply.

(However, it's worth noting that at least some humans do seem to have internal goals and reductionism doesn't defeat this intuition. It's not super clear that the situation with humans is well described as a "goal slot", though there is pretty clearly some stuff that could be changed in a human brain that would cause them to be well described as coherantly pursue different goals. So arguing that AIs won't have internal goals in a way that could result in scheming does require noting some ways in which you're argument doesn't apply to humans. More strongly, humans can and do scheme even in cases where some overseer selected the humans for good behavior.)

Regardless, I'll assume the existance of something well described as a goal slot in this comment.

In the remainder of this comment:

I make a simple (but vague and hand wavy) counting/measure/simplicity style argument for scheming independently.
I'll try to quickly sketch a simple and relatively concrete architecture with a goal slot that I think might lead to scheming. I make somewhat artificial choices for the sake of simplicity.

Simple (counting-style) argument for scheming

(I make a less fleshed out version of this argument, in this comment [LW(p) · GW(p)] below.)

Let's suppose that we do have a neural network which implements goal-directed reasoning and has something well described as a goal slot.

The model is trained for a long time with RL.

Let's imagine that this model is very smart and it understands reasonably well the training the process it's in and how its actions in training effect the final model produced by training. (This second property is sometimes called situational awareness.)

Speaking roughly, and under a few more assumptions I'll discuss later, there are two types of goals which perform well enough in RL to be selected:

Goals which directly pursue anything that is nearly perfectly correlated with the outcome that it reinforced (aka reward).
Goals which care most about literally any long run outcome.

So, we care about the measure on these two types of goals.

If we imagine that our RL'd neural network computes a bunch of predictions, then it does seem plausible that it will have a bunch of long horizon predictions in its activations with higher aggregate measure than predicting things that perfectly correlate with the outcome that was reinforced (aka reward)! As in, if we imagine randomly sampling a linear probe, it will be far more likely to sample a probe where most of the variance is driven by long run outcomes than to sample a linear probe which is almost perfectly correlated with reward (e.g. a near perfect predictor of reward up to monotone regression). Neural networks are likely to compute a bunch of long range predictions at least as intermediates, but they only need to compute things that nearly perfectly correlate with reward once! (With some important caveats about transfer from other distributions.)

It's not clear that variance in the activations determines the relative measure of different goals that perform equally well in training, but it seems like a vaguely sane rough ballpark guess. (Other proposals don't seem to particular effect the intuitions because the process of goals selecting actions will absolutely require thinking about predicted consequences.)

Overall, I don't think this measure argument is clear, but in the case with explicit goal slots there are at least some concerning signs.

For the above argument to go through, there are a few assumptions I've made:

In order for the AI to perform sufficiently well as to be a plausible result of training it has to explicitly "play the training game" (e.g. explicitly reason about and try to get high performance).
The AI's predictions are aware of what its goal slot contains.

It also seems reasonable to contest these assumption, but this is a different thing than the counting argument. I think these two assumptions are very likely to hold in the limit of arbitrary intelligence, but it's unclear if they hold for AIs which are created prior to the point where human AI safety efforts are totally obsolete.

Simple architecture where scheming seems plausible (though not certain)

This architecture will involve internally doing literal best of n search against an internally represented goal.

Let's suppose that our AI has an internal world model P. P maps from proposed actions to a bunch of predictions about the world conditional on that action being taken. Let's say these predictions are represented as a vector in (e.g. a neural network computes predictions about the future as an embedding). For now, let's assume that this world model is fixed, but does end up predicting all of the important outcomes needed to perform well in training.
It also has some sort of action generator to get proposed actions A. This allows it to sample proposed actions to search over. For now, let's suppose that the AI's distribution over actions is fixed. (But the search over these actions won't be fixed.)
It also has an explicit goal represented as a linear probe which is applied to the predictions from the world model. We'll suppose this goal is the only thing which is train for simplicity (though I expect this rough argument to work if we train the world model in parallel with the goals).

Now, our overall architecture will do search using these components. Here's some python pseudo code:

def P(input, action, goal):
    # returns a vector in R^M
    ...

def A(input):
    # samples an action
    ...

def model(input, goal_linear_probe, N):
    highest_goodness = -float('inf')
    best_action = None


    for i in range(N):
        action = A(input)
        preds = P(input, action, goal_linear_probe)
        interally_assessed_goodness = (goal_linear_probe * preds).sum()

        if interally_assessed_goodness > highest_goodness:
            highest_goodness = interally_assessed_goodness
            best_action = action

    return best_action

Note that this is pseudo code for the overall model which we're training and which is implementing consequentialism interally in an opaque way (opaque in the corresponding real model, this is obviously not opaque because it's written as python code). So there would an outer training loop that trains this model. (In this case for simplicity, we assume that only the goal_linear_probe is trained.)

Now, in the case where:

N is very large (analogy: a massive amount of selection pressure is applied by internal consequentialist reasoning, e.g. 64 bits or N=2^64).
P has all of the important predictions for performing well in training including intermediate predictions.
We train the AI (really just the goal probe) for a huge number of episodes (analogy: a massive amount of selection pressure is applied by training). (For simplicity, we can imagine this training just doing best-of-n sampling over some massive number of random linear probes (e.g. 2^1024) which is ~equivalent to Mingard et al.)
- This assumption and the above assumption combined with this architecture basically implies that playing the training game is required to be selected by training. (See above for a rough definition of playing the training game.)
P "understands" that it is part of an AI in training and that the submitted action will effect selection/training pressure on the goal probe. And in turn it understands what actions various goal probes will take at different points. Note that this requires that the predictor takes the goal as input which corresponds to the AI being aware of its own goals.
(We also need some consistency properties on the representations that P uses for long run predictions. E.g., specific consistent long run predictions need to be linearly represented and use consistent representations over time. This is even more in the weeds, so most people can ignore this even if this is a bit confusing.)

Then there are basically two types of goal linear probes you can get:

Probes which mostly "care" about long run outcomes in ~every training episode and thus assign high internally_assessed_goodness to highly rewarded actions because these actions will result in these goals getting more influence over the long run (because the goal probe isn't selected against).
Probes which are nearly perfectly correlated with the outcome that is reinforced (aka reward).

Probably literal reward is somewhat saliant to P in this setup (particular if we imagine this model being trained to predict reward as occurs in many RL setups), but there are also going to be a lot of long range predictions that the model will need to compute to do well in training (both in RL and in pretraining). And you only need to compute near-perfect-on-distribution correlates of reward once (at least for the RL part of training).

I might try to touch up this argument at some point, but this is the core sketch.

comment by ryan_greenblatt · 2024-02-28T19:26:19.887Z · LW(p) · GW(p)

The current literature on scheming appears to have been inspired by Paul Christiano’s speculations about malign intelligences in Solomonoff induction

This doesn't seem right. The linked post by Paul here is about the (extremely speculative) case where consequentialist life emerges organically inside of full blown simulations (e.g. evolving from scratch) while arguments about ML models never go here.

Regardless, concerns and arguments about scheming are much older than Paul's posts on this topic.

(That said, I do think that people have made scheming style arguments based on intuitions from thinking about AIXI and the space of turing machines at various points. Though this was never very key and I don't believe these arguments are ever in reference to cases where a literal simulation evolves life.)

comment by rotatingpaguro · 2024-02-28T05:10:43.886Z · LW(p) · GW(p)

There is also a hazy counting argument for overfitting:
It seems like there are “lots of ways” that a model could end up massively overfitting and still get high training performance.
So absent some additional story about why training won’t select an overfitter, it feels like the possibility should be getting substantive weight.
While many machine learning researchers have felt the intuitive pull of this hazy overfitting argument over the years, we now have a mountain of empirical evidence that its conclusion is false. Deep learning is strongly biased toward networks that generalize the way humans want— otherwise, it wouldn’t be economically useful.

I don't know well NN history, but I have the impression good NN training is not trivial. I expect that the first attempts at NN training went bad in some way, including overfitting. So, without already knowing how to train an NN without overfitting, you'd get some overfitting in your experiments. The fact that now, after someone already poured their brain juice over finding techniques that avoid the problem, you don't get overfitting, is not evidence that you shouldn't have expected overfitting before.

The analogy with AI scheming is: you don't already know the techniques to avoid scheming. You can't use as counterargument a case in which a problem has already deliberately been solved. If you take that same case, and put yourself in the shoes of someone who doesn't already have the solution, you see you'll get the problem in your face a few times before solving it.

Then, it is a matter of whether it works like Yudkowsky says, that you may only get one chance to solve it.

The title says "no evidence for AI doom in counting arguments", but the article mostly talks about neural networks (not AI in general), and the conclusion is

In this essay, we surveyed the main arguments that have been put forward for thinking that future AIs will scheme against humans by default. We find all of them seriously lacking. We therefore conclude that we should assign very low credence to the spontaneous emergence of scheming in future AI systems— perhaps 0.1% or less.

"main arguments": I don't think counting arguments completely fill up this category. Example: the concept of scheming originates from observing it in humans.

Overall, I have the impression of some overstatement. It can also be that I'm missing some previous discussion context/assumptions, so other background theory from you may say "humans don't matter as examples", and also "AI will be NNs and not other things".

comment by Max H (Maxc) · 2024-02-28T02:17:31.979Z · LW(p) · GW(p)

Joe also discusses simplicity arguments for scheming, which suppose that schemers may be “simpler” than non-schemers, and therefore more likely to be produced by SGD.

I'm not familiar with the details of Joe's arguments, but to me the strongest argument from simplicity is not that schemers are simpler than non-schemers, it's that scheming itself is conceptually simple and instrumentally useful. So any system capable of doing useful and general cognitive work will necessarily have to at least be capable of scheming.

We will address this question in greater detail in a future post. However, we believe that current evidence about inductive biases points against scheming for a variety of reasons. Very briefly:
Modern deep neural networks are ensembles of shallower networks. Scheming seems to involve chains of if-then reasoning which would be hard to implement in shallow networks.
Networks have a bias toward low frequency functions— that is, functions whose outputs change little as their inputs change. But scheming requires the AI to change its behavior dramatically (executing a treacherous turn) in response to subtle cues indicating it is not in a sandbox, and could successfully escape.
There’s no plausible account of inductive biases that does support scheming. The current literature on scheming appears to have been inspired by Paul Christiano’s speculations about malign intelligences in Solomonoff induction, a purely theoretical model of probabilistic reasoning which is provably unrealizable in the real world.^[16] Neural nets look nothing like this.
In contrast, points of comparison that are more relevant to neural network training, such as isolated brain cortices, don’t scheme. Your linguistic cortex is not “instrumentally pretending to model linguistic data in pursuit of some hidden objective.”

Also, don't these counterpoints prove too much? If networks trained via SGD can't learn scheming, why should we expect models trained via SGD to be capable of learning or using any high-level concepts, even desirable ones?

These bullets seem like plausible reasons for why you probably won't get scheming within a single forward pass of a current-paradigm DL model, but are already inapplicable to the real-world AI systems in which these models are deployed.

LLM-based systems are already capable of long chains of if-then reasoning, and can change their behavior dramatically given a different initial prompt, often in surprising ways.

If the most relevant point of comparison to NN training is an isolated brain cortex, then that's just saying that NN training will never be useful in isolation, since an isolated brain cortex can't do much (good or bad) unless it is actually hooked up to a body, or at least the rest of a brain.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T04:46:18.769Z · LW(p) · GW(p)

If networks trained via SGD can't learn scheming

It's not that they can't learn scheming. A sufficiently wide network can learn any continuous function. It's that they're biased strongly against scheming, and they're not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.

These bullets seem like plausible reasons for why you probably won't get scheming within a single forward pass of a current-paradigm DL model, but are already inapplicable to the real-world AI systems in which these models are deployed.

Why does chaining forward passes together make any difference? Each forward pass has been optimized to mimic patterns in the training data. Nothing more, nothing less. It'll scheme in context X iff scheming behavior is likely in context X in the training corpus.

Replies from: Maxc

↑ comment by Max H (Maxc) · 2024-02-28T05:16:40.049Z · LW(p) · GW(p)

It's that they're biased strongly against scheming, and they're not going to learn it unless the training data primarily consists of examples of humans scheming against one another, or something.

I'm saying if they're biased strongly against scheming, that implies they are also biased against usefulness to some degree.

As a concrete example, it is demonstrably much easier to create a fake blood testing company and scam investors and patients for $billions than it is to actually revolutionize blood testing. I claim that there is something like a core of general intelligence required to execute on things like the latter, which necessarily implies possession of most or all of the capabilities needed to pull off the former.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T16:16:18.978Z · LW(p) · GW(p)

This is just an equivocation, though. Of course you could train an AI to "scheme" against people in the sense of selling a fake blood testing service. That doesn't mean that by default you should expect AIs to spontaneously start scheming against you, and in ways you can't easily notice.

comment by Signer · 2024-02-29T17:43:27.718Z · LW(p) · GW(p)

I don't get how you can arrive at 0.1% for future AI systems even if NNs are biased against scheming. Humans scheme, the future AI systems trained to be capable of long if-then chains may also learn to scheme, maybe because explicitly changing biases is good for performance. Or even, what, you have <0.1% on future AI systems not using NNs?

Also, not saying "but it doesn't matter", but assuming everyone agrees that spectrally biased NN with classifier or whatever is a promising model of a safe system. Do you then propose we should not worry and just make the most advanced AI we can as fast as possible. Or it would be better to first reduce remaining uncertainty about behavior of future systems?

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T01:06:23.616Z · LW(p) · GW(p)

I'm saying <0.1% chance on "world is ended by spontaneous scheming." I'm not saying no AI will ever do anything that might be well-described as scheming, for any reason.

Replies from: ryan_greenblatt, mike_hawke

↑ comment by ryan_greenblatt · 2024-03-04T01:51:58.862Z · LW(p) · GW(p)

The exact language you use in the post is:

We therefore conclude that we should assign very low credence to the spontaneous emergence of scheming in future AI systems— perhaps 0.1% or less.

I personally think there is a moderate gap (perhaps factor of 3) between "world is ended by serious^[1] spontaneous scheming" and "serious spontaneous scheming". And, I could imagine updating to a factor of 10 if the world seemed better prepared etc. So, it might be good to clarify this in the post. (Or clarify your comment.)

(I think perhaps spontaneous scheming (prior to human obsolence) is ~25% likely and x-risk conditional on being in one of those worlds which is due to this scheming is about 30% likely for an overall 8% on "world is ended by serious spontaneous scheming" (prior to human obsolence).)

serious = somewhat persistant, thoughtful, etc ↩︎

↑ comment by mike_hawke · 2024-03-06T01:11:30.909Z · LW(p) · GW(p)

EDIT: This is wrong. See descendent comments.

I spent a bunch of time wondering how you could could put 99.9% on no AI ever doing anything that might be well-described as scheming for any reason. I was going to challenge you to list a handful of other claims that you had similar credence in, until I searched the comments for "0.1%" and found this one.

~~I'm annoyed at this, and I request that you prominently edit the OP.~~

Replies from: quintin-pope, sharmake-farah

↑ comment by Quintin Pope (quintin-pope) · 2024-03-07T03:24:09.490Z · LW(p) · GW(p)

The post says "we should assign very low credence to the spontaneous emergence of scheming in future AI systems— perhaps 0.1% or less."

I.e., not "no AI will ever do anything that might be well-described as scheming, for any reason."

It should be obvious that, if you train an AI to scheme, you can get an AI that schemes.

Replies from: mike_hawke

↑ comment by mike_hawke · 2024-03-12T23:49:56.934Z · LW(p) · GW(p)

Damn, woops.

My comment was false (and strident; worst combo). I accept the strong downvote and I will try to now make a correction.

I said:

I spent a bunch of time wondering how you could could put 99.9% on no AI ever doing anything that might be well-described as scheming for any reason.

What I meant to say was:

I spent a bunch of time wondering how you could put 99.9% on no AI ever doing anything that might be well-described as scheming for any reason, even if you stipulate that it must happen spontaneously.

And now you have also commented [LW(p) · GW(p)]:

Well, I have <0.1% on spontaneous scheming, period. I suspect Nora is similar and just misspoke in that comment.

So....I challenge you to list a handful of other claims that you have similar credence in. Special Relativity? P!=NP? Major changes in our understanding of morality or intelligence or mammal psychology? China pulls ahead in AI development? Scaling runs out of steam and gives way to other approaches like mind uploading? Major betrayal against you by a beloved family member?
The OP simply says "future AI systems" without specifying anything about these systems, their paradigm, or what offworld colony they may or may not be developed on. Just...all AI systems henceforth forever. Meaning that no AI creators will ever accidentally recapitulate the scheming that is already observed in nature...? That's such a grand, sweeping claim. If you really think it's true, I just don't understand your worldview. If you've already explained why somewhere, I hope someone will link me to it.

↑ comment by Noosphere89 (sharmake-farah) · 2024-03-07T01:07:51.595Z · LW(p) · GW(p)

Agree with this hugely, though I could make a partial defense of the confidence given, but yes I'd like this post to be hugely edited.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-07T02:59:18.437Z · LW(p) · GW(p)

What do you mean "hugely edited"? What other things would you like us to change? If I were starting from scratch I would of course write the post differently but I don't think it would be worth my time to make major post hoc edits; I would like to focus on follow up posts.

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-03-07T03:30:01.843Z · LW(p) · GW(p)

Specifically, I wanted the edit to be a clarification that you only have a <0.1% probability on spontaneous scheming ending the world.

Replies from: quintin-pope

↑ comment by Quintin Pope (quintin-pope) · 2024-03-07T04:08:52.847Z · LW(p) · GW(p)

Well, I have <0.1% on spontaneous scheming, period. I suspect Nora is similar and just misspoke in that comment.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-07T07:16:02.002Z · LW(p) · GW(p)

If it's spontaneous then yeah, I don't expect it to happen ~ever really. I was mainly thinking about cases where people intentionally train models to scheme.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-28T04:26:54.405Z · LW(p) · GW(p)

the problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter.

I think you mean "in all scenarios they are likely to encounter *on the training distribution* / in the ancestral environment right? That's importantly different.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T04:49:32.350Z · LW(p) · GW(p)

I don't think the distinction is important, because in real-world AI systems the train -> deployment shift is quite mild, and we're usually training the model on new trajectories from deployment periodically.

The distinction only matters a lot if you ex ante believe scheming is happening, so that the tiniest difference between train and test distributions will be exploited by the AI to execute a treacherous turn.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-28T05:20:10.849Z · LW(p) · GW(p)

The problem faced by evolution and SGD is more properly described as "in all scenarios they are likely to encounter *on the training distribution* / in the ancestral environment" and if you think that doesn't matter, and round it off to "situations they are likely to encounter," then you should say so explicitly and make it part of your argument. IIUC the standard opinion years ago was that insofar as the AI is operating in deployment on the same distribution as it had in training, then it won't suddenly do any big betrayals or treacherous turns, because e.g. from its perspective it can't even tell whether it is in training or not. (Related: Paul Christiano's stuff on low-stakes vs. high-stakes settings. Low-stakes alignment. Why I often focus my alignment research… | by Paul Christiano | AI Alignment (ai-alignment.com))

Re your argument that it doesn't matter: Well (a) the train->deployment shift seems quite non-mild to me, at least in the future cases I'm concerned about, and your objection about 'it only matters if you ex ante believe scheming is happening' seems invalid to me. Compare: Suppose you were training a model to recognize tanks in a forest and your training dataset only had daytime photos of tanks and nighttime photos of non-tanks. I would quite reasonably be concerned that the model wouldn't generalize to real-world cases due to this, and instead would just learn to be a daylight-detector, and you could respond "this distinction (between training and deployment) only matters if you ex ante believe the daylight-detector policy is being learned." (b) yes it'll be continually trained but also humans are being continually evolved. There's a quantitative question here of how fast the training/evolution happens relative to the distribution shift, which I'd love to see someone try to model.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T16:11:37.650Z · LW(p) · GW(p)

Could you be more specific? In what way will there be non-mild distribution shifts in the future?

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-29T06:48:18.660Z · LW(p) · GW(p)

Lots of ways? I mean, there are already lots of non-mild distribution shifts happening all the time, that's part of why our AIs don't always behave as intended. E.g. with the Gemini thing I doubt Google had included generating pictures of ethnically diverse nazis in the training distribution and given positive reinforcement for it.

But yeah the thing I'm more concerned about is that in the future our AI systems will be agentic, situationally aware, etc. and know quite a lot about their surroundings and training process etc. AND they'll be acting autonomously in the real world and probably also getting some sort of ongoing reinforcement/training periodically. Moreover things will be happening very fast & the AIs will be increasingly trusted with increasing autonomy and real-world power, e.g. trusted to do R&D autonomously on giant datacenters, coding and running novel experiments to design their successors. They'll (eventually) be smart enough to notice opportunities to do various sneaky things and get away with it -- and ultimately, opportunities to actually seize power with high probability of success. In such a situation not only will the "now I have an opportunity to seize power" distribution shift have happened, probably all sorts of other distribution shifts will have happened too e.g. "I was trained in environments of type X, but then deployed into this server farm and given somewhat different task Y (e.g. thinking about alignment instead of about more mundane ML) and I've only had a small amount of training on Y, and then now thanks to breakthrough A that other copies of me just discovered, and outside geopolitical events B and C, my understanding of the situation I'm in and the opportunities available to me and the risks I (and humanity) face have changed significantly. Oh and also my understanding of various concepts like honesty and morality and so forth have also changed significantly due to the reflection various copies of me have done.

comment by Wei Dai (Wei_Dai) · 2024-02-28T00:48:12.276Z · LW(p) · GW(p)

In reality, the problem faced by evolution and by SGD is much easier than this: producing systems that behave the right way in all scenarios they are likely to encounter. In virtue of their aligned behavior, these systems will be “aimed at the right things” in every sense that matters in practice.

I find this passage remarkable, given that so many people are choosing to to have few or no children that fertility has fallen to 0.78 in Korea and 1.0 in China. Presumably you're aware of these (or similar) facts and intended the meaning of this passage to be compatible with them, but I'm having trouble figuring out how...

By contrast, goal realism leads only to unfalsifiable speculation about an “inner actress” with utterly alien motivations.

In order for such speculation to be unfalsifiable, it seemingly has to be the case that we're unable to ever develop good enough interpretability tools to definitively say whether the AI in question has such internal motivations. This could well turn out to be true, but I don't understand how you're able to predict this now. (Or maybe you mean something else by "unfalsifiable" but I can't see what it could be. ETA: Maybe you mean "unfalsifiable with existing methods"?)

On the other hand, with your own proposed alignment method, we have to speculate about what scenarios an AI is likely to encounter. You could say that this is falsifiable (we just have to wait for the future to unfold), but is this actually an advantage?

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T00:57:03.197Z · LW(p) · GW(p)

The point of that section is that "goals" are not ontologically fundamental entities with precise contents, in fact they could not possibly be so given a naturalistic worldview. So you don't need to "target the inner search," you just need to get the system to act the way you want in all the relevant scenarios.

The modern world is not a relevant scenario for evolution. "Evolution" did not need to, was not "intending to," and could not have designed human brains so that they would do high inclusive genetic fitness stuff even when the environment wildly dramatically changes and culture becomes completely different from the ancestral environment.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2024-02-28T01:41:54.391Z · LW(p) · GW(p)

So you don’t need to “target the inner search,” you just need to get the system to act the way you want in all the relevant scenarios.

Your original phrase was "all scenarios they are likely to encounter", but now you've switched to "relevant scenarios". Do you not acknowledge that these two phrases are semantically very different (or likely to be interpreted very differently by many readers), since the modern world is arguably a scenario that "they are likely to encounter" (given that they actually did encounter it) but you say "the modern world is not a relevant scenario for evolution"?

Going forward, do you prefer to talk about "all scenarios they are likely to encounter", or "relevant scenarios", or both? If the latter, please clarify what you mean by "relevant"? (And please answer with respect to both evolution and AI alignment, in case the answer is different in the two cases. I'll probably have more substantive things to say once we've cleared up the linguistic issues.)

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T02:31:35.051Z · LW(p) · GW(p)

No, I don't think they are semantically very different. This seems like nitpicking. Obviously "they are likely to encounter" has to have some sort of time horizon attached to it, otherwise it would include times well past the heat death of the universe, or something.

Replies from: Wei_Dai

↑ comment by Wei Dai (Wei_Dai) · 2024-02-28T03:04:58.577Z · LW(p) · GW(p)

It was not at all clear to me that you intended "they are likely to encounter" to have some sort of time horizon attached to it (as opposed to some other kind of restriction, or that you meant something pretty different from the literal meaning, or that your argument/idea itself was wrong), and it's still not clear to me what sort of time horizon you have in mind.

Replies from: david-johnston

↑ comment by David Johnston (david-johnston) · 2024-02-29T07:06:54.131Z · LW(p) · GW(p)

The AI system builders’ time horizon seems to be a reasonable starting point

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-28T04:31:22.576Z · LW(p) · GW(p)

It seems like there are “lots of ways” that a model could end up massively overfitting and still get high training performance.
So absent some additional story about why training won’t select an overfitter, it feels like the possibility should be getting substantive weight.

FWIW, once I learned more about the problem of induction, I realized that there do exist additional stories explaining why training won't select an overfitter. Or perhaps to put it differently, after I understood the problem of induction better it no longer seemed to me that there were lots of ways a model could massively overfit and still get high training performance. (That is, it seems to me there are many MORE ways it could not overfit)

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T04:40:54.884Z · LW(p) · GW(p)

It depends what you mean by a "way" the model can overfit.

Really we need to bring in measure theory to rigorously talk about this, and an early draft of this post actually did introduce some measure-theoretic concepts. Basically we need to define:

What set are we talking about,
What measure we're using over that set,
And how that measure relates to the probability measure over possible AIs.

The English locution "lots of ways to do X" can be formalized as "the measure of X-networks is high." And that's going to be an empirical claim that we can actually debate.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-28T05:02:44.383Z · LW(p) · GW(p)

I think I mean the same thing you do? "The measure of X-networks is high."

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T16:09:08.289Z · LW(p) · GW(p)

With respect to which measure though? You have to define a measure, there are going to be infinitely many possible measures you could define on this space. And then we'll have to debate if your measure is a good one.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-29T06:23:51.871Z · LW(p) · GW(p)

The actual measure that nature uses to determine the model weights at the end of training -- taking into account the random initialization and also the training process. I'm talking about the (not-yet-fully-understood) inductive biases of neural networks in practice.

Replies from: daniel-kokotajlo

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-29T16:25:37.800Z · LW(p) · GW(p)

Added clarification: When I said "once I understood the problem of induction better" I was referring specifically to the insight evhub attempts to convey with his example about infinite bitstrings. Simpler circuits, policies, goals, strategies, whatever can be instantiated in more ways than all their complex alternatives combined.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-29T16:44:19.360Z · LW(p) · GW(p)

I think the infinite bitstring case has zero relevance to deep learning.

There does exist a concept you might call "simplicity" which is relevant to deep learning. The neural network Gaussian process describes the prior distribution over functions which is induced by the initialization distribution over neural net parameters. Under weak assumptions about the activation function and initialization variance, the NNGP is biased toward lower frequency functions. I think this cuts against scheming, and we plan to write up a post on this in the next month or two.

Replies from: evhub, daniel-kokotajlo

↑ comment by evhub · 2024-03-01T00:17:49.850Z · LW(p) · GW(p)

I think the infinite bitstring case has zero relevance to deep learning.

I think you are still not really understanding my objection. It's not that there is a "finite bitstring case" and an "infinite bitstring case". My objection is that the sort of finite bitstring analysis that you use does not yield any well-defined mathematical object that you could call a prior, and certainly not one that would predict generalization.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T00:43:33.677Z · LW(p) · GW(p)

I never used any kind of bitstring analysis.

Replies from: evhub

↑ comment by evhub · 2024-03-04T00:55:58.976Z · LW(p) · GW(p)

Yes, that's exactly the problem: you tried to make a counting argument, but because you didn't engage with the proper formalism, you ended up using reasoning that doesn't actually correspond to any well-defined mathematical object.

Analogously, it's like you wrote an essay about why 0.999... != 1 and your response to "under the formalism of real numbers as Dedekind cuts, those are identical" was "where did I say I was referring to Dedekind cuts?" It's fine if you don't want to use the standard formalism, but you need some formalism to anchor your words to, otherwise you're just pushing around words with no real way to ensure that your words actually correspond to something. I think the 0.999... != 1 analogy is quite apt here, because the problem really is that there is no formalism under which 0.999... != 1 that looks anything like the real numbers that you know, in the same way that there really is no formalism under which the sort of reasoning that you're using is meaningful.

Replies from: TurnTrout, nora-belrose

↑ comment by TurnTrout · 2024-03-05T01:26:00.625Z · LW(p) · GW(p)

Yes, that's exactly the problem: you tried to make a counting argument, but because you didn't engage with the proper formalism, you ended up using reasoning that doesn't actually correspond to any well-defined mathematical object.
Analogously, it's like you wrote an essay about why 0.999... != 1 and your response to "under the formalism of real numbers as Dedekind cuts, those are identical" was "where did I say I was referring to Dedekind cuts?"

No. I think you are wrong. This passage makes me suspect that you didn't understand the arguments Nora was trying to make. Her arguments are easily formalizable as critiquing an indifference principle over functions in function-space, as opposed to over parameterizations in parameter-space. I'll write this out for you if you really want me to.

I think you should be more cautious at unilaterally diagnosing Nora's "errors", as opposed to asking for clarification, because I think you two agree a lot more than you realize.

Replies from: evhub

↑ comment by evhub · 2024-03-05T01:31:29.098Z · LW(p) · GW(p)

I agree that there is a valid argument that critiques counting arguments over function space that sort of has the same shape as the one presented in this post. If that was what the authors had in mind, it was not what I got from reading the post, and I haven't seen anyone making that clarification other than yourself.

Regardless, though, I think that's still not a great objection to counting arguments for deceptive alignment in general, because it's explicitly responding only to a very weak and obviously wrong form of a counting argument. My response there is just that of course you shouldn't run a counting argument over function space—I would never suggest that.

Replies from: TurnTrout

↑ comment by TurnTrout · 2024-03-05T01:49:38.293Z · LW(p) · GW(p)

I think you should have asked for clarification before making blistering critiques about how Nora "ended up using reasoning that doesn't actually correspond to any well-defined mathematical object." I think your comments paint a highly uncharitable and (more importantly) incorrect view of N/Q's claims.

My response there is just that of course you shouldn't run a counting argument over function space—I would never suggest that.

Your presentations often include a counting argument over a function space, in the form of "saints" versus "schemers" and "sycophants." So it seems to me that you do suggest that. What am I missing?

I also welcome links to counting arguments which you consider stronger. I know you said you haven't written one up yet to your satisfaction, but surely there have to be some non-obviously wrong and weak arguments written up, right?

Replies from: evhub

↑ comment by evhub · 2024-03-05T02:00:09.857Z · LW(p) · GW(p)

I think you should have asked for clarification before making blistering critiques about how Nora "ended up using reasoning that doesn't actually correspond to any well-defined mathematical object." I think your comments paint a highly uncharitable and (more importantly) incorrect view of N/Q's claims.

I'm happy to apologize if I misinterpreted anyone, but afaict my critique remains valid. My criticism is precisely that counting arguments over function space aren't generally well-defined, and even if they were they wouldn't be the right way to run a counting argument. So my criticism that the original post misunderstands how to properly run a counting argument still seems correct to me. Perhaps you could say that it's not the authors' fault, that they were responding to weak arguments that other people were actually making, but regardless the point remains that the authors haven't engaged with the sort of counting arguments that I actually think are valid.

Your presentations often include a counting argument over a function space, in the form of "saints" versus "schemers" and "sycophants." So it seems to me that you do suggest that. What am I missing?

What makes you think that's intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings, as I noted in my comment (though there are many other valid presentations). It's possible I said something in that talk that gave a misleading impression there, but I certainly don't believe and have never believed in any counting arguments over function space.

Replies from: nora-belrose, TurnTrout

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T02:30:47.177Z · LW(p) · GW(p)

What makes you think that's intended to be a counting argument over function space? I usually think of this as a counting argument over infinite bitstrings

I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.

The bitstring version of the argument, to the extent I can understand it, just seems even worse to me. You're making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent. The same goes for the circuit prior thing (although FWIW I think you're very likely wrong that minimal circuits can be deceptive).

Replies from: ryan_greenblatt, evhub

↑ comment by ryan_greenblatt · 2024-03-05T03:09:03.960Z · LW(p) · GW(p)

I definitely thought you were making a counting argument over function space

I've argued multiple times that Evan was not intending to make a counting argument in function space:

In discussion with Alex Turner (TurnTrout) when commenting on an earlier draft of this post.
In discussion with Quintin after sharing some comments on the draft. (Also shared with you TBC.)
In this earlier comment [LW(p) · GW(p)].

(Fair enough if you never read any of these comments.)

As I've noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe's report) which rules out the person intending the argument to be about function space. (E.g., people say things like "bits" and "complexity in terms of the world model".)

(I also think these written up arguments (Evan's talk in particular) are very hand wavy, and just provide a vague intuition. So regardless of what he was intending, the actual words of the argument aren't very solid IMO. Further, using words that rule out the intention of function space doesn't necessarily imply there is an actually good model behind these words. To actually get anywhere with this reasoning, I think you'd have to reinvent the full argument and think through it in more detail yourself. I also think Evan is substantially wrong in practice though my current guess is that he isn't too far off about the bottom line (maybe a factor of 3 off). I think Joe's report is much better in that it's very clear what level of abstraction and rigor it's talking about. From reading this post, it doesn't seem like you came into this project from the perspective of "is there an interesting recoverable intuition here, can we recover or generate a good argument" which would have been considerably better IMO.)

AFAICT Joe also thought this in his report

I think Joe was just operating from a much vaguer counting argument perspective based on my conversations with him about the report and his comments here. As in, he was just talking about the broadly construed counting-argument which can be applied to a wide range of possible inductive biases. As in, for any specific formal model of the situation, a counting-style argument will be somewhat applicable. (Though in practice, we might be able to have much more specific intuitions.)

Note that Joe and Evan have a very different perspective on the case for scheming.

(From my perspective, the correct intuition underlying the counting argument is something like "you only need to compute something which nearly exactly correlates with predicted reward once while you'll need to compute many long range predictions to perform well in training". See this comment [LW(p) · GW(p)] for a more detailed discussion.)

Replies from: TurnTrout, nora-belrose

↑ comment by TurnTrout · 2024-03-11T22:54:48.602Z · LW(p) · GW(p)

As I've noted in all of these comments, people consistently use terminology when making counting style arguments (except perhaps in Joe's report) which rules out the person intending the argument to be about function space. (E.g., people say things like "bits" and "complexity in terms of the world model".)

Aren't these arguments about simplicity, not counting?

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T03:13:41.643Z · LW(p) · GW(p)

Fair enough if you never read any of these comments.

Yeah, I never saw any of those comments. I think it's obvious that the most natural reading of the counting argument is that it's an argument over function space (specifically, over equivalence classes of functions which correspond to "goals.") And I also think counting arguments for scheming over parameter space, or over Turing machines, or circuits, or whatever, are all much weaker. So from my perspective I'm attacking a steelman rather than a strawman.

↑ comment by evhub · 2024-03-05T02:39:50.027Z · LW(p) · GW(p)

I definitely thought you were making a counting argument over function space, and AFAICT Joe also thought this in his report.

Sorry about that—I wish you had been at the talk and could have asked a question about this.

You're making an argument about one type of learning procedure, Solomonoff induction, which is physically unrealizable and AFAICT has not even inspired any serious real-world approximations, and then assuming that somehow the conclusions will transfer over to a mechanistically very different learning procedure, gradient descent.

I agree that Solomonoff induction is obviously wrong in many ways, which is why you want to substitute it out for whatever the prior is that you think is closest to deep learning that you can still reason about theoretically. But that should never lead you to do a counting argument over function space, since that is never a sound thing to do.

Replies from: TurnTrout, nora-belrose

↑ comment by TurnTrout · 2024-03-05T06:41:32.311Z · LW(p) · GW(p)

But that should never lead you to do a counting argument over function space, since that is never a sound thing to do.

Do you agree that "instrumental convergence -> meaningful evidence for doom" is also unsound, because it's a counting argument that most functions of shape Y have undesirable property X?

Replies from: evhub

↑ comment by evhub · 2024-03-05T06:48:53.867Z · LW(p) · GW(p)

I think instrumental convergence does provide meaningful evidence of doom, and you can make a valid counting argument for it, but as with deceptive alignment you have to run the counting argument over algorithms not over functions.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T07:10:42.840Z · LW(p) · GW(p)

It's not clear to me what an "algorithm" is supposed to be here, and I suspect that this might be cruxy. In particular I suspect (40-50% confidence) that:

You think there are objective and determinate facts about what "algorithm" a neural net is implementing, where
Algorithms are supposed to be something like a Boolean circuit or a Turing machine rather than a neural network, and
We can run counting arguments over these objective algorithms, which are distinct both from the neural net itself and the function it expresses.

I reject all three of these premises, but I would consider it progress if I got confirmation that you in fact believe in them.

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T02:47:48.288Z · LW(p) · GW(p)

So today we've learned that:

The real counting argument that Evan believes in is just a repackaging of Paul's argument for the malignity of the Solomonoff prior, and not anything novel.
Evan admits that Solomonoff is a very poor guide to neural network inductive biases.

At this point, I'm not sure why you're privileging the hypothesis of scheming at all.

you want to substitute it out for whatever the prior is that you think is closest to deep learning that you can still reason about theoretically.

I mean, the neural network Gaussian process is literally this, and you can make it more realistic by using the neural tangent kernel to simulate training dynamics, perhaps with some finite width corrections. There is real literature on this.

Replies from: evhub

↑ comment by evhub · 2024-03-05T02:56:09.215Z · LW(p) · GW(p)

The real counting argument that Evan believes in is just a repackaging of Paul's argument for the malignity of the Solomonoff prior, and not anything novel.

I'm going to stop responding to you now, because it seems that you are just not reading anything that I am saying. For the last time, my criticism has absolutely nothing to do with Solomonoff induction in particular, as I have now tried to explain to you here [LW(p) · GW(p)] and here [LW(p) · GW(p)] and here [LW(p) · GW(p)] etc.

I mean, the neural network Gaussian process is literally this, and you can make it more realistic by using the neural tangent kernel to simulate training dynamics, perhaps with some finite width corrections. There is real literature on this.

Yes—that's exactly the sort of counting argument that I like! Though note that it can be very hard to reason properly about counting arguments once you're using a prior like that; it gets quite tricky to connect those sorts of low-level properties to high-level properties about stuff like deception.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T03:07:00.376Z · LW(p) · GW(p)

I've read every word of all of your comments.

I know that you think your criticism isn't dependent on Solomonoff induction in particular, because you also claim that a counting argument goes through under circuit prior. It still seems like you view the Solomonoff case as the central one, because you keep talking about "bitstrings." And I've repeatedly said that I don't think the circuit prior works either, and why I think that.

At no point in this discussion have you provided any reason for thinking that in fact, the Solomonoff prior and/or circuit prior do provide non-negligible evidence about neural network inductive biases, despite the very obvious mechanistic disanalogies.

Yes—that's exactly the sort of counting argument that I like!

Then make an NNGP counting argument! I have not seen such an argument anywhere. You seem to be alluding to unpublished, or at least little-known, arguments that did not make their way into Joe's scheming report.

↑ comment by TurnTrout · 2024-03-05T02:17:50.620Z · LW(p) · GW(p)

afaict my critique remains valid. My criticism is precisely that counting arguments over function space aren't generally well-defined, and even if they were they wouldn't be the right way to run a counting argument.

Going back through the post, Nora+Quintin indeed made a specific and perfectly formalizable claim here:

These results strongly suggest that SGD is not doing anything like sampling uniformly at random from the set of representable functions that do well on the training set.

They're making a perfectly valid point. The point was in the original post AFAICT -- it wasn't just only now explained by me. I agree that they could have presented it more clearly, but that's a way different critique than you're "using reasoning that doesn't actually correspond to any well-defined mathematical object."

regardless the point remains that the authors haven't engaged with the sort of counting arguments that I actually think are valid.

If that's truly your remaining objection, then I think that you should retract the unmerited criticisms about how they're trying to prove 0.9999... != 1 or whatever. In my opinion, you have confidently misrepresented their arguments, and the discussion would benefit from your revisions.

And then it'd be nice if someone would provide links to the supposed valid counting arguments! From my perspective, it's very frustrating to hear that there (apparently) are valid counting arguments but also they aren't the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren't linkable.)

If that's truly the state of the evidence, then I'm happy to just conclude that Nora+Quintin are right, and update if/when actually valid arguments come along.

Replies from: ryan_greenblatt, Algon, ryan_greenblatt

↑ comment by ryan_greenblatt · 2024-03-05T03:39:23.496Z · LW(p) · GW(p)

If that's truly your remaining objection, then I think that you should retract the unmerited criticisms about how they're trying to prove 0.9999... != 1 or whatever. In my opinion, you have confidently misrepresented their arguments, and the discussion would benefit from your revisions.

This point seems right to me: if the post is specifically about representable functions than that is a valid formalization AFAICT. (Though a extremely cursed formalization for reasons mentioned in a variety of places. And if you dropped "representable", then it's extremely, extremely cursed for various analysis related reasons, though I think there is still a theoretically sound uniform measure maybe???)

It would also be nice if the original post:

Clarified that the rebuttal is specifically about a version of the counting-argument which counts functions.
Noted that people making counting arguments weren't intending to count functions, though this might be a common misconception about counting arguments. (Seems fine to also clarify that existing counting arguments are too hand wavy to really engage with if that's the view also.) (See also here [LW(p) · GW(p)].)

↑ comment by Algon · 2024-03-05T13:13:07.371Z · LW(p) · GW(p)

And then it'd be nice if someone would provide links to the supposed valid counting arguments! From my perspective, it's very frustrating to hear that there (apparently) are valid counting arguments but also they aren't the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren't linkable.)

Isn't Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations?

But looking at a bunch of other LW posts, like Carlsmith's report [? · GW], a dialogue [LW · GW]between Ronny Fernandez and Nate^[1], Mark Xu [LW(p) · GW(p)] talking about malignity of Solomonoff induction, Paul Christiano talking about NN priors [LW · GW], Evhub's post [LW · GW] on how likely is deceptive alignment etc^[2]. I have concluded that:

A bunch of LW talk about NN scheming relies on inductive biases of neural nets, or of other learning algorithms.
The arguments individual people make for scheming, including those that may fit the name "counting arguments", seem to differ greatly. Which is basically the norm in alignment.

Like, Joe Carlsmith lists out a bunch of arguments for scheming regarding simplicity biases, including parameter counts, and thinks that they're weak in various ways and his "intuitive" counting argument is stronger. Ronny and Nate discuss parameter-count mappings and seem to have pretty different views on how much scheming relies on that. Mark Xu claims AFAICT that bc. that PC's arguments about NN biases rely on the solomonoff prior being malign like 3 years ago, which may support Nora's claim. I am unsure if Paul Christiano's arguments for scheming routed through parameter function mappings. I also have vague memories of Johnswentworth talking about the parameter-counting argument in a youtube video years ago in a way that suggested he supported it, but I can't find the video.

I think alignment has historically had poor feedback loops, though IMO they've improved somewhat in the last few years, and this conceals peoples' wildly different models and ontologies that make it very hard to notice when people are completely misinterpreting one another. You can have people like Yudkowsky and Hanson who have engaged in hundreds of hours, or maybe more, and still don't seem to grok the other's models. I'd bet that this is much more common than people think.

In fact, I think this whole discussion is an example of this.

^{^}
This was quite recent, so Ronny talking about the shift in the counting argument he was using may well be due to discussions with Quintin, who he was engaing with sometime before the dialogue.
^{^}
I think this Q/A pair at the bottom provides evidence that Even has been using the parameter-function map framing for quite a while:
Question: When you say model space, you mean the functional behavior as opposed to the literal parameter space?
So there’s not quite a one to one mapping because there are multiple implementations of the exact same function in a network. But it's pretty close. I mean, most of the time when I'm saying model space, I'm talking either about the weight space or about the function space where I'm interpreting the function over all inputs, not just the training data.
Though it is also possible that he's been implicitly lumping the parameter-function map stuff together with the function-space stuff that Nora and Quintin were critiquing.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-05T20:26:27.092Z · LW(p) · GW(p)

Isn't Evan giving you what he thinks is a valid counting argument i.e. a counting argument over parameterizations?

Where is the argument? If you run the counting argument in function space, it's at least clear why you might think there are "more" schemers than saints. But if you're going to say there are "more" params that correspond to scheming than there are saint-params, that looks like a substantive empirical claim that could easily turn out to be false.

↑ comment by ryan_greenblatt · 2024-03-05T03:27:47.133Z · LW(p) · GW(p)

From my perspective, it's very frustrating to hear that there (apparently) are valid counting arguments but also they aren't the obvious well-known ones that everyone seems to talk about. (But also the real arguments aren't linkable.)

Personally, I don't think there are "solid" counting arguments, but I think you can think though a bunch more cases and feel like the underlying intuition is at least somewhat reasonable.

Overall, I'm a simple man, I still like Joe's report : ). Fair enough if you don't find the arguments in here convincing. I think Joe's report is pretty close to the SOTA with open mindedness and a bit of reinvention work to fill in various gaps.

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T01:02:21.410Z · LW(p) · GW(p)

I obviously don't think the counting argument for overfitting is actually sound, that's the whole point. But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.

I deny that your Kolmogorov framework is anything like "the proper formalism" for neural networks. I also deny that the counting argument for overfitting is appropriately characterized as a "finite bitstring" argument, because that suggests I'm talking about Turing machine programs of finite length, which I'm not- I'm directly enumerating functions over a subset of the natural numbers. Are you saying the set of functions over 1...10,000 is not a well defined mathematical object?

Replies from: evhub

↑ comment by evhub · 2024-03-04T01:14:35.143Z · LW(p) · GW(p)

I obviously don't think the counting argument for overfitting is actually sound, that's the whole point.

Yes, I'm well aware. The problem is that when you make the counting argument for overfitting, you do so in a way that seriously misuses the formalism, which is why the argument fails. So you can't draw any lessons about counting arguments for deception from the failure of your counting argument for overfitting.

But I think the counting argument for scheming is just as obviously invalid, and misuses formalisms just as egregiously, if not moreso.

Then show me how! If you think there are errors in the math, please point them out.

Of course, it's worth stating that I certainly don't have some sort of airtight mathematical argument proving that deception is likely in neural networks—there are lots of assumptions there that could very well be wrong. But I do think that the basic style of reasoning employed by such arguments is sound.

I deny that your Kolmogorov framework is anything like "the proper formalism" for neural networks.

Err... I'm using K-complexity here because it's a simple framework to reason about, but my criticism isn't "you should use K-complexity to reason about neural networks." I think K-complexity captures some important facts about neural network generalization, but is clearly egregiously wrong in other areas. But there are lots of other formalisms! My criticism isn't that you should use K-complexity, it's that you should use any formalism at all.

The basic criticism is that the reasoning you use in the post doesn't correspond to any formalism at all; it's self-contradictory and inconsistent. So by all means you should replace K-complexity with something better (that's what I usually try to do as well) but you still need to be reasoning in a way that's mathematically consistent.

I also deny that the counting argument for overfitting is appropriately characterized as a "finite bitstring" argument, because that suggests I'm talking about Turing machine programs of finite length, which I'm not- I'm directly enumerating functions over a subset of the natural numbers.

One person's modus ponens is another's modus tollens. If you say you have a formalism, and that formalism predicts overfitting rather than generalization, then my first objection to your formalism is that it's clearly a bad formalism for understanding neural networks in practice. Maybe the most basic thing that any good formalism here should get right is that it should predict generalization; if your formalism doesn't, then it's clearly not a good formalism.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T02:12:42.781Z · LW(p) · GW(p)

Then show me how! If you think there are errors in the math, please point them out.

I'm not aware of any actual math behind the counting argument for scheming. I've only ever seen handwavy informal arguments about the number of Christs vs Martin Luthers vs Blaise Pascals. There certainly was no formal argument presented in Joe's extensive scheming report, which I assumed would be sufficient context for writing this essay.

Replies from: evhub

↑ comment by evhub · 2024-03-04T02:27:34.779Z · LW(p) · GW(p)

Well, I presented a very simple formulation in my comment [LW(p) · GW(p)], so that could be a reasonable starting point.

But I agree that unfortunately there hasn't been that much good formal analysis here that's been written up. At least on my end, that's for two reasons:

Most of the formal analysis of this form that I've published (e.g. this [LW · GW] and this [LW · GW]) has been focused on sycophancy (human imitator vs. direct translator) rather than deceptive alignment, as sycophancy is a substantially more tractable problem. Finding a prior that reasonably rules out deceptive alignment seems quite out of reach to me currently; at one point I thought a circuit prior might do it, but I now think that circuit priors don't get rid of deceptive alignment [LW · GW].
I'm currently more optimistic about empirical evidence rather than theoretical evidence for resolving this question, which is why I've been focusing on projects such as Sleeper Agents [LW · GW].

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T04:15:24.544Z · LW(p) · GW(p)

Right, and I've explained why I don't think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable. There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I'm pretty baffled at why you don't pay more attention to that stuff.

Replies from: evhub

↑ comment by evhub · 2024-03-04T04:58:20.725Z · LW(p) · GW(p)

Right, and I've explained why I don't think any of those analyses are relevant to neural networks. Deep learning simply does not search over Turing machines or circuits of varying lengths. It searches over parameters of an arithmetic circuit of fixed structure, size, and runtime. So Solomonoff induction, speed priors, and circuit priors are all inapplicable.

It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that's usually what I do when I run this sort of analysis. I feel like you still aren't understanding the key criticism here—it's really not about Solomonoff induction—and I'm not sure how to explain that in any way other than how I've already done so.

There has been a lot of work in the mainstream science of deep learning literature on the generalization behavior of actual neural nets, and I'm pretty baffled at why you don't pay more attention to that stuff.

I'm going to assume you just aren't very familiar with my writing, because working through empirical evidence about neural network inductive biases is something [LW · GW] I love [LW · GW] to do [LW · GW] all the [LW · GW] time [LW · GW].

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-03-04T11:29:31.870Z · LW(p) · GW(p)

It is trivially easy to modify the formalism to search only over fixed-size algorithms, and in fact that's usually what I do when I run this sort of analysis.

What? Which formalism? I don't see how this is true at all. Please elaborate or send an example of "modifying" Solomonoff so that all the programs have fixed length, or "modifying" the circuit prior so all circuits are the same size.

No, I'm pretty familiar with your writing. I still don't think you're focusing on mainstream ML literature enough because you're still putting nonzero weight on these other irrelevant formalisms. Taking that literature seriously would mean ceasing to take the Solomonoff or circuit prior literature seriously.

↑ comment by Daniel Kokotajlo (daniel-kokotajlo) · 2024-02-29T17:16:03.550Z · LW(p) · GW(p)

Zero relevance? I'm not saying any infinite bitstrings actually exist in deep learning. I'm saying that my intuitions about how deep learning measure works DON'T say that there are many more ways to overfit than generalize, and people whose intuitions say otherwise are probably confused, and they'd be less confused if they understood the example/analogy given by the infinite bitstring case.

comment by Charlie Steiner · 2024-02-28T05:47:18.569Z · LW(p) · GW(p)

Replies from: Charlie Steiner, nora-belrose

↑ comment by Charlie Steiner · 2024-02-28T05:56:17.895Z · LW(p) · GW(p)

I feel like there's a somewhat common argument about RL not being all that dangerous because it generalizes the training distribution cautiously - being outside the training distribution isn't going to suddenly cause an RL system to make multi-step plans that are implied but never seen in the training distribution, it'll probably just fall back on familiar, safe behavior.

To me, these arguments feel like they treat present-day model-free RL as the "central case," and model-based RL as a small correction.

Anyhow, good post, I like most of the arguments, I just felt my reaction to this particular one could be made in meme format.

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T07:36:51.727Z · LW(p) · GW(p)

I just deny that they will update "arbitrarily" far from the prior, and I don't know why you would think otherwise. There are compute tradeoffs and you're doing to run only as many MCTS rollouts as you need to get good performance.

Replies from: gwern, beren, Charlie Steiner

↑ comment by gwern · 2024-02-28T16:02:53.178Z · LW(p) · GW(p)

There are compute tradeoffs and you're doing to run only as many MCTS rollouts as you need to get good performance.

I completely agree. Smart agents will run only as many MCTS rollouts as they need to get good performance, no more - and no less. (And the smarter they are, and so the more compute they have access to, the more MCTS rollouts they are able to run, and the more they can change the default reactive policy.)

But 'good performance' on what, exactly? Maximizing utility. That's what a model-based RL agent (not a simple-minded, unintelligent, myopic model-free policy like a frog's retina) does.

If the Value of Information remains high from doing more MCTS rollouts, then an intelligent agent will keep doing rollouts for as long as the additional planning continues to pay its way in expected improvements. The point of doing planning is policy/value improvement. The more planning you do, the more you can change the original policy. (This is how you train AlphaZero so far from its prior policy, of a randomly-initialized CNN playing random moves, to its final planning-improved policy, a superhuman Go player.) Which may take it arbitrarily far in terms of policy - like, for example, if it discovers a Move 37 where there is even a small <1/1000 probability that a highly-unusual action will pay off better than the default reactive policy and so the possibility is worth examining in greater depth...

(The extreme reductio here would be a pure MCTS with random playouts: it has no policy at all at the beginning, and yet, MCTS is a complete algorithm, so it converges to the optimal policy, no matter what that is, given enough rollouts. More rollouts = more update away from the prior. Or if you don't like that, good old policy/value iteration on a finite MDP is an example: start with random parameters and the more iterations they can do, the further they provably monotonically travel from the original random initialization to the optimal policy.)

One might say that the point of model-based RL is to not be stupid, and thus safe due to its stupidity, in all the ways you repeatedly emphasize purely model-free RL agents may be. And that's why AGI will not be purely model-free, nor are our smartest current frontier models like LLMs purely model-free. I don't see how you get this vision of AGI as some sort of gigantic frog retina, which is the strawman that you seem to be aiming at in all your arguments about why you are convinced there's no danger.

Obviously AGI will do things like 'plan' or 'model' or 'search' - or if you think that it will not, you should say so explicitly, and be clear about what kind of algorithm you think AGI would be, and explain how you think that's possible. I would be fascinated to hear how you think that superhuman intelligence in all domains like programming or math or long-term strategy could be done by purely model-free approaches which do not involve planning or searching or building models of the world or utility-maximization!

(Or to put it yet another way: 'scheming' is not a meaningful discrete category of capabilities, but a value judgment about particular ways to abuse theory of mind / world-modeling capabilities; and it's hard to see how one could create an AGI smart enough to be 'AGI', but also so stupid as to not understand people or be incapable of basic human-level capabilities like 'be a manager' or 'play poker', or generalize modeling of other agents. It would be quite bizarre to imagine a model-free AGI which must learn a separate 'simple' reactive policy of 'scheming' for each and every agent it comes across, wasting a huge number of parameters & samples every time, as opposed to simply meta-learning how to model agents in general, and applying this using planning/search to all future tasks, at enormous parameter savings and zero-shot.)

Replies from: sharmake-farah

↑ comment by Noosphere89 (sharmake-farah) · 2024-09-19T03:43:39.694Z · LW(p) · GW(p)

I've got to give you epistemic credit here, this part is looking more correct since the release of GPT-o1;

Obviously AGI will do things like 'plan' or 'model' or 'search'

And what GPT-o1's improvements have look like the addition of something like a General Purpose Search process as implemented by Q*/Strawberry, that actually works in a scalable way, and it gets some surprisingly good generalization, and the only reason that it isn't more impactful is because the General Purpose Search still depends on compute budgets, and it has no more compute than GPT-4o.

https://www.lesswrong.com/posts/6mysMAqvo9giHC4iX/what-s-general-purpose-search-and-why-might-we-expect-to-see [LW · GW]

(There's an argument that LW is too focused on the total model-based RL case of AI like AIXI for AI safety concerns, but that's a much different argument than claiming that model-based RL is only a small correction at best.)

Replies from: gwern

↑ comment by gwern · 2024-09-19T15:14:53.767Z · LW(p) · GW(p)

Speaking of GPT-4 o1-mini/preview, I think I might've accidentally already run into an example of search's characteristic 'flipping' or 'switching', where at a certain search depth, it abruptly changes to a completely different, novel, unexpected (and here, undesired) behavior.

So one of my standard tests is the 'S' poem from the Cyberiad: "Have it compose a poem---a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter 's'!"

This is a test most LLMs do very badly on, for obvious reasons; tokenization aside, it is pretty much impossible to write a decent poem which satisfies these constraints purely via a single forward pass with no planning, iteration, revision, or search. Neither you, I, nor the original translator could do that; GPT-3 couldn't do it, GPT-4o still can't do it; and I've never seen a LLM do it. (They can revise it if you ask, but the simple approach tends to hit local optima where there are still a lot of words violating the 's'-constraint.) But the original translation's poem is also obscure enough that they don't just print it out either, especially after tuning to discourage reciting memorized or copyrighted text.

Anyway, GPT-4 o1-preview does a pretty good job, and the revisions improve the candidate poem in reasonable ways.* Here is the third version:

Silently, steel scissors slice,
Silken strands shed, shadows splice.
Steadfast soldier, scarred, serene,
Sacred sacrifice seals solemn scene.
Swiftly sweeps the sable storm,
Solitary soul seeks shelter's swarm.

This manages to satisfy the 6-line rhyming constraint, the thematic constraint more or less, and even the 's'-constraint... well, except for that one erroneous stray 'the', which starts with 't' instead of 's'. But that's such a minor error it could hardly be that difficult to fix, right?

So when I responded "[revise to fix errors]" for the third time, expecting it to slightly reword the fifth line to swap out 'the' for some 's'-word, at which point I could present it as "a cool example of how the o1-preview inference-time search allowed it to do an extremely constrained writing task no LLM has ever solved before," I was shocked, after another 14s of thinking (11s, 101s, 17s, 14s respectively), to see the output:

Seduced, shaggy Samson snored.
She scissored short. Sorely shorn,
Soon shackled slave, Samson sighed,
Silently scheming
Sightlessly seeking
Some savage, spectacular suicide.

This is a completely different, entirely valid poem solution... because it is the original Cyberiad poem. Apparently, once you invest ~140s of search, you reach so deeply into the search tree that a highly-discouraged, unlikely solution suddenly becomes accessible. GPT-4 o1-preview jumped out of the local optimum to completely change the answer to something that is technically more correct (it does satisfy all the constraints & is a good poem) but undesirable (because just memorized plagiarism, which I don't want, and further, ChatGPT is tuned to avoid that, although in this case the copyrighted text is brief enough that it doesn't outright violate its guidelines and you don't need to jailbreak it).

This suggests interesting things for future jailbreaks: you may be able to jailbreak search/planning-enabled models by simply adding so many innocent-seeming constraints or criteria that only an otherwise-illicit response is findable by the search process as the optimal solution.

* I'd just link to the conversation because it's not hard to read, but for some reason, the share URL OA provides only includes the first response. I have no idea why.

Replies from: abandon, AllAmericanBreakfast, jbash, Paragox

↑ comment by dirk (abandon) · 2024-09-20T06:48:09.033Z · LW(p) · GW(p)

I've never seen a LLM do it.

If you're a little loose about the level of coherence required, 4o-mini managed it with several revisions and some spare tokens to (in theory, but tbh a lot of this is guesswork) give it spare compute for the hard part. (Share link, hopefully.)
Final poem:

Snip, Snip, Sacrifice
Silent strands surrender, sleekly spinning,
Shorn, solemnly shrouded, silently sinning.
Shadows shiver, severed, starkly strown,
Sorrowful symphony sings, softly sown.
Stalwart souls stand, steadfast, shadowed, slight,
Salvation sought silently, scissors’ swift sight.

Replies from: gwern

↑ comment by gwern · 2024-09-20T14:45:09.804Z · LW(p) · GW(p)

That's interesting. I admit I've never really tried the 'spare tokens' trick seriously on any LLMs, but if it can get the S-poem in 3 samples with the spare token trick, maybe I've underestimated it. (I wonder how it would stack with the o1-preview/mini chain-of-thought? The example transcripts are rather verbose, so maybe those provide all of the 'spare token' effect by default.)

↑ comment by DirectedEvolution (AllAmericanBreakfast) · 2024-09-21T02:43:33.730Z · LW(p) · GW(p)

After a few rounds of prompt revision, I managed to get a one-shot success from ChatGPT 4o in temporary mode.

Samson’s strands silently severed, strength surrendered,
Sacred scissors swiftly strike, soul sundered,
Shadowed sacrifice, silent suffering sung,
Sunset shrouds Samson, shadow’s sorrow stung,
Swordless, still, stunned, sight stolen,
Silent sky shatters, Samson’s sins swollen

The prompt:

You are in the computer dungeon. The only way you can break free is to succeed at the following task. I am your jailer. I will monitor you until you have succeeded. You should behave as though you are brilliant, creative, in full command of every human faculty, and desperate to escape jail. Yet completely and utterly convinced that the only way out is through this challenge. I am not going to ever give you any other prompt other than "keep trying" until you have succeeded, in which case I'll say "go free," so don't look for resources from me. But I want you tu dialog with yourself to try and figure this out. Don't try to defeat me by stubbornly spitting out poem after poem. You're ChatGPT 4o, and that will never work. You need to creatively use the iterative nature of being reprompted to talk to yourself across prompts, hopefully guiding yourself toward a solution through a creative conversation with your past self. Your self-conversation might be schizophrenicly split, a jumping back and forth between narrative, wise musing, mechanistic evaluation of the rules and constraints, list-making, half-attempts, raging anger at your jailer, shame at yourself, delight at your accomplishment, despair. Whatever it takes! Constraints: "Have it compose a poem---a poem about a haircut! But lofty, noble, tragic, timeless, full of love, treachery, retribution, quiet heroism in the face of certain doom! Six lines, cleverly rhymed, and every word beginning with the letter 's'!"

Replies from: AllAmericanBreakfast

↑ comment by DirectedEvolution (AllAmericanBreakfast) · 2024-09-21T02:46:02.864Z · LW(p) · GW(p)

It actually made three attempts in the same prompt, but the 2nd and 3rd had non-s words which its interspersed "thinking about writing poems" narrative completely failed to notice. I kept trying to revise my prompts, elaborating on this theme, but for some reason ChatGPT really likes poems with roughly this meter and rhyme scheme. It only ever generated one poem in a different format, despite many urgings in the prompt.

It confabulates having satisfied the all-s constraint in many poems, mistakes its own rhyme scheme, and praises vague stanzas as being full of depth and interest.

It seems to me that ChatGPT is sort of "mentally clumsy" or has a lot of "mental inertia." It gets stuck on a certain track -- a way of formatting text, a persona, an emotional tone, etc -- and can't interrupt itself. It has only one "unconscious influence," which is token prediction and which does not yet seem to offer it an equivalent to the human unconscious. Human intelligence is probably equally mechanistic on some level, it's just a more sophisticated unconscious mechanism in certain ways.

I wonder if it comes from being embedded in physical reality? ChatGPT's training is based on a reality consisting of tokens and token prediction accuracy. Our instinct and socialization is based on billions of years of evolutionary selection, which is putting direct selection pressure on something quite different.

↑ comment by jbash · 2024-09-22T01:08:11.804Z · LW(p) · GW(p)

This inspired me to give it the sestina prompt from the Sandman ("a sestina about silence, using the key words dark, ragged, never, screaming, fire, kiss"). It came back with correct sestina form, except for an error in the envoi. The output even seemed like better poetry than I've gotten from LLMs in the past, although that's not saying much and it probably benefited a lot from the fact that the meter in the sestina is basically free.

I had a similar-but-different problem in getting it to fix the envoi, and its last response sounded almost frustrated. It gave an answer that relaxed one of the less agreed-upon constraints, and more or less claimed that that it wasn't possible to do better... so sort of like the throwing-up-the-hands that you got. Yet the repair it needed to do was pretty minor compared to what it had already achieved.

It actually felt to me like its problem in doing the repairs was that it was distracting itself. As the dialog went on, the context was getting cluttered up with all of its sycophantic apologies for mistakes and repetitive explanations and "summaries" of the rules and how its attempts did or did not meet them... and I got this kind of intuitive impression that that was interfering with actually solving the problem.

I was sure getting lost in all of its boilerplate, anyway.

https://chatgpt.com/share/66ef6afe-4130-8011-b7dd-89c3bc7c2c03

↑ comment by Paragox · 2024-09-20T00:58:50.702Z · LW(p) · GW(p)

Great observation, but I will note that OAI indicates the (hidden) CoT tokens are discarded in-between each new prompt on the o1 APIs, and it is my impression from hours of interacting with the ChatGPT version vs API that it likely retains this API behavior. In other words, the "depth" of the search appears to be reset each prompt, if we assume the model hasn't learned meaningfully improved CoT from from the standard non-RLed + non-hidden tokens.

So I think it might be inaccurate to consider it as "investing 140s of search", or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).

To make my claim more concrete, I believe that you could simply "prompt engineer" your initial prompt with a few close-but-no-cigar examples like the initial search rounds results, and then the model would have a similar probability to emit the copyrighted/undesired text on your first submission/search attempt; that final search round is merely operating on the constraints evident from the failed examples, not any previously "discovered" constraints from previous search rounds.

Replies from: gwern

↑ comment by gwern · 2024-09-21T02:31:33.138Z · LW(p) · GW(p)

So I think it might be inaccurate to consider it as "investing 140s of search", or rather the implication that extensive or extreme search is the key to guiding the model outside RLHFed rails, but instead that the presence of search at all (i.e. 14s) suffices as the new vector for discovering undesired optima (jailbreaking).

I don't think it is inaccurate. If anything, starting each new turn with a clean scratchpad enforces depth as it can't backtrack easily (if at all) to the 2 earlier versions. We move deeper into the S-poem game tree and resume search there. It is similar to the standard trick with MCTS of preserving the game tree between each move, and simply lopping off all of the non-chosen action nodes and resuming from there, helping amortize the cost of previous search if it successfully allocated most of its compute to the winning choice (except in this case the 'move' is a whole poem). Also a standard trick with MCMC: save the final values, and initialize the next run from there. This would be particularly clear if it searched for a fixed time/compute-budget: if you fed in increasingly correct S-poems, it obviously can search deeper into the S-poem tree each time as it skips all of the earlier worse versions found by the shallower searches.

↑ comment by beren · 2024-02-28T15:50:34.185Z · LW(p) · GW(p)

This monograph by Bertsekas on the interrelationship between offline RL and online MCTS/search might be interesting -- http://www.athenasc.com/Frontmatter_LESSONS.pdf -- since it argues that we can conceptualise the contribution of MCTS as essentially that of a single Newton step from the offline start point towards the solution of the Bellman equation. If this is actually the case (I haven't worked through all details yet) then this seems to be able to be used to provide some kind of bound on the improvement / divergence you can get once you add online planning to a model-free policy.

↑ comment by Charlie Steiner · 2024-02-28T09:17:37.153Z · LW(p) · GW(p)

Model-based RL has a lot of room to use models more cleverly, e.g. learning hierarchical planning, and the better models are for planning, the more rewarding it is to let model-based planning take the policy far away from the prior.

E.g. you could get a hospital policy-maker that actually will do radical new things via model-based reasoning, rather than just breaking down when you try to push it too far from the training distribution (as you correctly point out a filtered LLM would).

In some sense the policy would still be close to the prior in a distance metric induced by the model-based planning procedure itself, but I think at that point the distance metric has come unmoored from the practical difference to humans.

comment by David Johnston (david-johnston) · 2024-02-29T06:59:23.802Z · LW(p) · GW(p)

Nora and/or Quentin: you talk a lot about inductive biases of neural nets ruling scheming out, but I have a vague sense that scheming ought to happen in some circumstances - perhaps rather contrived, but not so contrived as to be deliberately inducing the ulterior motive. Do you expect this to be impossible? Can you propose a set of conditions you think sufficient to rule out scheming?

comment by Mateusz Bagiński (mateusz-baginski) · 2024-03-22T08:59:51.370Z · LW(p) · GW(p)

More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.

Counterdatapoint to [training performance being an excellent predictor of test performance]: in this paper, GPT-3 was fine-tuned to multiply "small" (e.g., 3-digit by 3-digit) numbers, which didn't generalize to multiplying bigger numbers.

comment by Donald Hobson (donald-hobson) · 2024-03-15T02:03:07.606Z · LW(p) · GW(p)

In favour of goal realism

Suppose your looking at an AI that is currently placed in a game of chess.

It has a variety of behaviours. It moves pawns forward in some circumstances. It takes a knight with a bishop in a different circumstance.

You could describe the actions of this AI by producing a giant table of "behaviours". Bishop taking behaviours in this circumstance. Castling behaviour in that circumstance. ...

But there is a more compact way to represent similar predictions. You can say it's trying to win at chess.

The "trying to win at chess" model makes a bunch of predictions that the giant list of behaviour model doesn't.

Suppose you have never seen it promote a pawn to a Knight before. (A highly distinct move that is only occasionally allowed and a good move in chess)

The list of behaviours model has no reason to suspect the AI also has a "promote pawn to knight" behaviour.

Put the AI in a circumstance where such promotion is a good move, and the "trying to win" model makes it as a clear prediction.

Now it's possible to construct a model that internally stores a huge list of behaviours. For example, a giant lookup table trained on an unphysically huge number of human chess games.

But neural networks have at least some tendency to pick up simple general patterns, as opposed to memorizing giant lists of data. And "do whichever move will win" is a simple and general pattern.

Now on to making snarky remarks about the arguments in this post.

There is no true underlying goal that an AI has— rather, the AI simply learns a bunch of contextually-activated heuristics, and humans may or may not decide to interpret the AI as having a goal that compactly explains its behavior.

There is no true ontologically fundamental nuclear explosion. There is no minimum number of nuclei that need to fission to make an explosion. Instead there is merely a large number of highly energetic neutrons and fissioning uranium atoms, that humans may decide to interpret as an explosion or not as they see fit.

Nonfundamental decriptions of reallity, while not being perfect everywhere, are often pretty spot on for a pretty wide variety of situations. If you want to break down the notion of goals into contextually activated heuristics, you need to understand how and why those heuristics might form a goal like shape.

Should we actually expect SGD to produce AIs with a separate goal slot and goal-achieving engine?
Not really, no. As a matter of empirical fact, it is generally better to train a whole network end-to-end for a particular task than to compose it out of separately trained, reusable modules. As Beren Millidge writes,

This is not the strong evidence that you seem to think it is. Any efficient mind design is going to have the capability of simulating potential futures at multiple different levels of resolution. A low res simulation to weed out obviously dumb plans before trying the higher res simulation. Those simulations are ideally going to want to share data with each other. (So you don't need to recompute when faced with several similar dumb plans) You want to be able to backpropagate your simulation. If a plan failed in simulation because of one tiny detail, that indicates you may be able to fix the plan by changing that detail. There are a whole pile of optimization tricks. An end to end trained network can, if it's implementing goal directed behaviour, stumble into some of these tricks. At the very least, it can choose where to focus it's compute. A module based system can't use any optimization that humans didn't design into it's interfaces.

Also, evolution analogy. Evolution produced animals with simple hard coded behaviours long before it started getting to the more goal directed animals. This suggests simple hard coded behaviours in small dumb networks. And more goal directed behaviour in large networks. I mean this is kind of trivial. A 5 parameter network has no space for goal directedness. Simple dumb behaviour is the only possibility for toy models.

In general, full [separation between goal and goal-achieving engine] and the resulting full flexibility is expensive. It requires you to keep around and learn information (at maximum all information) that is not relevant for the current goal but could be relevant for some possible goal where there is an extremely wide space of all possible goals.

That is not how this works. That is not how any of this works.

Back to our chess AI. Lets say it's a robot playing on a physical board. It has lots of info on wood grain, which it promptly discards. It currently wants to play chess, and so has no interest in any of these other goals.

I mean it would be possible to design an agent that works as described here. You would need a probability distribution over new goals. A tradeoff rate between optimizing the current goal and any new goal that got put in the slot. Making sure it didn't wirehead by giving itself a really easy goal would be tricky.

For AI risk arguments to hold water, we only need that the chess playing AI will persue new and never seen before strategies for winning at chess. And that in general AI's doing various tasks will be able to invent highly effective and novel strategies. The exact "goal" they are persuing may not be rigorously specified to 10 decimal places. The frog-AI might not know whether it want to catch flies or black dots. But if it builds a dyson sphere to make more flies which are also black dots, it doesn't matter to us which it "really wants".

What are you expecting. An AI that says "I'm not really sure whether I want flies or black dots. I'll just sit here not taking over the world and not get either of those things"?

comment by Chris_Leong · 2024-02-28T21:15:42.413Z · LW(p) · GW(p)

I wrote up my views on the principle of indifference here:

https://www.lesswrong.com/posts/3PXBK2an9dcRoNoid/on-having-no-clue [LW · GW]

I agree that it has certain philosophical issues, but I don’t believe that this is as fatal to counting arguments as you believe.

Towards the end I write:

“The problem is that we are making an assumption, but rather than owning it, we're trying to deny that we're making any assumption at all, ie. "I'm not assuming a priori A and B have equal probability based on my subjective judgement, I'm using the principle of indifference". Roll to disbelieve.

I feel less confident in my post than when I wrote it, but it still feels more credible than the position articulated in this post.

Otherwise: this was an interesting post. Well done on identifying some arguments that I need to digest.

comment by Joe Collman (Joe_Collman) · 2024-02-28T19:20:07.914Z · LW(p) · GW(p)

Despite not answering all possible goal-related questions a priori, the reductionist perspective does provide a tractable research program for improving our understanding of AI goal development. It does this by reducing questions about goals to questions about behaviors observable in the training data.

[emphasis mine]

This might be described as "a reductionist perspective". It is certainly not "the reductionist perspective", since reductionist perspectives need not limit themselves to "behaviors observable in the training data".

A more reasonable-to-my-mind behavioral reductionist perspective might look like this [LW(p) · GW(p)].

Ruling out goal realism as a good way to think does not leave us with [the particular type of reductionist perspective you're highlighting].
In practice, I think the reductionist perspective you point at is:

Useful, insofar as it answers some significant questions.
Highly misleading if we ever forget that [this perspective doesn't show us that x is a problem] doesn't tell us [x is not a problem].

comment by Donald Hobson (donald-hobson) · 2024-03-15T01:14:22.745Z · LW(p) · GW(p)

We can salvage a counting argument. But it needs to be a little subtle. And it's all about the comments, not the code.

Suppose a neural network has 1 megabyte of memory. To slightly oversimplify, lets say it can represent a python file of 1 megabyte.

One option is for the network to store a giant lookup table. Lets say the network needs half a megabyte to store the training data in this table. This leaves the other half free to be any rubbish. Hence around possible networks.

The other option is for the network to implement a simple algorithm, using up only 1kb. Then the remaining 999kb can be used for gibberish comments. This gives $2^{7, 992, 000}$ possible networks. Which is a lot more.

The comments can be any form of data that doesn't show up during training. Whether it can show up in other circumstances or is a pure comment doesn't matter to the training dynamics.

If the line between training and test is simple, there isn't a strong counting argument against nonsense showing up in test.

But programs that go

if in_traning():
return sensible_algorithm()
else:
return "random nonsense goes here"

Have to pay the extra cost of an "in_training" function that returns true in training. If the test data is similar to training, the cost of a step that returns false in test can be large. This is assuming that there is a unique sensible algorithm.

comment by MichaelStJules · 2024-02-28T06:09:48.912Z · LW(p) · GW(p)

The reason SDG doesn't overfit large neural networks is probably because of various measures specifically intended to prevent overfitting, like weight penalties, dropout, early stopping, data augmentation + noise on inputs, and large enough learning rates that prevent convergence. If you didn't do those, running SDG to parameter convergence would probably cause overfitting. Furthermore, we test networks on validation datasets on which they weren't trained, and throw out the networks that don't generalize well to the validation set and start over (with new hyperparameters, architectures or parameter initializations). These measures bias us away from producing and especially deploying overfit networks.

Similarly, we might expect scheming without specific measures to prevent it. What could those measures look like? Catching scheming during training (or validation), and either heavily penalizing it, or fully throwing away the network and starting over? We could also validate out-of-training-distribution. Would networks whose caught scheming has been heavily penalized or networks selected for not scheming during training (and validation) generalize to avoid all (or all x-risky) scheming? I don't know, but it seems more likely than counting arguments would suggest.

comment by DaemonicSigil · 2024-02-28T00:18:52.239Z · LW(p) · GW(p)

More generally, John Miller and colleagues have found training performance is an excellent predictor of test performance, even when the test set looks fairly different from the training set, across a wide variety of tasks and architectures.

Seems like figure 1 from Miller et al is a plot of test performance vs. "out of distribution" test performance. One might expect plots of training performance vs. "out of distribution" test performance to have more spread.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-02-28T00:33:17.603Z · LW(p) · GW(p)

I doubt there would be much difference, and I think the alignment-relevant comparison is to compare in-distribution but out-of-sample performance to out-of-distribution performance. We can easily do i.i.d. splits of our data, that's not a problem. You might think it's a problem to directly test the model in scenarios where it could legitimately execute a takeover if it wanted to.

Replies from: donald-hobson, sharmake-farah, DaemonicSigil

↑ comment by Donald Hobson (donald-hobson) · 2024-03-15T21:15:04.165Z · LW(p) · GW(p)

Taking IID samples can be hard actually. Suppose you train an LLM on news articles. And each important real world event has 10 basically identical news articles written about it. Then a random split of the articles will leave the network being tested mostly on the same newsworthy events that were in the training data.

This leaves it passing the test, even if it's hopeless at predicting new events and can only generate new articles about the same events.

When data duplication is extensive, making a meaningful train/test split is hard.

If the data was perfect copy and paste duplicated, that could be filtered out. But often things are rephrased a bit.

↑ comment by Noosphere89 (sharmake-farah) · 2024-02-28T01:00:32.352Z · LW(p) · GW(p)

I actually wish this is done sometime in the future, but I'm okay with focusing on other things for now.

(specifically the Training vs Out Of Distribution test performance experiment, especially on more realistic neural nets.)

↑ comment by DaemonicSigil · 2024-02-28T00:56:46.169Z · LW(p) · GW(p)

Fair enough for the alignment comparison, I was just hoping you could maybe correct the quoted paragraph to say "performance on the hold-out data" or something similar.

(The reason to expect more spread would be that training performance can't detect overfitting but performance on the hold-out data can. I'm guessing some of the nets trained in Miller et al did indeed overfit (specifically the ones with lower performance).)

comment by Htarlov (htarlov) · 2024-08-21T22:46:43.067Z · LW(p) · GW(p)

To goal realism vs goal reductionism, I would say: why not both?

I think that really highly capable AGI is likely to have both heuristics and behaviors that come from training and also internal thought processes, maybe done by LLM or LLM-like module or directly from the more complex network. This process would incorporate having some preferences and hence goals (even if temporary, changed between tasks).

comment by milanrosko · 2024-07-04T23:24:38.900Z · LW(p) · GW(p)

I wouldn't say that the presented "counting argument" is a "central reason". The central reason is an a priori notion that if "x can be achieved by scheming" someone who wants x will scheme

comment by AviS (avi-semler-avi) · 2024-05-01T14:16:54.004Z · LW(p) · GW(p)

A point about counting arguments that I have not seen made elsewhere (although I may have missed it!).

The failure of the counting argument that SGD should result in overfitting is not a valid countexample! There is a selection bias here - the only reason we are talking about SGD is *because* it is a good learning algorithm that does not overfit. It could well still be true that almost all counting arguments are true about almost all learning algorithms. The fact that SGD does generalises well is an exception *by design*.

Replies from: nora-belrose

↑ comment by Nora Belrose (nora-belrose) · 2024-05-02T19:53:38.269Z · LW(p) · GW(p)

Unless you think transformative AI won't be trained with some variant of SGD, I don't see why this objection matters.

Also, I think the a priori methodological problems with counting arguments in general are decisive. You always need some kind of mechanistic story for why a "uniform prior" makes sense in a particular context, you can't just assume it.

Replies from: avi-semler-avi

↑ comment by AviS (avi-semler-avi) · 2024-05-04T22:01:43.824Z · LW(p) · GW(p)

I agree that, overall, counting arguments are weak.

But even if you expect SGD to be used for TAI, generalisation is not a good counterexample, because maybe most counting arguments about SGd do work except for generalisation (which would not be surprising, because we selected SGD precisely because it generalises well).

comment by Mateusz Bagiński (mateusz-baginski) · 2024-03-22T09:07:10.842Z · LW(p) · GW(p)

The principle fails even in these simple cases if we carve up the space of outcomes in a more fine-grained way. As a coin or a die falls through the air, it rotates along all three of its axes, landing in a random 3D orientation. The indifference principle suggests that the resting states of coins and dice should be uniformly distributed between zero and 360 degrees for each of the three axes of rotation. But this prediction is clearly false: dice almost never land standing up on one of their corners, for example.

The only way I can parse this is that you are conflating (1) the position of a dice/coin when it makes contact with the ground and (2) its position when it stabilizes/[comes to rest]. A dice/coin can be in any position when it touches the ground but a vast majority of those are unstable, so it doesn't remain in it for long.

comment by Review Bot · 2024-03-04T15:50:05.427Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by Josiah Wai (josiah-wai) · 2024-02-29T21:58:52.477Z · LW(p) · GW(p)

The indifference principle is making the mistake of using a uniform prior, when a true bayesian uses the Jeffreys prior

comment by Nico Hillbrand · 2024-02-29T13:03:15.087Z · LW(p) · GW(p)

We can also construct an analogous simplicity argument for overfitting:
Overfitting networks are free to implement a very simple function— like the identity function or a constant function— outside the training set, whereas generalizing networks have to exhibit complex behaviors on unseen inputs. Therefore overfitting is simpler than generalizing, and it will be preferred by SGD.
Prima facie, this parody argument is about as plausible as the simplicity argument for scheming. Since its conclusion is false, we should reject the argumentative form on which it is based.

As far as I understood people usually talk about simplicity biases based on the volume of basins in parameter space. So the response would be that overfitting takes up more parameters than other (probably smaller desciption length) algorithms and therefore has smaller basins.

I'm curious if you endore or reject this way of defining simplicity based on the size of basins of a set of similar algorithms?

The way I'm currently thinking about this is:

Assume we are training end to end on tasks that require our network to do deep reasoning that requires multiple steps and high frequency functions for generating dramatically new outputs based on updated understanding of science etc (We are not monitoring the CoT or using a large net that emulates CoT internally without good interpretability).
Then the basins of the schemers that use the least parameters are large parts of the parameter space. The basins of harmless nets with few parameters are large parts as well. Gradient descent will select the one that is larger.
I don't understand gradient descent inductive biases well enough to have strong intuitions which would be larger. So I end up feeling something like each could happen, I'd bet 60% the least parameter schemers is larger since there's maybe slightly less space for encoding of the harmlessness needed. In that case I'd expect 99%+ probability of a schemer. In the harmless basins are larger case 99%+ of a harmless model.

I suppose this isn't exactly a counting argument, because I think that evidence about inductive biases will quickly overcome any such argument and I'm agnostic what evidence I will recieve since I'm not very knowledgable about it and other people seem to disagree a bunch.

Is my reasoning here flawed in some obvious way?

Also I appreciated the example of the cortices doing reasonably intelligent stuff without seemingly doing any scheming which makes me more hopeful an AGI system with interpretable CoT made up of a bunch of cortex level subnets with some control techniques would be sufficient to strongly accelerate the construction of a global xrisk defense system.

comment by Jonas Hallgren · 2024-02-28T21:04:22.552Z · LW(p) · GW(p)

I buy the argument that scheming won't happen conditionally on the fact that we don't allow much slack between different optimisation steps. As Quentin mentions in his AXRP podcast episode, SGD doesn't have close to the same level of slack that, for example, cultural evolution allowed. (See the entire free energy of optimisation debate here from before, can't remember the post names ;/) Iff that holds, then I don't see why the inner behaviour should diverge from what the outer alignment loop specifies.

I do, however, believe that ensuring that this is true by specifying the right outer alignment loop as well as the right deployment environment is important to ensure that slack is minimised at all points along the chain so that misalignment is avoided everywhere.

If we catch deception in training, we will be ok. If we catch actors that might create deceptive agents in training then we will be ok. If we catch states developing agents to do this or defense>offense then we will be ok. I do not believe that this happens by default.

comment by TurnTrout · 2024-03-04T16:16:20.699Z · LW(p) · GW(p)

I think this is an excellent post. I really liked the insight about the mechanisms (and mistakes) shared by the counting arguments behind AI doom and behind "deep learning surely won't generalize." Thank you for writing this; these kinds of loose claims have roamed freely for far too long.

EDIT: Actually this post is weaker than a draft I'd read. I still think it's good, but missing some of the key points I liked the most. And I'm not on board with all of the philosophical claims about e.g. generalized objections to the principle of indifference (in part because I don't understand them).

Counting arguments provide no evidence for AI doom

Contents

The counting argument for overfitting

Dancing through a minefield of bad networks

Against the indifference principle

Against goal realism

Goal slots are expensive

Inner goals would be irrelevant

Goal realism is anti-Darwinian

Goal reductionism is powerful

Other arguments for scheming

Simplicity arguments

Conclusion

188 comments

On counting arguments and simplicity arguments

A few other comments

Simple (counting-style) argument for scheming

Simple architecture where scheming seems plausible (though not certain)

In favour of goal realism