Gradient descent doesn't select for inner search
post by Ivan Vendrov (ivan-vendrov) · 2022-08-13T04:15:19.373Z · LW · GW · 23 commentsContents
Recap: compression argument for inner optimizers Compactness, Complexity, and Compute (Speculative) Implications for interpretability and safety None 23 comments
TL;DR: Gradient descent won't select for inner search processes because they're not compute & memory efficient.
Slightly longer TL;DR: A key argument for mesa-optimization is that as we search over programs, we will select for "search processes with simple objectives", because they are simpler or more compact than alternative less dangerous programs. This argument is much weaker when your program search is restricted to programs that use a fixed amount of compute, and you're not optimizing strongly for low description length - e.g. gradient descent in modern deep learning systems. We don't really know what shape of programs gradient descent selects for in realistic environments, but they are much less likely to involve search than commonly believed.
Note on terminology (added in response to comments): By "search" I mean here a process that evaluates a number of candidates before returning the best one; what Abram Demski calls "selection" in Selection vs Control [LW · GW] . The more candidates considered, the more "search-like" a process is - with gradient descent and A* being central examples, and a thermostat being a central counter-example.
Recap: compression argument for inner optimizers
Here's the argument from Risks From Learned Optimization [LW · GW]: [emphasis mine]
In some tasks, good performance requires a very complex policy. At the same time, base optimizers are generally biased in favor of selecting learned algorithms with lower complexity. Thus, all else being equal, the base optimizer will generally be incentivized to look for a highly compressed policy.
One way to find a compressed policy is to search for one that is able to use general features of the task structure to produce good behavior, rather than simply memorizing the correct output for each input. A mesa-optimizer is an example of such a policy. From the perspective of the base optimizer, a mesa-optimizer is a highly-compressed version of whatever policy it ends up implementing: instead of explicitly encoding the details of that policy in the learned algorithm, the base optimizer simply needs to encode how to search for such a policy. Furthermore, if a mesa-optimizer can determine the important features of its environment at runtime, it does not need to be given as much prior information as to what those important features are, and can thus be much simpler.
and even more forceful phrasing from John Wentworth [LW(p) · GW(p)]:
We don't know that the AI will necessarily end up optimizing reward-button-pushes or smiles; there may be other similarly-compact proxies which correlate near-perfectly with reward in the training process. We can probably rule out "a spread of situationally-activated computations which steer its actions towards historical reward-correlates", insofar as that spread is a much less compact policy-encoding than an explicit search process + simple objective(s).
Compactness, Complexity, and Compute
At face value, it does seem like we're selecting programs for simplicity. The Deep Double Descent paper showed us that gradient descent training in the overparametrized regime (i.e. the regime of all modern deep models) favors simpler models. But is this notion of simplicity the same as "compactness" or "complexity"? Evan seems to think so [LW · GW], I'm less sure. Let's dive into the different notions of complexity here.
The most commonly used notion of program complexity is Kolmogorov complexity (or description length), basically just "length of the program in some reference programming language". This definition seems natural... but, critically, it assumes away all computational constraints. K-complexity doesn't care if your program completes in a millisecond or runs until the heat death of the universe. This makes it a particularly perverse notion of complexity to use when interacting with the real, computationally constrained world.
Why do we use Kolmogorov complexity at all, then? One reason it's so intuitive is that it's the primary notion of complexity that humans programmers use in their work. Programs that humans write are selected to be short, hence easy to write and analyze. Memory and compute use are often secondary concerns, because in most programming contexts human time is so much more valuable than computer time.
A simplistic model for how humans do program search is "What's the shortest program I can implement that solves the problem exactly?". The answer is usually some form of explicit search with a simple objective, because with no computational constraints that's always the shortest description length program (as an extreme example, bogosort is probably the lowest-description-length sorting procedure).
In contrast, the way modern deep learning does program search is "I can tweak the parameters of a massively parallel computation involving at most 50 serial steps, 100 million multiply-adds per step, and 100 kilobytes of memory to store intermediate results; what parameters get me the best approximate solution given this compute budget?". Here it seems much less clear that the resulting program is going to resemble search; it might well be a diffuse bundle of heuristics, or something else that humans don't have good intuitions for.
What's going on here? Once we stop assuming compute is infinite and instead fix a budget, you can no longer rely on the "compression trick" that lets you replace an actual policy with a search for the policy. Search isn't free; it costs you compute and memory that could have been spent running a better policy, and gradient descent doesn't select for low Kolmogorov complexity nearly enough (if at all) to compensate.
Some caveats to this conclusion:
- This argument only applies to inner search, rather than any form of inner optimization. In particular, gradient descent might select for programs that perform control, even hierarchical control (see selection vs control [LW · GW]). It could select for other features of optimization such as "goal-directed behavior" and "internal representations of outcomes", like a thermostat that diffs the current temperature with the desired temperature to generate the next action. (Thanks to Richard Ngo for helping me clarify this point.)
- You might still select for search in certain domains with few exploitable regularities (like RL agents in randomized mazes) where there are no shortcuts so there's little or no compute penalty to doing search. In such domains, the simplicity bias could prove decisive and the model will indeed learn an inner search process.
- The standard example of mesa-optimization is humans w.r.t evolution, but note that evolution is also selecting for low Kolmogorov complexity because of the genomic bottleneck. It doesn't seem like current gradient descent has such a bottleneck currently (on the contrary, models are massively over-parametrized), but it's possible that future methods for training neural nets will have it (e.g. because of bandwidth constraints in massively-federated training setups).
- Added August 16: John Wentworth responds [LW · GW] with two more arguments for why we'll select for inner search, the existence of general-purpose heuristics and the recursive structure of search.
(Speculative) Implications for interpretability and safety
If most of the danger of inner optimization comes from search and planning[1], this line of reasoning has some unexpected implications. In AI safety we generally think of large black-box models as dangerous, and small interpretable models as safer. But the bias towards explicit search with simple objectives turns out to result from a human bias towards mechanistically interpretable (i.e. low-Kolmogorov-complexity) programs, a bias that often comes at significant computational cost.
If what we really care about when we talk about interpretability is predicting generalization errors, mechanistic interpretability may not be what we want. There is hideous, unpredictable complexity lurking in mechanistically simple programs such as the 3n+1 algorithm and the Rule 110 cellular automaton. The black-box models that gradient descent converges on may turn out to be simpler and more predictable in the sense that really matters. An "agent AI" trained end-to-end with gradient descent may be safer in practice than a "tool AI" that generates C code given a problem description, because the latter shares human programmers' bias towards explicit search over simple, brittle objectives. Which is where much of the danger lies [LW · GW].
- ^
Unclear how true this is - in the limit of capabilities, both search and control processes could be lethal. But intuitively, for a fixed level of capabilities, it seems like search processes are more brittle and likely to mis-generalize catastrophically than control processes.
23 comments
Comments sorted by top scores.
comment by tailcalled · 2022-08-13T09:34:41.276Z · LW(p) · GW(p)
Does it need to be either "pure search" or "no search"?
My expectation would be that in the limit it learns a ton of heuristics about what usually works, and learns to do a much more efficient search using those heuristics. This would especially be the case if e.g. capabilities researchers give the nets extra options to speed up the search (for instance, it's totally possible to embed a few steps of gradient descent into the inference of a neural network, since gradient descent is differentiable - I don't know if that will eventually be shown to help improve capabilities, but if it will, capabilities researchers would presumably do it).
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-13T20:10:01.434Z · LW(p) · GW(p)
Agreed that "search" is not a binary but more like a continuum, where we might call a program more "search-like" if it is enumerating possible actions and evaluating their consequences, and less "search-like" if it is directly mapping representations of inputs to actions. The argument in this post is that gradient descent (unlike evolution, and unlike human programmers) doesn't select much for "search-like" programs. If we take depth-first search as a central example of search, and a thermostat as the paradigmatic non-search program, gradient descent will select for something more like the thermostat.
it's totally possible to embed a few steps of gradient descent into the inference of a neural network, since gradient descent is differentiable
Agreed, and networks may even be learning something like this already! But in my ontology I wouldn't call an algorithm that performs, say, 5 steps of gradient descent over a billion-parameter space and then outputs an action very "search-like"; the "search" part is generating a tiny fraction of the optimization pressure, relative to whatever process sets up the initial state and the error signal.
Maybe this is just semantics, because for high levels of capability search and control are not fundamentally different (what you're pointing to with "much more efficient search" - an infinitely efficient search is just optimal control, you never even consider suboptimal actions!). But it does seem like for a fixed level of capabilities search is more brittle, somehow, and more likely to misgeneralize catastrophically.
comment by Ramana Kumar (ramana-kumar) · 2022-08-19T06:38:41.069Z · LW(p) · GW(p)
How does this square with Are minimal circuits deceptive? [AF · GW]
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-19T19:42:00.322Z · LW(p) · GW(p)
Mostly orthogonal:
- Evan's post argues that if search is computationally optimal (in the sense of being the minimal circuit) for a task, then we can construct a task where the minimal circuit that solves it is deceptive.
- This post argues against (a version of) Evan's premise: search is not in fact computationally optimal in the context of modern tasks and architectures, so we shouldn't expect gradient descent to select for it.
Other relevant differences are
- gradient descent doesn't actually select for low time complexity / minimal circuits; it holds time & space complexity fixed, while selecting for low L2 norm. But I think you could probably do a similar reduction for L2 norm as Evan does for minimal circuits. The crux is in the premise.
- I think Evan is using a broader definition of search than I am in this post, closer to John Wentworth's definition of search as "general problem solving algorithm".
- Evan is doing worst-case analysis (can we completely rule out the possibility of deception by penalizing time complexity?) whereas I'm focusing on the average or default case.
comment by johnswentworth · 2022-08-14T01:34:09.044Z · LW(p) · GW(p)
What exactly do you think the word "search" means?
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-15T00:54:17.706Z · LW(p) · GW(p)
See my answer to tailcalled:
a program is more "search-like" if it is enumerating possible actions and evaluating their consequences
I'm curious if you mean something different by search when you say that we're likely to find policies that look like an "explicit search process + simple objective(s)"
Replies from: johnswentworth↑ comment by johnswentworth · 2022-08-15T01:17:52.179Z · LW(p) · GW(p)
Yeah, that's definitely not what I mean by search (nor what I think others mean by search, in the context of AI and inner agents).
Roughly speaking, a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective. For instance, a gradient descent algorithm takes in an objective, and returns a point which scores well on the objective, for a very broad class of possible objectives; gradient descent is therefore a search method.
Enumerating possible actions and evaluating their consequences is one way to do general search, but it's wildly inefficient; I would typically refer to that as "brute force search". Gradient descent does better by leveraging backprop and gradients; approximately none of the algorithmic work done by gradient descent comes from direct evaluation of the consequences of actions. And there are many other tricks one can use too - like memoization on subsearches, or A*-style heuristic search, or (one meta-level up from A*) relaxation-based methods to discover heuristics. The key point is that these tricks are all very general purpose: they work on a very wide variety of search problems, and therefore produce general-purpose search algorithms which are more efficient than brute force (at least on realistic problems).
More advanced general-purpose search methods seem to rely relatively little on enumerating possible actions and evaluating their consequences. By the time we get to human-level search capabilities, we see human problem-solvers spend most of their effort on nontrivial problems thinking about subproblems, abstractions and analogies rather than thinking directly about particular solutions.
Replies from: ivan-vendrov, quintin-pope↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-15T03:48:14.099Z · LW(p) · GW(p)
I agree that A* and gradient descent are central examples of search; for realistic problems these algorithms typically evaluate the objective on millions of candidates before returning an answer.
In contrast, human problem solvers typically do very little state evaluation - perhaps evaluating a few dozen possibilities directly, and relying (as you said) on abstractions and analogies instead. I would call this type of reasoning "not very search-like".
On the far end we have algorithms like Gauss-Jordan elimination, which just compute the optimal solution directly without evaluating any possibilities. Calling them "search algorithms" seems quite strange.
a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective.
This appears to be a description of any optimization process, not just search - in particular it would include Gauss-Jordan elimination. I guess your ontology has "optimization" and "search" as synonyms, whereas mine has search as a (somewhat fuzzy) proper subset of optimization. Anyways, to eliminate confusion I'll use Abram Demski's term selection [LW · GW] in future. Also added a terminology note to the post.
Replies from: johnswentworth↑ comment by johnswentworth · 2022-08-15T05:41:52.493Z · LW(p) · GW(p)
My ontology indeed has search and a narrow notion of optimization as approximately synonyms; they differ only somewhat in type signature and are easily interchangeable. Conceptually, both take in an objective, and return something which scores highly on the objective. (This is narrower than e.g. Flint's notion of "optimization" [LW · GW]; in that ontology it might be called a "general-purpose optimizer" instead.)
Anyway, insofar as any of this is relevant to the arguments for mesa-optimization, it's the notion of search/optimization as general problem solving which applies there.
↑ comment by Quintin Pope (quintin-pope) · 2022-08-15T04:26:45.672Z · LW(p) · GW(p)
Roughly speaking, a general search process is something which takes in a specification of some problem or objective (from a broad class of possible problems/objectives), and returns a plan which solves the problem or scores well on the objective.
This description of “search” seems far too broad. E.g., it seems to include things like lookup tables, and AFAICT, literally every way to solve problems or satisfy objectives?
Seems like “search” should mean something more specific than “problem solving”.
Replies from: johnswentworth↑ comment by johnswentworth · 2022-08-15T05:35:34.696Z · LW(p) · GW(p)
It excludes methods specific to a small number of problems. Search is about general problem solving.
Anyway, IIUC this is how the term "search" has historically been used in AI, it is also the notion of "search" which is relevant to the arguments for mesaoptimization [? · GW] in Risks From Learned Optimization, it is also the notion of search which is relevant to general intelligence being A Thing, it is the notion of search which is relevant to the possibility of an ML system suddenly grokking "general search" and thereby undergoing a rapid increase in capabilities, etc.
comment by Vladimir_Nesov · 2022-08-13T04:37:16.351Z · LW(p) · GW(p)
we will select for "explicit search processes with simple objectives"
The actual argument is that small descriptions give higher parameter space volume, and so the things we find are those with short descriptions (low Kolmogorov complexity). The thing with a short description is the whole mesa-optimizer, not just its goal. This is misleading for goals because low Kolmogorov complexity doesn't mean low "complexity" in many other senses, so an arbitrary goal with low Kolmogorov complexity would actually be much more "complicated" than intended base objective. In particular, it probably cares about the real world outside an episode and is thus motivated to exhibit deceptively aligned behavior.
I think "explicit search" is similarly misleading, because most short programs (around a given behavior) are not coherent decision theoretic optimizers. Search would only become properly explicit after the mesa-optimizer completes its own agent foundations alignment research program and builds itself a decision theory based corrigible successor. A mesa-optimizer only needs to pursue the objective of aligned behavior (or whatever it's being selected for), and whether that tends to be its actual objective or the instrumental objective of deceptive alignment is a toss-up, either would do for it to be selected. But in either case, it doesn't need to be anywhere ready to pursue a coherent goal of its own (as aligned behavior is also not goal directed behavior in a strong sense).
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-13T04:51:49.352Z · LW(p) · GW(p)
Agreed on "explicit search" being a misleading phrase, I'll replace it with just "search" when I'm referring to learned programs.
small descriptions give higher parameter space volume, and so the things we find are those with short descriptions
I don't think I understand this. GPT-3 is a thing we found, which has 175B parameters, what is the short description of it?
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2022-08-13T05:00:24.734Z · LW(p) · GW(p)
I mean relatively short, as in the argument for why overparametrized models generalize. They still do get to ~memorize all training data, but anything else comes at a premium, reduces probability of getting selected for models whose behavior depends on those additional details. (This use of "short" as meaning "could be 500 gigabytes" was rather sloppy/misleading of me, in a comment about sloppy/misleading use of words...)
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-13T05:14:22.418Z · LW(p) · GW(p)
Ah I think that's the crux - I believe the overparametrized regime finds generalizing models because gradient descent finds functions that have low function norm, not low description length. I forget the paper that showed this for neural nets but here's a proof for logistic regression.
Replies from: Vladimir_Nesov↑ comment by Vladimir_Nesov · 2022-08-13T05:41:02.181Z · LW(p) · GW(p)
I'm thinking of a setting where shortest descriptions of behavior determine sets of models that exhibit matching behavior (possibly in a coarse-grained way, so distances in behavior space are relevant). This description-model relation could be arbitrarily hard to compute, so it's OK for shortest descriptions to be shortest programs or something ridiculous like that. This gives a partition of the model/parameter space according to the mapping from models to shortest descriptions of their behavior. I think shorter shortest descriptions (simpler behaviors) fill more volume in the parameter/model space, have more models whose behavior is given by those descriptions (this is probably the crux; e.g. it's false if behaviors are just models themselves and descriptions are exact).
Gradient descent doesn't interact with descriptions or the description-model relation in any way, but since it selects models ~based on behavior, and starts its search from a random point in the model space, it tends to select behaviors from larger elements of the partition of the space of models that correspond to simpler behaviors with shorter shortest descriptions.
This holds at every step of gradient descent, not just when it has already learned something relevant. The argument is that whatever behavior is selected, it is relatively simple, compared to other behaviors that could've been selected by the same selection process. Further training just increases the selection pressure.
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-13T07:01:32.567Z · LW(p) · GW(p)
Yeah I think you need some additional assumptions on the models and behaviors, which you're gesturing at with the "matching behaviors" and "inexact descriptions". Otherwise it's easy to find counterexamples: imagine the model is just a single N x N matrix of parameters, then in general there is no shorter description length of the behavior than the model itself.
Yes there are non-invertible (you might say "simpler") behaviors which each occupy more parameter volume than any given invertible behavior, but random matrices are almost certainly invertible so the actual optimization pressure towards low description length is infinitesimal.
comment by Lauro Langosco · 2022-08-15T21:11:31.125Z · LW(p) · GW(p)
I think you overestimate the importance of the genomic bottleneck. It seems unlikely that humans would have been as successful as we are if we were... the alternative to the kind of algorithm that does search, which you don't really describe.
Performing search to optimize an objective seems really central to our (human's) capabilities, and if you want to argue against that I think you should say something about what an algorithm is supposed to look like that is anywhere near as capable as humans but doesn't do any search.
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-15T23:50:31.538Z · LW(p) · GW(p)
I disagree that performing search is central to human capabilities relative to other species. The cultural intelligence hypothesis seems much more plausible: humans are successful because our language and ability to mimic allow us to accumulate knowledge and coordinate at massive scale across both space and time. Not because individual humans are particularly good at thinking or optimizing or performing search. (Not sure what the implications of this are for AI).
You're right though, I didn't say much about alternative algorithms other than point vaguely in the direction of hierarchical control. I mostly want to warn people not to reason about inner optimizers the way they would about search algorithms. But if it helps, I think AlphaStar is a good example of an algorithm that is superhuman in a very complex strategic domain but is very likely not doing anything like "evaluating many possibilities before settling on an action". In contrast to AlphaZero (with rollouts), which considers tens of thousands of positions before selecting an action. AlphaZero (just the policy network) I'm more confused about... I expect it still isn't doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?
Replies from: Lauro Langosco, Vladimir_Nesov, sharmake-farah↑ comment by Lauro Langosco · 2022-08-16T15:13:20.044Z · LW(p) · GW(p)
(Note that I'm not making a claim about how search is central to human capabilities relative to other species; I'm just saying search is useful in general. Plausibly also for other species, though it is more obvious for humans)
From my POV, the "cultural intelligence hypothesis" is not a counterpoint to importance of search. It's obvious that culture is important for human capabilities, but it also seems obvious to me that search is important. Building printing presses or steam engines is not something that a bundle of heuristics can do, IMO, without gaining those heuristics via a long process of evolutionary trial-and-error. And it seems important that humans can build steam engines without generations of breeding better steam-engine-engineers.
Re AlphaStar and AlphaZero: I've never played Starcraft, so I don't have good intuitions for what capabilities are needed. But on the definitions of search that I use, the AlphaZero policy network definitely performs search. In fact out of current systems it's probably the one that most clearly performs search!
...Now I'm wondering whether our disagreement just comes from having different definitions of search in mind. Skimming your other comments above, it seems like you take a more narrow view of search = literally iterating through solutions and picking a good one. This is fine by me definitionally, but I don't think the fact that models will not learn search(narrow) is very interesting for alignment, or has the implications that you list in the post? Though ofc I might still be misunderstanding you here.
Replies from: ivan-vendrov↑ comment by Ivan Vendrov (ivan-vendrov) · 2022-08-16T15:56:04.881Z · LW(p) · GW(p)
Yeah it's probably definitions. With the caveat that I don't mean the narrow "literally iterates over solutions", but roughly "behaves (especially off the training distribution) as if it's iterating over solutions", like Abram Demski's term selection. [LW · GW]
↑ comment by Vladimir_Nesov · 2022-08-16T09:44:00.567Z · LW(p) · GW(p)
AlphaZero (just the policy network) I'm more confused about... I expect it still isn't doing search, but it is literally trained to imitate the outcome of a search so it might have similar mis-generalization properties?
This suggests that the choice of decision theory that amplifies a decision making model (in the sense of IDA [LW · GW]/HCH [LW(p) · GW(p)], or just the way MCTS is used in training AlphaZero) might influence robustness of its behavior far off-distribution, even if its behavior around the training distribution is not visibly sensitive to choice of decision theory used for amplification.
Though perhaps this sense of "robustness" is not very appropriate, and a better one should be explicitly based on reflection/extrapolation from behavior in familiar situations [LW(p) · GW(p)], with the expectation that all models fail to be robust sufficiently far off-distribution (in the crash space [LW(p) · GW(p)]), and new models must always be prepared in advance of going there.
↑ comment by Noosphere89 (sharmake-farah) · 2022-08-16T00:08:37.116Z · LW(p) · GW(p)
My thinking is that one of the biggest reasons humans managed to dominate is basically 3x more brainpower combined with ways to get rid of the heat necessary to support brainpower, which requires sweating all over the body.
Essentially it's the scaling hypothesis applied to biological systems.
And since intelligence can be used for any goal, it's not surprising that intelligence's main function was cultural.