"The Bitter Lesson", an article about compute vs human knowledge in AI

the-gears-to-ascension

"The Bitter Lesson", an article about compute vs human knowledge in AI

post by the gears to ascension (lahwran) · 2019-06-21T17:24:50.825Z · LW · GW · 14 comments

This is a link post for http://www.incompleteideas.net/IncIdeas/BitterLesson.html

14 comments

The Bitter Lesson Rich Sutton

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

In computer chess, the methods that defeated the world champion, Kasparov, in 1997, were based on massive, deep search. At the time, this was looked upon with dismay by the majority of computer-chess researchers who had pursued methods that leveraged human understanding of the special structure of chess. When a simpler, search-based approach with special hardware and software proved vastly more effective, these human-knowledge-based chess researchers were not good losers. They said that brute force" search may have won this time, but it was not a general strategy, and anyway it was not how people played chess. These researchers wanted methods based on human input to win and were disappointed when they did not.

A similar pattern of research progress was seen in computer Go, only delayed by a further 20 years. Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale. Also important was the use of learning by self play to learn a value function (as it was in many other games and even in chess, although learning did not play a big role in the 1997 program that first beat a world champion). Learning by self play, and learning in general, is like search in that it enables massive computation to be brought to bear. Search and learning are the two most important classes of techniques for utilizing massive amounts of computation in AI research. In computer Go, as in computer chess, researchers' initial effort was directed towards utilizing human understanding (so that less search was needed) and only much later was much greater success had by embracing search and learning.

In speech recognition, there was an early competition, sponsored by DARPA, in the 1970s. Entrants included a host of special methods that took advantage of human knowledge---knowledge of words, of phonemes, of the human vocal tract, etc. On the other side were newer methods that were more statistical in nature and did much more computation, based on hidden Markov models (HMMs). Again, the statistical methods won out over the human-knowledge-based methods. This led to a major change in all of natural language processing, gradually over decades, where statistics and computation came to dominate the field. The recent rise of deep learning in speech recognition is the most recent step in this consistent direction. Deep learning methods rely even less on human knowledge, and use even more computation, together with learning on huge training sets, to produce dramatically better speech recognition systems. As in the games, researchers always tried to make systems that worked the way the researchers thought their own minds worked---they tried to put that knowledge in their systems---but it proved ultimately counterproductive, and a colossal waste of researcher's time, when, through Moore's law, massive computation became available and a means was found to put it to good use.

In computer vision, there has been a similar pattern. Early methods conceived of vision as searching for edges, or generalized cylinders, or in terms of SIFT features. But today all this is discarded. Modern deep-learning neural networks use only the notions of convolution and certain kinds of invariances, and perform much better.

This is a big lesson. As a field, we still have not thoroughly learned it, as we are continuing to make the same kind of mistakes. To see this, and to effectively resist it, we have to understand the appeal of these mistakes. We have to learn the bitter lesson that building in how we think we think does not work in the long run. The bitter lesson is based on the historical observations that 1) AI researchers have often tried to build knowledge into their agents, 2) this always helps in the short term, and is personally satisfying to the researcher, but 3) in the long run it plateaus and even inhibits further progress, and 4) breakthrough progress eventually arrives by an opposing approach based on scaling computation by search and learning. The eventual success is tinged with bitterness, and often incompletely digested, because it is success over a favored, human-centric approach.

One thing that should be learned from the bitter lesson is the great power of general purpose methods, of methods that continue to scale with increased computation even as the available computation becomes very great. The two methods that seem to scale arbitrarily in this way are search and learning.

The second general point to be learned from the bitter lesson is that the actual contents of minds are tremendously, irredeemably complex; we should stop trying to find simple ways to think about the contents of minds, such as simple ways to think about space, objects, multiple agents, or symmetries. All these are part of the arbitrary, intrinsically-complex, outside world. They are not what should be built in, as their complexity is endless; instead we should build in only the meta-methods that can find and capture this arbitrary complexity. Essential to these methods is that they can find good approximations, but the search for them should be by our methods, not by us. We want AI agents that can discover like we can, not which contain what we have discovered. Building in our discoveries only makes it harder to see how the discovering process can be done.

14 comments

Comments sorted by top scores.

comment by gjm · 2019-06-22T01:47:02.176Z · LW(p) · GW(p)

What he says about computer go seems wrong to me.

Two things came together to make really strong computer go players. The first was Monte Carlo tree search. Yes, that name has "search" in it, but it amounted to making programs less search-y, for two reasons. First, the original versions of MCTS did evaluations at the leaves of its trees by random full-game playouts, which took substantial time, so the trees were smaller than for earlier tree-searching programs. Second, the way MCTS propagates results up the tree moves away from the minimax paradigm and allows for the possibility that some tree nodes that seem to arise only from inferior moves (and hence should be irrelevant, minimaximally) actually have useful information in them about the position. The second was the use of big neural networks for evaluation instead of those Monte Carlo playouts. This amounts to having a really fancy static evaluation function, and (with the use of a policy net/head as in AlphaGo and its successors) a highly selective search that chooses what moves to include in the search according to how promising they look.

So when Sutton says this

Enormous initial efforts went into avoiding search by taking advantage of human knowledge, or of the special features of the game, but all those efforts proved irrelevant, or worse, once search was applied effectively at scale.

that seems misleading; while it's true that the currently best go programs minimize their use of human knowledge, it's not by "applying search effectively at scale" that they do it.

Now, this does all fit into the broader pattern of "leveraging computation". Fair enough, I guess, but what else would you expect? That one can play superhumanly strong go by doing only a few thousand elementary steps of computation per move? That applying more computation wouldn't help? (There's a reason why human players, when they want to play well, take more time to think.)

Replies from: dxu

↑ comment by dxu · 2019-06-22T06:00:29.237Z · LW(p) · GW(p)

Now, this does all fit into the broader pattern of "leveraging computation". Fair enough, I guess, but what else would you expect?

It also fits into the pattern of (as you yourself pointed out) minimizing human knowledge during the construction of these programs, allowing them to tease out the features of the problem space on their own. The claim here is that as computing power increases, domain-agnostic approaches (i.e. approaches that do not require programmers to explicitly encode human-created heuristics) will increasingly outperform domain-specific approaches (which do rely on externally encoded human knowledge).

This is a non-trivial claim! For example, it wasn't at all obvious prior to January 2017 that traditional chess engines (whose static evaluation functions are filled with human-programmed heuristics) could be overtaken by a pure learning-based approach, and yet the AlphaZero paper came out and showed it was possible. If the larger claim is true, then that might suggest directions for further research--in particular, approaches that abstract away large parts of a problem may have more success than approaches that focus on the details of the problem structure.

Replies from: gjm

↑ comment by gjm · 2019-06-23T13:37:24.375Z · LW(p) · GW(p)

Yes, that's fair. (Though I'm not sure about the terms "domain-agnostic" and "domain-specific"; e.g., the AlphaZero approach seems to work well for a wide variety of board games played on "regular" boards but would need substantial modification to apply even to other board games and isn't obviously applicable at all to anything that isn't more or less a board game.)

Replies from: gwern

↑ comment by gwern · 2019-12-02T02:27:14.618Z · LW(p) · GW(p)

And MuZero, which applies to ALE very well?

Replies from: gjm

↑ comment by gjm · 2019-12-02T17:22:55.363Z · LW(p) · GW(p)

MuZero seems to deserve to be called domain-agnostic more than AlphaZero does, yes.

(For anyone else who doesn't immediately recognize the abbreviation: ALE is the "Arcade Learning Environment".)

comment by the gears to ascension (lahwran) · 2019-06-21T17:48:11.357Z · LW(p) · GW(p)

My own thoughts on the topic of ai, as related to this:

I currently expect that the first strongly general AI will be trained very haphazardly, using lots of human knowledge akin to parenting, and will still have very significant "idiot-savant" behaviors. I expect we'll need to use a similar approach to deepmind's starcraft AI for the first version: that is, reaching past what current tools can do individually or automatically, and hacking them together in a complex training system built for the specific purpose. However, I think at this point we're getting pretty close from the capabilities of individual components. If a transformer network was the only module in an system, but the training setup produced training data that required the transformer becoming a general agent, I currently think it would be capable of the sort of abstracted variable-based consequential planning that MIRI folks talk about being dangerous.

comment by Vaniver · 2020-12-12T16:34:21.219Z · LW(p) · GW(p)

I'm confused about whether or not it makes sense to put a linkpost in the review, but it seems worth considering for The Bitter Lesson. I think this might be the 'most important lesson of the 2010s', let alone 2019 specifically; I don't think the insight is unique to Sutton, but I think the combination of "obvious to unbiased onlookers" and "resisted by biased insiders" feels very familiar from the LW perspective, and it has often been useful as a handle for me when thinking about the future or my personal plans. It's the sort of thing that I expect I will look back on and think "yes, I was taking it seriously, but I should have been taking it even more seriously."

comment by eternal_neophyte · 2019-06-24T14:17:32.092Z · LW(p) · GW(p)

It only makes sense to talk about "search" in the context of a *search space*; and all extent search algorithms / learning methods involve searching through a comparatively simple space of structures, such as the space of weights on a deep neural network or the space of board-states in Go and Chess. Defining these spaces is pretty trivial. As we move on to attack more complex domains, such as abstract mathematics, or philosophy or procedurally generated music or literature which stands comparison to the best products of human genius, the problem of even /defining/ the search space in which you intend to leverage search-based techniques becomes massively involved.

Replies from: dxu

↑ comment by dxu · 2019-06-24T17:26:00.379Z · LW(p) · GW(p)

all extent search algorithms / learning methods involve searching through a comparatively simple space of structures, such as the space of weights on a deep neural network [...] As we move on to attack more complex domains, such as abstract mathematics, or philosophy or procedurally generated music or literature which stands comparison to the best products of human genius, the problem of even /defining/ the search space in which you intend to leverage search-based techniques becomes massively involved.

Since deep neural networks are known to be Turing-complete, I don't think it's appropriate to characterize them as a "comparatively simple" search space (unless of course you hold that "more complex domains" such as abstract mathematics, philosophy, music, literature, etc. are actually uncomputable).

Replies from: eternal_neophyte

↑ comment by eternal_neophyte · 2019-06-24T20:17:02.341Z · LW(p) · GW(p)

So is brainf*ck, & like NNs bf programs are simple in the sense of being trivial to enumerate and hence search through. Defining a search space for a complex domain is equivalent to defining a subspace of BF programs or NNs which could and probably does have a highly convoluted, warped separating surface. In the context of deep learning your ability to approximate that surface is limited by your ability to encode it as a loss function.

Replies from: dxu

↑ comment by dxu · 2019-06-24T20:26:19.073Z · LW(p) · GW(p)

Defining a search space for a complex domain is equivalent to defining a subspace of BF programs or NNs which could and probably does have a highly convoluted, warped separating surface.

The task of locating points in such subspaces is what optimization algorithms (including search algorithms) are meant to address. The goal isn't to "define" your search space in such a way that only useful solutions to the problem are included (if you could do that, you wouldn't have a problem in the first place!); the point is to have a search space general enough to encompass all possible solutions, and then converge on useful solutions using some kind of optimization.

EDIT: There is an analogue in machine learning to the kind of problem you seemed to be gesturing at when you mentioned "more complex domains"--namely, the problem of how to choose a good objective function to optimize. It's true that for more abstract domains, it's harder to define a criterion (or set of criteria) that we want our optimizer to satisfy, and this is (to a first approximation) a large part of the AI alignment problem. But there's a significant difference between choosing an objective function and "defining your search space" (whatever that means), and the latter concept doesn't have much use as far as I can see.

Replies from: eternal_neophyte

↑ comment by eternal_neophyte · 2019-06-24T21:11:02.753Z · LW(p) · GW(p)

for more abstract domains, it's harder to define a criterion (or set of criteria) that we want our optimizer to satisfy

Yes.

But there's a significant difference between choosing an objective function and "defining your search space" (whatever that means), and the latter concept doesn't have much use as far as I can see.

If you don't know what it means, how do you know that it's significantly different from choosing an "objective function" and why do you feel comfortable in making a judgment about whether or not the concept is useful?

In any case, to define a search space is to provide a spanning set of production rules which allow you to derive all elements in the target set. For example, Peano arithmetic provides a spanning set of rules for arithmetic computations, and hence define ( in one particular way ) the set of computations a search algorithm can search through in order to find arithmetic derivations satisfying whatever property. Similarly the rules of chess define the search-space for valid board-state sequences in games of chess. For neural networks, it could be defining a set of topologies, or a set of composition rules for layering networks together; and in a looser sense a loss function induces "search space" on network weights, insofar as it practically excludes certain regions of the error surface from the region of space any training run is ever likely to explore.

Replies from: dxu

↑ comment by dxu · 2019-06-24T22:16:41.203Z · LW(p) · GW(p)

If you don't know what it means, how do you know that it's significantly different from choosing an "objective function" and why do you feel comfortable in making a judgment about whether or not the concept is useful?

Because words tend to mean things, and when you use the phrase "define a search space", the typical meaning of those words does not bring to mind the same concept as the phrase "choose an objective function". (And the concept it does bring to mind is not very useful, as I described in the grandparent comment.)

Now, perhaps your contention is that these two phrases ought to bring to mind the same concept. I'd argue that this is unrealistic, but fine; it serves no purpose to argue whether I think you used the right phrase when you did, in fact, clarify what you meant later on:

in a looser sense a loss function induces "search space" on network weights, insofar as it practically excludes certain regions of the error surface from the region of space any training run is ever likely to explore.

All right, I'm happy to accept this as an example of defining (or "inducing") a search space, though I would maintain that it's not a very obvious example (and I think you would agree, considering that you prefixed it with "in a looser sense"). But then it's not at all obvious what your original objection to the article is! To quote your initial comment:

It only makes sense to talk about "search" in the context of a *search space*; and all extent search algorithms / learning methods involve searching through a comparatively simple space of structures, such as the space of weights on a deep neural network or the space of board-states in Go and Chess. As we move on to attack more complex domains, such as abstract mathematics, or philosophy or procedurally generated music or literature which stands comparison to the best products of human genius, the problem of even /defining/ the search space in which you intend to leverage search-based techniques becomes massively involved.

Taken at face value, this seems to be an argument that the original article overstates the importance of search-based techniques (and potentially other optimization techniques as well), because there are some problems to which search is inapplicable, owing to the lack of a well-defined search space. This is a meaningful objection to make, even though I happen to think it's untrue (for reasons described in the grandparent comment).

But if by "lack of a well-defined search space" you actually mean "the lack of a good objective function", then it's not clear to me where you think the article errs. Not having a good objective function for some domains certainly presents an obstacle, but this is not an issue with search-based optimization techniques; it's simply a consequence of the fact that you're dealing with an ill-posed problem. Since the article makes no claims about ill-posed problems, this does not seem like a salient objection.

Replies from: eternal_neophyte

↑ comment by eternal_neophyte · 2019-06-25T08:56:23.594Z · LW(p) · GW(p)

it's not a very obvious example

I honestly regret that I didn't make it as clear as I possibly could the first time around, but expressing original, partially developed ideas is not the same thing as reciting facts about well-understood concepts that have been explained and re-explained many times. Flippancy is needlessly hostile.

there are some problems to which search is inapplicable, owing to the lack of a well-defined search space

If not wholly inapplicable, then not performant, yes. Though the problem isn't that the search-space is not defined at all, but that the definitions which are easiest to give are also the least helpful ( to return to the previous example, in the Platonic real there exists a brainf*ck program that implements an optimal map from symptoms to diagnoses - good luck finding it ). As the original author points out, there's a tradeoff between knowledge and the need for brute-force. It may be that you can have an agent synthesize knowledge by consolidating the results of a brute-force search into a formal representation which an agent can then use to tune or reformulate the search-space previously given to fit some particular purposes; but this is quite a level of sophistication above pure brute force.

Edit:

this is not an issue with search-based optimization techniques; it's simply a consequence of the fact that you're dealing with an ill-posed problem

If the problems of literature or philosophy were not in some sense "ill posed" they would also be dead subjects. The 'general' part in AGI would seem to imply some capacity for dealing with vague, partially defined ideas in useful ways.

"The Bitter Lesson", an article about compute vs human knowledge in AI

Contents

14 comments