No free lunch theorem is irrelevant

post by Catnee (Dmitry Savishchev) · 2022-10-04T00:21:54.790Z · LW · GW · 7 comments

Contents

7 comments

I often see people cite the "no free lunch theorem" as a counterargument to the existence of an AGI. I think this is a very bad argument. This theorem is formulated as a problem of predicting random and uniformly distributed data. Just open the Wikipedia!. In my opinion, this is another example of how people use results from mathematics simply because they sound relevant without understanding their original meaning and limitations. What makes them think it's relevant to AGI?

It sounds something like “well, since there is such a theorem, it turns out that we will not be able to create an algorithm that can do many tasks.” This is complete nonsense. In order to somehow bind it to real systems, you need to make a reservation and add a restriction on some kind of physical size of the model: the number of parameters, disk space, total equipment weight, training time, total budget whatever. In this case, well, yes. If we have a general algorithm for solving many problems, then for any problem we can probably use the same resources to make an algorithm that performs better. The problem is that we don't say "how much better". Any improvement will do.

It will not be the original theorem, but I think this is close to "what people think when they say no free lunch theorem".

If our algorithm for solving many problems shows itself “kind of meh” compared to specialized algorithms, then what prevents us from increasing the resource budget and collecting all our specialized algorithms into one big one? Yes, the new one can also be reassembled into specialized algorithms, but sooner or later we will hit diminishing returns. It is impossible to calculate arithmetic more accurately than in 100% of cases, it is impossible to predict the structure of a protein more accurately than it will turn out in reality, and so on. By increasing the available resources, distilling and amplifying, we can make a general-purpose algorithm as good as we like in any task. As close to maximum performance for the metrics as we need.

7 comments

Comments sorted by top scores.

comment by Donald Hobson (donald-hobson) · 2022-10-04T12:50:20.898Z · LW(p) · GW(p)

If our algorithm for solving many problems shows itself “kind of meh” compared to specialized algorithms, then what prevents us from increasing the resource budget and collecting all our specialized algorithms into one big one?

Consider an algorithm specialized to chess in particular, as compared to a more general algorithm that plays games in general.

The chess specific algorithm can have a large advantage, for the same amount of compute, by containing precomputed data specific to chess, for example a table of good opening moves. 

All that precomputation must have happened somewhere, possibly in human brains. So there exists a general algorithm that could be told the rules of chess, do a large amount of precomputation, and then will play equally good chess for the same resources as the specialized chess algorithm. 

No algorithm can contain precomputed values for all possible games, as there are infinitely many / exponentially many of those. 

Replies from: Dmitry Savishchev
comment by Catnee (Dmitry Savishchev) · 2022-10-04T16:00:34.010Z · LW(p) · GW(p)

I don't understand how this contradicts anything? As soon as you let loose some of the physical constraints, you can start to pile up precomputation/memory/ budget/volume/whatever. If you spend all of this to solve one task, then, well, you should get higher performance than any other approach that doesn't focus on one thing. Or, you can make an algorithm that can outperform anything that you've made before. Given enough of any kind of unconstrained resource.

Precompute is just another resource

comment by Morpheus · 2022-10-04T01:27:39.775Z · LW(p) · GW(p)

Reading Wolpert's Paper (no-free-lunch for optimization) (Scihub is your friend), I got the impression that he might in large part be himself responsible for people misunderstanding the implications of his theorem, because he does not understand them himself:

First, if the practitioner has knowledge of problem characteristics, but does not incorporate them into the optimization algorithm, then is effectively uniform. (Recall that can be viewed as a statement concerning the practitioner’s choice of optimization algorithms.) In such a case, the NFL theorems establish that there are no formal assurances that the algorithm chosen will be at all effective.

Wolpert does not actually justify this assumption in his paper, but here are some of the interesting results he is getting from them:

In Section VII we present an introduction to the alternative approach to the formal analysis of optimization in which problems are held fixed and one looks at properties across the space of algorithms. Since these results hold in general, they hold for any and all optimization problems and thus are independent of the types of problems one is more or less likely to encounter in the real world. In particular, these results show that there is no a priori justification for using a search algorithm’s observed behavior to date on a particular cost function to predict its future behavior on that function.

I think in the real world, you can use induction. I am going to stick my neck out, claiming that Wolpert uses induction himself.

For an in-depth critique of the weird assumptions in the paper, see here.

comment by TW123 (ThomasWoodside) · 2022-10-04T01:11:41.239Z · LW(p) · GW(p)

I think it's tangentially relevant in certain cases. Here's something I wrote in another context, where I think it's perhaps useful to understand what we really mean when we say "intelligence."

We consider humans intelligent not because they do better on all possible optimization problems (they don’t, due to the no free lunch theorem), but because they do better on a subset of problems that are actually encountered in the real world. For instance, humans have particular cognitive architectures that allow us to understand language well, and language is something that humans need to understand. This can be seen even more clearly with “cognitive biases”, which are errors in logical reasoning that are nevertheless advantageous for surviving in the ancestral human world. Recently, people have tried to rid themselves of such biases, because of their belief that they no longer help in the modern world, a perfect example of the fact that human intelligence is highly domain-dependent.

We can’t do arithmetic nearly as well as machines, but we don’t view them as intelligent because we can do many more things than they can, more flexibly and better. The machine might reply that it can do many things as well: it can quickly multiply 3040 by 2443, but also 42323 by 3242, and 379 by 305, and in an absolutely huge number of these calculations. The human might respond that these are all “just multiplication problems”; the machine might say that human problems are “just thriving-on-planet-earth problems”. When we say intelligence, we essentially just mean “ability to excel at thriving-on-planet-earth problems,” which requires knowledge and cognitive architectures that are specifically good at thriving-on-planet-earth. Thinking too much about "generality" tends to confuse; instead consider performance on thriving-on-planet-earth problems, or particular subsets of those problems.

Agree it's mostly not relevant.

Replies from: Morpheus
comment by Morpheus · 2022-10-04T01:53:12.504Z · LW(p) · GW(p)

I think I disagree (that generality is irrelevant (this is only the case, because nfl-theorems use unreasonable priors)). If your problem has "any" structure: your environment is not maximally random, then you can use Occam's razor and make sense of your environment. No need for the "real world". The paper on universal intelligence is great, by the way, if formalizing intelligence seems interesting.

Replies from: Morpheus
comment by Morpheus · 2022-10-04T02:00:25.417Z · LW(p) · GW(p)

To spell out how this applies to your comment: If you use a reasonable prior, then intelligence is well-defined: some things are smarter than others. For example, GPT3 is smarter than GPT2.