post by [deleted] · · ? · GW · 0 comments

This is a link post for

0 comments

Comments sorted by top scores.

comment by gwern · 2020-02-18T01:40:06.057Z · LW(p) · GW(p)

Of course, most of us would be very skeptical. Not just because insights of that magnitude are rarely ever discovered by a single person or small team of people, but also because it's hard to see how there could be a simple core to image classification. The reason why you can recognize a cat is not because cats are simple things in thingspace and are therefore easily identifiable; it's because there are a bunch of things that make cats cat-like, and you understand a lot about the world. Current image classifiers recognize cats because they have learned a bunch of features: whiskers, ears, legs, fur, eyes, tails etc. and they leverage this learned knowledge to identify cats. Humans recognize cats because they have learned a bunch of information about animals, bodies, moving objects, and some domain specific information about cats, and they leverage this learned knowledge to identify cats. Either way, there's no way around the fact that you need to know a lot in order to understand what is and isn't a cat. Image classification just isn't the type of thing that should be easily compressible, because by compressing it, you lose important learned information that can be used to identify features of the world. In fact, I think we can say the same about many areas of intelligence.

According to you, the entire field of model distillation & compression, whose paradigmatic use-case is compressing image classification CNNs down to sizes like 10% or 1% (or less) and running it on your smartphone, which is not even that hard in practice, is impossible and cannot exist. That seems a little puzzling.

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2020-02-18T01:52:27.686Z · LW(p) · GW(p)

My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data. Distilled CNNs are still CNNs and thus the argument follows.

My point was that you couldn't do better than just memorizing the features that make up a cat. I should clarify that I do think that deep neural networks often have a lot of wasted information (though I believe removing some of it incurs a cost in robustness). The question is whether future insights will allow us do much better than what we currently do.

Replies from: gwern
comment by gwern · 2020-02-18T14:24:08.586Z · LW(p) · GW(p)

My understanding was that distilling CNNs worked more-or-less by removing redundant weights, rather than by discovering a more efficient form of representing the data.

No. That might describe sparsification, but it doesn't describe distillation, and in either case, it's shameless goalpost moving - by handwaving away all the counterexamples, you're simply no-true-Scotsmanning progress. 'Oh, Transformers? They aren't real performance improvements because they just learn "good representations of the data". Oh, model sparsification and compression and distillation? They aren't real compression because they're just getting rid of "wasted information".'

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2020-02-18T16:08:24.663Z · LW(p) · GW(p)

I removed this post because you convinced me it was sufficiently ill composed. I still disagree strongly because I don't really understand how you would agree with the person in the analogy. And again, CNNs still seem pretty good at representing data to me, and it's still unclear why model distillation disproves this.

comment by johnswentworth · 2020-02-18T06:54:47.908Z · LW(p) · GW(p)
Suppose someone told you that they had an ingenious idea for a new algorithm that would classify images with identical performance to CNNs, but with 1% the overhead memory costs. They explain that CNNs are using memory extremely inefficiently; image classification has a simple core, and when you discover this core, you can radically increase the efficiency of your system. If someone said this, what would be your reaction be?

My reaction would be "sure, that sounds like exactly the sort of thing that happens from time to time". In fact, if you replace the word "memory" with either "data" or "compute", then this has already happened with the advent of transformer architectures just within the past few years, on the training side of things.

Reducing costs for some use-case (compute, data, memory, whatever) by multiple orders of magnitude is the default thing I expect to happen when someone comes up with an interesting new algorithm. One such algorithm was backpropagation. CNNs themselves were another. It shouldn't be surprising at this point.

And search? You really want to tell me that there aren't faster reasonably-general-purpose search algorithms (i.e. about as general as backprop + gradient descent) awaiting discovery? Or that faster reasonably-general-purpose search algorithms wouldn't lead to a rapid jump in AI/ML capabilities?

Replies from: matthew-barnett, matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2020-02-18T07:15:50.893Z · LW(p) · GW(p)

I don't think it's impossible. I have wide uncertainty about timelines, and I certainly think that parts of our systems can get much more efficient. I should have made this more clear in the post. What I'm trying to say is that I am skeptical of a catch-all general efficiency gain that comes from a core insight into rationality, that makes systems much more efficient suddenly.

Replies from: johnswentworth
comment by johnswentworth · 2020-02-18T07:34:09.634Z · LW(p) · GW(p)

Imagine a search algorithm that finds local minima, similar to gradient descent, but has faster big-O performance than gradient descent. (For instance, an efficient roughly-n^2 matrix multiplication algorithm would likely yield such a thing, by making true Newton steps tractable on large systems - assuming it played well with sparsity.) That would be a general efficiency gain, and would likely stem from some sudden theoretical breakthrough (e.g. on fast matrix multiplication). And it is exactly the sort of thing which tends to come from a single person/team - the gradual theoretical progress we've seen on matrix multiplication is not the kind of breakthrough which makes the whole thing practical; people generally think we're missing some key idea which will make the problem tractable.

Replies from: matthew-barnett
comment by Matthew Barnett (matthew-barnett) · 2020-02-18T07:43:27.708Z · LW(p) · GW(p)
some sudden theoretical breakthrough (e.g. on fast matrix multiplication)

These sorts of ideas seem possible, and I'm not willing to discard them as improbable just yet. I think a way to imagine my argument is that I'm saying, "Hold on, why are we assuming that this is the default scenario? I think we should be skeptical by default." And so in general counterarguments of the form, "But it could be wrong because of this" aren't great, because something being possible does not imply that it's likely.

comment by Matthew Barnett (matthew-barnett) · 2020-02-18T07:04:02.427Z · LW(p) · GW(p)
My reaction would be "sure, that sounds like exactly the sort of thing that happens from time to time".

Insights trickle in slowly. Over the long-run, you can see vast efficiency improvements. But this seems unrealistically fast. You would really believe that a single person or team did something like that, which if true would completely and radically reshape the field of computer vision, because "it happens from time to time"?

In fact, if you replace the word "memory" with either "data" or "compute", then this has already happened with the advent of transformer architectures just within the past few years, on the training side of things.

Transformers are impressive, but how much of their usefulness is due to efficiency by having good representations of the data? I argue, not by orders of magnitude. OpenAI recently did this comparison to LSTMs, and this was their result.

comment by Steven Byrnes (steve2152) · 2020-02-18T16:03:36.207Z · LW(p) · GW(p)

Here are three areas where I think I have a different perspective:

  1. I think we should be careful to quantify how compute-efficient, how compressible, etc. I agree that vision relies on lots of heuristics. Vision is not possible in 100 lines of python code with no ML! But are we talking about megabytes of heuristics, or gigabytes, or terabytes? I don't know.

  2. It also seems to me that we can plausibly declare vision, audio, etc. to be "compute-efficient enough"—that explicit, high-level reasoning is the only bottleneck, and that the other 6 boxes are "solved problems", or close enough anyway.

  3. I think we shouldn't confuse two types of "simple": A "simple core learning algorithm" can still learn arbitrarily complicated heuristics from data.

I don't think we currently know of an algorithm which would be an AGI if only we had enough compute. Yes we have search algorithms, but I don't think we have data structures that can encode the kind of arbitrary abstract understanding of the world that is needed, and I don't think we have a way to build and sort through those data structures. (But I do think there are ideas out there that are getting a lot closer.) Supposedly, the human brain does about as much calculation as a supercomputer. But we can't take a supercomputer and have it be a remote-work software engineer, doing all the things that software engineers do, like answering emails and debugging code. (Equivalently, we can't take a normal computer and have it be a software engineer running in slow-motion.)

Therefore, I think we have to say that we're not putting our computer cycles to the maximally-efficient use in creating intelligence.

The line of research I would most expect to lead to AGI soon would be algorithms explicitly trying to emulate (at a high level) the algorithms of human intelligence. While these tend to be remarkably sample efficient (just as humans can learn a new word from a single exposure), they are not particularly easy computations to do today (they're vaguely related to probabilistic programming), and they are also substantially different from ResNets and the other most popular ML approaches. Are they "compute efficient"? I wouldn't know how to answer that, because I think you can only fairly compare it to other algorithms "doing AGI", and I don't think there are any other such algorithms today. I would expect a pretty simple core learning algorithm [LW · GW], that is (at first) painfully slow and inefficient to run, and which creates a not-at-all-simple mess of heuristics as it learns about the world.

For example, one fast-takeoff-ish scenario I might imagine would be that someone gets AGI-level algorithms just barely working inefficiently on small toy problems, and then gets a 1000× speedup by translating the algorithms to CUDA-C++ / FPGA / whatever (or opening a collaboration with google, or finding a better implementation, or whatever), and the result takes people by surprise—maybe even including the programmers, and certainly people outside the group.