Can you get AGI from a Transformer?

post by steve2152 · 2020-07-23T15:27:51.712Z · score: 65 (27 votes) · LW · GW · 16 comments

Contents

  Introduction
    Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.
    Background Claim 2: We can't just brush aside Background Claim 1 by appealing to the "Bitter Lesson"
  Generative-model-based information processing: Background and motivation
    Predicting inputs with generative models (a.k.a. “analysis by synthesis” or “probabilistic programming”)
    What’s so great about generative-model-based processing?
      Feature 1: Better and better results with a longer search
      Feature 2: Generative models are simpler & easier to learn than discriminative models
      Feature 3: A generative-model-based approach is more sample-efficient and out-of-distribution-generalizable
      Feature 4: Foresight, counterfactuals, deliberation, etc.
    So, at the end of the day, is the generative-model-based approach essential for superintelligent AGI?
  Back to Transformers and GPT-N
      So here’s my hypothesis: this whole complicated decentralized asynchronous probabilistic-generative-model-inference process is more-or-less exactly what the GPT-3 Transformer learns to approximate.
      So, does that mean GPT-3 is a human-like intelligence? No!! There’s a big difference!
  Conclusion
None
16 comments

Introduction

I want to share my thoughts about the calculations that Transformers (such as GPT-3) do, and the calculations that I think are required for general intelligence, and how well they line up, and what I think GPT-3 is doing under the hood, and why I think an arbitrary transformer-based GPT-N might be incapable of doing certain tasks are seemingly essential for a system to qualify as an AGI.

Epistemic status: Very low confidence, to the point that I almost decided to delete this without posting it. I think some of my opinions here are very unpopular, and I would love any feedback or discussion.

Before we get into it, I want to make a couple background claims. The point here is basically to argue that the question “Can you get general intelligence by sufficiently scaling up a Transformer?” is worth asking, and does not have an answer of “Obviously yes, duh!!!” You can skip this part if you already agree with me on that.

Background Claim 1: There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.

By “information processing” I mean anything from sorting algorithms to data compression, random access memories, hash tables, whatever.

Let’s take Monte Carlo Tree Search (MCTS) as an example. AlphaZero does MCTS because DeepMind engineers explicitly programmed it to do MCTS—not because a generic RNN or other deep learning system spontaneously discovered, during gradient descent, that MCTS is a good idea.

Now, in principle, DNNs are universal function approximators, and more to the point, RNNs are Turing complete. So an RNN can emulate any other algorithm, including MCTS. But that doesn’t mean it can emulate it efficiently!

Let’s say we take a generic (PyTorch default) RNN, and train it such that it is incentivized to discover and start using MCTS. Assuming that the gradient flows converge to MCTS (a big "if"!), I believe (low confidence) that its only method for actually executing the MCTS involves: 

This is absurdly inefficient when compared to MCTS written by a DeepMind engineer and compiled to run directly on bare hardware with appropriate parallelization. Like, maybe, factor-of-a-million inefficient—this is not the kind of inefficiency where you can just shrug it off and wait a year or two for Moore's law to care of it.

MCTS is just one example. Again, you can open up your algorithms textbook and find thousands of ways to process information. What fraction of these can be implemented reasonably well in the form of DNN-type matrix multiplications / ReLUs / etc.? I expect <<100%. If any such type of information processing is essential for AGI, then we should expect that we won’t get AGI in a pure DNN.

(We could still get it in a DNN-plus-other-stuff, e.g. DNN-plus-MCTS, DNN-plus-random-access-memory, etc.)

Background Claim 2: We can't just brush aside Background Claim 1 by appealing to the "Bitter Lesson"

Rich Sutton’s Bitter Lesson says that "general methods that leverage computation" have tended to outperform methods that "leverage...human knowledge of the domain". I strongly agree with this. As I've argued—most recently here [LW · GW]—I think that the neocortex (or pallium in birds and lizards) uses a "general method that leverages computation" to understand the world, act, reason, etc. And I think we could absolutely get AGI using that method. By contrast—again in agreement with Rich Sutton—I think it’s much less likely that we would get AGI from, say, hand-coded rules about proper reasoning, knowledge graphs, etc. Although who knows, I guess.

Now, read it carefully: Rich Sutton's claim is that the best way to solve any given problem will be a "general method that leverages computation". The claim is not the reverse, i.e., if you have any "general method that leverages computation", it will solve every problem, if only you leverage enough computation!

A fully-connected feedforward neural net is a “general method that leverages computation”, but it can't do what a GAN or MuZero or Transformer does, no matter how much computation you leverage. It still has to be an appropriate method, with appropriate learning algorithm, parametrization, dataset, etc.

So, just like a fully-connected feedforward neural net is not the right "general method" for NLP, I think it's at least open to question whether the more general paradigm of matrix-multiplications-and-ReLUs (possibly attached to other simple components like MCTS, RAM, etc.) is definitely the right "general method" for general intelligence. Maybe it is, maybe it isn't! It’s an open question.

(And by the way, I freely acknowledge that this paradigm can do lots of amazing things, including things that I probably wouldn't have expected it to be able to do, if I didn't already know.)

OK, so far, all this is just to plant the general idea that DNNs do a particular type of information processing, and that there’s no guarantee that it’s the right type of information processing for AGI, merely a suggestion from the fact that it is a form of information processing that can do lots of amazing things that seem to be related to AGI.

Generative-model-based information processing: Background and motivation

Moving on, now I'll switch gears and talk more specifically about a certain type of information processing that I think might be critical for AGI, and after that I'll talk about how that relates to the information processing in DNNs (and more specifically, Transformers like GPT-3).

Predicting inputs with generative models (a.k.a. “analysis by synthesis” or “probabilistic programming”)

I’ll introduce the idea of analysis-by-synthesis via an unusually simple example: Towards the First Adversarially Robust NN Model on MNIST. MNIST is a famous collection of low-resolution images of handwritten digits, and the task is to recognize the digit from the image pixels. The traditional deep learning method for solving MNIST is: You do gradient descent to find a ConvNet model whose input is pixel values and whose output is a probability distribution for which digit it is. In this paper here, however, they go the other direction: They build a generative model for pictures representing the digit “0”, a different generative model for the digit “1”, etc. Then, when presented with a new digit image, they do ten calculations in parallel: How compatible is this image with the “0” generative model? How compatible is it with the “1” generative model? … They then pick the hypothesis that best fits the data.

Suggestively, their classification method winds up unusually human-like—it makes similar mistakes and has similar confusions as humans (assuming the paper is correct), unlike those funny examples in the adversarial perturbation literature.

As in this example, analysis-by-synthesis is any algorithm that searches through a collection / space of generative models for a model that best fits the input data (or better yet, with time-varying inputs, a model that predicts the input data that will arrive next). I’ve seen two other analysis-by-synthesis digit-recognition papers: Josh Tenenbaum & coworkers using classic probabilistic programming; and Dileep George & coworkers using a brain-inspired approach (and see also a nice series of blog posts introducing the latter).

The latter captures, I think, an essential aspect of the neocortex, in that there’s a whole society of generative models. Some of these generative models are at a very low level, and make predictions (with attached confidence levels) that certain sensory inputs will or won't be active. But more often the generative models make predictions that other generative models will or won’t be active. (These predictions can all be functions of time, or of other parameters—I'm leaving aside lots of details). To oversimplify, imagine that if the “bird” generative model is active, it also sends out a low-confidence prediction that the “is flying” generative model is also active, and that the "noun" generative model is active, and so on. This leads to not only hierarchical generative models, but also other forms of compositionality. For example, there’s a "purple jar" generative model that more-or-less just predicts that the "purple" generative model and the "jar" generative model are active simultaneously. This works because the "purple" and "jar" models make (by-and-large) compatible predictions—they agree with each other on some predictions, and in other cases they’re making predictions about different things. By contrast, there is no “stationary dancing” composite model—those two models make contradictory predictions. When two mutually-incompatible generative models are active simultaneously, they kinda fight each other for dominance. The algorithm underlying that "fight" is at least vaguely analogous to message-passing in a probabilistic graphical model. See more of my thoughts at Gary Marcus vs Cortical Uniformity [LW · GW] and Predictive coding = RL + SL + Bayes + MPC [LW · GW].

What’s so great about generative-model-based processing?

Well, you might say, OK, analysis-by-synthesis is an interesting approach. But is it essential for AGI? Or is it just one of many ways to do things? I think it’s at least plausibly essential. Consider some of its features...

Feature 1: Better and better results with a longer search

Example: You look at a picture of a camouflaged animal, hidden so well that you can’t find it at first. But after staring for a minute, it snaps into place, and you recognize it.

Example: Your code has a syntax error on line 12. You stare at the line for a few seconds, and then finally see the problem.

Since we’re doing a search over a space of generative models, if we don’t find a good match immediately, we can notice our confusion and keep searching. The space of generative models is astronomically large; you can always search more, given more time.

Feature 2: Generative models are simpler & easier to learn than discriminative models

Example: You are asked to prove a simple theorem for homework. You spend a few minutes thinking through different possible lines of attack, and eventually see an approach that will work.

Example: You are asked to invent a microscope with higher resolution and speed than what exists today. You think through a wide array of many different possible configurations of lenses, lasers, filters, polarizers, galvos, sample-holders, etc. Eventually you come up with a configuration that meets all requirements.

In both these cases, the forward / generative direction has a (relatively) simple structure: If you differentiate both sides, what happens? By contrast, the reverse / discriminative direction is almost arbitrarily complex: At the end, you proved the theorem; what was the first step? A model that could immediately answer the endless variety of such reverse-direction question seems like it would be hopelessly complex to learn!

In fact, probably the only way that the reverse model could be learned is by first developing a good generative model, running it through in lots of configurations, and training a reverse model on that. And something like that is indeed entirely possible! In humans, there's a kind of memorization / chunking that allows the system to "cache" the results of runs through the generative models, so that the corresponding reverse-lookup can be very fast in the future. Actually, this is not so far from how AlphaZero works too—in that case, the MCTS self-play functions as the generative model. 

Feature 3: A generative-model-based approach is more sample-efficient and out-of-distribution-generalizable

This is related to Feature 2. Since generative models are simpler (less information content) than reverse / discriminative models, they can be learned more quickly. Empirically, I think this is borne out by those three digit-recognition papers I cited above, and by the fact that humans can do zero-shot transfer learning in all kinds of situations. (Of course, humans need to spend years cultivating a healthy and diverse society of generative models; the sample-efficiency I'm talking about is after that.)

Another way to think about it: things in the world are, in reality, generated by highly structured hierarchical-and-compositional-generative-model-type processes. If we want a good inductive bias, then we need to structure our analysis in a parallel way: we need to go looking for those kinds of generative models until we find ones that explain the data. Better inductive bias means better out-of-distribution generalization and fewer samples required.

Example: You read about a new concept you hadn't heard of—Quantum Capacitance. You find it confusing at first, but after reading a couple descriptions and thinking about it, you find a good mental model of it—a certain way to visualize it, to relate it to other known concepts, etc. Having internalized that concept, you can then go use it and build on it in the future.

In this example, you built a good generative model of Quantum Capacitance despite little or even no concrete examples of it. How? For one thing, you're not building it from scratch, you're snapping together a few pieces of generative models you've already learned, and your textbook helps you figure out which ones, by describing the concept using analogies, figurative spatial language, etc. For another thing, you are trying to slot this concept into a dense web of previously-established physics-related generative models, each of which will protest loudly if they see something that contradicts them. In other words, there are only so many ways to create this concept in a way that doesn't contradict the things about physics you have already come to believe. Third, you can test any number of proposed generative model against the same known examples—sorta vaguely like off-policy replay learning or something. So anyway, with all these ingredients, it's a highly constrained problem, and so you can can find the right model with few if any concrete examples.

Feature 4: Foresight, counterfactuals, deliberation, etc.

A society of generative models is a powerful thing...

I could go on, but you get the idea.

So, at the end of the day, is the generative-model-based approach essential for superintelligent AGI?

I mean, it kind of feels that way to me, but maybe I’m anthropomorphizing.

Back to Transformers and GPT-N

OK, you’ve read all this way, now how does this relate to GPT-3? Will GPT-N be an AGI or not??

Now, when the brain does its probabilistic generative model inference thing, it does it by a complicated, decentralized process. There’s a feedforward pass that activates some of the most promising generative models. Then the models do some message-passing back and forth. Contradictory models fight it out. New models get awakened. Redundant models go away, etc. Multi-step models, extended through time, walk through their steps.

So here’s my hypothesis: this whole complicated decentralized asynchronous probabilistic-generative-model-inference process is more-or-less exactly what the GPT-3 Transformer learns to approximate.

As a special case: I feel pretty comfortable imagining that a single trained Transformer layer can approximate a single step of message-passing in a probabilistic graphical model inference algorithm.

Now, I haven’t thought this through in any amount of detail, but so far this hypothesis seems at least plausible to me. For example, the Transformer has residual connections that allow later iterations to create small increments in the activations, which is what often happens with later stages of message-passing. It has a self-attention mechanism that seems well-suited to representing the sparsely-connected structure of a probabilistic graphical model. It has the positional information that it needs to figure out how far along the various multi-step generative models have progressed. GPT-2 has 12 layers and GPT-3 has 96, enough for quite a lot of sequential processing, which is important to capture this asynchronous process as it unfolds in time.

I like this hypothesis a lot because it's consistent with my strong belief that GPT-3 has captured human-like concepts and can search through them and manipulate them and combine them in a very human-like way.

So, does that mean GPT-3 is a human-like intelligence? No!! There’s a big difference!

Let me offer a more specific list of my hypothesized fundamental deficiencies in Transformers as AGIs.

First, for the reason mentioned above, I think the sample efficiency is bound to be dramatically worse for training a Transformer versus training a real generative-model-centric system. And this makes it difficult or impossible for it to learn or create concepts that humans are not already using.

For example, I am confident that GPT-N, after reading tons of text where people use some random concept (Quantum Capacitance, say), can gradient-descent its way to properly using that concept, and to properly integrating it into its (implicit) world-model. But if GPT-N had never heard of Quantum Capacitance, and saw one good explanation in its training data, I’m pretty skeptical that it would be able to use the concept properly. And I'm even more skeptical that it could invent that concept from scratch.

To clarify, I’m talking here about sample-inefficiency in modifying the weights to form a more sophisticated permanent understanding of the world. By contrast, I think there is little doubt that it has the ability to alter its activations in a quick and flexible and sample-efficient way in response to thought-provoking input text. But that only takes you so far! That cannot take you more than a couple steps of inferential distance away from the span of concepts frequently used by humans in the training data. 

Second, the finite number of Transformer layers puts a ceiling on the quality of the generative-model-search process, the time spent deliberating, etc. As I mentioned above, humans can stretch their capabilities by thinking a little bit longer and harder. However, if you have a Transformer that more-or-less simulates the first 100 (or whatever) milliseconds of the neocortex's generative-model-search process, then that's all you can ever get.

Third, because the Transformer is a kind of information processing imitating a different kind of information processing, I generally expect edge cases where the imitation breaks down, leading to weird inductive biases, crazy out-of-distribution behavior, etc. I’m not too sure about this one though.

Maybe there are other things too.

Then of course there are also the obvious things like GPT-3 being disembodied, not having a reward signal, not having visual or spatial input channels, etc. These are all plausibly very important, but I don't think want to emphasize them too much, because I don't think they're fundamental architectural limitations, I think they're more likely just things that OpenAI hasn't gotten around to doing yet.

Conclusion

These three deficiencies I'm hypothesizing seem like pretty serious roadblocks to AGI—especially the one where I claim that it can't form new concepts and add them permanently to its world-model.

That said, it’s entirely possible that these deficiencies don’t matter, or can be worked around. (Or that I'm just wrong.)

But for the moment, I continue to consider it very possible that Transformers specifically, and DNN-type processing more generally (matrix multiplications, ReLUs, etc.), for all their amazing powers, will eventually be surpassed in AGI-type capabilities by a different kind of information processing, more like probabilistic programming and message-passing, and also more like the neocortex (but, just like DNNs, still based on relatively simple, general principles, and still requiring an awful lot of compute).

I recognize that this kind of probabilistic programming stuff is not as "hot" as DNNs right now, but it's not neglected either; it's a pretty active area of CS research, moving forward each year.

But I dunno. As I mentioned, I'm not confident about any of this, and I am very interested in discussion and feedback here. :-)

16 comments

Comments sorted by top scores.

comment by johnswentworth · 2020-07-23T17:54:34.585Z · score: 22 (9 votes) · LW(p) · GW(p)

This was a really interesting read. I'm glad you decided to post it.

I find all of the pieces fairly interesting and plausible in isolation, but I think the post either underexplains or understates the case for how it all fits together. As written, the post roughly says:

  • first section: there are probably some kinds of computation which can't be efficiently supported (though such computations might not actually matter)
  • second section: here's one particular kind of computation which seems pretty central to how humans think (though I don't really see a case made here that it's necessary)
  • third section: some things which might therefore be difficult

I do think the case can be strengthened a lot, especially for low sample efficiency and the difficulty of inventing new concepts. Here's a rough outline of the argument I would make:

  • Sample efficiency is all about how well we approximate Bayesian reasoning. This is one of the few places where both the theory and the empirics have a particularly strong case for Bayes.
  • Bayesian reasoning, on the sort of problems we're talking about, means generative models. So if we want sample efficiency, then generative models (or a good approximation thereof) are a necessary element.
  • GPT-style models do not have "basin of attraction"-style convergence, where they learn the general concept of generative models and can then easily create new generative models going forward. They have to converge to each new generative model the hard way.

That last step is the one I'd be most uncertain about, but it's also a claim which is "just math", so it could be checked by either analysis or simulation if we know how the models in question are embedded.

comment by Gurkenglas · 2020-07-23T19:25:21.731Z · score: 7 (4 votes) · LW(p) · GW(p)

Re claim 1: If you let it use the page as a scratch pad, you can also let it output commands to a command line interface so it can outsource these hard-to-emulate calculations to the CPU.

comment by Davidmanheim · 2020-07-24T09:10:07.989Z · score: 4 (2 votes) · LW(p) · GW(p)

I'm unsure that GPT3 can output, say, a ipython notebook to get the values it wants.

That would be really interesting to try...

comment by Vaniver · 2020-07-23T15:40:06.015Z · score: 7 (4 votes) · LW(p) · GW(p)
There are types of information processing that cannot be cast in the form of Deep Neural Net (DNN)-type calculations (matrix multiplications, ReLUs, etc.), except with an exorbitant performance penalty.

Sure... but humans can't do those either, without an exorbitant performance penalty! Does this imply that humans alone aren't general intelligences (and thus the threshold we should be worried about is lower), or that they're not actually important for general intelligence?

will eventually be surpassed in AGI-type capabilities by a different kind of information processing

"Surpassed" seems strange to me; I'll bet that the first AGI system will have a very GPT-like module, that will be critical to its performance, that will nevertheless not be "the whole story." Like, by analogy to AlphaGo, the interesting thing was the structure they built around the convnets, but I don't think it would have worked nearly as well without the convnets.

comment by steve2152 · 2020-07-23T16:02:31.060Z · score: 8 (5 votes) · LW(p) · GW(p)

Sure... but humans can't do those either, without an exorbitant performance penalty!

Well, a big part of this post is an argument that the human neocortex is doing a different type of information processing than a DNN, with the neocortex's algorithms being more similar to the algorithms underlying probabilistic programming, message-passing, etc. Therefore I don't accept the premise that in general, if a DNN can't do a certain type of information processing efficiently, then neither can the human brain.

Do you think DNNs and human brains are doing essentially the same type of information processing? If not, how did you conclude "humans can't do those either"? Thanks!

comment by Douglas Summers-Stay (douglas-summers-stay) · 2020-07-23T17:45:36.454Z · score: 5 (3 votes) · LW(p) · GW(p)

Regarding "thinking a problem over"-- I have seen some examples where on some questions that GPT-3 can't answer correctly off the bat, it can answer correctly when the prompt encourages a kind of talking through the problem, where its own generations bias its later generations in such a way that it comes to the right conclusion in the end. This may undercut your argument that the limited number of layers prevents certain kinds of problem solving that need more thought?

comment by steve2152 · 2020-07-23T18:13:57.484Z · score: 6 (3 votes) · LW(p) · GW(p)

Yeah, maybe. Well, definitely to some extent.

I guess I would propose that "using the page as a scratchpad" doesn't help with the operation "develop an idea, chunk it, and then build on it". The problem is that the chunking and building-on-the-chunk have to happen sequentially. So maybe it can (barely) develop an idea, chunk it, and write it down. Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before. And then there aren't enough layers left to build on it... Let alone build a giant hierarchy of new chunks-inside-chunks.

So I think that going more than 1 or 2 "steps" of inferential distance beyond the concepts represented in the training data requires that the new ideas get put into the weights, not just the activations.

I guess you could fine-tune the network on its own outputs, or something. I don't think that would work, but who knows.

comment by ESRogs · 2020-08-12T01:11:34.821Z · score: 4 (2 votes) · LW(p) · GW(p)

Then you turn the Transformer back on for the next word, with the previous writing as an additional input, and maybe it takes 30 Transformer layers just to get back to where it was, i.e. having re-internalized that concept from before.

Are the remaining 66 layers not enough to build on the concept? What if we're talking about GPT-N rather than GPT-3, with T >> 96 total layers, such that it can use M layers to re-internalize the concept and T-M layers to build on it?

Aren't our brains having to do something like that with our working memory?

comment by steve2152 · 2020-08-14T13:40:24.303Z · score: 4 (2 votes) · LW(p) · GW(p)

I agree that, the more layers you have in the Transformer, the more steps you can take beyond the range of concepts and relations-between-concepts that are well-represented in the training data.

If you want your AGI to invent a new gadget, for example, there might be 500 insights involved in understanding and optimizing its operation—how the components relate to each other, what happens in different operating regimes, how the output depends on each component, what are the edge cases, etc. etc. And these insights are probably not particularly parallelizable; rather, you need to already understand lots of them to figure out more. I don't know how many Transformer layers it takes to internalize a new concept, or how many Transformer layers you can train, so I don't know what the limit is, only that I think there's some limit. Unless the Transformer has recurrency I guess, then maybe all bets are off? I'd have to think about that more.

Aren't our brains having to do something like that with our working memory?

Yeah, definitely. We humans need to incorporate insights into our permanent memory / world-model before we can effectively build on them.

This is analogous to my claim that we need to get new insights somehow out of GPT-N's activations and into its weights, before it can effectively build on them.

Maybe the right model is a human and GPT-N working together. GPT-N has some glimmer of an insight, and the human "gets it", and writes out 20 example paragraphs that rely on that insight, and then fine-tunes GPT-N on those paragraphs. Now GPT-N has that insight incorporated into its weights, and we go back, with the human trying to coax GPT-N into having more insights, and repeat.

I dunno, maybe. Just brainstorming. :-)

comment by gwern · 2020-07-23T17:23:06.580Z · score: 5 (6 votes) · LW(p) · GW(p)

I'm going to stop at your very first claim and observe: MuZero.

comment by LGS · 2020-07-24T01:13:18.482Z · score: 6 (5 votes) · LW(p) · GW(p)

You are aware that MuZero has tree search hardcoded into it, yes? How does that contradict claim 1?

comment by gwern · 2020-07-24T02:49:22.044Z · score: 1 (8 votes) · LW(p) · GW(p)

MuZero does not do MCTS and still outperforms.

comment by LGS · 2020-07-24T03:05:27.762Z · score: 18 (13 votes) · LW(p) · GW(p)

It does do (a variant of) MCTS. Check it out for yourself. The paper is here:

https://arxiv.org/pdf/1911.08265.pdf

Appendix B, page 12:

"We now describe the search algorithm used by MuZero. Our approach is based upon Monte-Carlo tree search with upper confidence bounds, an approach to planning that converges asymptotically to the optimal policy in single agent domains and to the minimax value function in zero sum games [22]."

comment by Pongo · 2020-07-23T21:09:55.114Z · score: 4 (3 votes) · LW(p) · GW(p)

To check, I think you're saying:

You believe it is inefficient to simulate some algorithms with a DNN. But the algorithms only matter for performance on various tasks. You give an example where a hard-to-simulate algorithm was used in an AI system, but when we got rid of the hard-to-simulate algorithm, it performed comparably well.

comment by ESRogs · 2020-08-12T01:00:06.619Z · score: 4 (2 votes) · LW(p) · GW(p)

I recognize that this kind of probabilistic programming stuff is not as "hot" as DNNs right now, but it's not neglected either; it's a pretty active area of CS research, moving forward each year.

Would the generative models not themselves be deep neural networks?

comment by steve2152 · 2020-08-14T13:16:00.333Z · score: 4 (2 votes) · LW(p) · GW(p)

Yeah, probably. I gave this simple example where they build 10 VAEs to function as 10 generative models, each of which is based on a very typical deep neural network. The inference algorithm is still a bit different from a typical MNIST model, because the answer is not directly output, but comes from MAP inference, or something like that.

I don't think that particular approach is scalable because there's a combinatorial explosion of possible things in the world, which need to be matched by a combinatorial explosion of possible generative models to predict them. So you need an ability to glue together models ("compositionality", although it's possible that I'm misusing that term). For example, compositionality in time ("Model A happens, and then Model B happens"), or compositionality in space ("Model A and Model B are both active, with a certain spatial relation"), or compositionality in features ("Model A is predicting the object's texture and Model B is predicting its shape and Model C is predicting its behavior"), etc.

(In addition to being able to glue them together, you also need an algorithm that searches through the space of possible ways to glue them together, to find the right glued-together generative model that fits a certain input, in a computationally-efficient way.)

It's not immediately obvious how to take typical deep neural network generative models and glue them together like that. Of course, I'm sure there are about 10 grillion papers on exactly that topic that I haven't read. So I don't know, maybe it's possible. 

What I have been reading is papers trying to work out how the neocortex does it. My favorite examples for vision are probably currently this one from Dileep George and this one from Randall O'Reilly. Note that the two are not straightforwardly compatible with each other—this is not a well-developed field, but rather lots of insights that are gradually getting woven together into a coherent whole. Or at least that's how it feels to me.

Are these neocortical models "deep neural networks"?

Well, they're "neural" in a certain literal sense :-) I think the neurons in those two papers are different but not wildly different than the "neurons" in PyTorch models, more-or-less using the translation "spike frequency in biological neurons" <--> "activation of PyTorch 'neurons'". However, this paper proposes a computation done by a single biological neuron which would definitely require quite a few PyTorch 'neurons' to imitate. They propose that this computation is important for learning temporal sequences, which is one form of compositionality, and I suspect it's useful for the other types of compositionality as well.

They're "deep" in the sense of "at least some hierarchy, though typically 2-5 layers (I think) not 50, and the hierarchy is very loose, with lots of lateral and layer-skipping and backwards connections". I heard a theory that the reason that ResNets need 50+ layers to do something vaguely analogous to what the brain does in ~5 (loose) layers is that the brain has all these recurrent connections, and you can unroll a recurrent network into a feedforward network with more layers. Plus the fact that one biological neuron is more complicated than one PyTorch neuron. I don't really know though...