Predictive Coding has been Unified with Backpropagation

post by lsusr · 2021-04-02T21:42:12.937Z · LW · GW · 51 comments

Contents

  Hebbian theory
  Unification
None
51 comments

Artificial Neural Networks (ANNs) are based around the backpropagation algorithm. The backpropagation algorithm allows you to perform gradient descent on a network of neurons. When we feed training data through an ANNs, we use the backpropagation algorithm to tell us how the weights should change.

ANNs are good at inference problems. Biological Neural Networks (BNNs) are good at inference too. ANNs are built out of neurons. BNNs are built out of neurons too. It makes intuitive sense that ANNs and BNNs might be running similar algorithms.

There is just one problem: BNNs are physically incapable of running the backpropagation algorithm.

We do not know quite enough about biology to say it is impossible for BNNs to run the backpropagation algorithm. However, "a consensus has emerged that the brain cannot directly implement backprop, since to do so would require biologically implausible connection rules"[1].

The backpropagation algorithm has three steps.

  1. Flow information forward through a network to compute a prediction.
  2. Compute an error by comparing the prediction to a target value.
  3. Flow the error backward through the network to update the weights.

The backpropagation algorithm requires information to flow forward and backward along the network. But biological neurons are one-directional. An action potential goes from the cell body down the axon to the axon terminals to another cell's dendrites. An axon potential never travels backward from a cell's terminals to its body.

Hebbian theory

Predictive coding is the idea that BNNs generate a mental model of their environment and then transmit only the information that deviates from this model. Predictive coding considers error and surprise to be the same thing. Hebbian theory is specific mathematical formulation of predictive coding.

Predictive coding is biologically plausible. It operates locally. There are no separate prediction and training phases which must be synchronized. Most importantly, it lets you train a neural network without sending axon potentials backwards.

Predictive coding is easier to implement in hardware. It is locally-defined; it parallelizes better than backpropagation; it continues to function when you cut its substrate in half. (Corpus callosotomy is used to treat epilepsy.) Digital computers break when you cut them in half. Predictive coding is something evolution could plausibly invent.

Unification

The paper Predictive Coding Approximates Backprop Along Arbitrary Computation Graphs[1:1] "demonstrate[s] that predictive coding converges asymptotically (and in practice rapidly) to exact backprop gradients on arbitrary computation graphs using only local learning rules." The authors have unified predictive coding and backpropagation into a single theory of neural networks. Predictive coding and backpropagation are separate hardware implementations of what is ultimately the same algorithm.

There are two big implications of this.


  1. Source is available on arxiv. ↩︎ ↩︎

51 comments

Comments sorted by top scores.

comment by abramdemski · 2021-04-06T17:14:59.981Z · LW(p) · GW(p)

I have not dug into the math in the paper yet, but the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning. In particular, Hebbian learning has been touted as an (inefficient but biologically plausible) algorithm for PCA. How can you chain a bunch of PCAs together and get gradient descent?

Aside from that, here's what I understood from the paper so far.

  • By predictive coding, they basically mean: take the structure of the computation graph (eg, the structure of the NN) and interpret it as a gaussian bayes net instead.
  • Calculating learning using only local information follows, therefore, from the general fact that bayes nets let you efficiently compute gradient descent (and some other rules, such as the EM algorithm) using only local information, so you don't have to perform automatic differentiation on the whole network.
  • So the local computation of gradient descent isn't surprising: it's standard for graphical models, it's just unusual for NNs. This is one reason why graphical models might be a better model of the brain than artificial neural networks.
  • The contribution of this paper is the nice correspondence between Gaussian bayes nets and NN backprop. I'm not really sure this should be exciting. It's not like it's useful for anything. If we were really excited about local learning rules, well, we already had some. 

Maybe the tremendous success of backprop lends some fresh credibility to bayes nets due to this correspondence. IE, maybe we are supposed to make an inference like: "I know backprop on NNs can be super effective, so I draw the lesson that learning for bayes nets (at least Gaussian bayes nets) can also be super effective, at a 100x slowdown [LW(p) · GW(p)]." But this should have already been plausible, I claim. The machine learning community didn't really put bayes nets and NNs side by side and find bayes nets horribly lacking in learning capacity. Rather, I think the 100x slowdown was the primary motivator: bayes nets eliminate the need for an extra automatic differentiation step, but at the cost of a more expensive inference algorithm.

In particular, someone might take this as evidence that the brain uses Gaussian networks in particular, because we now know Gaussian approximates backprop, and we know backprop is super effective. I think this would be a mistaken inference: I don't think this provides much evidence that Gaussian bayes nets are especially intelligent compared to other Bayes nets.

On the other hand, the simplicity of the math for the Gaussian case does provide some evidence. It seems more plausible that the brain uses Gaussian bayes nets than, say, particle filters.

Replies from: Ilio
comment by Ilio · 2023-08-09T13:04:38.746Z · LW(p) · GW(p)

the surprising thing from my current perspective is: backprop is basically for supervised learning, while Hebbian learning is basically for unsupervised learning

That’s either poetic analogy or factual error.. For example autoencoders belongs to unsupervised learning and are trained through backprop.

Replies from: abramdemski
comment by abramdemski · 2023-08-10T16:07:49.429Z · LW(p) · GW(p)

I don't think my reasoning was particularly strong there, but the point is less "how can you use gradient descent, a supervised-learning tool, to get unsupervised stuff????" and more "how can you use Hebbian learning, an unsupervised-learning tool, to get supervised stuff????" 

Autoencoders transform unsupervised learning into supervised learning in a specific way (by framing "understand the structure of the data" as "be able to reconstruct the data from a smaller representation").

But the reverse is much less common. EG, it would be a little weird to apply clustering (an unsupervised learning technique) to a supervised task. It would be surprising to find out that doing so was actually equivalent to some pre-existing supervised learning tool. (But perhaps not as surprising as I was making it out to be, here.)

Replies from: Ilio
comment by Ilio · 2023-08-11T02:26:47.787Z · LW(p) · GW(p)

"how can you use Hebbian learning, an unsupervised-learning tool, to get supervised stuff????"

Thanks for the precision. I guess the key insight for this is: they’re both Turing complete.

"be able to reconstruct the data from a smaller representation"

Doesn’t this sound like the thalamus includes a smaller representation than the cortices?

it would be a little weird to apply clustering (an unsupervised learning technique) to a supervised task.

Actually this is one form a feature engineering., I ‘m confident you can find many examples on kaggle! Yes, you’re most probably right this is telling something important, like it’s telling something important that in some sense all NP-complete problems are arguably the same problem.

comment by Gunnar_Zarncke · 2021-04-04T00:00:19.272Z · LW(p) · GW(p)

On page 8 at the end of section 4.1:

Due to the need to iterate the vs until convergence, the predictive coding network had roughly a 100x greater computational cost than the backprop network.

This seems to imply that artificial NNs are 100x more computationally efficient (at the cost of not being able to grow and probably lower fault tolerance etc.). Still, I'm updating to simulating a brain requiring much less CPU than the neurons in the brain would indicate. 

Replies from: interstice, samshap
comment by interstice · 2021-04-04T19:58:30.804Z · LW(p) · GW(p)

That's assuming that the brain is using predictive coding to implement backprop, whereas it might instead be doing something that is more computationally efficient given its hardware limitations. (Indeed, the fact that it's so inefficient should make you update that it's not likely for the brain to be doing it)

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2021-04-04T20:33:21.942Z · LW(p) · GW(p)

Partly, yes. But partly the computation could be the cheap part compared to the thing it's trading off against (ability to grow, fault tolerance, ...). It is also possible that the brains architecture allows it to include a wider range of inputs that might not be able to model with back-prod (or not efficiently so).  

comment by samshap · 2021-04-05T14:07:54.639Z · LW(p) · GW(p)

I think that's premature. This is just one (digital, synchronous) implementation of one model of BNN that can be shown to converge on the same result as backprop. In a neuromorphic implementation of this circuit, the convergence would occur on the same time scale as the forward propagation.

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2021-04-05T15:25:28.024Z · LW(p) · GW(p)

Well, another advantage of the BNN is of course the high parallelism. But that doesn't change the computational cost (number of FLOPS required), it just spreads it out in parallel.  

comment by Gurkenglas · 2021-04-03T20:47:43.099Z · LW(p) · GW(p)

If they set to 1 they converge in a single backward pass, since they then calculate precisely backprop. Setting to less than that and perhaps mixing up the pass order merely obfuscates and delays this process, but converges because any neuron without incorrect children has nowhere to go but towards correctness. And the entire convergence is for a single input! After which they manually do a gradient step on the weights as usual.

[Preliminary edit: I think this was partly wrong. Replicating...]

It's neat that you can treat activations and parameters by the same update rule, but then you should actually do it. Every "tick", replace the input and label and have every neuron update its parameters and data in lockstep, where every neuron can only look at its neighbors. Of course, this only has a chance of working if the inputs and labels come from a continuous stream, as they would if the input were the output of another network. They also notice the possibility of continuous data. And then one could see how its performance degrades as one speeds up the poor brain's environment :).

: Which has to be in backward order and has to be done once more after the v update line.

Replies from: lsusr
comment by lsusr · 2021-04-04T05:13:27.016Z · LW(p) · GW(p)

Of course, this only has a chance of working if the inputs and labels come from a continuous stream, as they would if the input were the output of another network.

Predictive processing is thus well-suited for BNNs because the real-time sensory data of a living organism, including sensory data preprocessed by another network, is a continuous stream.

comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-04-03T07:51:19.734Z · LW(p) · GW(p)

I think there's another big implication of this:

Current neural nets seem to require more data (e.g. games of Starcraft) to reach the same level of performance as a human adult. There are different hypotheses as to why this is:

--Human brains are bigger (bigger NN's require less data, it seems?)

--Human brains have a ton of pre-training from which they can acquire good habits and models which generalize to new tasks/situations (we call this "childhood" and "education"). GPT-3 etc. show that something similar works for neural nets, maybe we just need to do more of this with bigger brains.

--Human brains have tons of instincts built in (priors?)

--Human brains have special architectures, various modules that interact in various ways (priors?)

--Human brains don't use Backprop; maybe they have some sort of even-better algorithm

I've heard all of these hypotheses seriously maintained by various people. If this is true it rules out the last one.

Replies from: abramdemski, lsusr
comment by abramdemski · 2021-04-06T17:22:12.030Z · LW(p) · GW(p)

How does it "rule out" the last one??

It does provide a small amount of evidence against it, because it shown one specific algorithm is "basically backprop". Maybe you're saying this is significant evidence, because we have some evidence that predictive coding is also the algorithm the brain actually uses.

But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn't the obvious conclusion from our observations be: humans don't use backprop, but rather, use more data-efficient algorithms?

I'll grant, I'm now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-04-06T20:28:00.462Z · LW(p) · GW(p)

I guess I was thinking: Brains use predictive coding, and predictive coding is basically backprop, so brains can't be using something dramatically better than backprop. You are objecting to the "brains use predictive coding" step? Or are you objecting that only one particular version of predictive coding is basically backprop?

But we also know there are algorithms which are way more data-efficient than NNs (while being more processing-power intensive). So wouldn't the obvious conclusion from our observations be: humans don't use backprop, but rather, use more data-efficient algorithms?

Are you referring to Solomonoff Induction and the like? I think the "brains use more data-efficient algorithms" is an obvious hypothesis but not an obvious conclusion--there are several competing hypotheses, outlined above. (And I think the evidence against it is mounting, this being one of the key pieces.)

I'll grant, I'm now quite curious how the scaling argument works out. Is it plausible that human-brain-sized NNs are as data-efficient as humans?

In terms of bits/pixels/etc., humans see plenty of data in their lifetime, a bit more than the scaling laws would predict IIRC. But the scaling laws (as interpreted by Ajeya, Rohin, etc.) are about the amount of subjective time the model needs to run before you can evaluate the result. If we assume for humans it's something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it's longer, then the gap in data-efficiency grows.

Some issues though. One, the scaling laws might not be the same for all architectures. Maybe if your context window is bigger, or your use recurrency, or whatever, the laws are different. Too early to tell, at least for me (maybe others have more confident opinions, I'd love to hear them!) Two, some data is higher-quality than other data, and plausibly human data is higher-quality than the stuff GPT-3 was fed--e.g. humans deliberately seek out data that teaches them stuff they want to know, instead of just dully staring at a firehose of random stuff. Three, it's not clear how to apply this to humans anyway. Maybe our neurons are updating a hundred times a second or something.

I'd be pretty surprised if a human-brain-sized Transformer was able to get as good as a human at most important human tasks simply by seeing a firehose of 10^9 images or context windows of internet data. But I'd also be pretty surprised (10%) if the scaling laws turn out to be so universal that we can't get around them; if it turns out that transformative tasks really do require a NN at least the size of a human brain trained for at least 10^14 steps or so where each step involves running the NN for at least a subjective week. (Subjective second, I'd find more plausible. Or subjective week (or longer) but with fewer than 10^14 steps.)

Replies from: abramdemski
comment by abramdemski · 2021-04-07T03:57:34.854Z · LW(p) · GW(p)

You are objecting to the "brains use predictive coding" step? Or are you objecting that only one particular version of predictive coding is basically backprop?

Yeah, somewhere along that spectrum. Generally speaking, I'm skeptical of claims that we know a lot about the brain.

Are you referring to Solomonoff Induction and the like?

I was more thinking of genetic programming.

I think the "brains use more data-efficient algorithms" is an obvious hypothesis but not an obvious conclusion--there are several competing hypotheses, outlined above.

I agree with this. 

(And I think the evidence against it is mounting, this being one of the key pieces.)

(I still don't see why.)

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-04-07T14:29:41.870Z · LW(p) · GW(p)
Yeah, somewhere along that spectrum. Generally speaking, I'm skeptical of claims that we know a lot about the brain.
"(And I think the evidence against it is mounting, this being one of the key pieces.)"
(I still don't see why.)

--I wouldn't characterize my own position as "we know a lot about the brain." I think we should taboo "a lot."

--We are at an impasse here I guess--I think there's mounting evidence that brains use predictive coding and mounting evidence that predictive coding is like backprop. I agree it's not conclusive but this paper seems to be pushing in that direction and there are others like it IIRC. I'm guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am... perhaps because the other hypotheses on my list are less plausible to you?

Replies from: abramdemski, TekhneMakre
comment by abramdemski · 2021-04-07T16:37:33.211Z · LW(p) · GW(p)

--I wouldn't characterize my own position as "we know a lot about the brain." I think we should taboo "a lot."

To give my position somewhat more detail:

  • I think the methods of neuroscience are mostly not up to the task. This is based on the paper which applied neuroscience methods to try to reverse-engineer the CPU.
  • I think what we have are essentially a bunch of guesses about functionality based on correlations and fairly blunt interventional methods (lesioning), combined with the ideas we've come up with about what kinds of algorithms the brain might be running (largely pulling from artificial intelligence for ideas).

I'm guessing you just are significantly more skeptical of both predictive coding and the predictive coding --> backprop link than I am... perhaps because the other hypotheses on my list are less plausible to you?

It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.) However:

  1. There are a lot of different algorithms resembling belief prop. Sticking within the big tent of "variational methods", there are a lot of different variational objectives, which result in different algorithms. The brain could be using a variation which we're unfamiliar with. This could result in significant differences from backprop. (I'm still fond of Hinton's analogy between contrastive divergence and dreaming, for example. It's a bit like saying that dreams are GAN-generated adversarial examples, and the brain trains to anti-learn these examples during the night, which results in improved memory consolidation and conceptual clarity during the day. Isn't that a nice story?)
  2. There are a lot of graphical models besides Bayesian networks. Many of them are "basically the same", but for example SPNs (sum-product networks) are very different. There's a sense in which Bayesian networks assume everything is neatly organized into variables already, while SPNs don't. Also, SPNs are fundamentally faster, so the convergence step in the paper (the step which makes predictive coding 100x slower than belief prop) becomes fast. So SPNs could be a very reasonable alternative, which might not amount to backprop as we know it.
  3. I think it could easily be that the neocortex is explained by some version of predictive coding, but other important elements of the brain are not. In particular, I think the numerical logic of reinforcement learning isn't easily and efficiently captured via graphical models. I could be ignorant here, but what I know of attempts to fit RL into a predictive-processing paradigm ended up using multiplicative rewards rather than additive (so, you multiply in the new reward rather than adding), simply because adding up a bunch of stuff isn't natural in graphical models. I think that's a sign that it's not the right paradigm.
  4. Radical Probabilism / Logical Uncertainty / Logical Induction makes it generally seem pretty probable, almost necessary, that there's also some "non-Bayesian" stuff going on in the brain (ie generalized-bayesian, ie non-bayesian updates). This doesn't seem well-described by predictive coding. This could easily be enough to ruin the analogy between the brain and backprop.
  5. And finally, reiterating the earlier point: there are other algorithms which are more data-efficient than backprop. If humans appear to be more efficient than backprop, then it seems plausible that humans are using a more data-efficient algorithm.

As for the [predictive coding -> backprop] link, well, that's not a crux for me right now, because I was mainly curious why you think such a link, if true, would be evidence against "the brain uses something else that backprop". I think I understand why you would think that, now, sans what the mounting evidence is.

I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency? If so, I have no objection to the hypothesis that the brain uses something more-or-less equivalent to gradient descent.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-04-30T09:00:35.958Z · LW(p) · GW(p)

Thanks for this reply!

--I thought the paper about the methods of neuroscience applied to computers was cute, and valuable, but I don't think it's fair to conclude "methods are not up to the task." But you later said that "It makes a lot of sense to me that the brain does something resembling belief propagation on bayes nets. (I take this to be the core idea of predictive coding.)" so you aren't a radical skeptic about what we can know about the brain so maybe we don't disagree after all.

1 - 3: OK, I think I'll defer to your expertise on these points.

4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn't mean that the brain isn't running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN's trained via backprop might also stumble across similar networks which would then do similarly cool stuff.

I think my main crux is the question: (for some appropriate architecture, ie, not necessarily transformers) do human-brain-sized networks, with human-like opportunities for transfer learning, achieve human-level data-efficiency?

Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN's do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)

Replies from: abramdemski
comment by abramdemski · 2021-04-30T13:47:19.200Z · LW(p) · GW(p)

4, 5: Whoa whoa, just because we humans do some non-bayesian stuff and some better-than-backprop stuff doesn't mean that the brain isn't running pure bayes nets or backprop-approximation or whatever at the low level! That extra fancy cool stuff we do could be happening at a higher level of abstraction. Networks in the brain learned via backprop-approximation could themselves be doing the logical induction stuff and the super-efficient-learning stuff. In which case we should expect that big NN's trained via backprop might also stumble across similar networks which would then do similarly cool stuff.

How could this address point #5? If GD is slow, then GD would be slow to learn faster learning methods.

All of the following is intended as concrete examples against the pure-bayes-brain hypothesis, not as evidence against the brain doing some form of GD:

  1. One thing the brain could be doing under the hood is some form of RL using value-prop. This is difficult to represent in Bayes nets. The attempts I've seen end up making reward multiplicative rather than additive across time, which makes sense because Bayes nets are great at multiplying things but not so great at representing additive structures. I think this is OK (we could regard it as an exponential transform of usual reward) until we want to represent temporal discounting. Another problem with this is: representing via graphical models means representing the full distribution over reward values, rather than a point estimate. But this is inefficient compared with regular tabular RL.
  2. Another thing the brain could be doing under the hood is "memory-network" style reasoning which learns a policy for utilizing various forms of memory (visual working memory, auditory working memory, episodic memory, semantic memory...) for reasoning. Because this is fundamentally about logical uncertainty (being unsure about the outcome at the end of some mental work), it's not very well-represented by Bayesian models. It probably makes more sense to use (model-free) RL to learn how to use WM.

 Of course both of those objections could be overcome with a specific sort of work, showing how to represent the desired algorithm in bayes nets.

As for GD:

  • My back of the envelope calculation suggests that GPT-3 has trained on 7 orders of magnitude more data than a 10yo has experienced in their lifetime. Of course a different NN architecture (+ different task, different loss functions, etc) could just be that much more efficient than transformers; but overall, this doesn't look good for the human-GD hypothesis.
    • Maybe your intention is to argue that we use GD with a really good prior, though! This seems much harder to dismiss.
  • Where does the gradient come from? [LW · GW] Providing a gradient is a difficult problem which requires intelligence.
    • Even if the raw loss(/reward) function is simple and fixed, it's difficult to turn that into a gradient for learning, because you don't know how to attribute punishment/loss to specific outputs (actions or cognitive acts). The dumb method, policy-gradient, is highly data inefficient due to attributing reward/punishment to all recent actions (frequently providing spurious gradients which adjust weights up/down noisily).
    • But, quite possibly, the raw loss/reward function is not simple/fixed, but rather, requires significant inference itself. An example of this is imprinting. 

The last two sub-points only argue "against GD" in so far as you mean to suggest that the brain "just uses GD" (where "just" is doing a lot of work). My claim there is that more learning principles are needed (for example, model-based learning) to understand what's going on.

Indeed. Your crux is my question and my crux is your question. (My crux was: Does the brain, at the low level, use something more or less equivalent to the stuff modern NN's do at a low level? From this I hoped to decide whether human-brain-sized networks could have human-level efficiency)

Since the brain is difficult to pin down but ML experiments are not, I would think the more natural direction of inference would be to check the scaling laws and see whether it's plausible that the brain is within the same regime.

Replies from: anonce
comment by anonce · 2023-07-31T17:40:52.325Z · LW(p) · GW(p)

Thanks for the great back-and-forth! Did you guys see the first author's comment [LW(p) · GW(p)]? What are the main updates you've had re this debate now that it's been a couple years?

Replies from: abramdemski
comment by abramdemski · 2023-08-08T16:46:05.886Z · LW(p) · GW(p)

I have not thought about these issues too much in the intervening time. Re-reading the discussion, it sounds plausible to me that the evidence is compatible with roughly brain-sized NNs being roughly as data-efficient as humans. Daniel claims: 

If we assume for humans it's something like 1 second on average (because our brains are evaluating-and-updating weights etc. on about that timescale) then we have a mere 10^9 data points, which is something like 4 OOMs less than the scaling laws would predict. If instead we think it's longer, then the gap in data-efficiency grows.

I think the human observation-reaction loop is closer to ten times that fast, which results in a 3 OOM difference. This sounds like a gap which is big, but could potentially be explained by architectural differences or other factors, thus preserving a possibility like "human learning is more-or-less gradient descent". Without articulating the various hypotheses in more detail, this doesn't seem like strong evidence in any direction.

Did you guys see the first author's comment [LW(p) · GW(p)]? 

Not before now. I think the comment had a relatively high probability in my world, where we still have a poor idea of what algorithm the brain is running, and a low probability in Daniel's world, where evidence is zooming in on predictive coding as the correct hypothesis. Some quotes which I think support my hypothesis better than Daniel's:

If we (speculatively) associate alpha/beta waves with iterations in predictive coding,

This illustrates how we haven't pinned down the mechanical parts of algorithms. What this means is that speculation about the algorithm of the brain isn't yet causally grounded -- it's not as if we've been looking at what's going on and can build up a firm abstract picture of the algorithm from there, the way you might successfully infer rules of traffic by watching a bunch of cars. Instead, we have a bunch of different kinds of information at different resolutions, which we are still trying to stitch together into a coherent picture.

While it's often claimed that predictive coding is biologically plausible and the best explanation for cortical function, this isn't really all that clear cut. Firstly, predictive coding itself actually has a bunch of implausibilities. Predictive coding suffers from the same weight transport problem as backprop, and secondly it requires that the prediction and prediction error neurons are 1-1 (i.e. one prediction error neuron for every prediction neuron) which is way too precise connectivity to actually happen in the brain. I've been working on ways to adapt predictive coding around these problems as in this paper (https://arxiv.org/pdf/2010.01047.pdf), but this work is currently very preliminary and its unclear if the remedies proposed here will scale to larger architectures.

This directly addresses the question of how clear-cut things are right now, while also pointing to many concrete problems the predictive coding hypothesis faces. The comment continues on that subject for several more paragraphs. 

The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning -- just one that has backprop as a subroutine. Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single 'particle' following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs. 

This paragraph supports my picture that hypotheses about what the brain is doing are still largely being pulled from ML, which speaks against the hypothesis of a growing consensus about what the brain is doing, and also illustrates the lack of direct looking-at-the-brain-and-reporting-what-we-see. 

On the other hand, it seems quite plausible that this particular person is especially enthusiastic about analogizing ML algorithms and the brain, since that is what they work on; in which case, this might not be so much evidence about the state of neuroscience as a whole. Some neuroscientist could come in and tell us why all of this stuff is bunk, or perhaps why Predictive Coding is right and all of the other ideas are wrong, or perhaps why the MCMC thing is right and everything else is wrong, etc etc.

But I take it that Daniel isn't trying to claim that there is a consensus in the field of neuroscience; rather, he's probably trying to claim that the actual evidence is piling up in favor of predictive coding. I don't know. Maybe it is. But this particular domain expert doesn't seem to think so, based on the SSC comment.

comment by TekhneMakre · 2021-04-07T15:24:00.105Z · LW(p) · GW(p)
--Human brains have special architectures, various modules that interact in various ways (priors?)
--Human brains don't use Backprop; maybe they have some sort of even-better algorithm

This is a funny distinction to me. These things seem like two ends of a spectrum (something like, the physical scale of "one unit of structure"; predictive coding is few-neuron-scale, modules are big-brain-chunk scale; in between, there's micro-columns, columns, lamina, feedback circuits, relays, fiber bundles; and below predictive coding there's the rules for dendrite and synapse change).


I wouldn't characterize my own position as "we know a lot about the brain." I think we should taboo "a lot."
I think there's mounting evidence that brains use predictive coding

Are you saying, there's mounting evidence that predictive coding screens off all lower levels from all higher levels? Like all high-level phenomena are the result of predictive coding, plus an architecture that hooks up bits of predictive coding together?

comment by lsusr · 2021-04-03T08:21:02.458Z · LW(p) · GW(p)

Daniel Kokotajlo was the person who originally pointed me to this article. Thank you!

There is no question that human brains have tons of instincts built-in. But there is a hard limit on how much information a single species' instincts can contain. It is implausible that human beings' cognitive instincts contain significantly more information than the human genome (750 megabytes). I expect our instincts contain much less.

Human brains definitely have special architectures too, like the hippocampus. The critical question is how important these special architectures are. Are our special architectures critical to general intelligence or are they just speed hacks? If they are speed hacks then we can outrace them by building a bigger computer or writing more efficient algorithms.

There is no doubt that humans transmit more cultural knowledge than other animals. This has to do with language. (More specifically, I think our biology underpinning language hit a critical point around 50,000 years ago.) Complex grammar is not present in any non-human animal. Wernicke's area is involved. Wernicke's area could be a special architecture.

How important are the above human advantages? I believe that taking a popular ANN architecture and merely scaling it up will not enable a neural network to compete with humans at StarCraft with equal quantities of training data. If, in addition, the ANN is not allowed to utilize transfer learning then I am willing to publicly bet money on this prediction. (The ANN must be restricted to a human rate of actions-per-second. The ANN does not get to play via an API or similar hand-coded preprocessor. If the ANN watches videos of other players then that counts towards its training data.)

Replies from: daniel-kokotajlo, TekhneMakre
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-04-03T11:48:03.075Z · LW(p) · GW(p)

If the ANN can't use transfer learning, that's pretty unfair, since the human can. (It's not like a baby can play Starcraft straight out of the womb; humans can learn Starcraft but only after years of pre-training on diverse data in diverse environments)

Replies from: lsusr
comment by lsusr · 2021-04-03T18:01:03.579Z · LW(p) · GW(p)

Good point. Transfer learning is allowed but it still counts towards the total training data where "training data" is now everything a human can process over a lifetime.

Replies from: noah-walton
comment by Noah Walton (noah-walton) · 2021-04-12T15:19:29.822Z · LW(p) · GW(p)

Once you add this condition, are current state-of-the-art Starcraft-learning ANNs still getting more training data than humans?

comment by TekhneMakre · 2021-04-07T15:16:23.122Z · LW(p) · GW(p)
It is implausible that human beings' cognitive instincts contain significantly more information than the human genome (750 megabytes). I expect our instincts contain much less.

Our instincts contain pointers to learning from other humans, which contain lots of cognitive info. The pointer is small, but that doesn't mean the resulting organism is algorithmically that simple.

Replies from: daniel-kokotajlo
comment by Daniel Kokotajlo (daniel-kokotajlo) · 2021-04-07T17:57:27.399Z · LW(p) · GW(p)

That seems plausible, but AIs can have pointers to learning from other humans too. E.g. GPT-3 read the Internet, if we were making some more complicated system it could evolve pointers analogous to the human pointers. I think.

comment by lumenwrites · 2021-04-04T06:23:13.643Z · LW(p) · GW(p)

You guys will probably find this Slate Star Codex post interesting:

https://slatestarcodex.com/2017/09/05/book-review-surfing-uncertainty/

Scott summarizes the Predictive Processing theory, explains it in a very accessible way (no math required), and uses it to explain a whole bunch of mental phenomena (attention, imagination, motor behavior, autism, schizophrenia, etc.)

Can someone ELI5/TLDR this paper for me, explain in a way more accessible to a non-technical person?

- How does backprop work if the information can't flow backwards?
- In Scotts post, he says that when lower-level sense data contradicts high-level predictions, high-level layers can override lower-level predictions without you noticing it. But if low-level sensed data has high confidence/precision - the higher levels notice it and you experience "surprise". Which one of those is equivalent to the backdrop error? Is it low-level predictions being overridden, or high-level layers noticing the surprise, or something else, like changing the connections between neurons to train the network and learn from the error somehow?

Replies from: samshap
comment by samshap · 2021-04-05T14:33:52.767Z · LW(p) · GW(p)

TLDR for this paper: There is a separate set of 'error' neurons that communicate backwards. Their values converge on the appropriate back propagation terms.

A large error at the top levels corresponds to 'surprise', while a large error at the lower levels corresponds more to the 'override'.

comment by harsimony · 2021-04-03T15:56:58.399Z · LW(p) · GW(p)

Is it known whether predictive coding is easier to train than backprop? Local learning rules seem like they would be more parallelizable.

Every Model Learned by Gradient Descent Is Approximately a Kernel Machine seems relevant to this discussion:

We show, however, that deep networks learned by the standard gradient descent algorithm are in fact mathematically approximately equivalent to kernel machines, a learning method that simply memorizes the data and uses it directly for prediction via a similarity function (the kernel). This greatly enhances the interpretability of deep network weights, by elucidating that they are effectively a superposition of the training examples. The network architecture incorporates knowledge of the target function into the kernel.

Replies from: rhaps0dy, lsusr
comment by Adrià Garriga-alonso (rhaps0dy) · 2021-04-03T16:16:57.526Z · LW(p) · GW(p)

Not particularly relevant, I think, but interesting nonetheless.

A first drawback of this paper is that its conclusion assumes that the NN underneath trains with gradient flow (GF), which is the continuous-time version of gradient descent (GD). This is a good assumption if the learning rate is very small, and the resulting GD dynamics closely track the GF differential equation.

This does not seem to be the case in practice. Larger initial learning rates help get better performance (https://arxiv.org/abs/1907.04595), so people use them in practice. If what people use in practice was well-approximated by GF, then smaller learning rates would give the same result. You can use another differential equation that does seem to approximate GD fairly well (http://ai.stanford.edu/blog/neural-mechanics/), but I don't know if the math from the paper still works out.

Second, as the paper points out, the kernel machine learned by GD is a bit strange in that the coefficients $a_i$ for weighing different $K(x, x_i)$ depend on $x$. Thus, the resulting output function is not in the reproducing kernel Hilbert space of the kernel that is purported to describe the NN. As a result, as kernel machines go, it's pretty weird. I expect that a lot of the analysis about the output of the learning process (learning theory etc) assumes that the $a_i$ do not depend on the test input $x$.

Replies from: harsimony
comment by harsimony · 2021-04-03T17:03:18.826Z · LW(p) · GW(p)

Good point!

Do you know of any work that applies similar methods to study the equivalent kernel machine learned by predictive coding?

Replies from: rhaps0dy
comment by Adrià Garriga-alonso (rhaps0dy) · 2021-04-03T22:53:53.519Z · LW(p) · GW(p)

I don't, and my best guess is that nobody has done it :) The paper you linked is extremely recent.

You'd have to start by finding an ODE model of predictive coding, which I suppose is possible taking limit of the learning rate to 0.

comment by lsusr · 2021-04-03T18:28:54.208Z · LW(p) · GW(p)

Due to the need to iterate the vs until convergence, the predictive coding network had roughly a 100x greater computational cost than the backprop network.

The paper claims that predictive coding takes more compute. I agree that predictive coding ought to be more parallelizable. If you are using a GPU then backpropagation is already sufficiently parallelizable. However, it may be that neuromorphic hardware could parallelize better than a GPU, thus producing an increase in compute power that outstrips the 100x greater computational cost of the algorithm itself.

Replies from: samshap
comment by samshap · 2021-04-05T14:40:07.675Z · LW(p) · GW(p)

Kind of. Neuromorphics don't buy you too much benefit for generic feedforward networks, but they dramatically reduce the expenses of convergence. Since the 100x in this paper derives from iterating until the network converges, a neuromorphics implementation (say on Loihi) would directly eliminate that cost.

comment by samshap · 2021-04-03T03:28:51.439Z · LW(p) · GW(p)

Thanks for sharing!

Two comments:

  • There seem to be a couple of sign errors in the manuscript. (Probably worth reaching out to the authors directly)
  • Their predictive coding algorithm holds the vhat values fixed during convergence, which actually implies a somewhat different network topology than the more traditional one shown in your figure.
Replies from: lsusr, Chris_Leong
comment by lsusr · 2021-04-03T08:25:51.349Z · LW(p) · GW(p)

What are the sign errors?

Replies from: samshap
comment by samshap · 2021-04-03T13:01:28.170Z · LW(p) · GW(p)

Right side of equation 2. Also the v update step in algorithm 1 should have a negative sign (the text version earlier on the same page has it right).

comment by Chris_Leong · 2021-04-03T07:08:00.431Z · LW(p) · GW(p)

vhat?

Replies from: lsusr
comment by lsusr · 2021-04-03T07:45:07.529Z · LW(p) · GW(p)

hat = .

comment by Chris van Merwijk (chrisvm) · 2021-04-05T14:54:29.152Z · LW(p) · GW(p)

I suspect a better title would be "Here is a proposed unification of a particular formalization of predictive coding, with backprop"

comment by anonce · 2023-07-31T17:37:35.343Z · LW(p) · GW(p)

The paper's first author, beren [LW · GW], left a detailed comment on the ACX linkpost, painting a more nuanced and uncertain (though possibly outdated by now?) picture. To quote the last paragraph:

"The brain being able to do backprop does not mean that the brain is just doing gradient descent like we do to train ANNs. It is still very possible (in my opinion likely) that the brain could be using a more powerful algorithm for inference and learning -- just one that has backprop as a subroutine. Personally (and speculatively) I think it's likely that the brain performs some highly parallelized advanced MCMC algorithm like Hamiltonian MCMC where each neuron or small group of neurons represents a single 'particle' following its own MCMC path. This approach naturally uses the stochastic nature of neural computation to its advantage, and allows neural populations to represent the full posterior distribution rather than just a point prediction as in ANNs."

One of his subcomments went into more detail on this point.

comment by ryan_b · 2021-04-14T20:37:08.646Z · LW(p) · GW(p)

I see in the citations for this they already have the Neural ODE paper and libraries.  Which means the whole pipeline also has access to all our DiffEq tricks.

In terms of timelines, this seems bad: unless there are serious flaws with this line of work, we just plugged in our very best model for how the only extant general intelligences work into our fastest-growing field of capability.

comment by Chris_Leong · 2021-04-03T07:23:15.742Z · LW(p) · GW(p)

I'm confused. The bottom diagram seems to also involve bidirectional flow of information.

Replies from: lsusr
comment by lsusr · 2021-04-03T07:53:48.925Z · LW(p) · GW(p)

The black circles represent neurons. The red triangles represent activations (action potentials). Action potentials' information content is shared between presynaptic neurons and postsynaptic neurons because activations are transmitted from the presynaptic neuron to the postsynaptic neuron.

The black arrows in the bottom diagram denote the physical creation of action potentials. The red arrows denote intra-neuron calculation of the gradient. Keep in mind that each neuron knows both the action potential it generates itself and the action potentials sent to it.

Replies from: Chris_Leong
comment by Chris_Leong · 2021-04-04T01:00:49.785Z · LW(p) · GW(p)

Oh that makes a lot more sense. Is delta v1 hat the change of v1 rather than a infintesimal? (Asking because if it was then it'd be easier to understand how it is calculated).

comment by Ilio · 2023-08-09T12:53:43.698Z · LW(p) · GW(p)

"a consensus has emerged that the brain cannot directly implement backprop, since to do so would require biologically implausible connection rules"

There’s no such emerging consensus, that’s actually from F. Crick’s personal opinions in the nineties. Here’s a more recent take:

https://www.cell.com/trends/cognitive-sciences/fulltext/S1364-6613(19)30012-9?_returnURL=https%3A%2F%2Flinkinghub.elsevier.com%2Fretrieve%2Fpii%2FS1364661319300129%3Fshowall%3Dtrue

comment by George3d6 · 2021-04-04T16:19:33.698Z · LW(p) · GW(p)

There is no relationship between neurons and the "neurons" of an ANN. It's just a naming mishap at this point.

Replies from: samshap
comment by samshap · 2021-04-05T14:53:03.160Z · LW(p) · GW(p)

Incorrect. Perceptrons are a low fidelity (but still incredibly useful!) rate-encoded model of individual neurons.

Replies from: aa-m-sa
comment by Aaro Salosensaari (aa-m-sa) · 2021-04-05T21:13:27.387Z · LW(p) · GW(p)

Sure, but statements like

>ANNs are built out of neurons. BNNs are built out of neurons too.

are imprecise and possibly imprecise enough to be also incorrect if it turns out that biological neurons do something different than perceptrons that is important. Without making the exact arguments and presenting evidence in what respects the perceptron model is useful, it is quite easy to bake in conclusions along the lines of "this algorithm for ANNs is a good model of biology" in the assumptions "both are built out of neurons".