Parameter vs Synapse?

post by Bjartur Tómas · 2021-03-11T15:30:59.745Z · LW · GW · 4 comments

In terms of AI timelines, the biggest question I haven't seen addressed is the computational equivalence of a synapse vs a parameter in modern neural nets. This seems like a very important input for any prediction on when we will have human-level AI.

Moravec's estimates of a retina vs the then-current edge-detection methods are sort of worthless under the assumption that AI will be built using modern learning methods, because feature engineered code is plausibly much more compute efficient than learned policies on tasks that are comprehensible to human programmers, and Moravec compared the retina with the former.

To pump this intuition, if we assume that 1 synapse == 1 parameter, then Moravec's estimates are more than 6 orders of magnitude too low. The size of the largest models we are able to train, on computers that are more powerful than Moravec predicted are needed for AGI, is at most 10 billion parameters, which is about as many synapses as the retina has. Though, it would be more accurate to say we are memory bandwidth constrained, as we train neural networks at much higher temporal resolution than human brains.

A very interesting question, then, is how many parameters does a modern learned model need to have plausibly similar capabilities to a human retina. This seems hard to investigate but not impossible. If we look at the first few layers of a conv net, for example, they seem to be doing the sort of edge detection that the retina does.

I think this would be a high-leverage thing to investigate as both the parameter > synapse and parameter < synapse are not implausible at first glance. Some people think the brain is doing something mysterious and better than backpropagation. Some people think it is just doing a worse, Rube Goldberg approximation of backpropagation. Artificial spiking neural networks perform really poorly compared standard stuff, which may imply paramater > synapse, etc.

If we assume AGI will come from scaling up current methods (rather than, as Moravec predicted, very efficient programs that imitate the aggregate function of thousand-neuron assemblies) then this question is very pertinent to any prediction. Is anyone here aware of any work on this?

4 comments

Comments sorted by top scores.

comment by gwern · 2019-12-29T05:36:36.905Z · LW(p) · GW(p)

Drexler's recent AI whitepaper had some arguments in a similar vein about functional equivalence and necessary compute and comparing CNNs with the retina or visual cortex, so you might want to look at that.

Replies from: wunan
comment by Gerald Monroe (gerald-monroe) · 2021-03-12T09:12:05.142Z · LW(p) · GW(p)

I have one thing to add that may not be fully appreciated elsewhere, but is clearly true.

When an action potential rushes through an axon to a synapse, which fine details does the law of physics even allow to interfere?  Some of these endless arguments about how complex a neuron can be remind me of the audio reproduction arguments.  Succinctly, the information you see at the end of a cable - any cable, including one in a brain - is a pulse made of subcomponents of sine waves, with the highest frequency wave proportional to the bandwidth.  Square peaks are always rounded off.

What this means is that any factor that doesn't affect the timing of the synapse enough for the firing of the next synapse to be affected has zero net effect. 

Similar arguments apply to the effects of signal to noise for learning and other long term adjustments for a synapse.

What this means is a digital equivalent need only duplicate the signal part.  It may turn out that 1 bit of resolution is more than enough.

comment by Steven Byrnes (steve2152) · 2019-12-29T21:56:30.001Z · LW(p) · GW(p)

See AI Impacts articles on 'Human Level Hardware' if you haven't already. I haven't dug into it myself, but I agree that your question is a good one.

A simpler related question that I don't know off-hand is, what prevents trillion-parameter NN's? Does training data requirement scale with network size? (in which case, I don't expect this to be a problem for long, because I expect we'll find algorithms with human-level data efficiency before we get AGI.) Or just the limited memory capacity per GPU, and the hassle / overhead / cost of parallelization? (in which case, again I expect we'll get dramatically more parallelizable algorithms in the near future.)