Parameter vs Synapse?post by Bjartur Tómas · 2021-03-11T15:30:59.745Z · LW · GW · 4 comments
In terms of AI timelines, the biggest question I haven't seen addressed is the computational equivalence of a synapse vs a parameter in modern neural nets. This seems like a very important input for any prediction on when we will have human-level AI.
Moravec's estimates of a retina vs the then-current edge-detection methods are sort of worthless under the assumption that AI will be built using modern learning methods, because feature engineered code is plausibly much more compute efficient than learned policies on tasks that are comprehensible to human programmers, and Moravec compared the retina with the former.
To pump this intuition, if we assume that 1 synapse == 1 parameter, then Moravec's estimates are more than 6 orders of magnitude too low. The size of the largest models we are able to train, on computers that are more powerful than Moravec predicted are needed for AGI, is at most 10 billion parameters, which is about as many synapses as the retina has. Though, it would be more accurate to say we are memory bandwidth constrained, as we train neural networks at much higher temporal resolution than human brains.
A very interesting question, then, is how many parameters does a modern learned model need to have plausibly similar capabilities to a human retina. This seems hard to investigate but not impossible. If we look at the first few layers of a conv net, for example, they seem to be doing the sort of edge detection that the retina does.
I think this would be a high-leverage thing to investigate as both the parameter > synapse and parameter < synapse are not implausible at first glance. Some people think the brain is doing something mysterious and better than backpropagation. Some people think it is just doing a worse, Rube Goldberg approximation of backpropagation. Artificial spiking neural networks perform really poorly compared standard stuff, which may imply paramater > synapse, etc.
If we assume AGI will come from scaling up current methods (rather than, as Moravec predicted, very efficient programs that imitate the aggregate function of thousand-neuron assemblies) then this question is very pertinent to any prediction. Is anyone here aware of any work on this?
Comments sorted by top scores.