$250 prize for checking Jake Cannell's Brain Efficiency

alexander-gietelink-oldenziel

$250 prize for checking Jake Cannell's Brain Efficiency

post by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-26T16:21:06.035Z · LW · GW · 170 comments

171 comments

This is to announce a $250 prize for spotchecking [? · GW] or otherwise indepth reviewing Jacob Cannell's technical claims concerning thermodynamic & physical limits on computations and the claim of biological efficiency of the brain in his post Brain Efficiency: Much More Than You Wanted To Know [LW · GW]

I've been quite impressed by Jake's analysis ever since it came out. I have been puzzled why there has been so little discussion about his analysis since if true it seems to be quite important. That said I have to admit I personally cannot asses whether the analysis is correct. This is why I am announcing this prize.

Whether Jake's claims concerning DOOM & FOOM [LW · GW]really follow from his analysis is up for debate. Regardless, to me it seems to have large implications on how the future might go and how future AI will look like.

I will personally judge whether I think an entry warrants a prize.^[1]
If you are also interested in seeing this situation resolved, I encourage you to increase the prize pool!

EDIT: some clarifications
- You are welcome to discuss DOOM& FOOM and the relevance or lack thereof of Jake's analysis but note I will only consider (spot)checking of Jacob Cannel's technical claims.
- in case of multiple serious entries I will do my best to fairly split the prize money.
- note I will not be judging who will be right. Instead, I will judge whether the entry has seriously engaged with Jacob Cannell's technical claims in a way that moves the debate forward. That is I will reward points for 'pushing the depth of the debate tree' beyond what it was before.

- by technical claims I mean to encompass all technical claims made in the brain efficiency post, broadly construed, as well as claims made by Jacob Cannell in other posts/ comments.
These claims includes especially: limits to energy efficiency, interconnect losses, Landauer limit, convection vs blackbody radiation, claims concerning the effective working memory of the human brain versus that of computers, end of Moore's law, CPU vs GPU vs neuromorphic chips, etc etc.
Here's Jacob Cannell's own summary of his claims:

1.) Computers are built out of components which are also just simpler computers, which bottoms out at the limits of miniaturization in minimal molecular sized (few nm) computational elements (cellular automata/tiles). Further shrinkage is believed impossible in practice due to various constraints (overcoming these constraints if even possible would require very exotic far future tech).
2.) At this scale the landauer bound represents the ambient temperature dependent noise (which can also manifest as a noise voltage). Reliable computation at speed is only possible using non-trivial multiples of this base energy, for the simple reasons described by landauer and elaborated on in the other refs in my article.
3.) Components can be classified as computing tiles or interconnect tiles, but the latter is simply a computer which computes the identity but moves the input to an output in some spatial direction. Interconnect tiles can be irreversible or reversible, but the latter has enormous tradeoffs in size (ie optical) and or speed or other variables and is thus not used by brains or GPUs/CPUs.
4.) Fully reversible computers are possible in theory but have enormous negative tradeoffs in size/speed due to 1.) the need to avoid erasing bits throughout intermediate computations, 2.) the lack of immediate error correction (achieved automatically in dissipative interconnect by erasing at each cycle) leading to error build up which must be corrected/erased (costing energy), 3.) high sensitivity to noise/disturbance due to 2
And the brain vs computer claims:
5.) The brain is near the pareto frontier for practical 10W computers, and makes reasonably good tradeoffs between size, speed, heat and energy as a computational platform for intelligence
6.) Computers are approaching the same pareto frontier (although currently in a different region of design space) - shrinkage is nearing its end

^{^}
As an example, DaemonicSigil's recent post [LW · GW] is in the right direction.
However, after reading Jacob Cannell's response I did not feel the post seriously engaged with the technical material, retreating to the much weaker claim that maybe exotic reversible computation could break the limits that Jacob's posits which I found unconvincing. The original post is quite clear that the limits are only for nonexotic computing architectures.

170 comments

Comments sorted by top scores.

comment by jacob_cannell · 2023-04-26T18:07:22.845Z · LW(p) · GW(p)

I support this and will match the $250 prize.

Here are the central background ideas/claims:

1.) Computers are built out of components which are also just simpler computers, which bottoms out at the limits of miniaturization in minimal molecular sized (few nm) computational elements (cellular automata/tiles). Further shrinkage is believed impossible in practice due to various constraints (overcoming these constraints if even possible would require very exotic far future tech).

2.) At this scale the landauer bound represents the ambient temperature dependent noise (which can also manifest as a noise voltage). Reliable computation at speed is only possible using non-trivial multiples of this base energy, for the simple reasons described by landauer and elaborated on in the other refs in my article.

3.) Components can be classified as computing tiles or interconnect tiles, but the latter is simply a computer which computes the identity but moves the input to an output in some spatial direction. Interconnect tiles can be irreversible or reversible, but the latter has enormous tradeoffs in size (ie optical) and or speed or other variables and is thus not used by brains or GPUs/CPUs.

4.) Fully reversible computers are possible in theory but have enormous negative tradeoffs in size/speed due to 1.) the need to avoid erasing bits throughout intermediate computations, 2.) the lack of immediate error correction (achieved automatically in dissipative interconnect by erasing at each cycle) leading to error build up which must be corrected/erased (costing energy), 3.) high sensitivity to noise/disturbance due to 2

And the brain vs computer claims:

5.) The brain is near the pareto frontier for practical 10W computers, and makes reasonably good tradeoffs between size, speed, heat and energy as a computational platform for intelligence

6.) Computers are approaching the same pareto frontier (although currently in a different region of design space) - shrinkage is nearing its end

Replies from: johnswentworth, None

↑ comment by johnswentworth · 2023-04-26T18:58:01.126Z · LW(p) · GW(p)

FWIW, I basically buy all of these, but they are not-at-all sufficient to back up your claims about how superintelligence won't foom (or whatever your actual intended claims are about takeoff). Insofar as all this is supposed to inform AI threat models, it's the weakest subclaims necessary to support the foom-claims which are of interest, not the strongest subclaims.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-26T19:06:57.968Z · LW(p) · GW(p)

I basically buy all of these, but they are not-at-all sufficient to back up your claims about how superintelligence won't foom

Foom isn't something that EY can prove beyond doubt or I can disprove beyond doubt, so this is a matter of subjective priors and posteriors.

If you were convinced of foom inevitability before, these claims are unlikely to convince of the opposite, but they do undermine EY's argument:

they support the conclusion that the brain is reasonably pareto-efficient (greatly undermining EY's argument that evolution and the brain are grossly inefficient, as well as his analysis confidence),
they undermine nanotech as a likely source of large FOOM gains, and
weaken EY's claim of huge software FOOM gains (because the same process which optimized the brain's hardware platform optimized the wiring/learning algorithms over the same time frame).

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-26T20:49:50.610Z · LW(p) · GW(p)

The four claims you listed as "central" at the top of this thread don't even mention the word "brain", let alone anything about it being pareto-efficient.

It would make this whole discussion a lot less frustrating for me (and probably many others following it) if you would spell out what claims you actually intend to make about brains, nanotech, and FOOM gains, with the qualifiers included. And then I could either say "ok, let's see how well the arguments back up those claims" or "even if true, those claims don't actually say much about FOOM because...", rather than this constant probably-well-intended-but-still-very-annoying jumping between stronger and weaker claims.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T00:10:46.364Z · LW(p) · GW(p)

Ok fair those are more like background ideas/claims, so I reworded that and added 2

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-27T00:26:09.198Z · LW(p) · GW(p)

Thanks!

Also, I recognize that I'm kinda grouchy about the whole thing and that's probably coming through in my writing, and I appreciate a lot that you're responding politely and helpfully on the other side of that. So thankyou for that too.

↑ comment by [deleted] · 2023-04-26T23:34:02.840Z · LW(p) · GW(p)

Jacob something really bothers me about your analysis.

Are you accounting for the brains high error rate? Efficiently getting the wrong answer a high percent of the time isn't useful, it slashes the number of bits of precision on every calculation and limits system performance.

If every synapse only has an effective 4 bits of precision, the lower order bits being random noise, it would limit throughput through the system and prevent human judgement, possibly on matters where the delta is smaller than 1/16. It would explain humans ignoring risks smaller than a few percent or having trouble making a decision between close alternatives.

(And this is true for any analog precision level obviously)

It would mean a digital system with a few more bits of precision and less redundant synapses, could significantly outperform a human brain at the same power level.

Note I also have a ton of skillpoints in this area, I have worked on analog data acquisition and control systems and filters for several years and work on inference accelerators now. (And masters CS/bachelor's CE)

Note due to my high skillpoints here I also disagree with Yudkowsky on foom but for a different set of reasons, also tied to the real world. Like you I have noticed a shortage of inference compute - if an ASI existed today there aren't enough of the right kind of accelerators to outthink the bulk of humans. (I have some numbers on this i can edit in this post if you show interest)

Remember Wikipedia says Yudkowsky didn't even go to high school and I can find no reference to him building anything in the world of engineering in his life. Just writing sci Fi and the sequences. So it may be a case where he's blind to certain domains and doesn't know what he doesn't know.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T19:40:45.338Z · LW(p) · GW(p)

There is extensive work in DL on bit precision reduction, the industry started at 32b, moved to 16b, is moving to 8b, and will probably end up at 4b or so, similar to the brain.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-28T10:47:54.389Z · LW(p) · GW(p)

For my Noob understanding: what is bit precision exactly?

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-28T13:40:12.549Z · LW(p) · GW(p)

Just the number of bits used to represent a quantity. The complexity of multiplying numbers is nonlinear in bit complexity, so 32b multipliers are much more expensive than 4b multipliers. Analog multipliers are more efficient in various respects at low signal to noise ratio equivalent to low bit precision, but blow up quickly (exponentially) with a crossover near 8 bits or so last I looked.

comment by Steven Byrnes (steve2152) · 2023-04-26T21:50:53.697Z · LW(p) · GW(p)

I certainly don’t expect any prize for this, but…

why there has been so little discussion about his analysis since if true it seems to be quite important

…I can at least address this part from my perspective.

Some of the energy-efficiency discussion (particularly interconnect losses) seems wrong to me, but it seems not to be a crux for anything, so I don’t care to spend time looking into it and arguing about it. If a silicon-chip AGI server were 1000× the power consumption of a human brain, with comparable performance, its electricity costs would still be well below my local minimum wage [LW(p) · GW(p)]. So who cares? And the world will run out of GPUs long before it runs out of the electricity needed to run them. And making more chips (or brains-in-vats or whatever) is a far harder problem than making enough solar cells to power them, and that remains true even if we substantially sacrifice energy-efficiency for e.g. higher speed.

If we (or an AI) master synthetic biology and can make brains-in-vats, tended and fed by teleoperated robots, then we (or the AI) can make whole warehouses of millions of them, each far larger (and hence smarter) than would be practical in humans who had to schlep their brains around the savannah, and they can have far better cooling systems (liquid-cooled with 1°C liquid coolant coming out of the HVAC system, rather than blood-temperature which is only slightly cooler than the brain), and each can have an ethernet/radio connection to a distant teleoperated robot body, etc. This all works even when I’m assuming “merely brain efficiency”. It doesn’t seem important to me whether it’s possible to do even better than that.

Likewise, the post argues that existing fabs are pumping out the equivalent of ~~5 million~~ (5000 maybe? See thread below.) brains per year, which to me seems like plenty for AI takeover—cf. the conquistadors [LW · GW], or Hitler / Stalin taking over a noticeable fraction of humanity with a mere 1 brain each. Again, maybe there’s room for improvement in chip tech / efficiency compared to today, or maybe not, it doesn’t really seem to matter IMO.

Another thing is: Jacob & I agree that “the cortex/cerebellum/BG/thalamus system is a generic universal learning system”, but he argues that this system isn’t doing anything fundamentally different from the MACs and ReLUs and gradient descent that we know and love from deep learning, and I think he’s wrong, but I don’t want to talk about it for infohazard reasons. Obviously, you have no reason to believe me. Oh well. We’ll find out sooner or later. (I will point out this paper arguing that correlations between DNN-learned-model-activations and brain-voxel-activations is weaker evidence than it seems. The paper is mostly about vision but also has an LLM discussion in Section 5.) Anyway, there are a zillion important model differences that are all downstream of that core disagreement, e.g. how many GPUs it will take for human-level capabilities, how soon and how gradually-vs-suddenly we’ll get human-level capabilities, etc. And hence I have a hard time discussing those too ¯\_(ツ)_/¯

Jacob & I have numerous other AI-risk-relevant disagreements too, but they didn’t come up in the “Brain Efficiency” post.

Replies from: jacob_cannell, alexander-gietelink-oldenziel

↑ comment by jacob_cannell · 2023-04-27T17:00:00.237Z · LW(p) · GW(p)

If a silicon-chip AGI server were 1000× the power consumption of a human brain, with comparable performance, its electricity costs would still be well below my local minimum wage. So who cares? And the world will run out of GPUs long before it runs out of the electricity needed to run them. And making more chips (or brains-in-vats or whatever) is a far harder problem than making enough solar cells to power them, and that remains true even if we substantially sacrifice energy-efficiency for e.g. higher speed.

I largely agree with this, except I will note that energy efficiency is extremely important for robotics, which is partly why robotics lags and will continue to lag until we have more neuromorphic computing.

But also again the entire world produces less than 5TW currently, so if we diverted all world energy to running 20KW AGI that would only result in a population of 250M AGIs. But yes given that nvidia produces only a few hundred thousand high end GPUs per year, GPU production is by far the current bottleneck.

If we (or an AI) master synthetic biology and can make brains-in-vats

Yes but they take too long to train. The whole advantage of silicon AI is faster speed of thought (at the cost of enormous energy use).

Likewise, the post argues that existing fabs are pumping out the equivalent of 5 million brains per year, which to me seems like plenty for AI takeove

Err where? My last estimate is a few hundred thousand high end GPUs per year, and currently well more than one GPU to equal one brain (although that comparison is more complex).

Another thing is: Jacob & I agree that “the cortex/cerebellum/BG/thalamus system is a generic universal learning system”, but he argues that this system isn’t doing anything fundamentally different from the MACs and ReLUs and gradient descent that we know and love from deep learning, and I think he’s wrong, but I don’t want to talk about it for infohazard reasons.

Not quite: just GPTs do a bit more than MACs and RELUs, they also have softmax, normalization, transpose, etc. And in that sense the toolkit is complete, its more about what you implement with it and how efficient it is - but it's obviously a universal circuit toolkit.

But in general I do think there are approaches likely to exceed the current GPT paradigm, and more to learn/apply from the brain, but further discussion in that direction should be offline.

Replies from: steve2152, steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-27T17:33:44.150Z · LW(p) · GW(p)

Yes but they take too long to train.

I stand by brains-in-vats being relevant in at least some doom scenarios, notwithstanding the slow training. For example, I sometimes have arguments like:

ME: A power-seeking AGI might wipe out human civilization with a super-plague plus drone strikes on the survivors.

THEM: Even if the AGI could do that, it wouldn’t want to, because it wants to survive into the indefinite future, and that’s impossible without having humans around to manufacture chips, mine minerals, run the power grid, etc.

ME: Even if the AGI merely had access to a few dexterous teleoperated robot bodies and its own grid-isolated solar cell, at first, then once it wipes out all the humans, it could gradually (over decades) build its way back to industrial civilization.

THEM: Nope. Fabs are too labor-intensive to run, supply, and maintain. The AGI could scavenge existing chips but it could never make new ones. Eventually the scavenge-able chips would all break down and the AGI would be dead. The AGI would know that, and therefore it would never wipe out humanity in the first place.

ME: What about brains-in-vats?!

(I have other possible responses too—I actually wouldn’t concede the claim that nanofab is out of the question—but anyway, this is a context where brains-in-vats are plausibly relevant.)

I presume you’re imagining different argument chains, in which case, yeah, brains-in-vats that need 10 years to train might well not be relevant. :)

↑ comment by Steven Byrnes (steve2152) · 2023-04-27T17:16:28.297Z · LW(p) · GW(p)

> Likewise, the post argues that existing fabs are pumping out the equivalent of 5 million brains per year, which to me seems like plenty for AI takeove
Err where

In brain efficiency [LW · GW] you wrote “Nvidia - the single company producing most of the relevant flops today - produced roughly 5e21 flops of GPU compute in 2021, or the equivalent of about 5 million brains, perhaps surpassing the compute of the 3.6 million humans born in the US. With 200% growth in net flops output per year from all sources it will take about a decade for net GPU compute to exceed net world brain compute.”

…Whoops. I see. In this paragraph you were talking about FLOP/s, whereas you think the main constraint is memory capacity, which cuts it down by [I think you said] 3 OOM? But I think 5000 brains is enough for takeover too. Again, Hitler & Stalin had one each.

I will strike through my mistake above, sorry about that.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T17:24:28.258Z · LW(p) · GW(p)

Oh I see. Memory capacity does limit the size of a model you can fit on a reasonable number of GPUs, but flops and bandwidth constrain the speed. In brain efficiency I was just looking at total net compute counting all gpus, more recently I was counting only flagship GPUs (as the small consumer GPUs aren't used much for AI due to low RAM).

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-26T22:26:08.982Z · LW(p) · GW(p)

I encourage you to share your knowledge concerning energy-efficiency & interconnect losses! I will split the prize between all serious entries.

(to me the supposed implications for DOOM & FOOM are not so interesting. fwiw I probably agree with what you say here, including and especially your last paragraph)

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-26T23:46:08.682Z · LW(p) · GW(p)

Oh fine, you talked me into it [LW(p) · GW(p)] :)

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-27T22:24:24.420Z · LW(p) · GW(p)

😊 yayy

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-04-27T22:34:29.861Z · LW(p) · GW(p)

I'm confused at how somebody ends up calculating that a brain - where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually - could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed "Brain Efficiency" though not checked any numbers, and not seen anything inside it which seems to address this sanity check.

Replies from: jacob_cannell, Eliezer_Yudkowsky, Veedrac

↑ comment by jacob_cannell · 2023-04-27T23:05:12.072Z · LW(p) · GW(p)

The first step in reducing confusion is to look at what a synaptic spike does. It is the equivalent of - in terms of computational power - an ANN 'synaptic spike', which is a memory read of a weight, a low precision MAC (multiply accumulate), and a weight memory write (various neurotransmitter plasticity mechanisms). Some synapses probably do more than this - nonlinear decoding of spike times for example, but that's a start. This is all implemented in a pretty minimal size looking device. The memory read/write is local, but it also needs to act as an amplifier to some extent, to reduce noise and push the signal farther down the wire. An analog multiplier uses many charge carriers to get a reasonable SNR ratio, which compares to all the charge carries across a digital multiplier including interconnect.

So with that background you can apply the landauer analysis [LW(p) · GW(p)] to get base bit energy, then estimate the analog MAC energy cost [LW(p) · GW(p)] (or equivalent digital MAC, but the digital MAC is much larger so there are size/energy/speed tradeoffs), and finally consider the probably dominate interconnect cost [LW(p) · GW(p)]. I estimate the interconnect cost alone at perhaps a watt.

A complementary approach is to compare to projected upcoming end-of CMOS scaling tech as used in research accelerator designs and see that you end up getting similar numbers (also discussed in the article).

The brain, like current CMOS tech, is completely irreversible. Reversible computation is possible in theory but is exotic like quantum computation requiring near zero temp and may not be practical at scale on a noisy environment like the earth, for the reasons outlined by Cavin/Zhirnov here and discussed in a theoretical cellular model by Tiata here - basically fully reversible computers rapidly forget everything as noise accumulates. Irreversible computers like brains and GPUs erase all thermal noise at every step, and pay the hot iron price to do so.

Replies from: Eliezer_Yudkowsky, Maxc

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-04-29T01:25:01.821Z · LW(p) · GW(p)

This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T04:03:33.445Z · LW(p) · GW(p)

See my reply here [LW(p) · GW(p)] which attempts to answer this. In short, if you accept that the synapse is doing the equivalent of all the operations involving a weight in a deep learning system (storing the weight, momentum gradient etc in minimal viable precision, multiplier for forward back and weight update, etc), then the answer is a more straightforward derivation from the requirements. If you are convinced that the synapse is only doing the equivalent of a single bit AND operation, then obviously you will reach the conclusion that it is many OOM wasteful, but tis easy to demolish any notion that is merely doing something so simple.^[1]

There are of course many types of synapses which perform somewhat different computations and thus have different configurations, sizes, energy costs, etc. I am mostly referring to the energy/compute dominate cortical pyramidal synapses. ↩︎

Replies from: Eliezer_Yudkowsky, TekhneMakre

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-05-07T01:42:04.210Z · LW(p) · GW(p)

Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.

↑ comment by TekhneMakre · 2023-05-04T09:26:17.742Z · LW(p) · GW(p)

Is your point that the amount of neurotransmitter is precisely meaningful (so that spending some energy/heat on pumping one additional ion is doing on the order of a bit of "meaningful work")?

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-05-04T16:03:09.195Z · LW(p) · GW(p)

I'm not sure what you mean precisely by "precisely meaningful", but I do believe we actually know enough about how neural circuits and synapses work^[1] such that we have some confidence that they must be doing something similar to their artificial analogs in DL systems.

So this minimally requires:

storage for a K-bit connection weight in memory
(some synapses) nonlinear decoding of B-bit incoming neural spike signal (timing based)
analog 'multiplication'^[2] of incoming B-bit neural signal by K-bit weight
weight update from local backpropagating hebbian/gradient signal or equivalent

We know from DL that K and B do not need to be very large, but the optimal are well above 1-bit, and more importantly the long term weight storage (equivalent of gradient EMA/momentum) drives most of the precision demand, as it needs to accumulate many noisy measurements over time. From DL it looks like you want around 8-bit at least for long-term weight param storage, even if you can sample down to 4-bit or a bit lower for forward/backwards passes.

So that just takes a certain amount of work, and if you map out the minimal digital circuits in a maximally efficient hypothetical single-electron tile technology you really do get something on order 1e5 minimal 1eV units or more^[3]. Synapses are also efficient in the sense that they grow/shrink to physically represent larger/smaller logical weights using more/less resources in the optimal fashion.

I have also argued on the other side of this - there are some DL researchers who think the brain does many many OOM more computation than it would seem, but we can rule that out with the same analysis.

To those with the relevant background knowledge in DL, accelerator designs, and the relevant neuroscience. ↩︎
The actual synaptic operations are non-linear and more complex, but do something like the equivalent work of analog multiplication, and can't be doing dramatically more or less. ↩︎
This is not easy to do either and requires knowledge of the limits of electronics. ↩︎

Replies from: TekhneMakre

↑ comment by TekhneMakre · 2023-05-04T16:16:20.226Z · LW(p) · GW(p)

Thanks! (I'm having a hard time following your argument as a whole, and I'm also not trying very hard / being lazy / not checking the numbers; but I appreciate your answers, and they're at least fleshing out some kind of model that feels useful to me. )

↑ comment by Max H (Maxc) · 2023-04-27T23:40:42.531Z · LW(p) · GW(p)

From the synapses section:

Thus the brain is likely doing on order to $10^{15}$ low-medium precision multiply-adds per second.

I don't understand why this follows, and suspect it is false. Most of these synaptic operations are probably not "correct" multiply-adds of any precision - they're actually more random, noisier functions that are approximated or modeled by analog MACs with particular input ranges.

And even if each synaptic operation really is doing the equivalent of an arbitrary analog MAC computation, that doesn't mean that these operations are working together to do any kind of larger computation or cognition in anywhere close to the most efficient possible way.

Similar to how you can prune and distill large artificial models without changing their behavior much, I expect you could get rid of many neurons in the brain without changing the actual computation that it performs much or at all.

It seems like you're modeling the brain as performing some particular exact computation where every bit of noise is counted as useful work. The fact that the brain may be within a couple OOM of fundamental thermodynamic limits of computing exactly what it happens to compute seems not very meaningful as a measure of the fundamental limit of useful computation or cognition possible given particular size and energy specifications.

Replies from: bhauth, jacob_cannell

↑ comment by bhauth · 2023-05-01T03:22:59.389Z · LW(p) · GW(p)

I made a post [LW · GW] which may help explain the analogy between spikes and multiply-accumulate operations.

↑ comment by jacob_cannell · 2023-04-28T04:45:26.792Z · LW(p) · GW(p)

I could make the exact same argument about some grad student's first DL experiment running on a GPU, on multiple levels.

I also suspect you could get rid of many neurons in their DL model without changing the computation, I suspect they aren't working together to do any kind of larger cognition in anywhere closer to the most efficient possible way.

It's also likely they may not even know how to use the tensorcores efficiently, and even if they did the tensorcores waste most of their compute multiplying by zeros or near zeroes, regardless of how skilled/knowledge-able the DL practitioner.

And yet knowing all this, we still count flops in the obvious way, as counting "hypothetical fully utilized fllops" is not an easy useful quantity to measure discuss and compare.

Utilization of the compute resources is a higher level software/architecture efficiency consideration, not a hardware efficiency measure.

Replies from: Maxc

↑ comment by Max H (Maxc) · 2023-04-28T12:41:20.155Z · LW(p) · GW(p)

And yet knowing all this, we still count flops in the obvious way, as counting "hypothetical fully utilized fllops" is not an easy useful quantity to measure discuss and compare.

Given a CPU capable of a specified number of FLOPs at a specified precision, I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate^[1].

Not so for brains, for at least a couple of reasons:

An individual neuron can't necessarily perform an arbitrary multiply / add / accumulate operation, at any particular precision. It may be modeled by an analog MAC of a specified precision over some input range.
The software / architecture point above. For many artificial computations we care about, we can apply both micro (e.g. assembly code optimization) and macro (e.g. using a non-quadratic algorithm for matrix multiplication) optimization to get pretty close to the theoretical limit of efficiency. Maybe the brain is already doing the analog version of these kinds optimizations in some cases. Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors, it's another reason why the FLOPs to SYNOPs comparison is not appples-to-apples.

^{^}
modulo some concerns about I/O, generation, checking, and CPU manufacturers inflating their benchmark numbers

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-28T13:36:44.923Z · LW(p) · GW(p)

I actually can take arbitrary floats at that precision and multiply or add them in arbitrary ways at the specified rate[1].

And? DL systems just use those floats to simulate large NNs, and a good chunk of recent progress has resulted from moving down to lower precision from 32b to 16b to 8b and soon 4b or lower, chasing after the brain's carefully tuned use of highly energy efficient low precision ops.

Intelligence requires exploring a circuit space, simulating circuits. The brain is exactly the kind of hardware you need to do that with extreme efficiency given various practical physical constraints.

GPUs/accelerators can match the brain in raw low precision op/s useful for simulating NNs (circuits), but use far more energy to do so and more importantly are also extremely limited by memory bandwidth which results in an extremely poor 100:1 or even 1000:1 alu:mem ratio, which prevents them from accelerating anything other than matrix matrix multiplication, rather than the far more useful sparse vector matrix multiplication.

Yes, this is somewhat of a separate / higher-level consideration, but if neurons are less repurposable and rearrangeable than transistors,

This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.

Replies from: Maxc

↑ comment by Max H (Maxc) · 2023-04-28T13:50:16.825Z · LW(p) · GW(p)

This is just nonsense. A GPU can not rearrange its internal circuitry to change precision or reallocate operations. A brain can and does by shrinking/expanding synapses, growing new ones, etc.

Give me some floats, I can make a GPU do matrix multiplication, or sparse matrix multiplication, or many other kind of computations at a variety of precisions across the entire domain of floats at that precision.

A brain is (maybe) carrying out a computation which is modeled by a particular bunch of sparse matrix multiplications, in which the programmer has much less control over the inputs, domain, and structure of the computation.

The fact that some process (maybe) irreducibly requires some number of FLOPs to simulate faithfully is different from that process being isomorphic to that computation itself.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-28T14:09:19.870Z · LW(p) · GW(p)

Intelligence requires exploring and simulating a large circuit space - ie by using something like gradient descent on neural networks. You can use a GPU to do that inefficiently or you can create custom nanotech analog hardware like the brain.

The brain emulates circuits, and current AI systems on GPUs simulate circuits inspired by the brain's emulation.

Replies from: Maxc

↑ comment by Max H (Maxc) · 2023-04-28T18:26:27.332Z · LW(p) · GW(p)

Intelligence requires exploring and simulating a large circuit space - ie by using something like gradient descent on neural networks.

I don't think neuroplasticity is equivalent to architecting and then doing gradient descent on an artificial neural network. That process is more analogous to billions of years of evolution, which encoded most of the "circuit exploration" process in DNA. In the brain, some of the weights and even connections are adjusted at "runtime", but the rules for making those connections are necessarily encoded in DNA.

(Also, I flatly don't buy that any of this is required for intelligence.)

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-05-07T01:40:28.617Z · LW(p) · GW(p)

Further item of "these elaborate calculations seem to arrive at conclusions that can't possibly be true" - besides the brain allegedly being close to the border of thermodynamic efficiency, despite visibly using tens of thousands of redundant physical ops in terms of sheer number of ions and neurotransmitters pumped; the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible, so presumably at the Limit of the Possible themselves.

This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate, instead of digital transistor operations about them, even if you otherwise use actual real-world physical hardware. Sounds right to me; it would make no sense for such a vastly redundant digital computation of such a simple physical quantity to be anywhere near the borders of efficiency! https://spectrum.ieee.org/analog-ai

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-05-07T05:48:53.385Z · LW(p) · GW(p)

I'm not sure why you believe "the same calculations claim that modern GPUs are approaching brain efficiency, the Limit of the Possible". GPUs require at least on order ~1e-11J to fetch a single 8-bit value from GDDRX RAM (1e-19 J/b/nm (interconnect wire energy [LW(p) · GW(p)]) * 1cm * 8), so around ~1KW or 100x the brain for 1e14 of those per second, not even including flop energy cost (the brain doesn't have much more efficient wires, it just minimizes that entire cost by moving the memory synapses/weights as close as possible to the compute .. by merging them). I do claim that Moore's Law is ending and not delivering much farther increase in CMOS energy efficiency (and essentially zero increase in wire energy efficiency), but GPUs are far from the optimal use of CMOS towards running NNs.

This source claims 100x energy efficiency from substituting some basic physical analog operations for multiply-accumulate,

That sounds about right, and Indeed I roughly estimate the minimal energy for 8 bit analog MAC at the end of the synapse section, with 4 refs examples from the research lit:

We can also compare the minimal energy prediction of for 8-bit equivalent analog multiply-add to the known and predicted values for upcoming efficient analog accelerators, which mostly have energy efficiency in the $10^{- 14} J / o p$ range^[1]^[2]^[3]^[4] for < 8 bit, with the higher reported values around $10^{- 15} J / o p$ similar to the brain estimate here, but only for < 4-bit precision^[5]. Analog devices can not be shrunk down to few nm sizes without sacrificing SNR and precision; their minimal size is determined by the need for a large number of carriers on order $2^{c * β}$ for equivalent bit precision $β$ , and c ~ 2, as discussed earlier.

The more complicated part of comparing these is how/whether to include the cost of reading/writing a synapse/weight value from RAM across a long wire, which is required for full equivalence to the brain. The brain as a true RNN is doing Vector Matrix multiplication, whereas GPUs/Accelerators instead do Matrix Matrix multiplication to amortize the cost of expensive RAM fetches. VM mult can simulate MM mult at no extra cost, but MM mult can only simulate VM mult at huge inefficiency proportional to the minimal matrix size (determined by ALU/RAM ratio, ~1000:1 now at low precision). The full neuromorphic or PIM approach instead moves the RAM next to the processing elements, and is naturally more suited to VM mult.

Bavandpour, Mohammad, et al. "Mixed-Signal Neuromorphic Processors: Quo Vadis?" 2019 IEEE SOI-3D-Subthreshold Microelectronics Technology Unified Conference (S3S). IEEE, 2019. gs-link ↩︎
Chen, Jia, et al. "Multiply accumulate operations in memristor crossbar arrays for analog computing." Journal of Semiconductors 42.1 (2021): 013104. gs-link ↩︎
Li, Huihan, et al. "Memristive crossbar arrays for storage and computing applications." Advanced Intelligent Systems 3.9 (2021): 2100017. gs-link ↩︎
Li, Can, et al. "Analogue signal and image processing with large memristor crossbars." Nature electronics 1.1 (2018): 52-59. gs-link ↩︎
Mahmoodi, M. Reza, and Dmitri Strukov. "Breaking POps/J barrier with analog multiplier circuits based on nonvolatile memories." Proceedings of the International Symposium on Low Power Electronics and Design. 2018. gs-link ↩︎

Replies from: Eliezer_Yudkowsky, alexander-gietelink-oldenziel

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-05-07T19:53:21.170Z · LW(p) · GW(p)

Okay, if you're not saying GPUs are getting around as efficient as the human brain, without much more efficiency to be eeked out, then I straightforwardly misunderstood that part.

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-07T08:40:00.596Z · LW(p) · GW(p)

Could you elaborate on your last paragraph about matrix -matrix multiplication versus vector matrix multiplication. What does this have to do with the RAM being next to the processing units?

(As a general note, I think it would be useful for people trying to follow along if you would explain some of the technical terms you are using. Not everybody is a world-expert in GPU-design! E.g. PIM, CMOS, MAC etc )

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-05-07T18:52:36.956Z · LW(p) · GW(p)

Matrix Matrix Mult of square matrices dim N uses ~ ALU ops and ~ $3 N^{2}$ MEM ops, so it has an arithmetic intensity of ~N (ALU:MEM ratio).

Vector Matrix Mult of dim N uses ~2 $N^{2}$ ALU and ~3 $N^{2}$ MEM, for an arithmetic intensity of ~1.

A GPU has an ALU:MEM ratio of about 1000:1 (for lower precision tensorcore ALU), so it is inefficient at vector matrix mult by a factor of about 1000 vs matrix matrix mult. The high ALU:MEM ratio is a natural result of the relative wire lengths: very short wire distances to shuffle values between FP units in a tensorcore vs very long wire distances to reach a value in off chip RAM.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-18T17:33:45.628Z · LW(p) · GW(p)

What is ALU and MEM exactly? And what is the significance of the ALU:MEM ratio?

Replies from: ege-erdil

↑ comment by Ege Erdil (ege-erdil) · 2023-05-19T10:47:05.825Z · LW(p) · GW(p)

The GPU needs numbers to be stored in registers inside the GPU before it can do operations on them. A memory operation (what Jacob calls MEM) is when you load a particular value from memory into a register. An arithmetic operation is when you do an elementary arithmetic operation such as addition or multiplication on two values that have already been loaded into registers. These are done by the arithmetic-logic unit (ALU) of the processor so are called ALU ops.

Because a matrix multiplication of two matrices only involves $2 N^{2}$ distinct floating point numbers as input, and writing the result back into memory is going to cost you another $N^{2}$ memory operations, the total MEM ops cost of a matrix multiplication of two matrices of size $N \times N$ is $3 N^{2}$ . In contrast, if you're using the naive matrix multiplication algorithm, computing each entry in the output matrix takes you $N$ additions and $N$ multiplications, so you end up with $2 N \cdot N^{2} = 2 N^{3}$ ALU ops needed.

The ALU:MEM ratio is important because if your computation is imbalanced relative to what is supported by your hardware then you'll end up being bottlenecked by one of them and you'll be unable to exploit the surplus resources you have on the other side. For instance, if you're working with a bizarre GPU that has a 1:1 ALU:MEM ratio, whenever you're only using the hardware to do matrix multiplications you'll have enormous amounts of MEM ops capacity sitting idle because you don't have the capacity to be utilizing them.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-19T13:35:32.399Z · LW(p) · GW(p)

This is helpful, thanks a ton Ege!

↑ comment by Veedrac · 2023-04-28T14:57:28.802Z · LW(p) · GW(p)

The section you were looking for is titled ‘Synapses’.

https://www.lesswrong.com/posts/xwBuoE9p8GE7RAuhd/brain-efficiency-much-more-than-you-wanted-to-know#Synapses [LW(p) · GW(p)]

Replies from: Eliezer_Yudkowsky

↑ comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2023-04-29T01:18:04.397Z · LW(p) · GW(p)

And it says:

So true 8-bit equivalent analog multiplication requires about 100k carriers/switches

This just seems utterly wack. Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit? And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by... an axon terminal with a handful of synaptic vesicles dumping 10,000 neurotransmitter molecules to flood around a dendritic terminal (molecules which will later need to be irreversibly pumped back out), which in turn depolarizes and starts flooding thousands of ions into a cell membrane (to be later pumped out) in order to transmit the impulse at 1m/s? That's the most thermodynamically efficient a physical cognitive system can possibly be? This is approximately the most efficient possible way to turn all those bit erasures into thought?

This sounds like physical nonsense that fails a basic sanity check. What am I missing?

Replies from: jacob_cannell, Maxc, Veedrac, DaemonicSigil

↑ comment by jacob_cannell · 2023-04-29T03:44:26.090Z · LW(p) · GW(p)

And "analog multiplication down to two decimal places" is the operation that is purportedly being carried out almost as efficiently as physically possible by

I am not certain it is being carried "almost as efficiently as physically possible", assuming you mean thermodynamic efficiency (even accepting you meant thermodynamic efficiency only for irreversible computation), my belief is more that the brain and its synaptic elements are reasonably efficient in a pareto tradeoff sense [LW · GW].

But any discussion around efficiency must make some starting assumptions about what computations the system may be performing. We now have a reasonable amount of direct and indirect evidence - direct evidence from neuroscience, indirect evidence from DL - that allows us some confidence that the brain is conventional (irreversible, non quantum), and is basically very similar to an advanced low power DL accelerator built out of nanotech replicators. (and the clear obvious trend in hardware design is towards the brain)

So starting with that frame ..

Having any physical equivalent of an analog multiplication fundamentally requires 100,000 times the thermodynamic energy to erase 1 bit?

A synaptic op is the equivalent of reading an 8b-ish weight from memory, 'multiplying' by the incoming spike value, propagating the output down the wire, updating neurotransmitter receptors (which store not just the equivalent of the weight, but the bayesian distribution params on the weight, equivalent to gradient momentum etc), back-propagating spike (in some scenarios), spike decoding (for nonlinear spike timing codes), etc.

It just actually does a fair amount of work, and if you actually query the research literature to see how many transistors that would take it is something like 10k to 100k or more, each of which minimally uses 1eV per op * 10 for interconnect, according to the best micromodels of circuit limits (cavin/Zhirnov).

The analog multiplier and gear is very efficient (especially in space) for low SNRs, but it scales poorly (exponentially) with bit precision (equivalent SNR). From the last papers I recall 8b is the crossover point where digital wins in energy and perhaps size. Below that analog dominates. There are numerous startups working on analog hardware to replace GPUs for low bit precision multipliers, chasing the brain, but its extremely difficult and IMHO may. not be worth it without nanotech.

in order to transmit the impulse at 1m/s?

The brain only runs at 100hz and the axon conduction velocity is optimized just so that every brain region can connect to distal regions without significant delay (delay on order of a millisecond or so).

So the real question is then just why 100hz - which I also answer in brain efficiency. If you have a budget of 10W you can spend that running a very small NN very fast or a very large NN at lower speeds, and the latter seems more useful for biology. Digital minds obviously can spend the energy cost to run at fanastic speeds - and GPT4 was only possible because its NN can run vaguely ~10000x faster than the brain (for training).

I'll end with an interesting quote from Hinton^[1]:

The separation of software from hardware is one of the foundations of Computer Science and it has many benefits. It makes it possible to study the properties of programs without worrying about electrical engineering. It makes it possible to write a program once and copy it to millions of computers. If, however, we are willing to abandon immortality it should be possible to achieve huge savings in the energy required to perform a computation and in the cost of fabricating the hardware that executes the computation. We can allow large and unknown variations in the connectivity and non-linearities of different instances of hardware that are intended to perform the same task and rely on a learning procedure to discover parameter values that make effective use of the unknown properties of each particular instance of the hardware. These parameter values are only useful for that specific hardware instance, so the computation they perform is mortal: it dies with the hardware.

The Forward-Forward Algorithm -section 8 ↩︎

↑ comment by Max H (Maxc) · 2023-04-29T02:26:54.178Z · LW(p) · GW(p)

I think the quoted claim is actually straightforwardly true? Or at least, it's not really surprising that actual precise 8 bit analog multiplication really does require a lot more energy than the energy required to erase one bit.

I think the real problem with the whole section is that it conflates the amount of computation required to model synaptic operation with the amount of computation each synapse actually performs.

These are actually wildly different types of things, and I think the only thing it is justifiable to conclude from this analysis is that (maybe, if the rest of it is correct) it is not possible to simulate the operation of a human brain at synapse granularity, using much less than 10W and 1000 cm^3. Which is an interesting fact if true, but doesn't seem to have much bearing on the question of whether the brain is close to an optimal substrate for carrying out the abstract computation of human cognition.

(I expanded a little on the point about modeling a computation vs. the computation itself in an earlier sibling reply [LW(p) · GW(p)].)

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T03:58:21.887Z · LW(p) · GW(p)

Or at least, it's not really surprising that actual precise 8 bit analog multiplication

I'm not sure what you mean by "precise 8 bit analog multiplication", as analog is not precise in the way digital is. When I say 8-bit analog equivalent, I am talking about an analog operation that has SNR equivalent to quantized 8-bit digital, which is near the maximum useful range for analog multiplier devices, and near the upper range of estimates of synaptic precision.

Replies from: Maxc

↑ comment by Max H (Maxc) · 2023-04-29T04:31:55.677Z · LW(p) · GW(p)

I was actually imagining some kind of analogue to an 8 bit Analog-to-digital converter. Or maybe an op amp? My analog circuits knowledge is very rusty.

But anyway, if you draw up a model of some synapses as an analog circuit with actual analog components, that actually illustrates one of my main objections pretty nicely: neurons won't actually meet the same performance specifications of that circuit, even if they behave like or are modeled by those circuits for specific input ranges and a narrow operating regime.

An actual analog circuit has to meet precise performance specifications within a specified operating domain, whether it is comprised of an 8 bit or 2 bit ADC, a high or low quality op amp, etc.

If you draw up a circuit made out of neurons, the performance characteristics and specifications it meets will probably be much more lax. If you relax specifications for a real analog circuit in the same way, you can probably make the circuit out of much cheaper and lower-energy component pieces.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T05:15:28.426Z · LW(p) · GW(p)

An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.

The brain doesn't have that constraint, so it can to some extent learn to exploit the nuances of each specific subcircuit or device. This is almost obviously superior in terms of low level circuit noise tolerance and space and energy efficiency, and is seen by some as the ultimate endpoint of Moore's Law - see hinton's Forward Forward section 8 I quoted here [LW(p) · GW(p)]

Regardless if you see a neurotransmitter synapse system that is using 10k carriers or whatever flooding through variadic memory-like protein gates such that deep simulations of the system indicate it is doing something similar-ish to analog multiplication with SNR equivalent to 7-bits or whatnot, and you have a bunch of other neuroscience and DL experiments justifying that interpretation, then that is probably what it is doing.

It is completely irrelevant whether it's a 'proper' analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.

Replies from: Maxc

↑ comment by Max H (Maxc) · 2023-04-29T12:57:39.297Z · LW(p) · GW(p)

An actual CMOS analog circuit only has to meet those precision performance specifications because it is a design which must be fabricated and reproduced with precision over and over.

Mass production is one reason, but another reason this distinction actually is important is that they are performance characteristics of the whole system, not its constituent pieces. For both analog and digital circuits, these performance characteristics have very precise meanings.

Let's consider flop/s for digitial circuits.

If I can do 1M flop/s, that means roughly, every second, you can give me 2 million floats, and I can multiply them together pairwise and give you 1 million results, 1 second later. I can do this over and over again every second, and the values can be arbitrarily distributed over the entire domain of floating point numbers at a particular precision.^[1]

"Synaptic computations" in the brain, as you describe them, do not have any of these properties. The fact that 10^15 of them happen per second is not equivalent or comparable to 10^15 flop/s, because it is not a performance characteristic of the system as a whole.

By analogy: suppose you have some gas particles in a container, and you'd like to simulate their positions. Maybe the simulation requires 10^15 flop/s to simulate in real time, and there is provably no more efficient way to run your simulation.

Does that mean the particles themselves are doing 10^15 flop/s? No!

Saying the brain does "10^15 synaptic operations per second" is a bit like saying the particles in the gas are doing "10^15 particle operations per second".

The fact that, in the case of the brain, the operations themselves are performing some kind of useful work that looks like a multiply-add, and that this is maybe within an OOM of some fundamental efficiency limit, doesn't mean you can coerce the types arbitrarily to say the the brain itself is efficient as a whole.

As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is "near the limit of thermodynamic efficiency" in the sense that every microjoule of energy it uses is required to make it actually work. Cooling + overclocking might let you push things a bit, but you'll never be able to match the performance of re-designing the underlying transistors entirely at a smaller process (which often involves a LOT more than just shrinking individual transistors).

It is completely irrelevant whether it's a 'proper' analog multiplication that would meet precise performance specifications in a mass produced CMOS device. All that matters here is its equivalent computational capability.

Indeed, Brains and digital circuits have completely different computational capabilities and performance characteristics. That's kind of the whole point.

^{^}
If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T13:33:09.859Z · LW(p) · GW(p)

(If I do this with a CPU, I might have full control over which pairs are multiplied. If I have an ASIC, the pair indices might be fixed. If I have an FPGA, they might be fixed until I reprogram it.)

The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use. An FPGA is somewhere in between.

The brain uses active rewiring (and synapse growth/shrinkage) to physically adapt the hardware, which has the flexibility of an FPGA for the purposes of deep learning, but the efficiency of an ASIC.

As a less vacuous analogy, you could do a bunch of analysis on an individual CMOS gate from the 1980s, and find, perhaps, that it is "near the limit of thermodynamic efficiency"

Or you could make the same argument about a pile of rocks, or a GPU as I noticed earlier. The entire idea of computation is a map territory enforcement, it always requires a mapping between a logical computation and physics.

If you simply assume - as you do - that the brain isn't computing anything useful (as equivalent to deep learning operations as I believe is overwhelming supported by the evidence), then you can always claim that, but I see no reason to pay attention whatsoever. I suspect you simply haven't spent the requisite many thousands of hours reading the right DL and neuroscience.

Replies from: Veedrac, Maxc

↑ comment by Veedrac · 2023-04-29T18:39:17.915Z · LW(p) · GW(p)

The only advantage of a CPU/GPU over an ASIC is that the CPU/GPU is programmable after device creation. If you know what calculation you want to perform you use an ASIC and avoid the enormous inefficiency of the CPU/GPU simulating the actual circuit you want to use

This has a kernel of truth but it is misleading. There are plenty of algorithms that don't naturally map to circuits, because a step of an algorithm in a circuit costs space, whereas a step of an algorithm in a programmable computer costs only those bits required to encode the task. The inefficiency of dynamic decode can be paid for with large enough algorithms. This is most obvious when considering large tasks on very small machines.

It is true that neither GPUs nor CPUs seem particularly pareto optimal for their broad set of tasks, versus a cleverer clean-sheet design, and it is also true that for any given task you could likely specialize a CPU or GPU design for it somewhat easily for at least marginal benefit, but I also think this is not the default way your comment would be interpreted.

↑ comment by Max H (Maxc) · 2023-04-29T14:56:07.254Z · LW(p) · GW(p)

If you simply assume - as you do - that the brain isn't computing anything useful

I do not assume this, but I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 "synaptic computations".

Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question, which is perfectly fine as a matter of topic choice, but you then can't make any claims which depend on such an understanding.

Absent such an understanding, you can still make apples-to-apples comparisons about overall performance characteristics between digital and biological systems. But those _must_ be grounded in an actual precise performance metric of the system as a whole, if they are to be meaningful at all.

Component-wise analysis is not equivalent to system-wide analysis, even if your component-wise analysis is precise and backed by a bunch of neuroscience results and intuitions from artificial deep learning.

FYI for Jacob and others, I am probably not going to further engage directly with Jacob, as we seem to be mostly talking past each other, and I find his tone ("this is just nonsense", "completely irrelevant", "suspect you simply haven't spent...", etc.) and style of argument to be tiresome.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T15:29:25.521Z · LW(p) · GW(p)

I am claiming that something remains to be shown, namely, that human cognition irreducibly requires any of those 10^15 "synaptic computations".

Obviously it requires some of those computations, but in my ontology the question of how many is clearly a software efficiency question. The fact that an A100 can do ~1e15 low precision op/s (with many caveats/limitations) is a fact about the hardware that tells you nothing about how efficiently any specific A100 may be utilizing that potential. I claim that the brain can likewise do very roughly 1e15 synaptic ops/s, but that questions of utilization of that potential towards intelligence are likewise circuit/software efficiency questions (which I do address in some of my writing, but it is specifically out of scope for this particular question of synaptic hardware.)

Showing such a thing necessarily depends on an understanding of the nature of cognition at the software / algorithms / macro-architecture level. Your original post explicitly disclaims engaging with this question,

My original post does engage with this some in the circuit efficiency [LW(p) · GW(p)] section. I draw the circuit/software distinction around architectural prior and learning algorithms (genetic/innate) vs acquired knowledge/skills (cultural).

I find his tone ("this is just nonsense",

I used that in response to you saying "but if neurons are less repurposable and rearrangeable than transistors,", which I do believe is actually nonsense, because neural circuits literally dynamically rewire themselves, which allows the flexibility of FPGAs (for circuit learning) combined with the efficiency of ASICs, and transistors are fixed circuits not dynamically modifiable at all.

If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.

To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore's law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.

Replies from: Maxc, Maxc

↑ comment by Max H (Maxc) · 2023-04-29T16:14:53.157Z · LW(p) · GW(p)

To answer that question, I have provided some of the relevant arguments in my past writing, but at this point given the enormous success of DL (which I predicted well in advance) towards AGI and the great extent to which it has reverse engineered the brain, combined with the fact that moore's law shrinkage is petering out and the brain remains above the efficiency of our best accelerators, entirely shifts the burden on to you to write up detailed analysis/arguments as to how you can explain these facts.

I think there's just not that much to explain, here - to me, human-level cognition just doesn't seem that complicated or impressive in an absolute sense - it is performed by a 10W computer designed by a blind idiot god [LW · GW], after all.

The fact that current DL paradigm methods inspired by its functionality have so far failed to produce artificial cognition of truly comparable quality and efficiency seems more like a failure of those methods rather than a success, at least so far. I don't expect this trend to continue in the near term (which I think we agree on), and grant you some bayes points for predicting it further in advance.

↑ comment by Max H (Maxc) · 2023-04-29T15:38:27.035Z · LW(p) · GW(p)

If I was to try and steelman your position, it is simply that we can not be sure how efficiently the brain utilizes the potential of its supposed synaptic computational power.

I was actually referring to the flexibility and re-arrangability at design time here. Verilog and Cadence can make more flexible use of logic gates and transistors than the brain can make of neurons during a lifetime, and the design space available to circuit designers using these tools is much wider than the one available to evolution.

↑ comment by Veedrac · 2023-04-29T04:52:48.678Z · LW(p) · GW(p)

A sanity check of a counterintuitive claim can be that the argument to the claim implies things that seem unjustifiable or false. It cannot be that the conclusion of the claim itself is unjustifiable or false, except inasmuch as you are willing to deny the possibility to be convinced of that claim by argument at all.

(To avoid confusion, this is not in response to the latter portion of your comment about general cognition.)

↑ comment by DaemonicSigil · 2023-04-29T04:38:07.343Z · LW(p) · GW(p)

If you read carefully, Brain Efficiency does actually have some disclaimers to the effect that it's discussing the limits of irreversible computing using technology that exists or might be developed in the near future. See Jacob's comment here for examples: https://www.lesswrong.com/posts/mW7pzgthMgFu9BiFX/the-brain-is-not-close-to-thermodynamic-limits-on?commentId=y3EgjwDHysA2W3YMW [LW(p) · GW(p)]

In terms of what the actual fundamental thermodynamic limits are, Jacob and I still disagree by a factor of about 50. (Basically, Jacob thinks the dissipated energy needs to be amped up in order to erase a bit with high reliability. I think that while there are some schemes where this is necessary, there are others where it is not and high-reliability erasure is possible with an energy per bit approaching . I'm still working through the math to check that I'm actually correct about this, though.)

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T04:43:33.735Z · LW(p) · GW(p)

If you read landauers paper carefully he analyzes 3 sources of noise and is something like the speed of light for bit energy , only achieved at useless 50% error rate and or glacial speeds.

Replies from: DaemonicSigil, DaemonicSigil

↑ comment by DaemonicSigil · 2023-04-29T04:57:00.120Z · LW(p) · GW(p)

That's only for the double well model, though, and only for erasing by lifting up one of the wells. I didn't see a similar theorem proven for a general system. So the crucial question is whether it's still true in general. I'll get back to you eventually on that, I'm still working through the math. It may well turn out that you're right.

Replies from: jacob_cannell, alexander-gietelink-oldenziel

↑ comment by jacob_cannell · 2023-04-29T15:35:54.891Z · LW(p) · GW(p)

I believe the double well model - although it sounds somewhat specific at a glance - is actually a fully universal conceptual category over all relevant computational options for representing a bit.

You can represent a bit with dominoes, in which case the two bistable states are up/down, you can represent it with few electron quantum dots in one of two orbital configs, or larger scale wire charge changes, or perhaps fluid pressure waves, or ..

The exact form doesn't matter, as a bit always requires a binary classification between two partitions of device microstates, which leads to success probability being some exponential function of switching energy over noise energy. It's equivalent to a binary classification task for maxwell's demon.

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-29T07:43:32.855Z · LW(p) · GW(p)

Let me know how much time you need to check the math. I'd like to give the option to make an entry for the prize.

Replies from: DaemonicSigil, DaemonicSigil

↑ comment by DaemonicSigil · 2023-05-07T00:34:08.910Z · LW(p) · GW(p)

Finished, the post is here: https://www.lesswrong.com/posts/PyChB935jjtmL5fbo/time-and-energy-costs-to-erase-a-bit [LW · GW]

Summary of the conclusions is that energy on the order of should work fine for erasing a bit with high reliability, and the ~ $50 k T$ claimed by Jacob is not a fully universal limit.

↑ comment by DaemonicSigil · 2023-05-01T19:08:32.419Z · LW(p) · GW(p)

Sorry for the slow response, I'd guess 75% chance that I'm done by May 8th. Up to you whether you want to leave the contest open for that long.

↑ comment by DaemonicSigil · 2023-05-07T00:31:51.502Z · LW(p) · GW(p)

Okay, I've finished checking my math and it seems I was right. See post here for details: https://www.lesswrong.com/posts/PyChB935jjtmL5fbo/time-and-energy-costs-to-erase-a-bit [LW · GW]

comment by johnswentworth · 2023-04-26T17:32:38.645Z · LW(p) · GW(p)

(Copied with some minor edits from here [LW(p) · GW(p)].)

Jacob's argument in the Density and Temperature [LW · GW] section of his Brain Efficiency post basically just fails.

Jacob is using a temperature formula for blackbody radiators, which is basically irrelevant to temperature of realistic compute substrate - brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip). The obvious law to use instead would just be the standard thermal conduction law: heat flow per unit area proportional to temperature gradient.

Jacob's analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).

Put all that together, and a more sensible formula would be:

... where:

$R$ is radius of the system
$A$ is surface area of thermal contact
$q$ is heat flow out of system
$T_{S}$ is system temperature
$T_{E}$ is environment temperature (e.g. blood or heat sink temperature)
$C_{1}, C_{2}$ are constants with respect to system size and temperature

(Of course a spherical approximation is not great, but we're mostly interested in change as all the dimensions scale linearly, so the geometry shouldn't matter for our purposes.)

First key observation: all the $R$ 's cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick. So, overall, equilibrium temperature stays the same as the system scales down.

So in fact scaling down is plausibly free, for purposes of heat management. (Though I'm not highly confident that would work in practice. In particular, I'm least confident about the temperature gradient scaling with inverse system size, in practice.)

On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don't even need to go to liquid helium for that.

In terms of scaling, our above formula says that $T_{S}$ will scale proportionally to $T_{E}$ . Halve the environment temperature, halve the system temperature. And that result I do expect to be pretty robust (for systems near Jacob's interconnect Landauer limit), since it just relies on temperature scaling of the Landauer limit plus heat flow being proportional to temperature delta.

Replies from: AllAmericanBreakfast, johnswentworth, jacob_cannell, ADifferentAnonymous, jacob_cannell

↑ comment by DirectedEvolution (AllAmericanBreakfast) · 2023-04-26T19:41:54.856Z · LW(p) · GW(p)

I'm going to make this slightly more legible, but not contribute new information.

Note that downthread, Jacob says:

the temp/size scaling part is not one of the more core claims so any correction there probably doesn't change the conclusion much.

So if your interest is in Jacob's arguments as they pertain to AI safety, this chunk of Jacob's writings is probably not key for your understanding and you may want to focus your attention on other aspects.

Both Jacob and John agree on the obvious fact that active cooling is necessary for both the brain and for GPUs and a crucial aspect of their design.

Jacob:

Humans have evolved exceptional heat dissipation capability using the entire skin surface for evaporative cooling: a key adaption that supports both our exceptional long distance running ability, and our oversized brains...
Current 2021 gpus have a power density approaching W / $m^{2}$ , which severely constrains the design to that of a thin 2D surface to allow for massive cooling through large heatsinks and fans...

John:

... brains, chips, and probably future compute substrates are all cooled by conduction through direct contact with something cooler (blood for the brain, heatsink/air for a chip)..
On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing. According to this meta-analysis, the average temperature delta between e.g. brain and blood is at most ~2.5 C, so even liquid nitrogen would be enough to achieve ~100x larger temperature delta if the system were at the same temperature as the brain; we don't even need to go to liquid helium for that.

Where they disagree is on two points:

Whether temperature of GPUs/brains scales with their surface area
Tractability of dealing with higher temperatures in scaled-down computers with active cooling

Jacob applies the Stefan-Boltzmann Law for black body radiators. In this model, temperature output scales with both energy and surface area:

$T = (\frac{M_{e}}{σ})^{\frac{1}{4}}$
Where $M_{e}$ is the power per unit surface area in W/ $m^{2}$ , and $σ$ is the Stefan-Boltzmann constant.

In comments, he rationalizes this choice by saying:

SB law describes the relationship to power density of a surface and corresponding temperature; it just gives you an idea of the equivalent temperature sans active cooling... That section was admittedly cut a little short, if I had more time/length it would justify a deeper dive into the physics of cooling and how much of a constraint that could be on the brain. You're right though that the surface power density already describes what matters for cooling.

And downthread, he says:

I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space).

John advocates an alternative formula for heat flow:

Put all that together, and a more sensible formula would be:
$\frac{q}{A} = \frac{C_{1} T_{S} R}{R^{2}} = \frac{C_{2} (T_{S} - T_{E})}{R}$
... where:
$R$ is radius of the system
$A$ is surface area of thermal contact
$q$ is heat flow out of system
$T_{S}$ is system temperature
$T_{E}$ is environment temperature (e.g. blood or heat sink temperature)
$C_{1}, C_{2}$ are constants with respect to system size and temperature

R cancels out. I'm also going to move A over to the other side, ignore the constants for our conceptual purposes, and cut out the middle part of the equation, leaving us with:

$q = A (T_{S} - T_{E})$

In language, the heat flow out of the brain/GPU and into its cooling system (i.e. blood, a heatsink) is proportional to (area of contact) x (temperature difference).

At first glance, this would appear to also show that as you scale down, heat flow out of the system will decrease because there'll be less available area for thermal contact. They key point is whether or not power consumption stays the same as you scale down.

Here is Jacob's description of what happens to power consumption in GPUs as you scale down:

Current 2021 gpus have a power density approaching $10^{6}$ W / $m^{2}$ , which severely constrains the design to that of a thin 2D surface...
This in turn constrains off-chip memory bandwidth to scale poorly: shrinking feature sizes with Moore's Law by a factor of D increases transistor density by a factor of $D^{2}$ , but at best only increases 2d off-chip wire density by a factor of only D, and doesn't directly help reduce wire energy cost at all.

And here is John's model, where he clearly and crucially disagrees with Jacob on whether scaling down affects power consumption by shortening wires (relevant text is bolded in the quote above and below).

If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick.
So in fact scaling down is plausibly free, for purposes of heat management...

John also speaks to our ability to upgrade the cooling system:

On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing.

Jacob doesn't really talk about the limits of our ability to cool GPUs by upgrading the cooling system in this section, talking only of the thin 2D design of GPUs being motivated by a need to achieve "massive cooling through large heatsinks and fans." Ctrl+F does not find the words "nitrogen" and "helium" in his post, and only the version of John's comment in DaemonicSigil's rebuttal to Jacob [LW · GW] contains those terms. I am not sure if Jacob has expanded on his thoughts on the limits of higher-performance cooling elsewhere in his many comment replies.

So as far as I can tell, this is where the chain of claims and counter-claims is parked for now: a disagreement over power consumption changes as wires are shortened, and a disagreement on how practical it is for better cooling to allow further miniaturization even if scaling down does result in decreased heat flows and thus higher temperatures inside of the GPU. I expect there might be disagreement over whether scaling down will permit thinning of the surface (as John tentatively proposes).

Note that I am not an expert on these specific topics, although I have a biomedical engineering MS - my contribution here is gathering relevant quotes and attempting to show how they relate to each other in a way that's more convenient than bouncing back and forth between posts. If I have made mistakes, please correct me and I will update this comment. If it's fundamentally wrong, rather than having a couple local errors, I'll probably just delete it as I don't want to add noise to the discussion.

Replies from: M. Y. Zuo, alexander-gietelink-oldenziel, jacob_cannell

↑ comment by M. Y. Zuo · 2023-04-26T21:55:19.157Z · LW(p) · GW(p)

Strongly upvoted for taking the effort to sum up the debate between these two.

Just a brief comment from me, this part:

If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick.

Only makes sense in the context of a specified temperature range and wire material. I'm not sure if it was specified elsewhere or not.

A trivial example, A superconducting wire at 50 K will certainly not have it's power consumption halved by scaling down a factor of 2, since it's consumption is already practically zero (though not perfectly zero).

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-27T00:30:33.077Z · LW(p) · GW(p)

This is all assuming that the power consumption for a wire is at-or-near the Landauer-based limit Jacob argued in his post.

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-26T22:22:51.644Z · LW(p) · GW(p)

Thank you for this effort. I will probably end up allocating a share of the prize money for effortposts like these too.

↑ comment by jacob_cannell · 2023-04-28T22:19:11.104Z · LW(p) · GW(p)

Thank you for the effort in organizing this conversation. I want to clarify a few points.

Around the very beginning of the density & temperature section [LW · GW] I wrote:

but wire volume requirements scale linearly with dimension. So if we ignore all the machinery required for cellular maintenance and cooling, this indicates the brain is at most about 100x larger than strictly necessary (in radius), and more likely only 10x larger.

However, even though the wiring energy scales linearly with radius, the surface area power density which crucially determines temperature scales with the inverse squared radius, and the minimal energy requirements for synaptic computation are radius invariant.

Radius there refers to brain radius, not wire radius. Unfortunately there are two meanings of wiring energy or wire energy. By 'wiring energy' above hopefully the context helps make clear that I meant the total energy used by brain wiring/interconnect, not the 'wire energy' in terms of energy per bit per nm, which is more of a fixed constant that depends on wire design tradeoffs.

So my model was/is that if we assume you could just take the brain and keep the same amount of compute (neurons/synapses/etc) but somehow shrink the entire radius by a factor of D, this would decrease total wiring energy by the same factor D by just shortening all the wires in the obvious way.

However, the surface power density scales with radius as , so the net effect is that surface power density from interconnect scales with $1 / R$ , ie it increases by a factor of D as you shrink by a factor of D, which thereby increases your cooling requirement (in terms of net heat flow) by the same factor D. But since the energy use of synaptic computation does not change, that just quickly dominates scaling with $1 / R^{2}$ and thus $D^{2}$ .

In the section you quoted where I say:

This in turn constrains off-chip memory bandwidth to scale poorly: shrinking feature sizes with Moore's Law by a factor of D increases transistor density by a factor of $D^{2}$ , but at best only increases 2d off-chip wire density by a factor of only D, and doesn't directly help reduce wire energy cost at all.

Now I have moved to talking about 2D microchips, and "wire energy" here means the energy per bit per nm, which again doesn't scale with device size. Also the D here is scaling in a somewhat different way - it is referring to reducing the size of all devices as in normal moore's law shrinkage while holding the total chip size constant, increasing device density.

Looking back at that section I see numerous clarifications I would make now, and I would also perhaps focus more on the surface power density as a function of size, and perhaps analyze cooling requirements. However I think it is reasonably clear from the document that shrinking the brain radius by a factor of X increases the surface power density (and thus cooling requirements in terms of coolant flow at fixed coolant temp) from synaptic computation by $X^{2}$ and from interconnect wiring by $X$ .

In practice digital computers are approaching the limits of miniaturization and tend to be 2D for fast logic chips in part for cooling considerations as I describe. The cerebras wafer for example represents a monumental engineering advance in terms of getting power in and pumping heat out to a small volume, but they still use a 2D chip design, not 3D, because 2D allows you dramatically more surface area for pumping in power and out heat than a 3D design, at the sacrifice of much worse interconnect geometry scaling in terms of latency and bandwidth.

We can make 3D chips today and do, but that tends to be most viable for memory rather than logic, because memory has far lower power density (and the brain being neuromorphic is more like a giant memory chip with logic sprinkled around right next to each memory unit).

↑ comment by johnswentworth · 2023-04-26T17:38:36.073Z · LW(p) · GW(p)

(Note that this, in turn, also completely undermines the claims about optimality of speed in the next section. Those claims ultimately ground out in high temperatures making high clock speeds prohibitive, e.g. this line:

Scaling a brain to GHz speeds would increase energy and thermal output into the 10MW range, and surface power density to / $m^{2}$ , with temperatures well above the surface of the sun

)

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-26T18:21:51.991Z · LW(p) · GW(p)

For extra clarification, that should perhaps read " with (uncooled) temperatures well above ..." (ie isolated in vacuum).

↑ comment by jacob_cannell · 2023-04-26T18:20:05.403Z · LW(p) · GW(p)

I think you may be misunderstanding why I used the blackbody temp - I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space). So when I (or the refs I link) mention "temperatures greater than the surface of the sun" for the surface of some CMOS processor, it is not because we actually believe your GPU achieves that temperature (unless you have some critical cooling failure or short circuit, in which case it briefly achieves a very high temperature before melting somewhere).

So in fact scaling down is plausibly free, for purposes of heat management. (Though I'm not highly confident that would work in practice. In particular, I'm least confident about the temperature gradient scaling with inverse system size, in practice.)

I think this makes all the wrong predictions and so is likely wrong, but I will consider it more.

On top of that, we could of course just use a colder environment, i.e. pump liquid nitrogen or even liquid helium over the thing.

Of course - not really relevant for the brain, but that is an option for computers. Obviously you aren't gaining thermodynamic efficiency by doing so - you pay extra energy to transport the heat.

All that being said, I'm going to look into this more and if I feel a correction to the article is justified I will link to your comment here with a note. But the temp/size scaling part is not one of the more core claims so any correction there probably doesn't change the conclusion much.

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-26T18:53:54.174Z · LW(p) · GW(p)

I think you may be misunderstanding why I used the blackbody temp - I (and the refs I linked) use that as a starting point to to indicate the temp the computing element would achieve without convective cooling (ie in vacuum or outer space).

There's a pattern here which seems-to-me to be coming up repeatedly (though this is the most legible example I've seen so far). There's a key qualifier which you did not actually include in your post, which would make the claims true. But once that qualifier is added, it's much more obvious that the arguments are utterly insufficient to back up big-sounding claims like:

Thus even some hypothetical superintelligence, running on non-exotic hardware, will not be able to think much faster than an artificial brain running on equivalent hardware at the same clock rate.

Like, sure, our hypothetical superintelligence can't build highly efficient compute which runs in space without any external cooling machinery. So, our hypothetical superintelligence will presumably build its compute with external cooling machinery, and then this vacuum limit just doesn't matter.

You could add all those qualifiers to the strong claims about superintelligence, but then they will just not be very strong claims. (Also, as an aside, I think the wording of the quoted section is not the claim you intended to make, even ignoring qualifiers? The quote is from the speed section, but "equivalent hardware at the same clock rate" basically rules out any hardware speed difference by construction. I'm responding here to the claim which I think you intended to argue for in the speed section.)

Obviously you aren't gaining thermodynamic efficiency by doing so - you pay extra energy to transport the heat.

Note that you also potentially save energy by running at a lower temperature, since the Landauer limit scales down with temperature. I think it comes out to roughly a wash: operate at 10x lower temperature, and power consumption can drop by 10x (at Landauer limit), but you have to pay 9x the (now reduced) power consumption in work to pump that heat back up to the original temperature. So, running at lower temperature ends up energy-neutral if we're near thermodynamic limits for everything.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-26T19:16:24.492Z · LW(p) · GW(p)

The 'big-sounding' claim you quoted makes more sense only with the preceding context you omitted:

Conclusion: The brain is a million times slower than digital computers, but its slow speed is probably efficient for its given energy budget, as it allows for a full utilization of an enormous memory capacity and memory bandwidth. As a consequence of being very slow, brains are enormously circuit cycle efficient. Thus even some hypothetical superintelligence, running on non-exotic hardware, will not be able to think much faster than an artificial brain running on equivalent hardware at the same clock rate.

Because of its slow speed, the brain is super-optimized for intelligence per clock cycle. So digital superintelligences can think much faster, but to the extent they do so they are constrained to be brain-like in design (ultra optimized for low circuit depth). I have a decade old post analyzing/predicting this here [LW · GW], and today we have things like GPT4 which imitate the brain but run 1000x to 10000x times faster during training, and thus accel at writing.

↑ comment by ADifferentAnonymous · 2023-04-26T18:28:11.436Z · LW(p) · GW(p)

I agree the blackbody formula doesn't seem that relevant, but it's also not clear what relevance Jacob is claiming it has. He does discuss that the brain is actively cooled. So let's look at the conclusion of the section:

Conclusion: The brain is perhaps 1 to 2 OOM larger than the physical limits for a computer of equivalent power, but is constrained to its somewhat larger than minimal size due in part to thermodynamic cooling considerations.

If the temperature-gradient-scaling works and scaling down is free, this is definitely wrong. But you explicitly flag your low confidence in that scaling, and I'm pretty sure it wouldn't work.* In which case, if the brain were smaller, you'd need either a hotter brain or a colder environment.

I think that makes the conclusion true (with the caveat that 'considerations' are not 'fundamental limits').

(My gloss of the section is 'you could potential make the brain smaller, but it's the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table').

* I can provide some hand-wavy arguments about this if anyone wants.

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-26T19:04:34.502Z · LW(p) · GW(p)

My gloss of the section is 'you could potential make the brain smaller, but it's the size it is because cooling is expensive in a biological context, not necessarily because blind-idiot-god evolution left gains on the table'

I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn't apply to engineered compute hardware. More generally, the brain is probably efficient relative to lots of constraints which don't apply to engineered compute hardware. A hypothetical AI designing hardware will have different constraints.

Either Jacob needs to argue that the same limiting constraints carry over (in which case hypothetical AI can't readily outperform brains), or he does not have a substantive claim about AI being unable to outperform brains. If there's even just one constraint which is very binding for brains, but totally tractable for engineered hardware, then that opens the door to AI dramatically outperforming brains.

Replies from: jacob_cannell, ADifferentAnonymous

↑ comment by jacob_cannell · 2023-05-02T15:59:49.190Z · LW(p) · GW(p)

I tentatively buy that, but then the argument says little-to-nothing about barriers to AI takeoff. Like, sure, the brain is efficient subject to some constraint which doesn't apply to engineered compute hardware.

The main constraint at minimal device sizes is the thermodynamic limit for irreversible computers, so the wire energy constraint is dominant there.

However the power dissipation/cooling ability for a 3D computer only scales with the surface area , whereas compute device density scales with $d^{3}$ and interconnect scales somewhere in between.

The point of the temperature/cooling section was just to show that shrinking the brain by a factor of X (if possible given space requirements of wire radius etc), would increase surface power density by a factor of $X^{2}$ , but only would decrease wire length&energy by X and would not decrease synapse energy at all.

2D chips scale differently of course: the surface area and heat dissipation tend to both scale with $d^{2}$ . Conventional chips are already approaching miniaturization limits and will dissipate too much power at full activity, but that's a separate investigation. 3D computers like the brain can't run that hot given any fixed tech ability to remove heat per unit surface area. 2D computers are also obviously worse in many respects, as long range interconnect bandwidth (to memory) only scales with $d$ rather than the $d^{2}$ of compute which is basically terrible compared to a 3D system where compute density and long-range interconnect scales $d^{3}$ and $d^{2}$ respectively.

↑ comment by ADifferentAnonymous · 2023-04-26T21:55:12.535Z · LW(p) · GW(p)

Had it turned out that the brain was big because blind-idiot-god left gains on the table, I'd have considered it evidence of more gains lying on other tables and updated towards faster takeoff.

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-26T22:58:58.157Z · LW(p) · GW(p)

I mean, sure, but I doubt that e.g. Eliezer thinks evolution is inefficient in that sense.

Basically, there are only a handful of specific ways we should expect to be able to beat evolution in terms of general capabilities, a priori:

Some things just haven't had very much time to evolve, so they're probably not near optimal. Broca's area would be an obvious candidate, and more generally whatever things separate human brains from other apes.
There's ways to nonlocally redesign the whole system to jump from one local optimum to somewhere else.
We're optimizing against an environment different from the ancestral environment, or structural constraints different from those faced by biological systems, such that some constraints basically cease to be relevant. The relative abundance of energy is one standard example of a relaxed environmental constraint; the birth canal as a limiting factor on human brain size during development or the need to make everything out of cells are standard examples of relaxed structural constraints.
- One particularly important sub-case of "different environment": insofar as the ancestral environment mostly didn't change very quickly, evolution didn't necessarily select heavily for very generalizable capabilities. The sphex wasp behavior is a standard example. A hypothetical AI designer would presumably design/select for generalization directly.

(I expect that Eliezer would agree with roughly this characterization, by the way. It's a very similar way-of-thinking to Inadequate Equilibria [? · GW], just applied to bio rather than econ.) These kinds of loopholes leave ample space to dramatically improve on the human brain.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T01:05:07.515Z · LW(p) · GW(p)

Interesting - I think I disagree most with 1. The neuroscience seems pretty clear that the human brain is just a scaled up standard primate brain, the secret sauce is just language (I discuss this now and again in some posts and in my recent part 2). In other words - nothing new about the human brain has had much time to evolve, all evolution did was tweak a few hyperparams mostly around size and neotany (training time): very very much like GPT-N scaling (which my model predicted).

Basically human technology beats evolution because we are not constrained to use self replicating nanobots built out of common locally available materials for everything. A jet airplane design is not something you can easily build out of self replicating nanobots - it requires too many high energy construction processes and rare materials spread across the earth.

Microchip fabs and their outputs are the pinnacle of this difference - requiring rare elements across the periodic table, massively complex global supply chains and many steps of intricate high energy construction/refinement processes all throughout.

What this ends up buying you mostly is very high energy densities - useful for engines, but also for fast processors.

Replies from: johnswentworth

↑ comment by johnswentworth · 2023-04-27T16:34:07.293Z · LW(p) · GW(p)

Yeah, the main changes I'd expect in category 1 are just pushing things further in the directions they're already moving, and then adjusting whatever else needs to be adjusted to match the new hyperparameter values.

One example is brain size: we know brains have generally grown larger in recent evolutionary history, but they're locally-limited by things like e.g. birth canal size. Circumvent the birth canal, and we can keep pushing in the "bigger brain" direction.

Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction, the various physiological problems those variants can cause need to be offset by other simultaneous changes, which is the sort of thing a designer can do a lot faster than evolution can. (And note that, given how much the Ashkenazi dominated the sciences in their heyday, that's the sort of change which could by itself produce sufficiently superhuman performance to decisively outperform human science/engineering, if we can go just a few more standard deviations along the same directions.)

... but I do generally expect that the "different environmental/structural constraints" class is still where the most important action is by a wide margin. In particular, the "selection for generality" part is probably pretty big game, as well as selection pressures for group interaction stuff like language (note that AI potentially allows for FAR more efficient communication between instances), and the need for learning everything from scratch in every instance rather than copying, and generally the ability to integrate quantitatively much more information than was typically relevant or available to local problems in the ancestral environment.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T19:47:00.824Z · LW(p) · GW(p)

Circumvent the birth canal, and we can keep pushing in the "bigger brain" direction.

Chinchilla scaling already suggests the human brain is too big for our lifetime data, and multiple distal lineages that have very little natural size limits (whales, elephants) ended up plateauing around the same OOM similar brain neuron and synapse counts.

Or, another example is the genetic changes accounting for high IQ among the Ashkenazi. In order to go further in that direction,

Human intelligence in terms of brain arch priors also plateaus, the Ashkenazi just selected a bit stronger towards that plateau. Intelligence also has neotany tradeoffs resulting in numerous ecological niches in tribes - faster to breed often wins.

Replies from: philh, Vivek, alexander-gietelink-oldenziel

↑ comment by philh · 2023-05-01T13:39:28.316Z · LW(p) · GW(p)

Chinchilla scaling already suggests the human brain is too big for our lifetime data

So I haven't followed any of the relevant discussion closely, apologies if I'm missing something, but:

IIUC Chinchilla here references a paper talking about tradeoffs between how many artificial neurons a network has and how much data you use to train it; adding either of those requires compute, so to get the best performance where do you spend marginal compute? And the paper comes up with a function for optimal neurons-versus-data for a given amount of compute, under the paradigm we're currently using for LLMs. And you're applying this function to humans.

If so, a priori this seems like a bizarre connection for a few reasons, any one of which seems sufficient to sink it entirely:

Is the paper general enough to apply to human neural architecture? By default I would have assumed not, even if it's more general than just current LLMs.
Is the paper general enough to apply to human training? By default I would have assumed not. (We can perhaps consider translating the human visual field to a number of bits and taking a number of snapshots per second and considering those to be training runs, but... is there any principled reason not to instead translate to 2x or 0.5x the number of bits or snapshots per second? And that's just the amount of data, to say nothing of how the training works.)
It seems you're saying "at this amount of data, adding more neurons simply doesn't help" rather than "at this amount of data and neurons, you'd prefer to add more data". That's different from my understanding of the paper but of course it might say that as well or instead of what I think it says.

To be clear, it seems to me that you don't just need the paper to be giving you a scaling law that can apply to humans, with more human neurons corresponding to more artificial neurons and more human lifetime corresponding to more training data. You also need to know the conversion functions, to say "this (number of human neurons, amount of human lifetime) corresponds to this (number of artificial neurons, amount of training data)" and I'd be surprised if we can pin down the relevant values of either parameter to within an order of magnitude.

...but again, I acknowledge that you know what you're talking about here much more than I do. And, I don't really expect to understand if you explain, so you shouldn't necessarily put much effort into this. But if you think I'm mistaken here, I'd appreciate a few words like "you're wrong about the comparison I'm drawing" or "you've got the right idea but I think the comparison actually does work" or something, and maybe a search term I can use if I do feel like looking into it more.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-01T18:04:33.690Z · LW(p) · GW(p)

Thanks for your contribution. I would also appreciate a response from Jake.

↑ comment by Vivek Hebbar (Vivek) · 2023-04-27T21:12:31.920Z · LW(p) · GW(p)

Human intelligence in terms of brain arch priors also plateaus

Why do you think this?

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-28T10:50:15.204Z · LW(p) · GW(p)

For my understanding: what is a brain arch?

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-28T13:41:46.481Z · LW(p) · GW(p)

The architectural design of a brain, which I think of as an prior on the weights, so I sometimes call it the architectural prior. It is encoded in the genome and is the equivalent of the high level pytorch code for a deep learning model.

↑ comment by jacob_cannell · 2023-05-02T16:22:01.138Z · LW(p) · GW(p)

Jacob's analysis in that section also fails to adjust for how, by his own model in the previous section, power consumption scales linearly with system size (and also scales linearly with temperature).

If we fix the neuron/synapse/etc count (and just spread them out evenly across the volume) then length and thus power consumption of interconnect linearly scale with radius , but the power consumption of compute units (synapses) doesn't scale at all. Surface power density scales with $R^{2}$ .

First key observation: all the R's cancel out. If we scale down by a factor of 2, the power consumption is halved (since every wire is half as long), the area is quartered (so power density over the surface is doubled), and the temperature gradient is doubled since the surface is half as thick

This seems rather obviously incorrect to me:

There is simply a maximum amount of heat/entropy any particle of coolant fluid can extract, based on the temperature difference between the coolant particle and the compute medium
The maximum flow of coolant particles scales with the surface area.
Given a fixed compute temperature limit, coolant temp, and coolant pump rate thus results in a limit on the device radius

But obviously I do agree the brain is nowhere near the technological limits of active cooling in terms of entropy removed per unit surface area per unit time, but that's also mostly irrelevant because you expend energy to move the heat and the brain has a small energy budget of 20W. Its coolant budget is proportional to it's compute budget.

Moreover as you scale the volume down the coolant travels a shorter distance and has less time to reach equilibrium temp with the compute volume and thus extract the max entropy (but not sure how relevant that is at brain size scales).

comment by Eli Tyre (elityre) · 2023-04-26T20:43:25.508Z · LW(p) · GW(p)

I would contribute $75 to the prize. : )

comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-04-26T16:53:27.144Z · LW(p) · GW(p)

I think Jake is right that the brain is very energy efficient (~~dis~~claimer: I'm currently employed by Jake and respect his ideas highly.) I'm pretty sure though that the question about energy efficiency misses the point. There are other ways to optimize the brain, such as improving axonal transmission speed from the current range 0.5 - 10 meters/sec to more like the speed of electricity through wires ~250,000,000 meters per second. Or adding abilities the mammalian brain does not have, such as the ability to add new long range neurons connecting distal parts of the brain. We can reconfigure the long range neurons we have, but not add new ones. So basically, I come down on the other side of his conclusion in his recent post. I think rapid recursive self-improvement through software changes is indeed possible, and a risk we should watch out for.

Replies from: qv^!q, jacob_cannell

↑ comment by qvalq (qv^!q) · 2023-04-28T17:24:40.406Z · LW(p) · GW(p)

disclaimer

This might be the least disclamatory disclaimer I've ever read.

I'd even call it a claimer.

Replies from: nathan-helm-burger

↑ comment by Nathan Helm-Burger (nathan-helm-burger) · 2023-04-28T22:39:42.600Z · LW(p) · GW(p)

hah, yeah. What I'm trying to get at is something like, "My ability to objectively debate this person in public is likely hampered in ways not clearly observable to me by the fact that I am working closely with them on their projects and have a lot of shared private knowledge and economic interests with them. Please keep these limitations in mind while reading my comments."

↑ comment by jacob_cannell · 2023-04-26T17:40:03.459Z · LW(p) · GW(p)

There are other ways to optimize the brain, such as improving axonal transmission speed from the current range 0.5 - 10 meters/sec to more like the speed of electricity through wires ~250,000,000 meters per second.

I agree this is the main obvious improvement of digital minds, and speculated on some implications here [LW · GW] a decade ago. But if it requires even just 1KW of energy flowing through GPUs to match one human brain then using all of current world power output towards GPUs would still not produce more equivalent brain power than humanity (world power output ~ 4TW, and GPU production would have to increase by OOM).

You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.

Replies from: Vivek, Bjartur Tómas, Vivek

↑ comment by Vivek Hebbar (Vivek) · 2023-04-27T08:55:29.243Z · LW(p) · GW(p)

You could use all of world energy output to have a few billion human speed AGI, or a millions that think 1000x faster, etc.

Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster?? The difference between top scientists and average humans seems to be something like "software" (Einstein isn't using 2x the watts or neurons). So then it should be totally possible for each of the "millions of human-level AIs" to be equivalent to Einstein. Couldn't a million Einstein-level scientists running at 1000x speed could beat all human scientists combined?
And, taking this further, it seems that some humans are at least 100x more productive at science than others, despite the same brain constraints. Then why shouldn't it be possible to go further in that direction, and have someone 100x more productive than Einstein at the same flops? And if this is possible, it seems to me like whatever efficiency constraints the brain is achieving cannot be a barrier to foom, just as the energy efficiency (and supposed learning optimality?) of the average human brain does not rule out Einstein more than 100x-ing them with the same flops.

Replies from: jacob_cannell, Vivek

↑ comment by jacob_cannell · 2023-04-27T17:03:16.510Z · LW(p) · GW(p)

Isn't it insanely transformative to have millions of human-level AIs which think 1000x faster??

Yes it will be transformative.

GPT models already think 1000x to 10000x faster - but only for the learning stage (absorbing knowledge), not for inference (thinking new thoughts).

↑ comment by Vivek Hebbar (Vivek) · 2023-04-27T08:59:19.394Z · LW(p) · GW(p)

Of course, my argument doesn't pin down the nature or rate of software-driven takeoff, or whether there is some ceiling. Just that the "efficiency" arguments don't seem to rule it out, and that there's no reason to believe that science-per-flop has a ceiling near the level of top humans.

↑ comment by Tomás B. (Bjartur Tómas) · 2023-04-28T15:55:21.525Z · LW(p) · GW(p)

The whole "compute greater than humanity" thing does not seem like a useful metric. It's just completely not necessary to exceed total human compute to dis-empower humans. We parallelize extremely poorly. And given how recent human civilization at this scale is and how adversarial humans are towards each other, it would be surprising if we used our collective compute in even a remotely efficient way. Not to mention the bandwidth limitations.

The summed compute of conquistador brains was much less than those they dis-empowered. The summed compute of slaughterhouse worker brains is vastly less than that of the chickens they slaughter in a single month!

I don't think this point deserves any special salience at all.

↑ comment by Vivek Hebbar (Vivek) · 2023-04-27T08:38:31.354Z · LW(p) · GW(p)

In your view, is it possible to make something which is superhuman (i.e. scaled beyond human level), if you are willing to spend a lot on energy, compute, engineering cost, etc?

comment by Max H (Maxc) · 2023-04-29T13:10:18.757Z · LW(p) · GW(p)

I made some long comments below about why I think the whole Synapses [LW(p) · GW(p)] section is making an implicit type error that invalidates most of the analysis. In particular, claims like this:

Thus the brain is likely doing on order to $10^{15}$ low-medium precision multiply-adds per second.

Are incorrect or at least very misleading, because they're implicitly comparing "synaptic computation" to "flop/s", but "synaptic computation" is not a performance metric of the system as a whole.

My most recent comment is here [LW(p) · GW(p)], which I think mostly stands on its own. This thread starts here [LW(p) · GW(p)], and an earlier, related thread starts here [LW(p) · GW(p)].

If others agree that my basic objection in these threads is valid, but find the presentation in the most recent comment is still confusing, I might expand it into a full post.

comment by Vaniver · 2023-04-26T20:24:22.434Z · LW(p) · GW(p)

It's been years since I looked into it and I don't think I have access to my old notes, so I don't plan to make a full entry. In short, I think the claim of "brains operate at near-maximum thermodynamic efficiency" is true. (I don't know where Eliezer got 6 OoM but I think it's wrong, or about some nonobvious metric [edit: like the number of generations used to optimize].)

I should also reiterate that I don't think it's relevant to AI doom arguments [LW(p) · GW(p)]. I am not worried about a computer that can do what I can do with 10W; I am worried about a computer that can do more than what I can do with 10 kW (or 10 MW or so on).

[EDIT: I found one of the documents that I thought had this and it didn't, and I briefly attempted to run the calculation again; I think 10^6 cost reduction is plausible for some estimates of how much computation the brain is using, but not others.]

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-26T22:23:46.023Z · LW(p) · GW(p)

I encourage you to participate. I will split the prize money between all serious entry, weighted by my subjective estimation of their respective Shapely value.

comment by Steven Byrnes (steve2152) · 2023-04-26T23:44:37.654Z · LW(p) · GW(p)

This comment is about interconnect losses, based on things I learned from attending a small conference on energy-efficient electronics at UC Berkeley in 2013. I can’t immediately find my notes or the handouts so am going off memory.

Eli Yablonovitch kicked off the conference with the big picture. It’s all about interconnect losses, he said. The formula is ½CV² from charging and discharging the (unintentional / stray) "capacitor" in which one "plate" is the interconnect wire and the other "plate" is any other conductive stuff in its vicinity.

There doesn’t seem to be any plan for how to dramatically reduce the stray capacitance C, he said, so we really just need to get V lower (or of course switch to optical interconnects—see this comment of mine [LW(p) · GW(p)], referring to a project whose earlier stages was the topic of one of the conference talks).

The challenge is that conventional transistors need V to be much higher than kT/e, where e is the electron charge, because the V is forming an electrostatic barrier that is supposed to block electrons, even when those electrons might be randomly thermally excited sometimes. The relevant technical term here is “subthreshold swing”. There is a natural (temperature-dependent) limit to subthreshold swing in normal transistors, based on thermal excitation over the barrier—the “thermionic limit” of 60mV/decade at room temperature. He might have said that actual transistors still had room for improvement before hitting that thermionic limit, but anyway the focus of the conference was to do much better than the thermionic limit.

There were a bunch of talks covering different possible approaches. All seemed sound in principle, none seemed ready for primetime, and I haven’t checked what if anything has come of them since 2013.

You can "increase q" and thereby decrease kT/q. Wait, what? Recall, the electrostatic barrier is qV, where q is the electric charge of the thing climbing the barrier. In a transistor, the thing climbing the barrier is a single individual electron. But one of the presenters was trying to develop mechanical (!) (NEMS) contacts that would switch a connection on and off the old-fashioned way—by actual physical contact between two conductors. The trick is that a mechanical cantilever could easily have a charge of 10e or 100e, and therefore a quite low voltage could actuate it without thermal noise being an issue. Obviously they were having a heck of a time with reliability, stiction and so on, but they suggested that there were no fundamental barriers.
You can have step-up / step-down voltage converters (I hesitate to use the word “transformers”) between the switches/transistors and the interconnects. I.e., we can send a high-current-at-low-voltage through the wires, then step it up to a low-current-at-high-voltage when it arrives at a transistor gate. Doesn't violate any laws of physics, but how do we do that at nanometer scale? I recall two talks in this category. One involved a stack of two nanofabricated mechanically-coupled piezoelectric blocks of different sizes, in a larger mechanical box. The other involved a ferroelectic layer, which under certain circumstances could act like a capacitor in series whose capacitance was negative. (For the latter, I vaguely recall that I started out skeptical but looked into it and decided that the theory was sound.)
You can just lower the temperature while redesigning the chip to have a lower-than-ambient (e.g. liquid nitrogen) operating voltage, hence lowering kT/e. I think Eli said that the math did in fact work out—the energy costs of cooling below ambient could be more than paid back. (I think it doesn’t help “in the ideal limit”, but I think he said it would help with actual chips, or something.)
You can use a different kind of switch that doesn’t involve electrons being thermally excited over a barrier. I recall one talk in this category. It concerned quantum tunneling transistors (a cousin of “backwards diodes”, I think). I recall that their preliminary experimental results did not show subthreshold swing better than the thermionic limit. I think I wasn’t convinced that their approach could beat the thermionic limit even in theory. I believe they weren’t quite sure either. They said “how sharp is the semiconductor band edge” was a basic physics question that, to their dismay, theoretical physicists seemed to have never looked into.

Anyway, that’s my understanding of interconnect losses.

Jacob’s discussion of interconnect losses is quite different, and doesn’t even mention ½CV². There was some elaboration in this thread [LW(p) · GW(p)]. I think that Jacob’s weird (from my perspective) formula was roughly consistent with what you get from ½CV² if you assume that the V is the normal voltage as constrained by the thermionic limit, which is what it is today and what most integrated circuit people would assume it will always be. But at least on paper, some of the above ideas would be able to lower interconnect losses by a lot, like I presume at least 1-2 OOM, if anyone can get them to actually work in practice.

Replies from: jacob_cannell, jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T04:28:54.575Z · LW(p) · GW(p)

So I predict in advance these approaches will fail or succeed only through using some reversible mechanism (with attended tradeoffs).

If you accept the Landuer analysis then the only question that remains for nano devices (where interconnect tiles are about the same size as your compute devices), is why you would ever use irreversible copy-tiles for interconnect instead of reversible move-tiles. It really doesn't matter whether you are using ballistic electrons or electron waves or mechanical rods, you just get different variations of ways to represent a bit (which still mostly look like a relation but the form isn't especially relevant )

A copy tile copy tile copies a bit from one side to the other. It has an internal memory M state (1 bit), and it takes an input bit from say the left and produces an output bit on the right. It's logic table looks like:

O I M

1 1 0

1 1 1

0 0 0

0 0 1

In other words, every cycle it erases whatever leftover bit it was storing and copies the input bit to the output, so it always erases one bit. This exactly predicts nanowire energy correctly, there is a reason cavin et al use it, etc etc.

But why do that instead of just move a bit? That is the part which I think is less obvious.

I believe it has to do with the difficulties of noise buildup. The copy device doesn't allow any error to accumulate at all. Your bits can be right on your reliability threshold (1 eV or whatever depending on the required reliability and speed tradeoffs), and error doesn't accumulate regardless of wire length, because you are erasing at every step.

The reversible move device seems much better - and obviously is for energy efficiency - but it accumulates a bit of noise on the landuer scale at every cycle, because of various thermal/quantum noise sources as you are probably aware: your device is always coupled to a thermal bath, or still subject to cosmic rays even in outer space, and producing it's own heat regardless at least for error correction. And if you aren't erasing noise, then you are accumulating noise.

Edit: After writing this out I just stumbled on this paper by Siamak Taati^[1] which makes the same argument about exponential noise accumulation much more formally. Looks like fully reversible computers are as challenging as scaling quantum computers. Quantum computers are naturally reversible and have all the same noise accumulation issues, resulting in quick decoherence - so you end up trying to decouple them from the environment as much as possible (absolute zero temp).

You can also have interconnect through free particle transmission as in lasers/optics, but that of course doesn't completely avoid the noise accumulation issue. Optical interconnect also just greatly increases the device size which is obviously a huge downside but helps further reduce energy losses by just massively scaling up the interaction length or equivalent tile size.

Reversible cellular automata in presence of noise rapidly forget everything∗ ↩︎

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-27T13:06:12.147Z · LW(p) · GW(p)

Your whole reply here just doesn’t compute for me. An interconnect is a wire. We know how wires work. They have resistance-per-length, and capacitance-per-length, and characteristic impedance, and Johnson noise, and all the other normal things about wires that we learned in EE 101. If the wire is very small—even down to nanometers—it’s still a wire, it’s just a wire with a higher resistance-per-length (both for the obvious reason of lower cross-sectional area, and because of surface scattering and grain-boundary scattering).

I don’t know why you’re talking about “tiles”. Wires are not made of tiles, right? I know it’s kinda rude of me to not engage with your effortful comment, but I just find it very confusing and foreign, right from the beginning.

If it helps, here is the first random paper I found about on-chip metal interconnects. It treats them exactly like normal (albeit small!) metal wires—it talks about resistance, resistivity, capacitance, current density, and so on. That’s the kind of analysis that I claim is appropriate.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T17:12:18.910Z · LW(p) · GW(p)

Your whole reply here just doesn’t compute for me. An interconnect is a wire. We know how wires work. They have resistance-per-length, and capacitance-per-length, and characteristic impedance, and Johnson noise, and all the other normal things about wires that we learned in EE 101

None of those are fundemental - all those rules/laws are derived - or should be derivable - from simpler molecular/atomic level simulations.

I don’t know why you’re talking about “tiles”. Wires are not made of tiles, right?

A wire carries a current and can be used to power devices, and or it can be used to transmit information - bits. In the latter usage noise analysis is crucial.

Let me state a chain of propositions to see where you disagree:

The landauer energy/bit/noise analysis is correct (so high speed reliable bits correspond to ~1eV).
The analysis applies to computers of all scales, down to individual atoms/molecules.
For a minimal molecular nanowire, the natural tile size is the electron radius.
An interconnect (wire) tile can be reversible or irreversible.
Reversible tiles rapidly accumulate noise/error ala Taati et al. and so aren't used for nano scale interconnect in brains or computers.

From 1 - 4 we can calculate the natural wire energy as it's just 1 electron charge per 1 electron radius and it reproduces wire equation near exactly (recall from that other thread in brain efficiency).

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-27T18:14:17.778Z · LW(p) · GW(p)

Let’s consider a ≤1mm wire on a 1GHz processor. Given the transmission line propagation speed, we can basically assume that the whole wire is always at a single voltage. I want to treat the whole wire as a unit. We can net add charge from the wire, anywhere in the wire, and the voltage of the whole wire will go up. Or we can remove charge from the wire, anywhere in the wire, and the voltage of the whole wire will go down.

Thus we have a mechanism for communication. We can electrically isolate the wire, and I can stand at one end of the wire, and you can stand at the other. I pull charge off of the wire at my end, and you notice that the voltage of the whole wire has gone down. And then I add charge into the wire, and you notice that the voltage of the whole wire has gone up. So now we’re communicating. And this is how different transistors within a chip communicate with each other, right?

I don’t think electron radius is relevant in this story. And there are no “tiles”. And this is irreversible. (When we bring the whole wire from low voltage to high voltage or vice-versa, energy is irrecoverably dissipated.) And the length of the wire only matters insofar as that changes its capacitance, resistance, inductance, etc. There will be voltage fluctuations (that depend on the frequency band, characteristic impedance, and ohmic losses), but I believe that they’re negligibly small for our purposes (normal chips are sending maybe 600 mV signals through the interconnects, so based on ½CV² we should get 2 OOM lower interconnect losses by “merely” going to 60 mV, whereas the Johnson noise floor at 1GHz is <<1mV I think). The loss involved in switching the whole wire from high voltage to low voltage or vice versa is certainly going to be >>1eV.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T19:00:33.183Z · LW(p) · GW(p)

I'm still not sure where you disagree with my points 1-5, but I'm guessing 3?

The relevance of 3 is that your wire is made of molecules with electron orbitals each of which is a computer subject to the landauer analysis. To send a bit very reliably across just one electron-radius length of wire requires about 1eV (not exactly, but using the equations). So for a minimal nanowire of single electron width that corresponds to 1V, but a wider wire can naturally represent a bit using more electrons and a lower voltage.

Either way, every individual molecule electron-radius length of the wire is a computer tile which must either 1.) copy a bit and thus erase a bit on order 1eV, or 2.) move a bit without erasure, but thus accumulate noise ala Taati et al.

So if we plug in those equations it near exactly agrees [LW(p) · GW(p)] with the spherical cow wire model of nanowires, and you get about 81 fJ/mm.

The only way to greatly improve on this is to increase the interaction distance (and thus tile size) - which requires the electrons move a much larger distance before interacting in the relay chain. That doesn't seem very feasible for conventional wires made of a dense crystal lattice but obviously is possible for non relay based interconnect like photonics (with its size disadvantage).

So in short, at the nanoscale it's better to model interconnect as molecular computers, not macro wires. Do you believe Cavin/Zhirnov are incorrect?

Specifically the tile model^[1], and also more generally the claim that adiabatic interconnect basically doesn't work at the nanolevel for conventional computers due to noise accumulation^[2], agreeing with Taati:

The presence of thermal noise dictates that an energy barrier is needed to preserve a binary state. Therefore, all electronic devices contain at least one energy barrier to control electron flow. The barrier properties determine the operating characteristics of electronic devices. Furthermore, changes in the barrier shape require changes in charge density. Operation of all charge transport devices includes charging/discharging capacitances to change barrier height. We analyze energy dissipation for several schemes of charging capacitors. A basic assumption of Reversible Computing is that the computing system is completely isolated from the thermal bath. An isolated system is a mathematical abstraction never perfectly realized in practice. Errors due to thermal excitations are equivalent to information erasure, and thus computation dissipates energy. Another source of energy dissipation is due to the need of measurement and control. To analyze this side of the problem, the Maxwell's Demon is a useful abstraction. We hold that apparent "energy savings" in models of adiabatic circuits result from neglecting the total energy needed by other parts of the system to implement the circuit.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-27T19:43:55.836Z · LW(p) · GW(p)

Here’s a toy model. There’s a vacuum-gap coax of length L. The inside is a solid cylindrical wire of diameter D and resistivity ρ. The outside is grounded, and has diameter Dₒ=10×D. I stand at one end and you stand at the other end. The inside starts out at ground. Your end is electrically isolated (open-circuit). If I want to communicate the bit “1” to you, then I raise the voltage at my end to V=+10mV, otherwise I lower the voltage at my end to V=–10mV.

On my end, the energy I need to spend is:

On your end, you’re just measuring a voltage so the required energy is zero in principle.

The resistivity ρ and diameter D don’t enter this equation, as it turns out, although they do affect the timing. If D is as small as 1nm, that’s fine, as long as the wire continues to be electrically conductive (i.e. satisfy ohm’s law).

Anyway, I have now communicated 1 bit to you with 60,000× less energy expenditure than your supposed limit of 81 fJ/mm. But I don’t see anything going wrong here. Do you? Like, what law of physics or assumption am I violating here?

a wider wire can naturally represent a bit using more electrons and a lower voltage

I don’t think it’s relevant, but for what it’s worth, 1nm³ of copper contains 90 conduction electrons.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-27T19:52:11.333Z · LW(p) · GW(p)

This may be obvious - but this fails to predict the actual wire energy, whereas my preferred model does. So if this model is correct - why does it completely fail to predict interconnect wire energy despite an entire engineering industry optimizing such parameters? Where do you believe the error is?

My first guess is perhaps you are failing to account for the complex error/noise buildup per unit length of wire. A bit is an approximation of a probability distribution. So you start out with a waveform on one end of the wire which minimally can represent 1 bit against noise (well maybe not even that - your starting voltage seems unrealistic), but then it quickly degrades to something which can not.

Actually, Looking back at the old thread [LW(p) · GW(p)] I believe you are incorrect that 10mV is realistic for anything near nanowire. You need to increase your voltage by 100x or use an enormous number of charge carriers which isn't possible for a nanowire (and is just a different way to arrive at 1eV per computational relay bit).

And in terms of larger wires, my model from brain efficiency actually comes pretty close to predicting actual wire energy for large copper wires - see this comment [LW(p) · GW(p)].

Kwa estimates 5e-21 J/nm, which is only 2x the lower landauer bound and corresponds to a ~75% bit probability (although the uncertainty in these estimates is probably around 2x itself). My explanation is that such very low bit energies approaching the lower landauer limit are possible but only with complex error correction - which is exactly what ethernet/infinityband cards are doing. But obviously not viable for nanoscale interconnect.

Or put another way - why do you believe that Cavin/Zhirnov are incorrect?

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-27T23:40:10.363Z · LW(p) · GW(p)

The easiest way to actuate an electronic switch is to use a voltage around 20kT/q≈500mV (where 20 is to get way above the noise floor).
The most efficient way to send information down a wire is to use a voltage around ≈ 0.3 mV (where 20 is to get way above the noise floor and Z₀ is the wire’s characteristic impedance which is kinda-inevitably somewhat lower than the 377Ω impedance of free space, typically 50-100Ω in practice).

So there’s a giant (>3 OOM) mismatch.

The easy way to deal with that giant mismatch is to ignore it. Just use the same 500mV voltage for both the switches and the wires, even though that entails wasting tons and tons of power unnecessarily in the latter—specifically 6.5 orders of magnitude more interconnect losses than if the voltage were tailored to the wire properties.

The hard way to deal with that giant mismatch is to make billions of nano-sized weird stacks of piezoelectric blocks so that each transistor gate has its own little step-up voltage-converter, or other funny things like that as in my top comment.

But people aren’t doing it the “hard way”, they’re doing it the “easy way”, and always have been.

Given that this is in fact the strategy, we can start doing fermi estimates about interconnect losses. We have V ≈ 20kT/q, C ≈ ε₀ × L (where L = typical device dimension), and if we ask how much loss there is in a “square tile” it would be ½CV²/L ≈200(kT/q)²ε₀=1.2e-21 J/nm which isn’t wildly far from Kwa’s estimate that you cite.

So in summary, I claim that Kwa gets reasonable numbers (compared to actual devices) by implicitly / effectively assuming somewhere-or-other that wire voltage is high enough to also simultaneously be adequate for a transistor gate voltage, even though such a high voltage is not remotely necessary for the wire to function well as a wire. Maybe he thinks otherwise, and if he does, I think he’s wrong. ¯\_(ツ)_/¯

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-28T04:39:06.749Z · LW(p) · GW(p)

To be clear, Kwa did not provide a model - so estimate is not really the right word. He provided a link to the actual wire consumption of some current coaxial ethernet, did the math, and got , which is near the lower bound I predicted based on the landauer analysis - which only works using sophisticated error correction codes (which require entire chips). You obviously can't use a whole CPU or ASIC for error correction for every little nanowire interconnect, so interconnect wires need to be closer to the 1eV/nm wire energy to have reliability. So your most efficient model could approach the $2 e^{- 21} J / n m$ level, but only using some big bulky mechanism - if not error correction coding then perhaps the billions of piezoelectric blocks.

Now you could believe that I had already looked up all those values and new that, but actually I did not. I did of course test the landauer model on a few examples, and then just wrote it in as it seemed to work.

So I predict that getting below the $2 e^{- 21} J / n m$ limit at room temp is impossible for irreversible electronic relay based communication (systems that send signals relayed through electrons on dense crystal lattices).

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-28T16:29:20.194Z · LW(p) · GW(p)

If you want to know the noise in a wire, you pull out your EE 101 textbook and you get formulas like where Z₀ is the wire’s characteristic impedance and f is the frequency bandwidth. (Assuming the wire has a low-impedance [voltage source] termination on at least one side, as expected in this context.) Right? (I might be omitting a factor of 2 or 4? Hmm, actually I’m a bit unsure about various details here. Maybe in practice the noise would be similar to the voltage source noise, which could be even lower. But OTOH there are other noise sources like cross-talk.) The number of charge carriers is not part of this equation, and neither is the wire diameter. If we connect one end of the wire to a +10mV versus -10mV source, that’s 1000× higher than the wire’s voltage noise, even averaging over as short as a nanosecond, so error correction is unnecessary, right?

I feel like your appeal to “big bulky mechanism” is special pleading. I don’t think Landauer’s analysis concluded “…therefore there is an inevitable energy dissipation of kT per bit erasure, oh unless you have a big bulky mechanism involving lots and lots of electrons, in which case energy dissipation can be as low as you like”. Right? Or if there’s a formula describing how “Landauer’s limit for interconnects” gets progressively weaker as the wire gets bigger, then what’s that formula? And why isn’t a 1nm-diameter wire already enough to get to the supposed large-wire-limit, given that copper has 90 conduction electrons per nm³?

Hmm, I think I should get back to my actual job now. You’re welcome to reply, and maybe other people will jump in with opinions. Thanks for the interesting discussion! :)

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-28T19:08:42.097Z · LW(p) · GW(p)

This is frustrating for me as I have already laid out my core claims [LW(p) · GW(p)] and you haven't clarified which (if any) you disagree with. Perhaps you are uncertain - that's fine, and I can kind of guess based on your arguments, but it still means we are talking past each more than I'd prefer.

If we connect one end of the wire to a +10mV versus -10mV source, that’s 1000× higher than the wire’s voltage noise, even averaging over as short as a nanosecond, so error correction is unnecessary, right?

It doesn't matter whether you use 10mV or 0.015mV as in your example above, as Landauer analysis bounds the energy of a bit, not the voltage. For high reliability interconnect you need ~1eV which could be achieved in theory by one electron at one volt naturally, but using 10mV would require ~100 electron charges and 0.015mV would require almost electron charges, the latter of which doesn't seem viable for nanowire interconnect, and doesn't change the energy per bit requirements regardless.

The wire must use ~1eV to represent and transmit one bit (for high reliability interconnect) to the receiving device across the wire exit surface, regardless of the wire width.

Now we notice that we can divide the wire in half, and the first half is also a wire which must transmit to the 2nd half, so now we know it must use at least 2eV to transmit a bit across both sections, each of which we can subdivide again, resulting in 4eV .. and so on until you naturally bottom out at the minimal wire length of one electron radius.

Hmm, I think I should get back to my actual job now

Agreed - this site was designed to nerdsnipe us away from creating AGI ;)

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-04-29T16:18:01.240Z · LW(p) · GW(p)

Gah, against my better judgment I’m gonna carry on for at least one more reply.

This is frustrating for me as I have already laid out my core claims [LW(p) · GW(p)] and you haven't clarified which (if any) you disagree with.

I think it’s wrong to think of a wire as being divided into a bunch of tiles each of which should be treated like a separate bit.

Back to the basic Landauer analysis: Why does a bit-copy operation require kT of energy dissipation? Because we go from four configurations (00,01,10,11) to two (00,11). Thermodynamics says we can’t reduce the number of microstates overall, so if the number of possible chip states goes down, we need to make up for it by increasing the temperature (and hence number of occupied microstates) elsewhere in the environment, i.e. we need to dissipate energy / dump heat.

OK, now consider a situation where we’re transferring information by raising or lowering the voltage on a wire. Define V(X) = voltage of the wire at location X and V(X+1nm) = voltage of the wire at location X+1nm (or whatever the supposed “tile size” is). As it turns out, under practical conditions and at the level of accuracy that matters, V(X) = V(X+1nm) always. No surprise—wires are conductors, and conductors oppose voltage gradients. There was never a time when we went from more microstates to fewer microstates, because there was never a time when V(X) ≠ V(X+1nm) in the first place. They are yoked together, always equal to each other. They are one bit, not two. For example, we don’t need an energy barrier preventing V(X) from contaminating the state of V(X+1nm) or whatever; in fact, that’s exactly the opposite of what we want.

(Nitpicky side note: I’m assuming that, when we switch the wire voltage between low and high, we do so by ramping it very gradually compared to (1nm / speed of light). This will obviously be the case in practice. Then V(X) = V(X+1nm) even during the transient as the wire voltage switches.)

The thing you’re proposing is, to my ears, kinda like saying that the voltage of each individual atom within a single RAM capacitor plate is 1 bit, and it just so happens that all those “bits” within a single capacitor plate are equal to each other at any given time, and since there’s billions of atoms on the one capacitor plate it must take billions of dissipative copy operations to every time that we flip that one RAM bit.

None of those are fundemental - all those rules/laws are derived - or should be derivable - from simpler molecular/atomic level simulations.

I’m confident that I can walk through any of the steps to get from the standard model of particle physics, to Bloch waves and electron scattering, to the drift-diffusion equation and then ohm's law, and to the telegrapher’s equations, and to Johnson noise and all the other textbook formulas for voltage noise on wires. (Note that I kinda mangled my discussions of voltage noise above, in various ways; I’m happy to elaborate but I don’t think that’s a crux here.)

Whereas “wires should be modeled as a series of discrete tiles with dissipative copy operations between them” is not derivable from fundamental physics, I claim. In particular, I don’t think there is any first-principles story behind your assertion that “the natural tile size is the electron radius”. I think it’s telling that “electron radius” is not a thing that I recall ever being mentioned in discussions of electrical conduction, including numerous courses that I’ve taken and textbooks that I’ve read in solid-state physics, semiconductor physics, nanofabrication, and electronics. Honestly I’m not even sure what you mean by “electron radius” in the first place.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-04-29T17:43:25.241Z · LW(p) · GW(p)

I think it’s wrong to think of a wire as being divided into a bunch of tiles each of which should be treated like a separate bit.

Why? Does not each minimal length of wire need to represent and transmit a bit? Does the landauer principle somehow not apply at the micro or nanoscale?

It is not the case that the wire represents a single bit, stretched out across the length of the wire, as I believe you will agree. Each individual section of wire stores and transmits different individual bits in the sequence chain at each moment in time, such that the number of bits on the wire is a function of length.

As it turns out, under practical conditions and at the level of accuracy that matters, V(X) = V(X+1nm) always.

Only if the wire is perfectly insulated from the external environment - which crucially perhaps is our crux. If the wire is in a noisy conventional environment, it accumulates noise on the landauer scale at each nanoscale transmission step, and at the minimal landauer bit energy scale this noise rapidly collapses the bit representation (decays to noise) exponentially quickly, unless erased. (because the landauer energy scale is defined as the minimal bit energy reasonable distinguishable from noise, so it has no room for more error).

There was never a time when we went from more microstates to fewer microstates, because there was never a time when V(X) ≠ V(X+1nm) in the first place.

I don't believe this is true in practice as again any conventional system is not perfectly reversible unless (unrealistically) there is no noise coupling.

The thing you’re proposing is, to my ears, kinda like saying that the voltage of each individual atom within a single RAM capacitor plate is 1 bit, and it just so happens that all those “bits” within a single capacitor plate are equal to each other at any given time, and since there’s billions of atoms on the one capacitor plate

I'm not sure how you got that? There are many ways to represent a bit, and for electronic relay systems the bit representation is distributed over some small fraction of the electrons moving between outer orbitals. The bit representation is a design constraint in terms of a conceptual partition of microstates, and as I already stated earlier you can represent a tiny landauer energy bit using partitions of almost unlimited number of atoms and their microstates (at least cross sectionally for an interconnect wire, but for density reasons the wires need be thin).

I sometimes use single electron examples, as those are relevant for nanoscale interconnect, and nanoscale computational models end up being molecule sized cellular automata where bits are represented by few electron gaps (but obviously not all electrons participate).

Whereas “wires should be modeled as a series of discrete tiles with dissipative copy operations between them” is not derivable from fundamental physics, I claim

Do you not believe that wires can be modeled as smaller units, recursively down to the level of atoms?

And I clearly do not believe that wires are somehow only capable of dissipative copy operations in theory. In theory they are perfectly capable of non-dissipative reversible move operations, but in practice that has 1.) never been successfully achieved in any conventional practical use that I am aware of, and 2.) is probably impossible in practical use without exotic noise isolation given the terminal rapid noise buildup problems I mentioned (I have some relevant refs in earlier comments),

In particular, I don’t think there is any first-principles story behind your assertion that “the natural tile size is the electron radius”.

The landauer principle doesn't suddenly stop applying at the nanoscale, it bounds atoms and electrons at all sizes and scales. The wire equations are just abstractions, the reality at nanoscale should be better modeled by a detailed nanoscale cellular automata. By "electron radius" I meant de broglie wavelength, which i'm using as a reasonable but admittedly vagueish guess for the interaction distance (the smallest distance scale at which we can model it as a cellular automata switching between distinct bit states, which I admit is not a concept I can yet tightly define, but I derive that concept from studies of the absolute minimum spacing between compute elements due to QM electron de broglie wavelength effects, and I expect it's close to the directional mean free path length but haven't checked ), so for an interconnect wire I used ~1.23nm at 1 volt, from this thread [LW(p) · GW(p)]:

The Landauer/Tile model predicts in advance a natural value of this parameter will be 1 electron charge per 1 volt per 1 electron radius, ie 1.602 e-19 F / 1.23 nm, or 1.3026 e-10 F/m.

Naturally it's not a fixed quantity as it depends on the electron energy and thus voltage, the thermal noise, etc, but it doesn't seem like that can make a huge difference for room temp conventional wires. (This page estimates a wavelength of 8 angstrom or 0.8nm for typical metals, so fairly close). I admit that my assertion that the natural interaction length (and thus cellular automata scale) is the electron de broglie wavelength seems ad hoc, but I believe it is justifiable and very much seems to make the right predictions so far.

But in that sense I should reassert that my model applies most directly only to any device which conveys bits relayed through electrons exchanging orbitals, as that is the generalized electronic cellular automata model, and wires should not be able to beat that bound. But if there is some way to make the interaction distance much much larger - for example via electrons moving ballistically OOM greater than the ~1 nm atomic scale before interacting, then the model will break down.

So what would cause you to update?

For me, I will update immediately if someone can find a single example of a conventional wire communication device (room temp etc) which has been measured to transmit information using energy confidently less than J/bit/nm. In your model this doesn't seem super hard to build.

Replies from: bhauth, steve2152

↑ comment by bhauth · 2023-05-15T11:42:45.152Z · LW(p) · GW(p)

But in that sense I should reassert that my model applies most directly only to any device which conveys bits relayed through electrons exchanging orbitals, as that is the generalized electronic cellular automata model, and wires should not be able to beat that bound. But if there is some way to make the interaction distance much much larger - for example via electrons moving ballistically OOM greater than the ~1 nm atomic scale before interacting, then the model will break down.

The mean free path of conduction electrons in copper at room temperature is ~40 nm. Cold pure metals can have much greater mean free paths. Also, a copper atom is ~0.1 nm, not ~1 nm.

↑ comment by Steven Byrnes (steve2152) · 2023-04-30T21:46:36.196Z · LW(p) · GW(p)

For me, I will update immediately if someone can find a single example of a conventional wire communication device (room temp etc) which has been measured to transmit information using energy confidently less than J/bit/nm. In your model this doesn't seem super hard to build.

I guess we could buy a 30-meter cat8 ethernet cable, send 40Gbps of data through it, coil up the cable very far away from both the transmitter and the receiver, and put that coil into a thermally-insulated box (or ideally, a calorimeter), and see if the heat getting dumped off the cable is less than 2.4 watts, right? I think that 2.4 watts is enough to be pretty noticeable without special equipment.

My expectation is… Well, I’m a bit concerned that I’m misunderstanding ethernet specs, but it seems that there are 4 twisted pairs with 75Ω characteristic impedance, and the voltage levels go up to ±1V. That would amount to a power flow of up to 4V²/Z=0.05W. The amount dissipated within the 30-meter cable is of course ~~much~~ less than that, or else there would be nothing left for the receiver to measure. So my prediction for the thermally-insulated box experiment above is “the heat getting dumped off the ethernet cable will be ~~well~~ under 0.05W (unless I’m misunderstanding the ethernet specs)”.

(Update: I struck-through the intensifiers “much” and “well” in the previous paragraph. Maybe they’re justified, but I’m not 100% sure and they’re unnecessary for my point anyway. See bhauth reply below.)

what would cause you to update?

I can easily imagine being convinced by a discussion that talks about wires in the way that I consider “normal”, like if we’re interested in voltage noise then we use the Johnson noise formula (or shot noise or crosstalk noise or whatever it is), or if we’re interested in the spatial profile of the waves then we use the telegrapher’s equations and talk about wavelength, etc.

For example, you wrote “it accumulates noise on the landauer scale at each nanoscale transmission step, and at the minimal landauer bit energy scale this noise rapidly collapses the bit representation (decays to noise) exponentially quickly”. I think if this were a real phenomenon, we should be able to equivalently describe that phenomenon using the formulas for electrical noise that I can find in the noise chapter of my electronics textbook. People have been sending binary information over wires since 1840, right? I don’t buy that there are important formulas related to electrical noise that are not captured by the textbook formulas. It’s an extremely mature field. I once read a whole textbook on transistor noise, it just went on and on about every imaginable effect.

As another example, you wrote:

It is not the case that the wire represents a single bit, stretched out across the length of the wire, as I believe you will agree. Each individual section of wire stores and transmits different individual bits in the sequence chain at each moment in time, such that the number of bits on the wire is a function of length.

Again, I want to use conventional wire formulas here. Let’s say:

It takes 0.1 nanosecond for the voltage to swing from low to high (thanks to the transistor’s own capacitance for example)
The interconnect has a transmission line signal velocity comparable to the speed of light
We’re talking about a 100μm-long interconnect.

Then you can do the math: the entire interconnect will be for all intents and purposes at a uniform voltage throughout the entire voltage-switching process. If you look at a graph of the voltage as a function of position, it will look like a flat horizontal line at each moment, and that horizontal line will smoothly move up or down over the course of the 0.1 ns swing. It won’t look like a propagating wave.

As a meta-commentary, you can see what’s happening here—I don’t think the thermal de Broglie wavelength is at all relevant in this context, nor the mean free path, and instead I’m trying to shift discussion to “how wires work”.

non-dissipative reversible move operations

One of the weird things in this discussion from my perspective is that you’re OK with photons carrying information with less than 2e-21 J/bit/nm energy dissipation but you’re not OK with wires carrying information with less than 2e-21 J/bit/nm energy dissipation. But they’re not so different in my perspective—both of those things are fundamentally electromagnetic waves traveling down transmission lines. Obviously the frequency is different and the electromagnetic mode profile is different, but I don’t see how those are relevant.

Replies from: spxtr, bhauth, jacob_cannell

↑ comment by spxtr · 2023-05-17T20:07:51.898Z · LW(p) · GW(p)

I don’t think the thermal de Broglie wavelength is at all relevant in this context, nor the mean free path, and instead I’m trying to shift discussion to “how wires work”.

This is the crux of it. I made the same comment here [LW(p) · GW(p)] before seeing this comment chain.

People have been sending binary information over wires since 1840, right? I don’t buy that there are important formulas related to electrical noise that are not captured by the textbook formulas. It’s an extremely mature field.

Also a valid point. @jacob_cannell [LW · GW] is making a strong claim: that the energy lost by communicating a bit is the same scale as the energy lost by all other means, by arbitrarily dividing by 1 nm so that the units can be compared. If this were the case, then we would have known about it for a hundred years. Instead, it is extremely difficult to measure the extremely tiny amounts of heat that are actually generated by deleting a bit, such that it's only been done within the last decade.

This arbitrary choice leads to a dramatically overestimated heat cost of computation, and it ruins the rest of the analysis.

@Alexander Gietelink Oldenziel [LW · GW], for whatever it is worth, I, a physicist working in nanoelectronics, recommend @Steven Byrnes [LW · GW] for the $250. (Although, EY's "it's wrong because it's obviously physically wrong" is also correct. You don't need to dig into details to show that a perpetual motion machine is wrong. You can assert it outright.)

Replies from: ege-erdil, jacob_cannell

↑ comment by Ege Erdil (ege-erdil) · 2023-05-19T12:08:57.725Z · LW(p) · GW(p)

For what it's worth, I think both sides of this debate appear strangely overconfident in claims that seem quite nontrivial to me. When even properly interpreting the Landauer bound is challenging due to a lack of good understanding of the foundations of thermodynamics, it seems like you should be keeping a more open mind before seeing experimental results.

At this point, I think the remarkable agreement between the wire energies calculated by Jacob and the actual wire energies reported in the literature is too good to be a coincidence. However, I suspect the agreement might be the result of some dimensional analysis magic as opposed to his model actually being good. I've been suspicious of the de Broglie wavelength-sized tile model of a wire since the moment I first saw it, but it's possible that there's some other fundamental length scale that just so happens to be around 1 nm and therefore makes the formulas work out.

People have been sending binary information over wires since 1840, right? I don’t buy that there are important formulas related to electrical noise that are not captured by the textbook formulas. It’s an extremely mature field.

The Landauer limit was first proposed in 1961, so the fact that people have been sending binary information over wires since 1840 seems to be irrelevant in this context.

↑ comment by jacob_cannell · 2023-05-18T10:09:41.049Z · LW(p) · GW(p)

arbitrarily dividing by 1 nm so that the units can be compared

1 nm is somewhat arbitrary but around that scale is a sensible estimate for minimal single electron device spacing ala Cavin/Zhirnov. If you haven’t actually read those refs you should - as they justify that scale and the tile model.

This arbitrary choice leads to a dramatically overestimated heat cost of computation, and

This is just false, unless you are claiming you have found some error in the cavin/zhirnov papers. It’s also false in the sense that the model makes reasonable predictions. I’ll just finish my follow up post, but using the mean free path as the approx scale does make sense for larger wires and leads to fairly good predictions for a wide variety of wires from on chip interconnect to coax cable Ethernet to axon signal conduction.

Replies from: spxtr

↑ comment by spxtr · 2023-05-18T16:49:03.758Z · LW(p) · GW(p)

1 nm is somewhat arbitrary but around that scale is a sensible estimate for minimal single electron device spacing ala Cavin/Zhirnov. If you haven’t actually read those refs you should - as they justify that scale and the tile model.

They use this model to figure out how to pack devices within a given area and estimate their heat loss. It is true that heating of a wire is best described with a resistivity (or parasitic capacitance) that scales as 1/L. If you want to build a model out of tiles, each of which is a few nm on a side (because the FETs are roughly that size), then you are perfectly allowed to do so. IMO the model is a little oversimplified to be particularly useful, but it's physically reasonable at least.

This is just false, unless you are claiming you have found some error in the cavin/zhirnov papers.

No, the papers are fine. They don't say what you think they say. They are describing ordinary resistive losses and such. In order to compare different types of interconnects running at different bitrates, they put these losses in units of energy/bit/nm. This has no relation to Landauer's principle.

Resistive heat loss in a wire is fundamentally different than heat loss from Landauer's principle. I can communicate 0 bits of information across a wire while losing tons of energy to resistive heat, by just flowing a large constant current through it.

It’s also false in the sense that the model makes reasonable predictions.

As pointed out by Steven Byrnes, your model predicts excess heat loss in a well-understood system. In my linked comment, I pointed out another way that it makes wrong predictions.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-05-19T06:14:35.093Z · LW(p) · GW(p)

Resistive heat loss in a wire is fundamentally different than heat loss from Landauer's principle. I can communicate 0 bits of information across a wire while losing tons of energy to resistive heat, by just flowing a large constant current through it.

Of course - as I pointed out in my reply here [LW(p) · GW(p)].

As pointed out by Steven Byrnes, your model predicts excess heat loss in a well-understood system.

False. I never at any point modeled the resistive heat/power loss for flowing current through a wire sans communication. It was Byrnes who calculated the resistive loss for a coax cable, and got a somewhat wrong result (for wire communication bit energy cost), whereas the tile model (using mean free path for larger wires) somehow outputs the correct values for actual coax cable communication energy use as shown here [LW(p) · GW(p)].

Replies from: spxtr

↑ comment by spxtr · 2023-05-19T15:52:27.217Z · LW(p) · GW(p)

Please respond to the meat of the argument.

Resistive heat loss is not the same as heat loss from Landauer's principle. (you agree!)
The Landauer limit is an energy loss per bit flip, with units energy/bit. This is the thermodynamic minimum (with irreversible computing). It is extremely small and difficult to measure. It is unphysical to divide it by 1 nm to model an interconnect, because signals do not propagate through wires by hopping from electron to electron.
The Cavin/Zhirnov paper you cite does not concern the Landauer principle. It models ordinary dissipative interconnects. Due to a wide array of engineering optimizations, these elements tend to have similar energy loss per bit per mm, however this is not a fundamental constraint. This number can be basically arbitrarily changed by multiple orders of magnitude.
You claim that your modified Landauer energy matches the Cavin/Zhirnov numbers, but this is a nonsense comparison because they are different things. One can be varied by orders of magnitude while the other cannot. Because they are different heat sources, their heat losses add.
We have known how wires work for a very long time. There is a thorough and mature field of physics regarding heat and information transport in wires. If we were off by a factor of 2 in heat loss (what you are claiming, possibly without knowing so) then we would have known it long ago. The Landauer principle would not be a very esoteric idea at the fringes of computation and physics, it would be front and center necessary to understand heat dissipation in wires. It would have been measured a hundred years ago.

I'm not going to repeat this again. If you ignore the argument again then I will assume bad faith and quit the conversation.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-05-20T04:37:14.432Z · LW(p) · GW(p)

I'm really not sure what your argument is if this is the meat, and moreover don't really feel morally obligated to respond given that you have not yet acknowledged that my model already made roughly correct predictions and that Byrne's model of wire heating under passive current load is way off theoretically and practically. Interconnect wire energy comes from charging and discharging capacitance energy, not resistive loss for passive constant (unmodulated) current flow.

The landauer limit connects energy to probability of state transitions, and is more general than erasure. Reversible computations still require energies that are multiples of this bound for reliability. It is completely irrelevant how signals propagate through the medium - whether by charging wire capacitance as in RC interconnect, or through changes in drift velocity, or phonons, or whatever. As long as the medium has thermal noise, the landauer/boltzmann relationship applies.
Cavin/Zhirnov absolutely cite and use the Landauer principle for bit energy.
I make no such claim as i'm not using a "modified Landauer energy".
I'm not making any claims of novel physics or anything that disagrees with known wire equations.

If we were off by a factor of 2 in heat loss (what you are claiming, possibly without knowing so)

Comments like this suggest you don't have a good model of my model. The actual power usage of actual devices is a known hard fact and coax cable communication devices have actual power usage within the range my model predicted - that is a fact. You can obviously use the wire equations (correctly) to precisely model that power use (or heat loss)! But I am more concerned with the higher level general question of why both human engineering and biology - two very separate long running optimization processes - converged on essentially the same wire bit energy.

Replies from: spxtr, DaemonicSigil

↑ comment by spxtr · 2023-05-20T06:17:21.921Z · LW(p) · GW(p)

Ok, I will disengage. I don't think there is a plausible way for me to convince you that your model is unphysical.

I know that you disagree with what I am saying, but from my perspective, yours is a crackpot theory. I typically avoid arguing with crackpots, because the arguments always proceed basically how this one did. However, because of apparent interest from others, as well as the fact that nanoelectronics is literally my field of study, I engaged. In this case, it was a mistake.

Sorry for wasting our time.

Replies from: alexander-gietelink-oldenziel, lahwran, ege-erdil

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-05-20T07:55:02.616Z · LW(p) · GW(p)

Dear spxtr,

Things got heated here. I and many others are grateful for your effort to share your expertise. Is there a way in which you would feel comfortable continuing to engage?

Remember that for the purposes of the prize pool there is no need to convince Cannell that you are right. In fact I will not judge veracity at all just contribution to the debate (on which metric you're doing great!)

Dear Jake,

This is the second person in this thread that has explicitly signalled the need to disengage. I also realize this is charged topic and it's easy for it to get heated when you're just honestly trying to engage.

Best, Alexander

Replies from: spxtr

↑ comment by spxtr · 2023-05-20T20:21:53.138Z · LW(p) · GW(p)

Hi Alexander,

I would be happy to discuss the physics related to the topic with others. I don't want to keep repeating the same argument endlessly, however.

Note that it appears that EY had a similar experience of repeatedly not having their point addressed:

I'm confused at how somebody ends up calculating that a brain - where each synaptic spike is transmitted by ~10,000 neurotransmitter molecules (according to a quick online check), which then get pumped back out of the membrane and taken back up by the synapse; and the impulse is then shepherded along cellular channels via thousands of ions flooding through a membrane to depolarize it and then getting pumped back out using ATP, all of which are thermodynamically irreversible operations individually - could possibly be within three orders of magnitude of max thermodynamic efficiency at 300 Kelvin. I have skimmed "Brain Efficiency" though not checked any numbers, and not seen anything inside it which seems to address this sanity check.

Then, after a reply:

This does not explain how thousands of neurotransmitter molecules impinging on a neuron and thousands of ions flooding into and out of cell membranes, all irreversible operations, in order to transmit one spike, could possibly be within one OOM of the thermodynamic limit on efficiency for a cognitive system (running at that temperature).

Then, after another reply:

Nothing about any of those claims explains why the 10,000-fold redundancy of neurotransmitter molecules and ions being pumped in and out of the system is necessary for doing the alleged complicated stuff.

Then, nothing more (that I saw, but I might have missed comments. this is a popular thread!).

:), spxtr

↑ comment by the gears to ascension (lahwran) · 2023-05-20T16:07:19.142Z · LW(p) · GW(p)

If this is your field but also you don't have the mood for pedagogy when someone from another field has strong opinions, which is emotionally understandable, I'm curious what learning material you'd recommend working through to find your claims obvious; is a whole degree needed? Are there individual textbooks or classes or even individual lectures?

Replies from: spxtr

↑ comment by spxtr · 2023-05-20T19:57:01.510Z · LW(p) · GW(p)

It depends on your background in physics.

For the theory of sending information across wires, I don't think there is any better source than Shannon's "A Mathematical Theory of Communication."

I'm not aware of any self-contained sources that are enough to understand the physics of electronics. You need to have a very solid grasp of E&M, the basics of solid state, and at least a small amount of QM. These subjects can be pretty unintuitive. As an example of the nuance even in classical E&M, and an explanation of why I keep insisting that "signals do not propagate in wires by hopping from electron to electron," see this youtube video.

You don't actually need all of that in order to argue that the brain cannot be efficient from a thermodynamic perspective. EY does not understand the intricacies of nanoelectronics (probably), but he correctly stated that the final result from the original post cannot be correct, because obviously you can imagine a computation machine that is more thermodynamically efficient than pumping tens of thousands of ions across membranes and back. This intuition probably comes from some thermodynamics or statistical mechanics books.

Replies from: adele-lopez-1

↑ comment by Adele Lopez (adele-lopez-1) · 2023-05-20T20:35:55.289Z · LW(p) · GW(p)

What is the most insightful textbook about nanoelectronics you know of, regardless of how difficult it may be?

Or for another question trying to get at the same thing: if only one book about nanoelectronics were to be preserved (but standard physics books would all be fine still), which one would you want it to be? (I would be happy with a pair of books too, if that's an easier question to answer.)

Replies from: spxtr

↑ comment by spxtr · 2023-05-20T20:57:13.556Z · LW(p) · GW(p)

I come more from the physics side and less from the EE side, so for me it would be Datta's "Electronic Transport in Mesoscopic Systems", assuming the standard solid state books survive (Kittel, Ashcroft & Mermin, L&L stat mech, etc). For something closer to EE, I would say "Principles of Semiconductor Devices" by Zeghbroeck because it is what I have used and it was good, but I know less about that landscape.

↑ comment by Ege Erdil (ege-erdil) · 2023-05-20T12:44:44.870Z · LW(p) · GW(p)

I strongly disapprove of your attitude in this thread. You haven't provided any convincing explanation of what's wrong with Jacob's model beyond saying "it's unphysical".

I agree that the model is very suspicious and in some sense doesn't look like it should work, but at the same time, I think there's obviously more to the agreement between his numbers and the numbers in the literature than you're giving credit for. Your claim that there's no fundamental bound on information transmission that relies on resistive materials of the form energy/bit/length (where the length scale could depend on the material in ways Jacob has already discussed) is unsupported and doesn't seem like it rests on any serious analysis.

You can't blame Jacob for not engaging with your arguments because you haven't made any arguments. You've just said that his model is unphysical, which I agree with and presumably he would also agree with to some extent. However, by itself, that's not enough to show that there is no bound on information transmission which roughly has the form Jacob is talking about, and perhaps for reasons that are not too dissimilar from the ones he's conjectured.

↑ comment by DaemonicSigil · 2023-05-20T06:11:25.871Z · LW(p) · GW(p)

I could be wrong here, but I think the "well-understood" physics principles that spxtr is getting at are the Shannon-Hartley Theorem and the Johnson-Nyquist noise. My best guess at how one would use these to derive a relationship between power consumption, bit rate, and temperature are as follows:

The power of the Johnson-Nyquist noise goes as , where $Δ f$ is the bandwidth. So we're interpreting the units of $k T$ as W/Hz. Interestingly, for power output, the resistance in the circuit is irrelevant. Larger resistance means more voltage noise and less current noise, but the overall power multiplies out to be the same.

Next, the Shannon-Hartley theorem says that the channel capacity is:

$C = Δ f {log}_{2} (1 + \frac{P_{signal}}{P_{noise}})$

Where $C$ is the bitrate (units are bits per second), and $P_{signal}, P_{noise}$ are the power levels of signal and noise. Then the energy cost to send a bit (we'll call it $E_{bit}$ ) is:

$E_{bit} = \frac{P_{signal}}{C}$

Based on Johnson-Nyquist, we have a noise level of $k T Δ f$ , so overall the energy cost per bit should be:

$E_{bit} = \frac{P_{signal}}{Δ f {log}_{2} (1 + \frac{P_{signal}}{k T Δ f})}$

Define a dimensionless $x = \frac{P_{signal}}{k T Δ f}$ . Then we have:

$E_{bit} = k T \frac{x}{{log}_{2} (1 + x)}$

Since $x$ must be positive, the minimum value for the dimensionless part is $log 2$ . So this gives a figure of $k T log 2$ per bit for the entire line, assuming resistance isn't too large. Interestingly, this is the same number as the Landauer limit itself, something I wasn't expecting when I started writing this.

I think one reason your capacitor charging/discharging argument didn't stop this number from coming out so small is that information can travel as pulses along the line that don't have to charge and discharge the entire thing at once. They just have to contain enough energy to charge the local area they happen to be currently occupying.

Replies from: ege-erdil, spxtr, jacob_cannell

↑ comment by Ege Erdil (ege-erdil) · 2023-05-20T12:37:00.629Z · LW(p) · GW(p)

The problem with this model is that it would apply equally as well regardless of how you're transmitting information on an electromagnetic field, or for that matter, any field to which the equipartition theorem applies.

If your field looks like lots of uncoupled harmonic oscillators joined together once you take Fourier transforms, then each harmonic oscillator is a quadratic degree of freedom, and each picks up thermal noise on the order of ~ kT because of the equipartition theorem. Adding these together gives you Johnson noise in units of power. Shannon-Hartley is a mathematical theorem that has nothing to do with electromagnetism in particular, so it will also apply in full generality here.

You getting the bitwise Landauer limit as the optimum is completely unsurprising if you look at the ingredients that are going into your argument. We already know that we can beat Jacob's wire energy bounds by using optical transmission, for example. The part your calculation fails to address is what happens if we attempt to drive this transmission by moving electrons around inside a wire made of an ordinary resistive material such as copper.

It seems to me that in this case we should expect a bound that has dimensions energy/bit/length and not energy/bit, and such a bound basically has to look like Jacob's bound by dimensional analysis, modulo the length scale of 1 nm being correct.

Replies from: DaemonicSigil, spxtr

↑ comment by DaemonicSigil · 2023-05-21T00:42:16.327Z · LW(p) · GW(p)

Yeah, I agree that once you take into account resistance, you also get a length scale. But that characteristic length is going to be dependent on the exact geometry and resistance of your transmission line. I don't think it's really possible to say that there's a fundamental constant of ~1nm that's universally implied by thermodynamics, even if we confine ourselves to talking about signal transmission by moving electrons in a conductive material.

For example, take a look at this chart:

(source) At 1GHz, we can see that:

There's a wide spread of possible levels of attenuation for different cable types. Note the log scale.
A typical level of attenuation is 10dB over 100 ft. If the old power requirement per bit was about , this new power requirement is about $10 k T$ . Then presumably to send the signal another 100ft, we'd have to pay another $10 k T$ . Call it $100 k T$ to account for inefficiencies in the signal repeater. So this gives us a cost of $1 k T$ per foot rather than $1 k T$ per nanometer!

Replies from: jacob_cannell, ege-erdil

↑ comment by jacob_cannell · 2023-05-21T07:38:27.709Z · LW(p) · GW(p)

That linked article and graph seems to be talking about optical communication (waveguides), not electrical.

There's nothing fundamental about ~1nm, it's just a reasonable rough guess of max tile density. For thicker interconnect it seems obviously suboptimal to communicate bits through maximally dense single electron tiles.

But you could imagine single electron tile devices with anisotropic interconnect tiles where a single electron moves between two precise slots separated by some greater distance and then ask what is the practical limit on that separation distance and it ends up being mean free path

MFP also naturally determines material resistivity/conductivity.

So anisotropic tiles with length scale around mean free path is about the best one could expect from irreversible communication over electronic wires, and actual electronic wire signaling in resistive wires comes close to that bound such that it is an excellent fit for actual wire energies. This makes sense as we shouldn't expect random electron motion in wires to beat single electron cellular automata that use precise electron placement.

The equations you are using here seem to be a better fit for communication in superconducting wires where reversible communication is possible.

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2023-05-21T17:12:10.135Z · LW(p) · GW(p)

That linked article and graph seems to be talking about optical communication (waveguides), not electrical.

Terminology: A waveguide has a single conductor, example: a box waveguide. A transmission line has two conductors, example: a coaxial cable.

Yes most of that page is discussing waveguides, but that chart ("Figure 5. Attenuation vs Frequency for a Variety of Coaxial Cables") is talking about transmission lines, specifically coaxial cables. In some sense even sending a signal through a transmission line is unavoidably optical, since it involves the creation and propagation of electromagnetic fields. But that's also kind of true of all electrical circuits.

Anyways, given that this attenuation chart should account for all the real-world resistance effects and it says that I only need to pay an extra factor of 10 in energy to send a 1GHz signal 100ft, what's the additional physical effect that needs to be added to the model in order to get a nanometer length scale rather than a centimeter length scale?

Replies from: jacob_cannell, jacob_cannell

↑ comment by jacob_cannell · 2023-05-24T13:27:57.355Z · LW(p) · GW(p)

See my reply here [LW(p) · GW(p)].

Using steady state continuous power attenuation is incorrect for EM waves in a coax transmission line. It's the difference between the small power required to maintain drift velocity against frictive resistance vs the larger energy required to accelerate electrons up to the drift velocity from zero for each bit sent.

↑ comment by jacob_cannell · 2023-05-21T23:37:41.755Z · LW(p) · GW(p)

In some sense none of this matters because if you want to send a bit through a wire using minimal energy, and you aren't constrained much by wire thickness or the requirement of a somewhat large encoder/decoder devices, you can just skip the electron middleman and use EM waves directly - ie optical.

I don't have any strong fundemental reason why you couldn't use reversible signaling through a wave propagating down a wire - it is just another form of wave as you point out.

The landauer bound till applies of course, it just determines the energy involved rather than dissipated. If the signaling mechanism is irreversible, then the best that can be achieved is on order ~1e-21 J/bit/nm. (10x landauer bound for minimal reliability over a long wire, but distance scale of about 10 nm from the mean free path of metals). Actual coax cable wire energy is right around that level, which suggests to me that it is irreversible for whatever reason.

↑ comment by Ege Erdil (ege-erdil) · 2023-05-21T13:45:39.764Z · LW(p) · GW(p)

↑ comment by spxtr · 2023-05-21T03:29:00.360Z · LW(p) · GW(p)

The part your calculation fails to address is what happens if we attempt to drive this transmission by moving electrons around inside a wire made of an ordinary resistive material such as copper.

I have a number floating around in my head. I'm not sure if it's right, but I think that at GHz frequencies, electrons in typical wires are moving sub picometer distances (possibly even femtometers?) per clock cycle.

The underlying intuition is that electron charge is "high" in some sense, so that 1. adding or removing a small number of electrons corresponds to a huge amount of energy (remove 1% of electrons from an apple and it will destroy the Earth in its explosion!) and 2. moving the electrons in a metal by a tiny distance (sub picometer) can lead to large enough electric fields to transmit signals with high fidelity.

Feel free to check these numbers, as I'm just going by memory.

The end result is that we can transmit signals with high fidelity by moving electrons many orders of magnitude less distance than their mean free path, which means intuitively it can be done more or less loss-free. This is not a rigorous calculation, of course.

Replies from: ege-erdil

↑ comment by Ege Erdil (ege-erdil) · 2023-05-21T13:56:51.137Z · LW(p) · GW(p)

I have a number floating around in my head. I'm not sure if it's right, but I think that at GHz frequencies, electrons in typical wires are moving sub picometer distances (possibly even femtometers?) per clock cycle.

The absolute speed of conduction band electrons inside a typical wire should be around 1e6 m/s at room temperature. At GHz frequencies, the electrons are therefore moving distances comparable to 1 mm per clock cycle.

If you look at the average velocity, i.e. the drift velocity, then that's of course much slower and the electrons will be moving much more slowly in the wire - the distances you quote should be of the right order of magnitude in this case. But it's not clear why the drift velocity of electrons is what matters here. By Maxwell, you only care about electron velocity on the average insofar as you're concerned with the effects on the EM field, but actually, the electrons are moving much faster so could be colliding with a lot of random things and losing energy in the process. It's this effect that has to be bounded, and I don't think we can actually bound it by a naive calculation that assumes the classical Drude model or something like that.

If someone worked all of this out in a rigorous analysis I could be convinced, but your reasoning is too informal for me to really believe it.

Replies from: spxtr

↑ comment by spxtr · 2023-05-21T17:26:24.002Z · LW(p) · GW(p)

Ah, I was definitely unclear in the previous comment. I'll try to rephrase.

When you complete a circuit, say containing a battery, a wire, and a light bulb, a complicated dance has to happen for the light bulb to turn on. At near the speed of light, electric and magnetic fields around the wire carry energy to the light bulb. At the same time, the voltage throughout the wire establishes itself at the the values you would expect from Ohm's law and Kirchhoff's rules and such. At the same time, electrons throughout the wire begin to feel a small force from an electric field pointing along the direction of the wire, even if the wire has bends and such. These fields and voltages, outside and inside the wire, are the result of a complicated, self-consistent arrangement of surface charges on the wire.

See this youtube video for a nice demonstration of a nonintuitive result of this process. The video cites this paper among others, which has a nice introduction and overview.

The key point is that establishing these surface charges and propagating the signal along the wire amounts to moving an extremely small amount of electric charge. In that youtube video he asserts without citation that the electrons move "the radius of a proton" (something like a femtometer) to set up these surface charges. I don't think it's always so little, but again I don't remember where I got my number from. I can try to either look up numbers or calculate it myself if you'd like.

Signals (low vs high voltages, say) do not propagate through circuits by hopping from electron to electron within a wire. In a very real sense they do not even propagate through the wire, but through electric and magnetic fields around and within the wire. This broad statement is also true at high frequencies, although there the details become even more complicated.

To maybe belabor the point: to send a bit across a wire, we set the voltage at one side high or low. That voltage propagates across the wire via the song and dance I just described. It is the heat lost in propagating this voltage that we are interested in for computing the energy of sending the bit over, and this heat loss is typically extremely small, because the electrons barely have to move and so they lose very little energy to collisions.

Replies from: ege-erdil

↑ comment by Ege Erdil (ege-erdil) · 2023-05-21T18:41:58.583Z · LW(p) · GW(p)

I'm aware of all of this already, but as I said, there seems to be a fairly large gap between this kind of informal explanation of what happens and the actual wire energies that we seem to be able to achieve. Maybe I'm interpreting these energies in a wrong way and we could violate Jacob's postulated bounds by taking an Ethernet cable and transmitting 40 Gbps of information at a long distance, but I doubt that would actually work.

I'm in a strange situation because while I agree with you that the tile model of a wire is unphysical and very strange, at the same time it seems to me intuitively that if you tried to violate Jacob's bounds by many orders of magnitude, something would go wrong and you wouldn't be able to do it. If someone presented a toy model which explained why in practice we can get wire energies down to a certain amount that is predicted by the model while in theory we could lower them by much more, I think that would be quite persuasive.

Replies from: spxtr

↑ comment by spxtr · 2023-05-21T22:18:35.281Z · LW(p) · GW(p)

Maybe I'm interpreting these energies in a wrong way and we could violate Jacob's postulated bounds by taking an Ethernet cable and transmitting 40 Gbps of information at a long distance, but I doubt that would actually work.

Ethernet cables are twisted pair and will probably never be able to go that fast. You can get above 10 GHz with rigid coax cables, although you still have significant attenuation.

Let's compute heat loss in a 100 m LDF5-50A, which evidently has 10.9 dB/100 m attenuation at 5 GHz. This is very low in my experience, but it's what they claim.

Say we put 1 W of signal power at 5 GHz in one side. Because of the 10.9 dB attenuation, we receive 94 mW out the other side, with 906 mW lost to heat.

The Shannon-Hartley theorem says that we can compute the capacity of the wire as where $B$ is the bandwidth, $S$ is received signal power, and $N$ is noise power.

Let's assume Johnson noise. These cables are rated up to 100 C, so I'll use that temperature, although it doesn't make a big difference.

If I plug in 5 GHz for $B$ , 94 mW for $S$ and $k_{B} (370 K) (5 G H z) \approx 2.5 \times 10^{- 11} W$ for $N$ then I get a channel capacity of 160 GHz.

The heat lost is then $(906 m W) / (160 G H z) / (100 m) \approx 0.05 f J / b i t / m m .$ Quite low compared to Jacob's ~10 fJ/mm "theoretical lower bound."

One free parameter is the signal power. The heat loss over the cable is linear in the signal power, while the channel capacity is sublinear, so lowering the signal power reduces the energy cost per bit. It is 10 fJ/bit/mm at about 300 W of input power, quite a lot!

Another is noise power. I assumed Johnson noise, which may be a reasonable assumption for an isolated coax cable, but not for an interconnect on a CPU. Adding an order of magnitude or two to the noise power does not substantially change the final energy cost per bit (0.05 goes to 0.07), however I doubt even that covers the amount of noise in a CPU interconnect.

Similarly, raising the cable attenuation to 50 dB/100 m does not even double the heat loss per bit. Shannon's theorem still allows a significant capacity. It's just a question of whether or not the receiver can read such small signals.

The reason that typical interconnects in CPUs and the like tend to be in the realm of 10-100 fJ/bit/mm is because of a wide range of engineering constraints, not because there is a theoretical minimum. Feel free to check my numbers of course. I did this pretty quickly.

Replies from: jacob_cannell, ege-erdil

↑ comment by jacob_cannell · 2023-05-21T23:48:45.268Z · LW(p) · GW(p)

The heat lost is then [..] 0.05 fJ/bit/mm. Quite low compared to Jacob's ~10 fJ/mm "theoretical lower bound."

In the original article I discuss interconnect wire energy, not a "theoretical lower bound" for any wire energy communication method - and immediately point out reversible communication methods (optical, superconducting) that do not dissipate the wire energy.

Coax cable devices seem to use [LW(p) · GW(p)] around 1 to 5 fJ/bit/mm at a few W of power, or a few OOM more than your model predicts here - so I'm curious what you think that discrepancy is, without necessarily disagreeing with the model.

I describe a simple model of wire bit energy for EM wave transmission in coax cable here [LW(p) · GW(p)] which seems physically correct but also predicts a bit energy distance range somewhat below observed.

Replies from: spxtr

↑ comment by spxtr · 2023-05-22T02:38:39.990Z · LW(p) · GW(p)

Active copper cable at 0.5W for 40G over 15 meters is ~J/nm, assuming it actually hits 40G at the max length of 15m.

I can't access the linked article, but an active cable is not simple to model because its listed power includes the active components. We are interested in the loss within the wire between the active components.

This source has specs for a passive copper wire capable of up to 40G @5m using <1W, which works out to ~ $5 e^{- 21}$ J/nm, or a bit less.

They write <1 W for every length of wire, so all you can say is <5 fJ/mm. You don't know how much less. They are likely writing <1 W for comparison to active wires that consume more than a W. Also, these cables seem to have a powered transceiver built-in on each end that multiplex out the signal to four twisted pair 10G lines.

Compare to 10G from here which. may use up to 5W to hit up to 10G at 100M, for ~ $5 e^{- 21}$ J/nm.

Again, these have a powered transceiver on each end.

So for all of these, all we know is that the sum of the losses of the powered components and the wire itself are of order 1 fJ/mm. Edit: I would guess that probably the powered components have very low power draw (I would guess 10s of mW) and the majority of the loss is attenuation in the wire.

The numbers I gave essentially are the theoretical minimum energy loss per bit per mm of that particular cable at that particular signal power. It's not surprising that multiple twisted pair cables do worse. They'll have higher attenuation, lower bandwidth, the standard transceivers on either side require larger signals because they have cheaper DAC/ADCs, etc. Also, their error correction is not perfect, and they don't make full use of their channel capacity. In return, the cables are cheap, flexible, standard, etc.

There's nothing special about kT/1 nm.

↑ comment by Ege Erdil (ege-erdil) · 2023-05-21T23:08:38.282Z · LW(p) · GW(p)

I think this calculation is fairly convincing pending an answer from Jacob. You should have probably just put this calculation at the top of the thread, and then the back-and-forth would probably not have been necessary. The key parameter that is needed here is the estimate of a realistic attenuation rate for a coaxial cable, which was missing from DaemonicSigil's original calculation that was purely information-theoretic.

As an additional note here, if we take the same setup you're using, then if you take the energy input to be a free parameter, then the energy per bit per distance is given by

$f (x) = \frac{0.906 x}{5 \cdot 10^{14} \cdot {log}_{2} (1 + \frac{0.094 x}{2.5 \cdot 10^{- 11}})}$

in units of J/bit/mm. This does not have a global optimum for $x > 0$ because it's strictly increasing, but we can take a limit to get the theoretical lower bound

$lim x \to 0 f (x) = 3.34 \cdot 10^{- 25}$

which is much lower than what you calculated, though to achieve this you would be sending information very slowly - indeed, infinitely slowly in the limit of $x \to 0$ .

Replies from: jacob_cannell, spxtr

↑ comment by jacob_cannell · 2023-05-24T10:24:59.069Z · LW(p) · GW(p)

I am skeptical that steady state direct current flow attenuation is the entirety of the story (and indeed it seems to underestimate actual coax cable wire energy of ~1e^-21 to 5e^-21 J/bit/nm by a few OOM).

For coax cable the transmission is through a transverse (AC) wave that must accelerate a quantity of electrons linearly proportional to the length of the cable. These electrons rather rapidly dissipate this additional drift velocity energy through collisions (resistance), and the entirety of the wave energy is ultimately dissipated.

This seems different than sending continuous DC power through the wire where the electrons have a steady state drift velocity and the only energy required is that to maintain the drift velocity against resistance. For wave propagation the electrons are instead accelerated up from a drift velocity of zero for each bit sent. It's the difference between the energy required to accelerate a car up to cruising speed and the power required to maintain that speed against friction.

transverse wave

If we take the bit energy to be , then there is a natural EM wavelength of $E_{b} = \frac{h c}{λ}$ , so $λ = \frac{h c}{E_{B}}$ , which works out to ~1um for ~1eV. Notice that using a lower frequency / longer wavelength seems to allow one to arbitrarily decrease the bit energy distance scale, but it turns out this just increases the dissipative loss.

So an initial estimate of the characteristic bit energy distance scale here is ~1eV/bit/um or ~1e-22 J/bit/nm. But this is obviously an underestimate as it doesn't yet include the effect of resistance (and skin effect) during wave propagation.

The bit energy of one wavelength is implemented through electron peak drift velocity on order $E_{b} = \frac{1}{2} N_{e} m_{e} v_{d}^{2}$ , where $N_{e}$ is the number of carrier electrons in one wavelength wire section. The relaxation time $τ$ or mean time between thermal collisions with a room temp thermal velocity of around ~1e5 m/s and the mean free path of ~40 nm in copper is $τ$ ~ 4e-13s. Meanwhile the inverse frequency or timespan of one wavelength is around 3e-14 s for an optical frequency 1eV wave, and is ~1e-9 s for a more typical (much higher amplitude) gigahertz frequency wave. So it would seem that resistance is quite significant on these timescales.

Very roughly the gigahertz 1e-9s period wave requires about 5 oom more energy per wavelength due to dissipation which cancels out the 5 oom larger distance scale. Each wavelength section loses about half of the invested energy every $τ$ ~ 4e-13 seconds, so maintaining the bit energy of $E_{b}$ requires roughly input power of ~ $E_{b} / τ$ for $f^{- 1}$ seconds which cancels out the effect of the longer wavelength distance, resulting in a constant bit energy distance scale independent of wavelength/frequency (naturally there are many other complex effects that are wavelength/frequency dependent but they can’t improve the bit energy distance scale )

For a low frequency (long wavelength) with $f^{- 1}$ << $τ$ :

$E_{b} / d \approx \frac{E_{b} f^{- 1}}{τ λ} = \frac{E_{b}}{τ f λ}$

$λ = \frac{c}{f}$

$E_{b} / d \approx \frac{E_{b}}{τ f λ}$

$E_{b} / d \approx \frac{E_{b}}{τ c}$ ~ 1eV / 10um ~ 1e-23 J/bit/nm

If you take the bit energy down to the minimal landauer limit of ~0.01 eV this ends up about equivalent to your lower limit, but I don't think that would realistically propagate.

A real wave propagation probably can’t perfectly transfer the bit energy over longer distances and has other losses (dielectric loss, skin effect, etc), so vaguely guesstimating around 100x loss would result in ~1e-21 J/bit/nm. The skin effect alone perhaps increases resistance by roughly 10x at gigahertz frequencies. Coax devices also seem constrained to use specific lower gigahertz frequences and then boost the bitrate through analog encoding, so for example 10-bit analog increases bitrate by 10x at the same frequency but requires about 1024X more power, so that is 2 OOM less efficient per bit.

Notice that the basic energy distance scale of $\frac{E_{b}}{τ c}$ is derived from the mean free path, via the relaxation time $τ$ from $τ = ℓ / V_{n}$ , where $ℓ$ is the mean free path and $V_{n}$ is the thermal noise velocity (around ~1e5 m/s for room temp electrons).

Coax cable doesn't seem to have any fundamental advantage over waveguide optical, so I didn't consider it at all in brain efficiency. It requires wires of about the same width several OOM larger than minimal nanoscale RC interconnect and largish sending/receiving devices as in optics/photonics.

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2023-05-28T03:33:42.828Z · LW(p) · GW(p)

This is very different than sending continuous power through the wire where the electrons have a steady state drift velocity and the only energy required is that to maintain the drift velocity against resistance. For wave propagation the electrons are instead accelerated up from a drift velocity of zero for each bit sent. It's the difference between the energy required to accelerate a car up to cruising speed and the power required to maintain that speed against friction.

Electrons are very light so the kinetic energy required to get them moving should not be significant in any non-contrived situation I think? The energy of the magnetic field produced by the current would tend to be much more of an important effect.

As for the rest of your comment, I'm not confident enough I understand the details of your argument be able to comment on it in detail. But from a high level view, any effect you're talking about should be baked into the attenuation chart I linked in this comment [LW(p) · GW(p)]. This is the advantage of empirically measured data. For example, the skin-effect (where high frequency AC current is conducted mostly in the surface of a conductor, so the effective resistance increases the higher the frequency of the signal) is already baked in. This effect is (one of the reasons) why there's a positive slope in the attenuation chart. If your proposed effect is real, it might be contributing to that positive slope, but I don't see how it could change the "1 kT per foot" calculation.

Replies from: jacob_cannell

↑ comment by jacob_cannell · 2023-05-30T18:04:05.230Z · LW(p) · GW(p)

Electrons are very light so the kinetic energy required to get them moving should not be significant in any non-contrived situation I think? The energy of the magnetic field produced by the current would tend to be much more of an important effect.

My current understanding is that the electric current energy transmits through electron drift velocity (and I believe that is the standard textbook understanding?, although I admit I have some questions concerning the details). The magnetic field is just a component of the EM waves which propagate changes in electron KE between electrons (the EM waves implement the connections between masses in the equivalent mass-spring system).

I'm not sure how you got "1 kT per foot" but that seems roughly similar to the model up thread I am replying to from spxtr that got 0.05 fJ/bit/mm or 5e-23 J/bit/mm. I attempted to derive an estimate from the lower level physics thinking it might be different but it ended up in the same range - and also off by the same 2 OOM vs real data. But I mention that skin effect could plausibly increase power by 10x in my lower level model, as I didn't model it nor use measured attenuation values at all. The other OOM probably comes from analog SNR inefficiency.

The part of this that is somewhat odd at first is the exponential attenuation. That does show up in my low lever model where any electron kinetic energy in the wire is dissipated by about 50% due to thermal collisions every ~ 4e-13 seconds (that is the important part from mean free path / relaxation time). But that doesn't naturally lead to a linear bit energy distance scale unless that dissipated energy is somehow replaced/driven by the preceding section of waveform.

So if you sent $E$ as a single large infinitesimal pulse down a wire of length $D$ , the energy you get on the other side is $E * 2^{- α D}$ for some attenuation constant $α$ that works out to about 0.1 mm or something as it's $τ c$ , not meters. I believe if your chart showed attenuation in the 100THZ regime on the scale of $τ$ it would be losing 50% per 0.1 mm instead of per meter.

We know that resistance is linear, not exponential - which I think arises from long steady flow where every $τ$ seconds half the electron kinetic energy is dissipated, but this total amount is linear with wire section length. The relaxation time $τ$ then just determines what steady mean electron drift velocity (current flow) results from the dissipated energy.

So when the wave period $f^{- 1}$ is much less than $τ$ you still lose about half of the wave energy $E$ every $τ$ seconds but that can be spread out over a much larger wavelength section. (and indeed at gigahertz frequencies this model roughly predicts the correct 50% attenuation distance scale of ~10m or so).

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2023-06-01T06:40:34.846Z · LW(p) · GW(p)

There's two types of energy associated with a current we should distinguish. Firstly there's the power flowing through the circuit, then there's energy associated with having current flowing in a wire at all. So if we're looking at a piece of extension cord that's powering a lightbulb, the power flowing through the circuit is what's making the lightbulb shine. This is governed by the equation . But there's also some energy associated with having current flowing in a wire at all. For example, you can work out what the magnetic field should be around a wire with a given amount of current flowing through it and calculate the energy stored in the magnetic field. (This energy is associated with the inductance of the wire.) Similarly, the kinetic energy associated with the electron drift velocity is also there just because the wire has current flowing through it. (This is typically a very small amount of energy.)

To see that these types have to be distinct, think about what happens when we double the voltage going into the extension cord and also double the resistance of the lightbulb it's powering. Current stays the same, but with twice the voltage we now have twice the power flowing to the light bulb. Because current hasn't changed, neither has the magnetic field around the wire, nor the drift velocity. So the energy associated with having a current flowing in this wire is unchanged, even though the power provided to the light bulb has doubled. The important thing about the drift velocity in the context of $P = I V$ is that it moves charge. We can calculate the potential energy associated with a charge in a wire as $E = q V$ , and then taking the time derivative gives the power equation. It's true that drift velocity is also a velocity, and thus the charge carriers have kinetic energy too, but this is not the energy that powers the light bulb.

In terms of exponential attenuation, even DC through resistors gives exponential attenuation if you have a "transmission line" configuration of resistors that look like this:

So exponential attenuation doesn't seem too unusual or surprising to me.

↑ comment by spxtr · 2023-05-21T23:26:54.008Z · LW(p) · GW(p)

Indeed, the theoretical lower bound is very, very low.

Do you think this is actually achievable with a good enough sensor if we used this exact cable for information transmission, but simply used very low input energies?

The minimum is set by the sensor resolution and noise. A nice oscilloscope, for instance, will have, say, 12 bits of voltage resolution and something like 10 V full scale, so ~2 mV minimum voltage. If you measure across a 50 Ohm load then the minimum received power you can see is This is an underestimate, but that's the idea.

↑ comment by spxtr · 2023-05-20T07:06:16.961Z · LW(p) · GW(p)

This is the right idea, but in these circuits there are quite a few more noise sources than Johnson noise. So, it won't be as straightforward to analyze, but you'll still end up with essentially a relatively small (compared to L/nm) constant times kT.

↑ comment by jacob_cannell · 2023-05-21T13:53:32.994Z · LW(p) · GW(p)

I think one reason your capacitor charging/discharging argument didn't stop this number from coming out so small is that information can travel as pulses along the line that don't have to charge and discharge the entire thing at once.

Sure information can travel that way in theory, but it doesn't work out in practice for dissipative resistive (ie non superconducting) wires. Actual on chip interconnect wires are 'RC wires' which do charge/discharge the entire wire to send a bit. They are like a pipe which allows electrons to flow from some source to a destination device, where that receiving device (transistor) is a capacitor which must be charged to a bit energy . The Johnson thermal noise on a capacitor is just the same Landauer Boltzmann noise of $E_{n} \approx K_{B} T$ . The wire geometry aspect ratio (width/length) determines the speed at which the destination capacitor can be charged up to the bit energy.

The only way for the RC wire to charge the distant receiver capacitor is by charging the entire wire, leading to the familiar RC wire capacitance energy, which is also very close to the landauer tile model energy using mean free path as the tile size (for the reasons i've articulated in various previous comments).

Replies from: DaemonicSigil

↑ comment by DaemonicSigil · 2023-05-21T16:50:15.834Z · LW(p) · GW(p)

Yeah, to be clear I do agree that your model gives good empirical results for on-chip interconnect. (I haven't checked the numbers myself, but I believe you that they match up well.) (Though I don't necessarily buy that the 1nm number is related to atom spacing in copper or anything like that. It probably has more to do with the fact that scaling down a transmission line while keeping the geometry the same means that the capacitance per unit length is constant. The idea you mention in your other comment about it somehow falling out of the mean free path also seems somewhat plausible.)

Anyway, I don't think my argument would apply to chip interconnect. At 1GHz, the wavelength is going to be about a foot, which is still wider than any interconnect on the microchip will be long. And we're trying to send a single bit along the line using a DC voltage level, rather than some kind of fancy signal wave. So your argument about charging and discharging the entire line should still apply in this case. My comment would mostly apply to Steven Byrnes's ethernet cable example, rather than microchip interconnect.

↑ comment by bhauth · 2023-05-15T11:40:51.191Z · LW(p) · GW(p)

The amount dissipated within the 30-meter cable is of course much less than that, or else there would be nothing left for the receiver to measure.

Signals decay exponentially and dissipation with copper cables can be ~50dB. At high frequencies, most of the power is lost.

Replies from: steve2152

↑ comment by Steven Byrnes (steve2152) · 2023-05-15T12:01:35.744Z · LW(p) · GW(p)

Sure, I guess the "much less" was a guess; I should have just said "less" out of an abundance of caution.

Before writing that comment, I had actually looked for a dB/meter versus frequency plot for cat8 Ethernet cable and couldn't find any. Do you have a ref? It's not important for this conversation, I'm just curious. :)

↑ comment by jacob_cannell · 2023-04-30T23:20:42.911Z · LW(p) · GW(p)

The 'tile' or cellular automata wire model fits both on-chip copper interconnect wire energy and brain axon wire energy very well. It is more obvious why it fits axon signal conduction as that isn't really a traditional voltage propagation in a wire, it's a propagation of ion cellular automata state changes. I'm working on a better writeup and I'll look into how the wire equations could relate. If you have some relevant link to physical limits of communication over standard electrical wires, that would of course be very interesting/relevant.

My expectation is… Well, I’m a bit concerned that I’m misunderstanding ethernet specs, but it seems that there are 4 twisted pairs with 75Ω characteristic impedance, and the voltage levels go up to ±1V. That would amount to a power flow of up to 4V²/Z=0.05W.

I'm guessing this is probably the correct equation for the resistive loss, but irreversible communication requires doing something dumb like charging and discharging/dissipating (or equivalent) every clock cycle, which is OOM greater than the resistive loss (which would be appropriate for a steady current flow).

Do you have a link to the specs you were looking at? As I'm seeing a bunch of variation in 40G capable cables. Also 40Gb/s is only the maximum transmission rate, actual rate may fall off with distance from what I can tell.

The first reference I can find From this website is second hand but:

When the data rate required for interconnection is less than 5 Gbps, the passive copper cable is usually used for interconnection in data center. However, they can only support 40G transmission over really short distance.

Active copper cable can support 40G transmission over copper cable up to 15 meters with QSFP+ connector embedded with electronics. In the battle over transmission distance, optical active cable wins without doubt.

The connectors attached with AOC and active copper cable are the main reason why the two cables can support 40G transmission over longer distance than that of passive copper cable. AOC which can support the longest 40G transmission distance is with the highest power consumption—more than 2W. The power consumption for active copper cable is only 440mW. However, passive copper cable requires no power during the transmission.

Active copper cable at 0.5W for 40G over 15 meters is ~ $1 e^{- 21}$ J/nm, assuming it actually hits 40G at the max length of 15m.

This source has specs for a passive copper wire capable of up to 40G @5m using <1W, which works out to ~ $5 e^{- 21}$ J/nm, or a bit less.

Compare to 10G from here which. may use up to 5W to hit up to 10G at 100M, for ~ $5 e^{- 21}$ J/nm.

One of the weird things in this discussion from my perspective is that you’re OK with photons carrying information with less than 2e-21 J/bit/nm energy dissipation but you’re not OK with wires carrying information with less than 2e-21 J/bit/nm energy dissipation.

I do think I have a good explanation in the cellular automata model, and I'll just put my full response in there, but basically it's the difference between using fermions vs bosons to propagate bits through the system. Photons as bosons are more immune to EM noise pertubations and in typical use have much longer free path length (distance between collisions). One could of course use electrons ballistically to get some of those benefits but they are obviously slower and 'noisier'.

↑ comment by jacob_cannell · 2023-05-21T16:16:16.315Z · LW(p) · GW(p)

The challenge is that conventional transistors need V to be much higher than kT/e, where e is the electron charge, because the V is forming an electrostatic barrier that is supposed to block electrons, even when those electrons might be randomly thermally excited sometimes. The relevant technical term here is “subthreshold swing”. There is a natural (temperature-dependent) limit to subthreshold swing in normal transistors, based on thermal excitation over the barrier—the “thermionic limit” of 60mV/decade at room temperature.

The thermionic voltage of ~20mV is just another manifestation of the landauer/boltzmann noise scale. Single/few electron devices need to use large multiples of this voltage for high reliability, many electron devices can use smaller multiples. I use this in the synapse section "minimal useful Landauer Limit voltage of ~70mV" and had guessed out the concept before being aware of the existing term "thermionic limit".

comment by spxtr · 2023-05-15T07:10:29.702Z · LW(p) · GW(p)

The post is making somewhat outlandish claims about thermodynamics. My initial response was along the lines of "of course this is wrong. Moving on." I gave it another look today. In one of the first sections I found (what I think is) a crucial mistake. As such, I didn't read the rest. I assume it is also wrong.

The original post said:

A non-superconducting electronic wire (or axon) dissipates energy according to the same Landauer limit per minimal wire element. Thus we can estimate a bound on wire energy based on the minimal assumption of 1 minimal energy unit per bit per fundamental device tile, where the tile size for computation using electrons is simply the probabilistic radius or De Broglie wavelength of an electron^[7:1] [LW · GW], which is conveniently ~1nm for 1eV electrons, or about ~3nm for 0.1eV electrons. Silicon crystal spacing is about ~0.5nm and molecules are around ~1nm, all on the same scale.
Thus the fundamental (nano) wire energy is: ~1 $E_{b} / b i t / n m$ , with $E_{b}$ in the range of 0.1eV (low reliability) to 1eV (high reliability).
The predicted wire energy is $10^{- 19}$ J/bit/nm or ~100 fJ/bit/mm for semi-reliable signaling at 1V with $E_{b}$ = 1eV, down to ~10 fJ/bit/mm at 100mV with complex error correction, which is an excellent fit for actual interconnect wire energy^[8] [LW · GW]^[9] [LW · GW]^[10] [LW · GW]^[11] [LW · GW], [...]

The measured/simulated interconnect wire energies from the citations in the realm of 10s-100s of fJ/bit/mm are a result of physical properties of the interconnects. These include things like resistances (they're very small wires) and stray capacitances. In principle, these numbers could be made basically arbitrarily worse by using smaller (cross sectional) interconnects, more resistive materials, or tighter packing of components. They can also be made significantly better, especially if you're allowed to consider alternative materials. Importantly, this loss can be dramatically reduced by reducing the operating voltage of the system, but some components do not work well at lower voltages, so there's a tradeoff.

... and we're supposed to compare that number with $E_{b}$ /bit/ $λ$ . I might be willing to buy the wide range of $E_{b}$ , but the choice of de Broglie wavelength as "minimal wire element" is completely arbitrary. The author seems to know this, because they give a few more examples of length scales that are around a nanometer. I can do that too: The spacing of conduction electrons in copper ( $n^{- 1 / 3}$ ) is roughly 0.2 nm. The mean free path of electrons in copper is a few nm. None of that matters, because signals do not propagate through wires by hopping from electron to electron. There is a complicated dance of electric field, magnetic field, conducting electrons as well as dielectrics that all work together to make signals move. The equation is asserting none of that matters, and it is simply unphysical. Sorry, there's not a nice way to put that.

The author seems to want us to compare the two equations, but they are truly two different things. I can communicate the same information in a circuit ( $E_{b}$ /bit/ $λ$ held fixed) but dramatically vary the cited "excellent fit" numbers by orders of magnitude by changing their material or lowering their voltage.

The Landauer energy is very, very small compared to just about every other energy that we care about. It is basically a non-factor in all but a few very esoteric experiments. 20 meV is a decent chunk of energy if it is given to every electron in a system, but it is extremely small as a one-off. It is absurd to think that computers, with their resistive wires and leaky gates, are anywhere near perfect efficiency. It is even more absurd to think that brains, squishy beasts that literally physically pump ions across membranes and back, are anywhere near that limit.

Looking through the comments, I now see that another user has correctly pointed out the same mistakes. See here [LW(p) · GW(p)], and the comments under that. Give them the $250. EY also pointed out the absurdity of brains being considered anywhere near efficient. Nice work!

comment by Ilio · 2023-04-27T19:13:31.091Z · LW(p) · GW(p)

Fwiw I did spotchecked this post at the time, although I did not share at the time (bad priors). Here it goes:

Yes, it’s probably approximately right, but you need to buy it’s the right assumptions. However, these assumptions also make the question somewhat unimportant for EA purposes, because even if the brain is one of the most efficient for its specs, including a few pounds ands watts, you could still believe doom or foom could happen with the same specs, except with a few tons and megawatts, or for some other specs (quantum computers, somewhat soon, or something else, somewhat maybe).

Edit: I just see this somewhat copycat Vaniver’s answer above, so let’s add something more: why I think it’s not the most interesting set of assumptions.

First, to me the brain is primarily optimized for robustness of it’s construction plan over working in a large set of species that all inherit basically the same construction plan, and for multiple (allelic) variants of these plans (with sexual competition, etc). Yes this is probably compatible with optimizing energy in the long run, but not enough to invent rooling balls, if you see what I mean.

Second, and perhaps more important, it assumes that the brain is doing a hard computation. On that, we really have no idea. Like most cognitive neuroscientists from the nineties, I once started a presentation with the widely-accepted trope that our brain is the most complicated thing bla bla. And yes, there are reasons to think that. On the other hand, if resnet-50 can predict most of the variance in neural hemodynamic while viewing the same picture, then maybe current GPU are not that far from the effective computational power of billions of neurons. This shouldn’t be that surprising: after all, biological neurons are not optimized for handling 64 bit-precision and gigahertz clocks.

Replies from: Ilio

↑ comment by Ilio · 2023-06-24T14:04:11.894Z · LW(p) · GW(p)

Follow-up: excellent new material from SB, who provide concrete research avenue for proving physics allow more than what JC assumptions allow. However the most interesting part might be Jacob providing his best (imho) point for why we can’t reject his assumptions so easily.

So I observe the fact that human engineering and biology have ended up on the same pareto surface for interconnect space & energy efficiency - despite being mostly unrelated optimization processes using very different materials - as evidence of a hard pareto surface rather than being mere coincidence.

Very good point indeed, unless someone could explain this coincidence using cheaper assumptions.

https://www.lesswrong.com/posts/YihMH7M8bwYraGM8g/my-side-of-an-argument-with-jacob-cannell-about-chip [LW · GW]

comment by DirectedEvolution (AllAmericanBreakfast) · 2023-04-26T17:49:36.574Z · LW(p) · GW(p)

I think it would be helpful if you specified a little more precisely which claims of Jake's you want spot checked. His posts are pretty broad and not everything in even the ones about efficiency are just about efficiency.

Replies from: alexander-gietelink-oldenziel

↑ comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2023-04-26T22:29:22.279Z · LW(p) · GW(p)

All technical claims made in the brain efficiency post, broadly construed. Including and especially limits to energy efficiency, interconnect losses, Landauer limit, convection vs blackbody radiation, claims concerning the effective working memory of the human brain versus that of computers, end of Moore's law, CPU vs GPU vs neuromorphic chips, etc etc

Replies from: AllAmericanBreakfast

↑ comment by DirectedEvolution (AllAmericanBreakfast) · 2023-04-26T23:21:39.218Z · LW(p) · GW(p)

Thank you, that is very helpful!

comment by Archimedes · 2023-04-27T23:19:44.524Z · LW(p) · GW(p)

I'm not interested in the prize, but as long as we're spot-checking, this paragraph bothered me:

It turns out that spreading out the communication flow rate budget over a huge memory store with a slow clock rate is fundamentally more powerful than a fast clock rate over a small memory store. One obvious reason: learning machines have a need to at least store their observational history. A human experiences a sensory input stream at a bitrate of about 10^6 bps (assuming maximal near-lossless compression) for about 10^9 seconds over typical historical lifespan, for a total of about 10^15 bits. The brain has about 2∗10^14 synapses that store roughly 5 bits each, for about 10^15 bits of storage. This is probably not a coincidence.

The idea of bitrate * lifespan = storage capacity makes sense for a VHS or DVD but human memory is completely different. Only a tiny percentage of all our sensory input stream makes it through memory consolidation into long-term memory. Sensory memory is also only one type of memory alongside others like semantic, episodic, autobiographical, and procedural (this list is neither mutually exclusive nor collectively exhaustive). Brains are very good at filtering the sensory stream to focus on salient information and forgetting the vast majority of the rest. This results in memory that is highly compressed (and highly lossy).

Because brain memory is so different than simple storage, this napkin math is roughly analogous to computing body mass from daily intake of air, water, and food mass and neglecting how much actually gets stored versus ultimately discarded. You can do the multiplication but it's not very meaningful without additional information like retention ratios.

comment by S Benfield (steven-benfield) · 2024-01-04T23:05:37.455Z · LW(p) · GW(p)

I sort of dismiss the entire argument here because based on my understanding, the brain determines the best possible outcomes given a set of beliefs (aka experiences) and based on some boolean logic based on sense of self, others, and reality, the result in actions will be derived from quantum wave function collapse given the belief set, current stimulus, and possible actions. I'm not trying to prove why I believe they are quantum here, except to say, to think otherwise is just saying quantum effects are not part of nature and not part of evolution. And that seems to be what would need to be proven given how efficient evolution is and how electrical and non-centered our brains are. So determining how many transistors would be needed and how much computational depth is needed is sort of moot if we are going to assume a Newtonian brain since I think we're solving for the wrong problem. AI will also beat us with normal cognition, but emotion and valence only come with the experiences we have and the beliefs we have about them and that will be the problem with AI until we approach the problem correctly.

comment by green_leaf · 2023-05-01T23:58:43.410Z · LW(p) · GW(p)

I hope that my two comments[1] [LW(p) · GW(p)][2] [LW(p) · GW(p)] helped you save $250.

$250 prize for checking Jake Cannell's Brain Efficiency

Contents

170 comments