Have you heard about MIT's "liquid neural networks"? What do you think about them?

post by Ppau · 2023-05-09T20:16:27.280Z · LW · GW · 3 comments

This is a question post.

Contents

  Answers
    10 Dave Orr
    7 the gears to ascension
None
3 comments

I came across this video by MIT CSAIL.

Here is the article they are talking about: https://www.science.org/doi/10.1126/scirobotics.adc8892

This team claims to have achieved driving tasks that previously required 10000 neurons, while using only 19, by using "liquid neural networks" inspired by worm neurology.

They say this innovation brings massive improvements on performance, especially in embedded systems, but also in interpretability, since the reduced number of neurons makes the system much more human-readable. In particular, the attention of the system would be much more easily tracked; this would open the door to safety certifications for high-stakes applications.

Having tried driving and flying tasks in different conditions and environments, they also claim that their system is vastly better at out-of-distribution zero-shot tasks.

So basically, they believe they have made very substantial steps in pretty much every dimension that matters, both for performance and for safety.

As far as I can tell these are very serious researchers, but doesn't that sound a bit too god to be true? I have no expertise in machine learning and I haven't seen any third-party opinions on this yet, so I'm having a hard time making up my mind.

I'd be curious to hear your takes!

Answers

answer by Dave Orr · 2023-05-09T23:10:39.181Z · LW(p) · GW(p)

I think this is real, in the sense that they got the results they are reporting and this is a meaningful advance. Too early to say if this will scale to real world problems but it seems super promising, and I would hope and expect that Waymo and competitors are seriously investigating this, or will be soon. 

Having said that, it's totally unclear how you might apply this to LLMs, the AI du jour. One of the main innovations in liquid networks is that they are continuous rather than discrete, which is good for very high bandwidth exercises like vision. Our eyes are technically discrete in that retina cells fire discretely, but I think the best interpretation of them at scale is much more like a continuous system. Similar to hearing, the AI analog being speech recognition.

But language is not really like that. Words are mostly discrete -- mostly you want to process things at the token level (~= words) or sometimes wordpieces or even letters, but it's not that sensible to think of text as being continuous. So it's not obvious how to apply liquid NNs to text understanding/generation.

Research opportunity!

But it'll be a while, if ever, before continuous networks work for language.

comment by Ppau · 2023-05-10T15:24:11.066Z · LW(p) · GW(p)

Thanks for your answer! Very interesting

I didn't know about the continuous nature of LNN; I would have thought that you needed different hardware (maybe an analog computer?) to treat continuous values.

Maybe it could work for generative networks for images or music, that seems less discrete than written language.

Replies from: dave-orr
comment by Dave Orr (dave-orr) · 2023-05-10T20:49:13.125Z · LW(p) · GW(p)

I mean, computers aren't technically continuous and neither are neural networks, but if your time step is small enough they are continuous-ish. It's interesting that that's enough.

I agree music would be a good application for this approach.

comment by awg · 2023-05-10T00:04:17.655Z · LW(p) · GW(p)

Then again...the output of an LLM is a stream of tokens (yeah?). I wonder what applications LTCs could have as a post-processor for LLM output? No idea what I'm really talking about though.

Replies from: mishka
comment by mishka · 2023-05-10T12:12:50.179Z · LW(p) · GW(p)

Not quite. The actual output is the map from tokens to probabilities, and only then one samples a token from that distribution.

So, LLMs are more continuous in this sense than is apparent at first, but time is discrete in LLMs (a discrete step produces the next map from tokens to probabilities, and then samples from that).

Of course, when one thinks about spoken language, time is continuous for audio, so there is still some temptation to use continuous models in connection with language :-) who knows... :-)

Replies from: awg
comment by awg · 2023-05-10T15:20:19.108Z · LW(p) · GW(p)

Ah aha! Thank you for that clarification!

answer by the gears to ascension · 2023-05-10T15:38:16.027Z · LW(p) · GW(p)

This is pure capabilities, and yes, it's a big deal.

comment by Christopher King (christopher-king) · 2023-05-10T18:25:40.944Z · LW(p) · GW(p)

If it works out-of-distribution, that's a huge deal for alignment! Especially if alignment generalizes farther than capabilities. Then you can just throw something like imitative amplification at it and it is probably aligned (assuming that "does well out-of-distribution" implies that the mesa-optimizers are tamed).

Replies from: red75prime, lahwran
comment by red75prime · 2023-05-10T18:43:14.462Z · LW(p) · GW(p)

I have low confidence in that, but I guess it (OOD generalization by "liquid" networks) works well in differentiable continuous domains (like low-level motion planning) by exploiting natural smoothness of a system. So I wouldn't get my hopes high in its universal applicability.

comment by the gears to ascension (lahwran) · 2023-05-11T05:56:26.282Z · LW(p) · GW(p)

it's built out of an optimizer, why would that tame inner optimizers? perhaps it makes them explicit, because now the whole thing is a loss function, but the iterative inference can't be shut off and still get functionally

Replies from: christopher-king
comment by Christopher King (christopher-king) · 2023-05-11T12:58:33.385Z · LW(p) · GW(p)

That's just part of the definition of "works out of distribution". Scenarios where inner optimizers become AGI or something are out of distribution from training.

3 comments

Comments sorted by top scores.

comment by Stephen Fowler (LosPolloFowler) · 2023-05-11T06:13:11.989Z · LW(p) · GW(p)

I have to dispute the idea that "less neurons" = "more human-readable". If the fewer neurons are performing a more complex task it won't necessarily be easier to interpret.  

Replies from: shayne-o-neill
comment by Shayne O'Neill (shayne-o-neill) · 2023-07-13T12:31:09.227Z · LW(p) · GW(p)

Definately. The lower the neuron vs 'concepts' ratio is, the more superposition required to represent everything. That said with the continuous function nature of LNNs these seem to be the wrong abstraction for language. Image models? Maybe.  Audio models? Definately. Tokens and/or semantic data?  That doesnt seeem practical.

 

comment by Charlie Steiner · 2023-05-13T04:32:19.725Z · LW(p) · GW(p)

I just skimmed the video, but it seems like there's more salesmanship than there is explanation of what the network is doing, how its capabilities would compare to using e.g. a small RNN, and how far it actually generalizes.

Remember that self-driving cars first appeared in the 1980s - lane-keeping is actually a very simple task if you only need 99% reliability. I don't think their demos are super informative about the utility of this architecture to complicated tasks.

So I'd be interested if you looked into it more and think that my first impression is unfair.