Cephalopods are highly intelligent but also have a very short lifecycle and very large number of offspring - making them a prime target for a artificial breeding program for intelligence uplifting.
E.g. the lifecycle of many octopi is about 1-2 years. A 15 year program could potentially breed for very significant behaviourial changes.
For instance, dog and horse breeds can be made within 10 generations.
Google says the female giant pacific octopis lays between 120k and 400k eggs at the end of her life. Giant pacific octopi live about 3-5 years.
I asked Claude to make some back of the enveloppe calculations
>>>I'll use the breeder's equation to estimate the potential response to selection for intelligence in octopuses. The breeder's equation states:
R = h² × S
- R is the response to selection (change in trait mean per generation)
- h² is the narrow-sense heritability of the trait
- S is the selection differential (difference between selected parents and population mean)
Let's make some reasonable assumptions for these parameters in octopuses:
- Heritability (h²): For cognitive traits in animals, heritability typically ranges from 0.2 to 0.6. Let's use h² = 0.4 as a moderate estimate for octopus intelligence.
- Selection differential (S): With 120,000-400,000 eggs per female, we could be extremely selective. If we select the top 0.1% of individuals (still giving us 120-400 individuals), we could achieve a selection differential of about 3 standard deviations.
Let's calculate the response to selection per generation:
R = 0.4 × 3 = 1.2 standard deviations per generation
For a breeding program over 15 years with Giant Pacific Octopuses (3-5 year lifecycle):
- Using a 4-year lifecycle: 15 ÷ 4 = 3.75 generations (let's round to 4 generations)
- Total expected gain: 4 × 1.2 = 4.8 standard deviations
For octopuses with a shorter 1-2 year lifecycle:
- Using a 1.5-year lifecycle: 15 ÷ 1.5 = 10 generations
- Total expected gain: 10 × 1.2 = 12 standard deviations
This represents a substantial shift in the distribution of intelligence. For context, the difference in IQ between average humans and those considered profoundly gifted is about 4 standard deviations. A shift of 12 standard deviations would be extraordinary.
However, several factors would likely limit these theoretical gains:
- Selection plateaus as genetic variation is depleted
- Pleiotropy (genes affecting multiple traits) may create unfavorable trade-offs
- The trait may reach biological/physiological limits
- Inbreeding depression could become an issue with intense selection
Even with these limitations, the potential for significant intelligence enhancement in octopuses through selective breeding appears substantial, especially for species with shorter lifecycles.
Alexander reacting to Claude back of enveloppe calculation:
Narrow sense heritability is probably higher for intelligence. Indeed, at this level of selection one would have to worry about inbreeding depression, selection plateaus etc
My best guess is that 6 SD would be possible IF one was actually able to accurately select top 0.1%.
This is probably quite hard. GWAS data for humans currently does not allow for this kind of precision. Doing the selection accurately / i.e. estimating the gradient is the main rate-limiting step [as it is in deep learning!]. One would need to construct psychometrically valid tests for cephalopod, run them at scale.
How much is 6 SD? Octopi might be about as intelligent as a dog (uncertain about this). 6 SD would be quite insane, and would naively plausibly push them tonthe upper end of dolphin/chimpanzee intelligence. The main limitor in my mind is that octopi are not natively social species and are do not have a long enough lifecycle to do significant learning so this might not actually lead to intelligence uplifting. Additionlly, there is the issue that octopi dont have vocal chords so would need to communicate differently.
Ofc like the famous Soviet silver fox breeding program one could separately select for sociability [which might be more important for effective intelligence. Iirc wolfs usually outperform dogs cognitively yet some dog breeds are generally considered more intelligent in a relevant sense.]
I wouldn't claim to be an expert on the UK system but from talking with colleagues at UCL it seems to be the case that French positions are more secure and given out earlier [and this was possibly a bigger difference in the past]. I am not entirely sure about the number 32. Anecdotally, I would say many of the best people I know did not obtain tenure this early. This is something that may also vary by field - some fields are more popular, better funded because of [perceived] practical applications.
Mathematiscs is very different from other fields. For instance: it is more long-tailed, benefits from ' deep research, deep ideas' far more than other fields, is difficult to paralellize, has ultimate ground truth [proofs], and in large fraction of subfields [e.g. algebraic geometry, homotopy theory ...] the amount of prerequisite knowledge is very large,[1] has many specialized subdisciplines , there are no empirical
All these factors suggest that the main relevant factor of production is how many positions that allow intellectuall freedom, are secure, at a young age plus how they are occupied by talented people is.
- ^
e.g. it often surprises outsiders that in certian subdisciplines of mathematics even very good PhD students will often struggle reading papers at the research frontier - even after four years of specialized study.
Yes I use LLMs in my writing [not this comment] and I strongly encourage others to do so too.
This the age of Cyborgism. Jumping on making use of the new capabilities opening up will likely be key to getting alignment right. AI is coming, whether you like it or not.
There is also a mundane reason: I have an order of magnitude more ideas than I can write down. Using LLMs allows me to write an essay in 30 min which otherwise would take half a day.
Sure happy to disagree on this one.
Fwiw, the French dominance isn't confined to Bourbakist topics. E.g. Pierre Louis Lions won one of the French medals and is the world most cited mathematician, with a speciality in PDEs. Some of his work investigates the notion of general nonsmooth ("viscosity") solutions for the general Hamilton-Jacobi(-Bellmann) equation both numerically and analytically. It's based on a vast generalization of the subgradient calculus ("nonsmooth" calculus), and is very directly related to good numerical approximation schemes.
Certainly for many/most other subjects the French system is not so good. E.g. for ML all that theory is mostly a waste.
Those are some good points certainly.
The UK/US system typically gives tenure around ~40, typically after ~two postdocs and a assistant -> associate -> full prof.
In the French system a typical case might land an effectively tenured job at 30. Since 30-40 is a decade of peak creativity for scientists in general, mathematicians in particular I would say this is highly
Laurent Lafforgue is a good example. Iirc he published almost nothing for seven years after his PhD until the work that he did for the Fields medal. He wouldnt have gotten a job in the American system.
He is an extreme example but generically having many more effectively tenured positions at a younger age means that mathematicians feel the freedom to doggedly pursue important, but perhaps obscure-at-present, research bets.
My point is primarily that the selection is at 20, instead of at 18. It s not about training per se, although here too the French system has an advantage. Paris has ~ 14 universities, a number of grand ecolees, research labs, etc a large fraction which do serious research mathematics. Paris consequently has the largest and most diverse assortiment of advanced coursework in the world. I don't believe there is any place in the US that compares [I've researched this in detail in the past].
Why Do the French Dominate Mathematics?
France has an outsized influence in the world of mathematics despite having significantly fewer resources than countries like the United States. With approximately 1/6th of the US population and 1/10th of its GDP, and French being less widely spoken than English, France's mathematical achievements are remarkable.
This dominance might surprise those outside the field. Looking at prestigious recognitions, France has won 13 Fields Medals compared to the United States' 15 a nearly equal achievement despite the vast difference in population and resources. Other European nations lag significantly behind, with the UK having 8, Russia/Soviet Union 6/9, and Germany 2.
France's mathematicians are similarly overrepresented in other mathematics prizes and honors, confirming this is not merely a statistical anomaly.
I believe two key factors explain France's exceptional performance in mathematics while remaining relatively average in other scientific disciplines:
1. The "Classes Préparatoires" and "Grandes Écoles" System
The French educational system differs significantly from others through its unique "classes préparatoires" (preparatory classes) and "grandes écoles" (elite higher education institutions).
After completing high school, talented students enter these intensive two-year preparatory programs before applying to the grandes écoles. Selection is rigorously meritocratic, based on performance in centralized competitive examinations (concours). This system effectively postpones specialization until age 20 rather than 18, allowing for deeper mathematical development during a critical cognitive period.
The École Normale Supérieure (ENS) stands out as the most prestigious institution for mathematics in France. An overwhelming majority of France's top mathematicians—including most Fields Medalists—are alumni of the ENS. The school provides an ideal environment for mathematical talent to flourish with small class sizes, close mentorship from leading mathematicians, and a culture that prizes abstract thinking.
This contrasts with other countries' approaches:
- Germany traditionally lacked elite-level mathematical training institutions (though the University of Bonn has recently emerged as a center of excellence)
- The United States focuses on mathematics competitions for students under 18, but these competitions often emphasize problem-solving skills that differ significantly from those required in mathematical research
The intellectual maturation between ages 18 and 20 is profound, and the French system capitalizes on this critical developmental window.
2. Career Stability Through France's Academic System
France offers significantly more stable academic positions than many other countries. Teaching positions throughout the French system, while modestly compensated, effectively provide tenure and job security.
This stability creates an environment where mathematicians can focus on deep, long-term research without the publish-or-perish pressure common in other academic systems. In mathematics particularly, where breakthroughs often require years of concentrated thought on difficult problems, this freedom to think without immediate productivity demands is invaluable.
While this approach might be less effective in experimental sciences requiring substantial resources and team management, for mathematics—where the primary resource is time for thought—it has proven remarkably successful.
ADHD is about the Voluntary vs Involuntary actions
The way I conceptualize ADHD is as a constraint on the quantity and magnitude of voluntary actions I can undertake. When others discuss actions and planning, their perspective often feels foreign to me—they frame it as a straightforward conscious choice to pursue or abandon plans. For me, however, initiating action (especially longer-term, less immediately rewarding tasks) is better understood as "submitting a proposal to a capricious djinn who may or may not fulfill the request." The more delayed the gratification and the longer the timeline, the less likely the action will materialize.
After three decades inhabiting my own mind, I've found that effective decision-making has less to do with consciously choosing the optimal course and more with leveraging my inherent strengths (those behaviors I naturally gravitate toward, largely outside my conscious control) while avoiding commitments that highlight my limitations (those things I genuinely intend to do and "commit" to, but realistically never accomplish).
ADHD exists on a spectrum rather than as a binary condition. I believe it serves an adaptive purpose—by restricting the number of actions under conscious voluntary control, those with ADHD may naturally resist social demands on their time and energy, and generally favor exploration over exploitation.
Society exerts considerable pressure against exploratory behavior. Most conventional advice and social expectations effectively truncate the potential for high-variance exploration strategies. While one approach to valuable exploration involves deliberately challenging conventions, another method simply involves burning bridges to more traditional paths of success.
I use LLMs throughout my personal and professional life. The productivity gains are immense. Yes hallucination is a problem but it's just as spam/ads/misinformation on wikipedia/internet - an small drawback that doesn't oblivate the ginormous potential of the internet/LLMs
I am 95% certain you are leaving value on the table.
I do agree straight LLMs are not generally intelligent (in the sense of universal intelligence/AIXI) and therefore not completely comparable to humans.
This was basically my model since i first started paying attention to modern AI
Curious why did you think differently before ? :)
yeahh i'm afraid I have too many other obligations right now to give a elaboration that does it justice.
otoh i'm in the Bay and we should definitely catch up sometime!
Yes sorry Eli, I meant to write out a more fully fleshed out response but unfortunately it got stuck in drafts.
The tl;dr is that I feel this perspective is singling out Sam Altman as some uniquely machiavellian actor in a way I find naive /misleading and ultimately maybe unhelpful.
I think in general im skeptical of the intense focus on individuals & individual tech companies that LW/EA has develloped recently. Frankly, it feels more rooted in savannah-brained tribalism & human interest than a evenkeeled analysis of what factors are actually important, neglected and tractable.
The idea I associate with scalable oversight is weaker models overseeing stronger models (probably) combined with safety-by-debate. Is that the same or different from " recursive techniques for reward generation" ?
Currently, this general class of ideas seems to me the most promising avenue for achieving alignment for vastly superhuman AI (' superintelligence' )..
I want to be able to describe agents that do not have (vNM, geometric, other) rational preferences because of incompleteness or inconsistency but self-modify to become so.
Eg. In vNM utility theory there is a fairly natural weakening one can do which is ask for a vNM-style representation theorem after dropping transitivity.
[ Incidentally, there is some interesting math here having to do with conservative vs nonconservative vector fields and potentials theory all the way to hodge theory. ]
does JB support this ?
Im confused since in vNM we start with a preference order over probability distributions. But in JB irs over propositions?
Is there a benchmark in which SAEs clearly, definitely outperform standard techniques?
this seems concerning. Can somebody ELI5 what's going on here?
Inconsistent and incomplete preferences are necessary for descriptive agent foundations.
In vNM preference theory an inconsistent preference can be described as cyclic preferences that can be moneypumped.
How to see this in JB ?
Is Tesla currently overvalued ?
P/e ratio is 188. The ceo has made himself deeply unpopular with many potential customers. Latest sales figures don't look good. Chinese competitors sell more total cars and seem to have caught up in terms of tech.
Happy to see this.
I have some very basic questions:
How can I see inconsistent preferences within the Jeffrey Bolker framework? What about incomplete preferences ?
Is there any relation you can smimagube with imprecise probability / infraprobability, i.e. knightian uncertainty ?
I'm actually curious about a related problem.
One of the big surprises of the deep learning revolution has been the universality of gradient descent optimization.
How large is the class of optimization problems that we can transform into a gradient descent problem of some kind? My suspicision is that it's a very large class; perhaps there is even a general way to transform any problem into a gradient descent optimization problem?
The natural thing that comes to mind is to consider gradient descent of Langrangian energy functionals in (optimal) control theory.
Can somebody ELI5 how much I should update on the recent SAE = dead salmon news?
On priors I would expect the SAE bear news to be overblown. 50% of mechinterp is SAEs - a priori, it seems unlikely to me that so many talented people went astray. But I'm an outsider and curious about alternate views.
God is live and we have birthed him.
It's still wild to me that highly cited papers in this space can make such elementary errors.
Thank you for writing this post Dmitry. I've only skimmed the post but clearly it merits a deeper dive.
I will now describe a powerful, central circle of ideas I've been obsessed with past year that I suspect is very close to the way you are thinking.
Free energy functionals
There is a very powerful, very central idea whose simplicity is somehow lost in physics obscurantism which I will call for lack of a better word ' tempered free energy functionals'.
Let us be given a loss function $L$ [physicists will prefer to think of this as an energy function/ Hamiltonian]. The idea is that one consider a functional $F_{L}(\beta): \Delta(\Omega) \to \mathbb{R}$ taking a distribution $p$ and sending it to $L(p) + \beta H(p)$, $\beta\in \mathbb{R}$ is the inherent coolness or inverse temperature.
We are now interested in minimizers of this functional. The functional will typically be convex (e.g. if $L(p)=KL(q||p)$ the KL-divergence or $L(P)= NL_N(p)$, the empirical loss at $N$ data points) so it has a minimum. This is the tempered Bayesian posterior/ Boltzmann distribution at inverse temperature $\beta$.
I find the physics terminology inherently confusing. So instead of the mysterious word temperature; just think of $\beta$ as a variable that controls the tradeoff between loss and inherent simplicity bias/noise. In other words, \beta controls the inherent noise.
SLT of course describes the free energy functional when evaluated at this minimizer as a function of $N$ through the Watanabe free energy functional.
Another piece of the story is that the [continuum limit of] stochastic gradient langevin descent at a given noise level is equivalently gradient descent along the free energy functional [at the given noise level, in the Wasserstein metric].
Rate-distortion theory
Instead of a free energy functional we can better think of it as a complexity-accuracy functional.
This is the basics of rate-distortion theory. I note that there is a very important but little known purely algorithmic version of this theory. See here for an expansive breakdown on more of these ideas.
Working in this generality it can be shown that every phase transition diagram is possible. There are also connections with Natural Abstractions/ sufficient statistics and time complexity.
Like David Holmes I am not an expert in tropical geometry so I can't give the best case for why tropical geometry may be useful. Only a real expert putting in serious effort can make that case.
Let me nevertheless respond to some of your claims.
- PL functions are quite natural for many reasons. They are simple. They naturally appear as minimizers of various optimization procedures, see e.g. the discussion in section 5 here.
- Polynomials don't satisfy the padding argument and architectures based on them therefore will typically fail to have the correct simplicitity bias.
As for
1." Algebraic geometry isn't good at dealing with deep composition of functions, and especially approximate composition." I agree a typical course in algebraic geometry will not much consider composition of functions but that doesn't seem to me a strong argument for the contention that the tools of algebraic geometry are not relevant here. Certainly, more sophisticated methods beyond classical scheme theory may be important [likely involving something like PROPs] but ultimately I'm not aware of any fundamental obstruction here.
2. >>
I don't agree with the contention that algebraic geometry is somehow not suited for questions of approximation. e.g. the Weil conjectures is really an approximate/ average statement about points of curves over finite fields. The same objection you make could have been made about singularity theory before we knew about SLT.
I agree with you that a probabilistic perspective on ReLUs/ piece-wise linear functions is probably important. It doesn't seem unreasonable to me in the slightest to consider some sort of tempered posterior on the space of piecewise linear functions. I don't think this invalidates the potential of polytope-flavored thinking.
>> Tropical geometry is an interesting, mysterious and reasonable field in mathematics, used for systematically analyzing the asymptotic and "boundary" geometry of polynomial functions and solution sets in high-dimensional spaces, and related combinatorics (it's actually closely related to my graduate work and some logarithmic algebraic geometry work I did afterwards). It sometimes extends to other interesting asymptotic behaviors (like trees of genetic relatedness). The idea of applying this to partially linear functions appearing in ML is about as silly as trying to see DNA patterns in the arrangement of stars -- it's a total type mismatch.
Shots fired! :D Afaik I'm the only tropical geometry stan in alignment so let me reply to this spicy takedown here.
It's quite plausible to me that thinking in terms of polytopes, convex is a reasonable and potentially powerful lens on understanding neural networks. Despite the hyperconfident and strong language in this post it seems you agree.
Is it then unreasonable to think that tropical geometry may be relevant too? I don't think so.
Perhaps your contention is that tropical geometry is more than just thinking in terms of polytopes but specifically the algebraic geometric flavored techniques. Perhaps. I don't feel strongly about that. If it's matroids that are most relevant, rather than toric varieties and tropicalized Grassmanians then so be it.
The basic tropical perspective on deep learning begins by observing ReLU neural networks as ' tropical rational functions' , i.e. decomposing the underlying map $f$ of your ReLU neural network as a difference of convex linear functions $f=g-h$. This decomposition isn't unique, but possibly still quite useful.
As is mentioned in the text, convex-linear functions are much easier to analyze than general piece-wise linear functions so this decomposition may prove advantageous.
Another direction that may be of interest in this context is the nonsmooth calculus and especially its extension the quasi-differential calculus.
" as silly trying to see DNA patterns in the arrangement of stars -- it's a total type mismatch"
This statement feels deeply overconfident to me. Whether or not tropical geometry may be relevant to understanding real neural networks can only really be resolved by having a true domain expert ' commit to the bit' and research this deeply.
This kind of idle speculation seems not so useful to me.
You are probably aware of this but there is indeed a mathematical theory of degeneracy/ multiplicity in which multiplicity/degeneracy in the parameter-function map of neural networks is key to their simplicity bias. This is singular learning theory.
The connection between degeneracy [SLT] and simplicity [algorithmic information theory] is surprisingly, delightfully simple. It's given by the padding/deadcode argument.
Beautifully argued, Dmitry. Couldn't agree more.
I would also note that I consider the second problem of interpretability basically the central problem of complex systems theory.
I consider the first problem a special case of the central probem of alignment. It's very closely related to the 'no free lunch' problem.
Well 2-3 shitposters and one gwern.
Who would be so foolish to short gwern? Gwern the farsighted, gwern the prophet, gwern for whom entropy is nought, gwern augurious augustus
Thanks for the sleuthing.
The thing is - last time I heard about OpenAI rumors it was Strawberry.
The unfortunate fact of life is that too many times OpenAI shipping has surpassed all but the wildest speculations.
Yes, this should be an option in the form.
Does clicking on HERE work for you?
Fair enough.
Thanks for reminding me about V-information. I am not sure how much I like this particular definition yet - but this direction of inquiry seems very important imho.
Those people will probably not see this so wont reply.
What I can tell you is that in the last three months I went through a phase transition in my AI use and I regret not doing this ~1 year earlier.
It's not that I didnt use AI daily before for mundane tasks or writing emails, it's not that I didnt try a couple times to get it to solve my thesis problem (it doesn't get it) - it's that I failed to refrain my thinking from asking "can AI do X?" to "how can I reengineer and refactor my own workflow, even the questions I am working on so as to maximally leverage AI?"
See also geometric rationality.
Hope this will be answered in a later post, but why should I care about the permanent for alignment ?
skills issue.
prep for the model that is coming tomorrow not the model of today
Mmm. You are entering the Cyborg Era. The only ideas you may take to the next epoch are those that can be uploaded to the machine intelligence.
Are there any plans to have writtten materials in parallel ?
meta note that I would currently recommend against spending much time with Watanabe's original texts for most people interested in SLT. Good to be aware of the overall outlines but much of what most people would want to know is better explained elsewhere [e.g. I would recommend first reading most posts with the SLT tag on LessWrong before doing a deep dive in Watanabe]
meta note *
if you do insist on reading Watanabe, I highly recommend you make use of AI assistance. I.e. download a pdf, cut down them down into chapters and upload to your favorite LLM.
John, you know much coding theory much better than I do so I am inclinced to defer to your superior knowledge.
Now behold the awesome power of gpt-Pro
Let’s unpack the question in pieces:
1. Is ZIP (a.k.a. DEFLATE) “locally decodable” or not?
- Standard ZIP files are typically not “locally decodable” in the strictest sense—i.e., you cannot start decoding exactly at the byte corresponding to your region of interest and reconstruct just that portion without doing some earlier decoding.
The underlying method, DEFLATE, is indeed based on LZ77 plus Huffman coding. LZ77 uses back‐references to prior segments of the uncompressed data, which forces you to at least partially decode from an earlier point to resolve those references.
However, many ZIP files are stored in blocks or chunks—sometimes called “deflate blocks.” So if you only need data near the start of a block (rather than in the middle of one), you can skip previous blocks and just decode that chunk. Still, you cannot trivially jump to an arbitrary byte offset inside the block without re-initializing the LZ77 state from somewhere earlier.
2. If ZIP is “based on Lempel–Ziv,” can we easily modify it for local decodability?
- Because Lempel–Ziv (LZ) methods revolve around referencing earlier uncompressed substrings, one must store enough information (or replicate the references with separate “partial dictionaries”) if one wants truly local decodability.
- One can certainly modify an LZ-based compression scheme so that each block is smaller, or that references never cross block boundaries, etc. That would allow a more local decoding property: you only decode from the start of the local block rather than from the entire file.
- This usually comes with tradeoffs. If your blocks are small, you lose compression efficiency because you can’t take advantage of references bridging across large swaths of data. So making a compression format “locally decodable” in a fine‐grained way often lowers the overall compression ratio.
3. Is Lempel–Ziv (and thus ZIP) lossless?
- Yes. Standard LZ77 (or LZ78, LZMA, etc.) is a lossless compression scheme: you can reconstruct the exact original data bit for bit. The DEFLATE algorithm used in ZIP is also lossless.
4. Does the picture change when doing lossy compression?
- Yes, quite a bit. In lossy compression (like JPEG for images or MP3/AAC for audio), parts of the original data are thrown away. Often you can still decode small “chunks” of the file—e.g., many lossy audio or video formats do decode in frames or segments. But the format internals (e.g., DCT coefficients in JPEG, wavelet transforms, etc.) differ a lot from LZ-based bit-by-bit exactness.
- In lossy formats that use transform blocks (like 8×8 pixel blocks in JPEG), you can decode each block locally, but you still have boundary effects, optional overlap between blocks, etc. So “local decodability” there can be simpler in some ways (each block is more independent), but it’s also less “exact” because we accept errors from the original data.
Short Summary
- ZIP/DEFLATE is not strictly locally decodable because LZ77 references require that you decode from (at least) the beginning of the current block.
- You can modify an LZ-based scheme (for instance, by using smaller blocks or restricting cross‐block references) to make it more locally decodable, but that typically sacrifices some compression ratio.
- LZ methods and ZIP are lossless.
- With lossy compression, “local decodability” can be easier in the sense that each block may be decoded without referencing data too far outside itself—but at the cost that you’re not reconstructing the original data exactly.
You May Want to Know About Locally Decodable Codes
In AI alignment and interpretability research, there's a compelling intuition that understanding equals compression. The idea is straightforward: if you truly understand a system, you can describe it more concisely by leveraging that understanding. This philosophy suggests that better interpretability techniques for neural networks should yield better compression of their behavior or parameters.
jake_mendel asks: if understanding equals compression, then shouldn't ZIP compression of neural network weights count as understanding? After all, ZIP achieves remarkable compression ratios on neural network weights - likely better than any current interpretability technique. Yet intuitively, having a ZIP file of weights doesn't feel like understanding at all! We wouldn't say we've interpreted a neural network just because we've compressed its weights into a ZIP file.
Compressing a bit string means finding a code for that string, and the study of such codes is the central topic of both algorithmic and Shannon information theory. Just compressing the set of weights as small as possible is too naive - we probably want to impose additional properties on the codes.
One crucial property we might want is "local decodability": if you ask a question about any specific part of the original neural network, you should be able to answer it by examining only a small portion of the compressed representation. You shouldn't need to decompress the entire thing just to understand one small aspect of how the network operates. This matches our intuitions about human understanding - when we truly understand something, we can answer specific questions about it without having to review everything we know.
A Locally Decodable Code (LDC) is a special type of error-correcting code that allows recovery of any single bit of the original message by querying only a small number of bits of the encoded message, even in the presence of some corruption. This property stands in stark contrast to ZIP compression, which requires processing the entire compressed file sequentially to recover any specific piece of information. ZIP compression is not locally decodable.
There's a fundamental tension between how compact an LDC can be (its rate) and how many bits you need to query to decode a single piece of information (query complexity). You can't make an LDC that only needs to look at one position, and if you restrict yourself to two queries, your code length must grow exponentially with message size.
This technical tradeoff might reflect something deeper about the nature of understanding. Perhaps true understanding requires both compression (representing information concisely) and accessibility (being able to quickly retrieve specific pieces of information), and there are fundamental limits to achieving both simultaneously.
Loving this!
But one thing this model likely predicts is that a better model for a NN than a single linear regression model is a collection of qualitatively different linear regression models at different levels of granularity. In other words, depending on how sloppily you chop your data manifold up into feature subspaces, and how strongly you use the "locality" magnifying glass on each subspace, you'll get a collection of different linear regression behaviors; you then predict that at every level of granularity, you will observe some combination of linear and nonlinear learning behaviors.
A couple things that come to mind.
- Linear features = sufficients statistics of exponential families ?
- simplest case is case of Gaussians and covariance matrix (which comes down to linear regression)
- formalized by GPD theorem
- exponential families are a fairly good class but not closed under hierarchichal structure. Basic example is a mixture of Gaussians is not exponential, i.e. not described in terms of just linear regression.
- The centrality of ReLU neural networks.
- Understanding ReLU neural networks is probably 80-90% of understanding NN- architectures. At sufficient scale pure MLP have the same or better scaling laws than transformers.
- There is several lines of evidence gradient descent has an inherent bias towards splines/piecewise linear functions/tropical polynomials. see e.g. here and references therein.
- Serious analysis of ReLU neural network can be done through tropical methods. A key paper is here. You say:
"very cool piece of the analysis here is locally modelling ReLU learning as building a convex function as a max of linear functions (and explaining why non-ReLU learning should exhibit a softer version of the same behavior). This is a somewhat "shallow" point of view on learning, but probably captures a nontrivial part of what's going on, and this predicts that every new weight update only has local effect -- i.e., is felt in a significant way only by a small number of datapoints (the idea being that if you're defining a convex function as the max of a bunch of linear functions, shifting one of the linear functions will only change the values in places where this particular linear function was dominant). The way I think about this phenomenon is that it's a good model for "local learning", i.e., learning closer to memorization on the memorization-generalization spectrum that only updates the behavior on a small cluster of similar datapoints (e.g. the LLM circuit that completes "Barack" with "Obama"). "
I suspect the notion one should be looking at are the Activation polytope and activation fan in section 5 of the paper. The hypothesis would be something about efficiently learnable features having a 'locality' constraint on these activation polytopes, ie. they are 'small', 'active on only a few data points'..
People are not thinking clearly about AI-accelerated AI research. This comment by Thane Ruthenis is worth amplifying.
I'm very skeptical of AI being on the brink of dramatically accelerating AI R&D.
My current model is that ML experiments are bottlenecked not on software-engineer hours, but on compute. See Ilya Sutskever's claim here:
95% of progress comes from the ability to run big experiments quickly. The utility of running many experiments is much less useful.
What actually matters for ML-style progress is picking the correct trick, and then applying it to a big-enough model. If you pick the trick wrong, you ruin the training run, which (a) potentially costs millions of dollars, (b) wastes the ocean of FLOP you could've used for something else.
And picking the correct trick is primarily a matter of research taste, because:
- Tricks that work on smaller scales often don't generalize to larger scales.
- Tricks that work on larger scales often don't work on smaller scales (due to bigger ML models having various novel emergent properties).
- Simultaneously integrating several disjunctive incremental improvements into one SotA training run is likely nontrivial/impossible in the general case.[1]
So 10x'ing the number of small-scale experiments is unlikely to actually 10x ML research, along any promising research direction.
And, on top of that, I expect that AGI labs don't actually have the spare compute to do that 10x'ing. I expect it's all already occupied 24/7 running all manners of smaller-scale experiments, squeezing whatever value out of them that can be squeezed out. (See e. g. Superalignment team's struggle to get access to compute: that suggests there isn't an internal compute overhang.)
Indeed, an additional disadvantage of AI-based researchers/engineers is that their forward passes would cut into that limited compute budget. Offloading the computations associated with software engineering and experiment oversight onto the brains of mid-level human engineers is potentially more cost-efficient.
As a separate line of argumentation: Suppose that, as you describe it in another comment, we imagine that AI would soon be able to give senior researchers teams of 10x-speed 24/7-working junior devs, to whom they'd be able to delegate setting up and managing experiments. Is there a reason to think that any need for that couldn't already be satisfied?
If it were an actual bottleneck, I would expect it to have already been solved: by the AGI labs just hiring tons of competent-ish software engineers. They have vast amounts of money now, and LLM-based coding tools seem competent enough to significantly speed up a human programmer's work on formulaic tasks. So any sufficiently simple software-engineering task should already be done at lightning speeds within AGI labs.
In addition: the academic-research and open-source communities exist, and plausibly also fill the niche of "a vast body of competent-ish junior researchers trying out diverse experiments". The task of keeping senior researchers up-to-date on openly published insights should likewise already be possible to dramatically speed up by tasking LLMs with summarizing them, or by hiring intermediary ML researchers to do that.
So I expect the market for mid-level software engineers/ML researchers to be saturated.
So, summing up:
- 10x'ing the ability to run small-scale experiments seems low-value, because:
- The performance of a trick at a small scale says little (one way or another) about its performance on a bigger scale.
- Integrating a scalable trick into the SotA-model tech stack is highly nontrivial.
- Most of the value and insight comes from full-scale experiments, which are bottlenecked on compute and senior-researcher taste.
- AI likely can't even 10x small-scale experimentation, because that's also already bottlenecked on compute, not on mid-level engineer-hours. There's no "compute overhang"; all available compute is already in use 24/7.
- If it weren't the case, there's nothing stopping AGI labs from hiring mid-level engineers until they are no longer bottlenecked on their time; or tapping academic research/open-source results.
- AI-based engineers would plausibly be less efficient than human engineers, because their inference calls would cut into the compute that could instead be spent on experiments.
- If so, then AI R&D is bottlenecked on research taste, system-design taste, and compute, and there's relatively little non-AGI-level models can contribute to it. Maybe a 2x speed-up, at most, somehow; not a 10x'ing.
For what it's worth I do think observers that observe themselves to be highly unique in important axes rationally should increase their credence in simulation hypotheses.
I probably shouldnt have used the free energy terminology. Does complexity accuracy tradeoff work better ?
To be clear, I very much dont mean these things as a metaphor. I am thinking there may be an actual numerical complexity - accuracy that is some elaboration of Watanabe s "free energy" formula that actually describes these tendencies.
Sorry these words are not super meaningful to me. Would you be able to translate this from physics speak ?