Posts
Comments
Thank you for the references Dan.
I agree neural networks probably don't actually satisfy the padding argument on the nose and agree that the exact degeneracy is quite interesting (as I say at the end of the op).
I do think for large enough overparameterization the padding argument suggests the LLC might come close to the K-complexity in many cases. But more interestingly to me is that the padding argument doesn't really require the programming language to be Turing-complete. In those cases the degeneracy will be proportional to complexity/simplicity measures that are specific to the programming language (/architecture class). Inshallah I will get to writing something about that soon.
The Padding Argument or Simplicity = Degeneracy
[I learned this argument from Lucius Bushnaq and Matthias Dellago. It is also latent already in Solomonoff's original work]
Consider binary strings of a fixed length
Imagine feeding these strings into some turing machine; we think of strings as codes for a function. Suppose we have a function that can be coded by a short compressed string of length . That is, the function is computable by a small program.
Imagine uniformly sampling a random code for . What number of the codes implement the same function as the string ? It's close to . Indeed, given the string of length we can 'pad' it to a string of length by writing the code
"run skip "
where is an arbitrary string of length where is a small constant accounting for the overhead. There are approximately of such binary strings. If our programming language has a simple skip / commenting out functionality then we expect approximately codes encoding the same function as . The fraction of all codes encoding s is 2^-k.
I find this truly remarkable: the degeneracy or multiplicity is inversely exponentially proportional to the minimum description length of the function!
Just by sampling codes uniformly at random we get the Simplicity prior!!
Why do Neural Networks work? Why do polynomials not work?
It is sometimes claimed that neural networks work well because they are 'Universal Approximators'. There are multiple problems with this explanation, see e.g. here but a very basic problem is that being a universal approximaton is very common. Polynomials are universal approximators!
Many different neural network architectures work. In the limit of large data, compute the difference of different architectures start to vanish and very general scaling laws dominate. This is not the case for polynomials.
Degeneracy=Simplicity explains why: polynomials are uniquely tied down by their coefficients, so a learning machine that tries to fit polynomials is does not have a 'good' simplicity bias that approximates the Solomonoff prior.
The lack of degeneracy applies to any set of functions that form an orthogonal basis. This is because the decomposition is unique. So there is no multiplicity and no implicit regularization/ simplicity bias.
[I learned this elegant argument from Lucius Bushnaq.]
The Singular Learning Theory and Algorithmic Information Theory crossover
I described the padding argument as an argument not a proof. That's because technically it only gives a lower bound on the number of codes equivalent to the minimal description code. The problem is there are pathological examples where the programming language (e.g. the UTM) hardcodes that all small codes encode a single function .
When we take this problem into account the Padding Argument is already in Solomonoff's original work. There is a theorem that states that the Solomonoff prior is equivalent to taking a suitable Universal Turing Machine and feeding in a sequence of (uniformly) random bits and taking the resulting distribution. To account for the pathological examples above everything is asymptotic and up to some constant like all results in algorithmic information theory. This means that like all other results in algorithmic information theory it's unclear whether it is at all relevant in practice.
However, while this gives a correct proof I think this understates the importance of the Padding argument to me. That's because I think in practice we shouldn't expect the UTM to be pathological in this way. In other words, we should heuristically expect the simplicity to be basically proportional to the fraction of codes yielding for a large enough (overparameterized) architecture.
The bull case for SLT is now: there is a direct equality between algorithmic complexity and the degeneracy. This has always been SLT dogma of course but until I learned about this argument it wasn't so clear to me how direct this connection was. The algorithmic complexity can be usefully approximated by the (local) learning coefficient !
EDIT: see Clift-Murfet-Wallbridge and Tom Warings thesis for more. See below, thanks Dan
The bull case for algorithmic information: the theory of algorithmic information, Solomonoff induction, AIXI etc is very elegant and in some sense gives answers to fundamental questions we would like to answer. The major problem was that it is both uncomputable and seemingly intractable. Uncomputability is perhaps not such a problem - uncomputability often arises from measure zero highly adversarial examples. But tractability is very problematic. We don't know how tractable compression is, but it's likely untractable. However, the Padding argument suggests that we should heuristically expect the simplicity to be basically proportional to the fraction of codes yielding for a large enough (overparameterized) architecture - in other words it can be measured by the local Learning coefficient.
Do Neural Networks actually satisfy the Padding argument?
Short answer: No.
Long answer: Unclear. maybe... sort of... and the difference might itself be very interesting...!
Stay tuned.
Neural Network have a bias towards Highly Decomposable Functions.
tl;dr Neural networks favor functions that can be "decomposed" into a composition of simple pieces in many ways - "highly decomposable functions".
Degeneracy = bias under uniform prior
[see here for why I think bias under the uniform prior is important]
Consider a space of parameters used to implement functions, where each element specifies a function via some map . Here, the set is our parameter space, and we can think of each as representing a specific configuration of the neural network that yields a particular function .
The mapping assigns each point to a function . Due to redundancies and symmetries in parameter space, multiple configurations might yield the same function, forming what we call a fiber, or the "set of degenerates." of
This fiber is the set of ways in which the same functional behavior can be achieved by different parameterizations. If we uniformly sample from codes, the degeneracy of a function counts how likely it is to be sampled.
The Bias Toward Decomposability
Consider a neural network architecture built out of layers. Mathematically, we can decompose the parameter space as a product:
where each represents parameters for a particular layer. The function implemented by the network, , is then a composition:
For a function its degeneracy (or the number of ways to parameterize it) is
.
Here, is the set of all possible decompositions , of .
That means that functions that have many such decompositions are more likely to be sampled.
In summary, the layered design of neural networks introduces an implicit bias toward highly decomposable functions.
I think I speak for all of the LessWrong commentariat when I say I am sad to see you go.
That said, congratulations for building such a wonderfully eigen website!
Looking for specific tips and tricks to break AI out of formal/corporate writing patterns. Tried style mimicry ('write like Hemingway') and direct requests ('be more creative') - both fell flat. What works?
Should I be using different AI models ( I am using GPT and Claude)? The base models output an enormous creative storm, but somehow the RLHF has partially lobotomized LLMs such that they always seem to output either cheesy stereotypes or overly verbose academise/corporatespeak.
Is true Novelty a Mirage?
One view on novelty is that it's a mirage. Novelty is 'just synthesis of existing work, plus some randomness.'
I don't think that's correct. I think true novelty is more subtle than that. Yes sometimes novel artforms or scientific ideas are about noisily mixing existing ideas. Does it describe all forms of novelty?
A reductio ad absurdum of the novelty-as-mirage point of view is that all artforms that have appeared since the dawn of time are simply noised versions of cavepaintings. This seems absurd.
Consider AlphaGO. Does AlphaGO just noisily mix human experts? No, alphaGO works on a different principle and I would venture strictly outcompetes anything based on averaging or smoothing over human experts.
AlphaGO is based on a different principle than averaging over existing data. Instead, AlphaGO starts with an initial guess on what good play looks like, perhaps imitated from previous plays. It then plays out to a long horizons and prunes those strategies that did poorly and upscales those strategies that did well. It iteratively amplifies, refines and distilles. I strongly suspect that approximately this modus operandi underlies much of human creativity as well.
True novelty is based on both the synthesis and refinement of existing work.
Yes thats worded too strongly and a result of me putting in some key phrases into Claude and not proofreading. :p
I agree with you that most modern math is within-paradigm work.
I shall now confess to a great caveat. When at last the Hour is there the Program of the World is revealed to the Descendants of Man they will gaze upon the Lines Laid Bare and Rejoice; for the Code Kernel of God is written in category theory.
Misgivings about Category Theory
[No category theory is required to read and understand this screed]
A week does not go by without somebody asking me what the best way to learn category theory is. Despite it being set to mark its 80th annivesary, Category Theory has the evergreen reputation for being the Hot New Thing, a way to radically expand the braincase of the user through an injection of abstract mathematics. Its promise is alluring, intoxicating for any young person desperate to prove they are the smartest kid on the block.
Recently, there has been significant investment and attention focused on the intersection of category theory and AI, particularly in AI alignment research. Despite the influx of interest I am worried that it is not entirely understood just how big the theory-practice gap is.
I am worried that overselling risks poisoning the well for the general concept of advanced mathematical approaches to science in general, and AI alignment in particular. As I believe mathematically grounded approaches to AI alignment are perhaps the only way to get robust worst-case safety guarantees for the superintelligent regime I think this would be bad.
I find it difficult to write this. I am a big believer in mathematical approaches to AI alignment, working for one organization (Timaeus) betting on this and being involved with a number of other groups. I have many friends within the category theory community, I have even written an abstract nonsense paper myself, I am sympathetic to the aims and methods of the category theory community. This is all to say: I'm an insider, and my criticisms come from a place of deep familiarity with both the promise and limitations of these approaches.
A Brief History of Category Theory
‘Before functoriality Man lived in caves’ - Brian Conrad
Category theory is a branch of pure mathematics notorious for its extreme abstraction, affectionately derided as 'abstract nonsense' by its practitioners.
Category theory's key strength lies in its ability to 'zoom out' and identify analogies between different fields of mathematics and different techniques. This approach enables mathematicians to think 'structurally', viewing mathematical concepts in terms of their relationships and transformations rather than their intrinsic properties.
Modern mathematics is less about solving problems within established frameworks and more about designing entirely new games with their own rules. While school mathematics teaches us to be skilled players of pre-existing mathematical games, research mathematics requires us to be game designers, crafting rule systems that lead to interesting and profound consequences. Category theory provides the meta-theoretic tools for this game design, helping mathematicians understand which definitions and structures will lead to rich and fruitful theories.
“I can illustrate the second approach with the same image of a nut to be opened.
The first analogy that came to my mind is of immersing the nut in some softening liquid, and why not simply water? From time to time you rub so the liquid penetrates better,and otherwise you let time pass. The shell becomes more flexible through weeks and months – when the time is ripe, hand pressure is enough, the shell opens like a perfectly ripened avocado!
A different image came to me a few weeks ago.
The unknown thing to be known appeared to me as some stretch of earth or hard marl, resisting penetration… the sea advances insensibly in silence, nothing seems to happen, nothing moves, the water is so far off you hardly hear it.. yet it finally surrounds the resistant substance.
“ - Alexandre Grothendieck
The Promise of Compositionality and ‘Applied category theory’
Recently a new wave of category theory has emerged, dubbing itself ‘applied category theory’.
Applied category theory, despite its name, represents less an application of categorical methods to other fields and more a fascinating reverse flow: problems from economics, physics, social sciences, and biology have inspired new categorical structures and theories. Its central innovation lies in pushing abstraction even further than traditional category theory, focusing on the fundamental notion of compositionality - how complex systems can be built from simpler parts.
The idea of compositionality has long been recognized as crucial across sciences, but it lacks a strong mathematical foundation. Scientists face a universal challenge: while simple systems can be understood in isolation, combining them quickly leads to overwhelming complexity. In software engineering, codebases beyond a certain size become unmanageable. In materials science, predicting bulk properties from molecular interactions remains challenging. In economics, the gap between microeconomic and macroeconomic behaviours persists despite decades of research.
Here then lies the great promise: through the lens of categorical abstraction, the tools of reductionism might finally be extended to complex systems. The dream is that, just as thermodynamics has been derived from statistical physics, macroeconomics could be systematically derived from microeconomics. Category theory promises to provide the mathematical language for describing how complex systems emerge from simpler components.
How has this promise borne out so far? On a purely scientific level, applied category theorists have uncovered a vast landscape of compositional patterns. In a way, they are building a giant catalogue, a bestiary, a periodic table not of ‘atoms’ (=simple things) but of all the different ways ‘atoms' can fit together into molecules (=complex systems).
Not surprisingly, it turns out that compositional systems have an almost unfathomable diversity of behavior. The fascinating thing is that this diversity, while vast, isn't irreducibly complex - it can be packaged, organized, and understood using the arcane language of category theory. To me this suggests the field is uncovering something fundamental about how complexity emerges.
How close is category theory to real-world applications?
Are category theorists very smart? Yes. The field attracts and demands extraordinary mathematical sophistication. But intelligence alone doesn't guarantee practical impact.
It can take many decades for basic science to yield real-world applications - neural networks themselves are a great example. I am bullish in the long-term that category theory will prove important scientifically. But at present the technology readiness level isn’t there.
There are prototypes. There are proofs of concept. But there are no actual applications in the real world beyond a few trials. The theory-practice gap remains stubbornly wide.
The principality of mathematics is truly vast. If categorical approaches fail to deliver on their grandiose promises I am worried it will poison the well for other theoretic approaches as well, which would be a crying shame.
Are Solomonoff Daemons exponentially dense?
Some doomers have very strong intuitions that doom is almost assured for almost any kind of building AI. Yudkowsky likes to say that alignment is about hitting a tiny part of values space in a vast universe of deeply alien values.
Is there a way to make this more formal? Is there a formal model in which some kind of solomonoff daemon/ mesa-optimizer/ gremlins in the machine start popping up all over the place as the cognitive power of the agent is scaled up?
How would removing Sam Altman significantly reduce extinction risk? Conditional on AI alignment being hard and Doom likely the exact identity of the Shoggoth Summoner seems immaterial.
[this is a draft. I strongly welcome comments]
The Latent Military Realities of the Coming Taiwan Crisis
A blockade of Taiwan seems significantly more likely than a full-scale invasion. The US's non-intervention in Ukraine suggests similar restraint might occur with Taiwan.
Nevertheless, Metaculus predicts a 65% chance of US military response to a Chinese invasion and separately gives 20-50% for some kind of Chinese military intervention by 2035. Let us imagine that the worst comes to pass and China and the United States are engaged in a hot war?
China's national memory of the 'century of humiliation' deeply shapes its modern strategic thinking. How many Westerners could faithfully recount the events of the Opium Wars? How many have even heard of the Boxer Rebellion, the Eight-nation alliance, the Tai-Ping rebellion? Yet these events are the core curriculum in Chinese education.
Chinese revanchism toward the West enjoys broad public support. The CCP repression of Chinese public opinion likely understates how popular this view is. CCP officals actually have more dovish view than the general public according to polling.
As other pieces of evidence: historically, the Boxer rebellion was a grass-root phenomenon. Movies depicting conflict between China and America consistently draw large audiences and positive reception. China has an absolute miniscule number of foreigners per capita and this has fallen after the pandemic and never rebounded.
China is the only nuclear power that has explicitly disavowed a nuclear first strike. It currently has a remarkably small nuclear stockpile (~200 warheads). With the increased sensor capabilities in recent years China has become vulnerable to a US nuclear first-strike destroying her launchers before she can react. This is likely part of the reason for a major build-up of her nuclear stockpile in recent years.
It is plausible that there will be a hot war without the use of nuclear weapons. The closest historical case is of course the Korea War, the last indirect conflict between the US and China, ended in stalemate despite massive US economic superiority. Today, that economic gap has largely closed - China's economy is 1.25x larger in PPP terms, while the US is only 40% bigger in nominal GDP.
How would a conventional US-China war look like? What can be learned from past conflicts?
The 1973 Falklands War between the UK and Argentina is the last air-naval war between near-peer powers. The 50-year gap since then equals the time between the US Civil War and WWI. Naval and air warfare technology advances much faster than land warfare - historically, this was tested through frequent conflicts. Today's unprecedented peace means we're largely guessing which naval technologies and doctrines will actually work. While land warfare in Ukraine looks like 'WWI with drones', naval warfare has likely seen much more dramatic changes.
Naval technology advances create bigger power gaps than land warfare. The Opium Wars showed this dramatically - British steamships simply sailed up Chinese rivers unopposed, forcing humiliating treaties on a land power.
Air warfare technology gaps may be even more extreme than naval ones. Modern F-35s achieve 20:0 kill ratios against previous-generation fighters in exercises.
The Arab-Israeli wars, and the Gulf war suggests some lessons about modern air warfare. These conflicts showed that air superiority is typically won or lost very quickly: initial strikes on airbases can be decisive, and most aircraft losses happen on the ground rather than in dogfights. This remains such a concern that it’s US Air Force doctrine to rotate aircraft between airfields. More broadly, these conflicts suggest that air warfare produces more decisive, one-sided outcomes than land battles - when one side gains air superiority, the results can be devastating.
Wild Cards
Drones and the Transparent Battlefield
Drones represent warfare's future, yet both sides underinvest. While the US military has only 10,000 small drones and 400 large ones, Ukraine alone produces 1-4 million drones annually. China leads in mass-producing small drones but lacks integration doctrine.The Ukraine war revealed how modern sensors create a 'transparent battlefield' where hiding large forces is impossible. Drones might make it trivially easy to find (and even destroy) submarines and surface ships.
Submarines
Since WWI Submarines are the kings of the sea. It is plausibly the case that submarines are dominant. A single torpedo from a submarine will sink an aircraft carrier - in exercises, small diesel-electric submarines regularly 'sink' entire carrier groups. These submarines can hide in sonar deadzones, regions where water temperature and salinity create acoustic blind spots.
Are Aircraft Carriers obsolete?
China now sports hypersonic missiles that at least in theory could disable an aircraft carrier from 1500 miles or beyond. On the flip side, missile defense effectiveness has increased dramatically, hypersonic missile effectiveness may be overstated. As a point of evidence of the remaining importance of air craft carriers, China is building her own fleet of aircraft carriers.
Military Competence Wildcard:
Peace means we don't know the true combat effectiveness of either military. Authoritarian militaries often suffer from corruption and incompetence - Chinese troops have been caught loading missile launchers with water instead of fuel during exercises [Comment 5: Need source]. But the US military also shows worrying signs: bureaucratic bloat, lack of recent peer conflict experience, and questions about training quality. Both militaries' actual combat effectiveness remains a major unknown. The US Navy now has more admirals than warships.
Stealth bombers and JASSM-ER
We don’t know what the real dominant weapon in a real conventional 21-century naval war between peers would be, but a plausible guess for a game-changing technology are Stealth Bombers & Stealth missiles.
The obscene cost made the B2 stealth bombers even less popular than the ever-more-costly jet fighters and the project was prematurely halted at 21 platforms. Despite the obscene cost it’s plausible that the B2 and it’s younger cousin the B21 is worth all the money and then some.
Unlike fighters a stealth bombers has something ‘true stealth’. While a stealth fighter like a F35 is better thought of as a ‘low-observable’ aircraft that is difficult to target-lock by short-wave radar but easily detectable by long-wave radar, the B2 stealth bomber is opaque to long-wave radar too. Stealth bombers can also carry air-to-air missiles so may even be effective against fighters. Manoeuvrability and speed, long the defining hallmark of fighters has become less important with the advent of highly accurate homing missiles.
Lockheed Martin has developed the JASSM-ER, a stealth missile with a range up to 900 miles. A B2 bomber has a range of up to something like 4000 miles. For comparison, the range of fighters is something in the range of 400-1200 miles.
A single hit of a JASSM-ER is probably a mission kill on a naval vessel. A B2 can carry up to 16 of these missiles. This means that a single squadron of stealth bombers taking off from a base in Guam could potentially wipe out half a fleet in a single sortie.
***********
And of course last but not least, the greatest wildcard of them all:
AGI.
I will refrain from speculating on the military implications of AGI.
Clear China Disadvantages, US Advantages:
Amphibious assaults are inherently difficult A full Taiwan invasion faces massive logistical hurdles. Taiwan could perhaps muster 500,000 defenders under full mobilization, requiring 1.5 million Chinese troops for a successful assault under standard military doctrine. For perspective, D-Day - history's largest amphibious invasion - landed only 133,000 troops.
China's energy vulnerability is significant - China imports 70% of its oil and 25% of its gas by sea. While Russia provides 20-40% of these imports and could increase supply, the US could severely disrupt China's energy access.
China's regional diplomacy has backfired - Chinas has alienated virtually all its neighbours. The US has basing options in Japan, Australia, Philippines, and across Pacific islands.
US carrier advantage The US operates 11 nuclear supercarriers with extensive blue-water experience. China has two smaller carriers active, one in trials, and one nuclear carrier under construction. The big questionmark is whether carriers might be obsolete or not.
US Stealth bomber advantage: The US leads with 21 B1s and 100 new B21s ordered, while China's H10 program still lags behind.
US submarine advantage US submarines are significantly technologically ahead. Putin selling Russian submarine technology might nullify some of that advantage, as might new cheap sea drones. Geographically, it’s hard for Chinese submarines to escape the China sea unnoticed.
Clear China Advantages, US Disadvantages:
Geography favors China Taiwan lies just 100 miles from mainland China while US forces must cross the Pacific. The massive Chinese Rocket Force can launch thousands of missiles from secure mainland positions.
Advanced missile capabilities Massive conventional rocket force plus claimed hypersonic missile capabilities [Comment : find skeptic hypersonic missile video]
China has been preparing for many years China has established numerous artificial islands with airfields throughout the region. They've successfully stolen F35 plans and are producing their own version at scale. The Chinese governments has built up enormous national emergency storages of essential resources in preparation for the (inevitable) conflict. Bringing Taiwan back into the fold has been a primary driver of policy for decades.
US Shipbuilding The US shipbuilding industry has collapsed to just 0.1% of global production, while China, South Korea, and Japan dominate with 35-40%, 25-30%, and 20-25% respectively.
Simon-Pepin Lehalleur weighs in on the DevInterp Discord:
I think his overall position requires taking degeneracies seriously: he seems to be claiming that there is a lot of path dependency in weight space, but very little in function space 😄
In general his position seems broadly compatible with DevInterp:
- models learn circuits/algorithmic structure incrementally
- the development of structures is controlled by loss landscape geometry
- and also possibly in more complicated cases by the landscapes of "effective losses" corresponding to subcircuits...
This perspective certainly is incompatible with a naive SGD = Bayes = Watanabe's global SLT learning process, but I don't think anyone has (ever? for a long time?) made that claim for non toy models.
It seems that the difference with DevInterp is that
- we are more optimistic that it is possible to understand which geometric observables of the landscape control the incremental development of circuits
- we expect, based on local SLT considerations, that those observables have to do with the singularity theory of the loss and also of sub/effective losses, with the LLC being the most important but not the only one
- we dream that it is possible to bootstrap this to a full fledged S4 correspondence, or at least to get as close as we can.
Ok, no pb. You can also add the following :
I am sympathetic but also unsatisfied with a strong empiricist position about deep learning. It seems to me that it is based on a slightly misapplied physical, and specifically thermodynamical intuition. Namely that we can just observe a neural network and see/easily guess what the relevant "thermodynamic variables" of the system.
For ordinary 3d physical systems, we tend to know or easily discover those thermodynamic variables through simple interactions/observations. But a neural network is an extremely high-dimensional system which we can only "observe" through mathematical tools. The loss is clearly one such thermodynamic variable, but if we expect NN to be in some sense stat mech systems it can't be the only one (otherwise the learning process would be much more chaotic and unpredictable). One view of DevInterp is that we are "just" looking for those missing variables...
I'd be curious about hearing your intuition re " i'm further guessing that most structures basically have 'one way' to descend into them"
Great work niplav !
Happy to see you've pushed through and finished this monumental piece of work =)
Yes. I would even say that finding the right assumptions is the most important part of proving nontrivial selection theorems.
I am flattered to receive these Bayes points =) ; I would be crying tears of joy if there was a genuine slowdown but
-
I generally think there are still huge gains to be made with scaling. Sometimes when people hear my criticism of scaling maximalism they patternmatch that to me saying scaling wont be as big as they think it is. To the contrary, I am saying scaling further will be as big as you think it will be, and additionally there is an enormous advance yet to come.
-
How much evidence do we have of a genuine slowdown? Strawberry was about as big an advance as gpt3 tp gpt4 in my book. How credible are these twitter rumors?
(Expensive) Matchmaking services already exist - what's your reading on why they're not more popular?
How to prepare for the coming Taiwan Crisis? Should one short TSMC? Dig a nuclear cellar?
Metaculus gives a 25% of a fullscale invasion of Taiwan within 10 years and a 50% chance of a blockade. It gives a 65% chance that if China invades Taiwan before 2035 the US will respond with military force.
Metaculus has very strong calibration scores (apparently better than prediction markets). I am inclined to take these numbers as the best guess we currently have of the situation.
Is there any way to act on this information?
The important thing is that both do active learning & decisionmaking & search, i.e. RL. *
LLMs don't do that. So the gain from doing that is huge.
Synthetic data is a bit of a weird word that get's thrown around a lot. There are fundamental limits on how much information resampling from the same data source will yield about completely different domains. So that seems a bit silly. Ofc sometimes with synthetic data people just mean doing rollouts, i.e. RL.
*the word RL sometimes gets mistaken for only very specific reinforcement learning algorithm. I mean here a very general class of algorithms that solve MDPs.
What is the sex recession ? And do we know it is caused by tindr ?
Glib formality: current LLMs do approximate something like a speed prior solomonoff inductor for internetdata but do not approximate AIXI.
There is a whole class of domains that are not tractably accesible from next-token prediction on human generated data. For instance, learning how to beat alphaGo with only access to pre2014 human go games.
Fwiw I basically think you are right about the agentic AI overhang and obviously so. I do think it shapes how one thinks about what's most valuable in AI alignment.
How sure are you that OKcupid is a significantly better product for the majority of people (as opposed to a niche group of very online people)?
I think it can be both rational to doubt his edge and not trade on it.
Hot Take #44: Preaching to the choir is 'good' actually.
- Almost anything that has a large counterfactual impact is achieved by people thinking and acting different from accepted ways of thinking and doing.
- With the exception of political entrepeneurs jumping into a power vacuum, or scientific achievements by exceptional individuals most counterfactual impactful is done made by movements of fanatics.
- The greatest danger to any movement is dissipation. Conversely, the greatest resource of any movement is the fanaticism of its members.
- Most persuasion is non-rational, based on tribal allegiances and social consensus.
- It follows that any group, movement, company, cult, etc that has the aspiration to have large counterfactual impact (for good or ill) must hence direct most of preaching, most of its education and information processing inward.
- The Catholic Church understood this. The Pontifex Maximus has reigned now for two thousand years.
Norvid on Twitter made the apt point that we will need to see the actual private data before we can really judge. Not unusual for lucky people to backrationalize their luck as a sure win.
Okay fair enough "rich idiot" was meant more tongue-in-cheek - that's not what I intended.
Yes, this is possible. It smells a bit of 4d-chess. As far as I can tell he already had finalized his position by the time the WSJ interview came out.
I've dug a little deeper and it seems he did do a bunch of research on polling data. I was a bit too rash to say he had no inside information whatsoever. Plausibly he had some. The degree of the inside information he would need is very high. It seems he did a similar Kelly bet calculation since he report his all-things-considered probability to be 80-90%:
"With so much money on the line, Théo said he is feeling nervous, though he believes Trump has an 80%-90% chance to win the election.
"A surprise can always occur," Théo told The Journal."
I have difficulty believing one can get this kind of certainty for all-things-considered-probability for something as noisy and tight as US presidential election. [but he won both the electoral college and popular vote bet]
The true probability would be more like >90% considering other factors like opportunity costs, transactions cost, counterparty risk, unforeseen black swans of various kinds etc.
Bear in mind this is all things considered probability not just in-model probability, i.e. this would have to integrate that most other observers (especially those with strong calibrated prediction ) very strongly disagree*. Certainly, in some cases this is possible but one would need quite overwhelming evidence that you had a huge edge.
I agree one can reject Kelly betting - that's pretty crazy risky but plausibly the case for people like Elon or Theo. The question is whether the rest of us (with presumably more reasonably cautious attitudes) should take his win as much epistemic evidence. I think not. From our perspective his manic riskloving wouldn't be an much evidence for rational expectations.
*didn't the Kelly formula already integrate the fact that other people think differently. No, this is an additional piece of information one has to integrate. The Kelly betting gives you an implicit risk-averseness even conditioning on your beliefs being true (on average).
EDIT: Indeed it seems Theo the French Whale might have done a Kelly bet estimate too, he reports his true probability at 80-90%. Perhaps he did have private information.
"For example, a hypothetical sale of Théo's 47 million shares for Trump to win the election would execute at an estimated average price of just $0.02, according to Polymarket, which would represent a 96% loss for the trader. Théo paid an average price of about $0.56 cents for the 47 million shares.
Meanwhile, a hypothetical sale of Théo's nearly 20 million shares for Trump to win the popular vote would execute at an average price of less than a 10th of a penny, according to Polymarket, representing a near-total loss.
With so much money on the line, Théo said he is feeling nervous, though he believes Trump has an 80%-90% chance to win the election.
"A surprise can always occur," Théo told The Journal."
Mindmeld
In theory AIs can transmit information far faster and more directly than humans. They can directly send weight/activation vectors to one another. The most important variable on whether entities (cells, organisms, polities, companies, ideologies, empire etc) stay individuals or amalgate into a superorganism is communication bandwith & copy fidelity.
Both of these differ many order of magnitude for humans versus AIs. At some point, mere communication becomes a literal melding of minds. It seems quite plausibly then that AIs will tend to mindmeld if left alone.
The information rate of human speech is around 39 bits per second, regardless of the language being spoken or how fast or slow it is spoken. This is roughly twice the speed of Morse code.
Some say that the rate of 39 bits per second is the optimal rate for people to convey information. Others suggest that the rate is limited by how quickly the brain can process or produce information. For example, one study found that people can generally understand audio recordings that are sped up to 120%.
While the rate of speech is relatively constant, the information density and speaking rate can vary. For example, the information density of Basque is 5 bits per syllable, while Vietnamese is 8 bits per syllable.
Current state of the art fibre optic cables can transmit up to 10 terabits a second.
That's probably a wild overestimate for AI communication though. More relevant bottlenecks are limits on processing informations [plausibly more in the megabits range], limits on transferability of activation vectors (but training could improve this).
>> 'a massive transfer of wealth from "sharps" '.
no. That's exactly the point.
1. there might no be any real sharps (=traders having access to real private arbitragiable information that are consistently taking risk-neutral bets on them) in this market at all.
This is because a) this might simple be a noisy, high entropy source that is inherently difficult to predict, hence there is little arbitragiable information and/or b) sharps have not been sufficiently incenticiz
2. The transfer of wealth is actually disappointing because Theo the French Whale moved the price so much.
For an understanding of what the trading decisions of a verifiable sharp looks like one should take a look at Jim Simons' Medaillon fund. They do enormous hidden information collection, ?myssterious computer models, but at the end of the day take a large amount of very hedged tiny edge positions.
***************************************************
You are misunderstanding my argument (and most of the LW commentariat with you). I might note that I made my statement before the election result and clearly said 'win or lose' but it seems that even on LW people think winning on a noisy N=1 sample is proof of rationality.
That's why I said: "In expectation", "win or lose"
That the coinflip came out one way rather than another doesnt prove the guy had actual inside knowledge. He bought a large part of the shares at crazy odds because his market impact moved the price so much.
But yes, he could be a sharp in sheeps clothings. I doubt it but who knows. EDIT: I calculated the implied private odds for a rational Kelly bettor that this guy would have to have. Suffice to say these private odds seem unrealistic for election betting.
Point is that the winners contribute epistemics and the losers contribute money. The real winner is society [if the questions are about socially-relevant topics].
EDIT: I was wrong. Theo the French Whale was the sharp. From the Kelly formula and his own statements his all things considered probability was 80-90% - he would need to possess an enormous amount of private information to justify such a deviation from other observers. It turns out he did. He commissioned his own secret polls using a novel polling method to compensate for the shy Trump voter.
https://x.com/FellowHominid/status/1854303630549037180
The French rich idiot who bought 75 million dollar of Trump is an EA hero win or lose.
LW loves prediction markets. EA loves them. I love them. You love them.
See: https://worksinprogress.co/issue/why-prediction-markets-arent-popular/
Problem is, unlike financial markets prediction markets are zero sum. That limits how much informed traders -"sharps"- are incentivized.
In theory this could be remedied by subsidizing the market by a party that is willing to pay for the information. The information is public so this suffers from standard tragedy of the commons.
Mister Theo bought 75 million dollars worth of Yes Trump. He seems to have no inside information. In expectation then he is subsidizing prediction markets, providing the much needed financial incentive to attract the time, effort and skills of sophisticated sharps.
Perhaps we should think of uninformed "noise traders" on prediction markets as engaging in prosocial behaviour. Their meta-epistemic delusion and gambling addiction provide the cold hard cash to finance accurate pricefinding in the long-run.
EDIT: to hit the point home, if you invest 50% of your capital (apparently the guy invested most of his ?money) at the odds that Polymarket was selling the Kelly-rational implied edge would mean that the guy's true probability is 80% [according to gpt]. And that's for Kelly betting, widely considered far too aggresive in real life. Most experts (e.g. actual poker players) advice fractional Kelly betting. All-in-all his all things-considered probability for a Trump win would have be something like >90% to justify this kind of investment. I don't think anybody in the world has access to enough private information to rationally justify this kind of all-things-considered probability on an extremely noisy random variable (US elections).
The Sun revolves around the Earth actually
The traditional story is that in olden times people were proudly stupid and thought the human animal lived at the centre of the universe, with all the planets, stars and the sun revolving around the God's creation, made in his image. The church would send anybody that said the sun was at the centre to be burned at the stake. [1]
Except...
there is no absolute sense in which the sun is at the centre of the solar system [2]. It's simply a question of perspective, a choice of frame of reference.
- Geocentrism is empirically grounded: it is literally what you see! Your lieing eyes, the cold hard facts and 16th century Mathew Barnett all agree: geocentrism is right.
The heliocentric point of view is a formal transformation of the data - a transformation with potentially heretical implications, a dangerous figment of the imagination ...
The Ptolemaic model fits the data well. It's based on elegant principles of iterated circles (epicycles). The heliocentrists love to talk about how their model is more mathematically elegant but they haven't made a single prediction that the Ptolemaic model hasn't. The orbits of planets become ellipses which have more free variables than the astrally perfect circles.
2. Epicycles is Fourier analysis. No really!
"Venus in retrograde" nowadays evokes images of a cartoon gypsy woman hand reading but it is a real astronomical phenomenon that mystified early stargazers wherein Venus literally moves backward some of the time. In the Ptolemaic model this is explained by a small orbit (cycle) on a large orbit (cycle) going backward. In the heliocentric model it is a result of the earth moving around the sun rather than vice versa that causes the apparent backwards motion of Venus.
3. Epicycles can be computed FAST. For epicycles are the precarnation of GEARS. Gears, which are epicycles manifest.
Late Hellenistic age possibly had a level of science and engineering only reached in late 17th century Western Europe, almost 2000 years later. The most spectacular example of Hellenistic science is the Antikythera mechanism, a blob of bronze rust found by divers off the Greek Coast. X-ray imaging revealed a hidden mysterious mechanism, launching a many-decade sleuthhunt for the answer. The final puzzle pieces were put together only very recently.
How does the Antikythera mechanism work? It is a Ptolemaic-esque model where the epicycles are literally modelled by gears! In other words - the choice for epicycles wasn't some dogmatic adherence to the perfection of the circle, it was a pragmatic and clever engineering insight!
[1] Contra received history, the Catholic Church was relatively tolerant of astronomical speculation and financially supported a number of (competent) astronomers. It was only when Giordano Bruno made the obvious inference that if the planets revolve around the sun, sun are other starts then... there might be other planets, with their own inhabitatns... did Jesus visit all the alien children too ?... that a line was crossed. Mr Bruno was burned at the stake.
[2] centre of mass not the sun. but the centre of mass is in the sun so this is a minor pedantry.
Wow this sounds great ! Any way to tune in from abroad ?
epic picture.
Thanks a lot!
A few followup questions :..
By computaibility level do you mean Turing degree ?
Why cant the universal distribution be constructed for most levels ?
What exactly is the coding theorem?
What do you mean by conditioning and planning damaging the computability level and why is not so bad ?
I see, thank you for the clarification. I should have been more careful with mischaracterizing your views.
I do have a question or two about your views if you would entertain me. You say humans wikl be economically obsolete and will 'retire' but there will still be trade between humans and AI. Does trade here just means humans consuming, I.e. trading money for AI goods and services? That doesn't sound like trading in the usual sense where it is a reciprocal exchange of goods and services.
How many 'different' AI individuals do you expect there to be ?
Will there be >1 individual per solar system?
A recently commonly heard viewpoint on the development of AI states that AI will be economically impactful but will not upend the dominancy of humans. Instead AI and humans will flourish together, trading and cooperating with one another. This view is particularly popular with a certain kind of libertarian economist: Tyler Cowen, Matthew Barnett, Robin Hanson.
They share the curious conviction that the probablity of AI-caused extinction p(Doom) is neglible. They base this with analogizing AI with previous technological transition of humanity, like the industrial revolution or the development of new communication mediums. A core assumption/argument is that AI will not disempower humanity because they will respect the existing legal system, apparently because they can gain from trades with humans.
The most extreme version of the GMU economist view is Hanson's Age of EM; it hypothesizes radical change in the form of a new species of human-derived uploaded electronic people which curiously have just the same dreary office jobs as we do but way faster.
Why is there trade & specialization in the first place?
Trade and specialization seems to mainly important in a world where: There are many individuals; those individuals have different skills and resources and a limited ability to transfer skills.
Domain of Biology: direct copying of genes but not brain, yes recombination, no or very low bandwith communication
highly adversarial, less cooperation,, no planning, much specialization, not centralized, vastly many sovereign individuals
Domain of Economics: no direct copying, yes recombination, medium bandwith communication
mildly adversarial, mostly cooperation,medium planning, much specialization, little centralization, many somewhat soverign individuals
!AIs can copy, share and merge their weights!
Domain of Future AI society: direct copying of brain and machines, yes recombination, very high bandwith communication
?minimally adversarial, ?very high cooperation, ?cosmic-scale mathematized planning, ? little specialization, ? high centralization?, ? singleton sovereign individual
It is often imagined that in a 'good' transhumanist future the sovereign AI will be like a loving and caring parent for the billions and trillions of uploads. In this case, while there is one all-powerful Sovereign entity there are still many individuals who remain free and have their rights protected perhaps through cryptographic incantations. The closest cultural artifact would be Ian Banks' Culture series.
There is another, more radical foreboding wherein the logic of ultra-high-bandwith sharing of weights is taken to the logical extreme and individuals merge into one transcendent hivemind.
AGI companies merging within next 2-3 years inevitable?
There are currently about a dozen major AI companies racing towards AGI with many more minor AI companies. The way the technology shakes out this seems like unstable equilibrium.
It seems by now inevitable that we will see further mergers, joint ventures - within two years there might only be two or three major players left. Scale is all-dominant. There is no magic sauce, no moat. OpenAI doesn't have algorithms that her competitors can't copy within 6-12 months. It's all leveraging compute. Whatever innovations smaller companies make can be easily stolen by tech giants.
e.g. we might have xAI- Meta, Anthropic- DeepMind-SSI-Google, OpenAI-Microsoft-Apple.
Actuallly, although this would be deeply unpopular in EA circles it wouldn't be all that surprising if Anthropic and OpenAI would team up.
And - of course - a few years later we might only have two competitors: USA, China.
EDIT: the obvious thing to happen is that nvidia realizes it can just build AI itself. if Taiwan is Dune, GPUs are the spice, then nvidia is house Atreides
There is a straightforward compmech take also. If the goal of the agent is simply to predict well (let's say the reward is directly tied to good prediction) for a sequential task AND it performs optimally then we know it must contain the Mixed State Presentation of the epsilon machine (causal states). Importantly the MSP must be used if optimal prediction is achieved.
There is a variant I think, that has not been worked out yet but we talked about briefly with Fernando and Vanessa in Manchester recently for transducers /MDPs
Yes that sounds great. Do you know more about what the limiting factors here are? I don't really buy the argument that there aren't enough smart people willing to fund or do slightly boring things.
The claim seems not to be "bio people should be doing something boring" whatever that boring thing is but something much more specific: "good CROs are undersupplied"
yes very lukewarm take
also nice product placement nina
That's a good counterexample ! Masks are dangerous and mysterious, but not cool in the way sunglasses are in agree
Shower thought - why are sunglasses cool ?
Sunglasses create an asymmetry in the ability to discern emotions between the wearer and nonwearer. This implicitly makes the wearer less predictable, more mysterious, more dangerous and therefore higher in a dominance hierarchy.
This is not novel to Hanson, it's been a staple of (neo)reactionary /conservative thought for millenia.