Examples of Highly Counterfactual Discoveries?

post by johnswentworth · 2024-04-23T22:19:19.399Z · LW · GW · 16 comments

This is a question post.

Contents

  Answers
    118 kromem
    36 DirectedEvolution
    31 Garrett Baker
    23 Alexander Gietelink Oldenziel
    23 CronoDAS
    22 Jesse Hoogland
    19 Thomas Kwa
    16 Carl Feynman
    14 cousin_it
    14 lukehmiles
    12 cubefox
    11 junk heap homotopy
    9 Alexander Gietelink Oldenziel
    9 francis kafka
    9 johnswentworth
    6 FreakTakes
    6 Mateusz Bagiński
    5 transhumanist_atom_understander
    5 Jonas Hallgren
    5 Niclas Kupper
    3 martinkunev
    3 AnthonyC
    3 Jordan Taylor
    -2 shminux
None
16 comments

The history of science has tons of examples of the same thing being discovered multiple time independently; wikipedia has a whole list of examples here. If your goal in studying the history of science is to extract the predictable/overdetermined component of humanity's trajectory, then it makes sense to focus on such examples.

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out. After all, if someone else would have figured it out shortly after anyways, then the discovery probably wasn't very counterfactually impactful.

Alas, nobody seems to have made a list of highly counterfactual scientific discoveries, to complement wikipedia's list of multiple discoveries.

To that end: what are some examples of discoveries which nobody else was anywhere close to figuring out?

A few tentative examples to kick things off:

(Feel free to debate any of these, as well as others' examples.)

Answers

answer by kromem · 2024-04-23T23:16:56.017Z · LW(p) · GW(p)

Lucretius in De Rerum Natura in 50 BCE seemed to have a few that were just a bit ahead of everyone else.

Survival of the fittest (book 5):

"In the beginning, there were many freaks. Earth undertook Experiments - bizarrely put together, weird of look Hermaphrodites, partaking of both sexes, but neither; some Bereft of feet, or orphaned of their hands, and others dumb, Being devoid of mouth; and others yet, with no eyes, blind. Some had their limbs stuck to the body, tightly in a bind, And couldn't do anything, or move, and so could not evade Harm, or forage for bare necessities. And the Earth made Other kinds of monsters too, but in vain, since with each, Nature frowned upon their growth; they were not able to reach The flowering of adulthood, nor find food on which to feed, Nor be joined in the act of Venus.

For all creatures need Many different things, we realize, to multiply And to forge out the links of generations: a supply Of food, first, and a means for the engendering seed to flow Throughout the body and out of the lax limbs; and also so The female and the male can mate, a means they can employ In order to impart and to receive their mutual joy.

Then, many kinds of creatures must have vanished with no trace Because they could not reproduce or hammer out their race. For any beast you look upon that drinks life-giving air, Has either wits, or bravery, or fleetness of foot to spare, Ensuring its survival from its genesis to now."

Trait inheritance from both parents that could skip generations (book 4):

"Sometimes children take after their grandparents instead, Or great-grandparents, bringing back the features of the dead. This is since parents carry elemental seeds inside – Many and various, mingled many ways – their bodies hide Seeds that are handed, parent to child, all down the family tree. Venus draws features from these out of her shifting lottery – Bringing back an ancestor’s look or voice or hair. Indeed These characteristics are just as much the result of certain seed As are our faces, limbs and bodies. Females can arise From the paternal seed, just as the male offspring, likewise, Can be created from the mother’s flesh. For to comprise A child requires a doubled seed – from father and from mother. And if the child resembles one more closely than the other, That parent gave the greater share – which you can plainly see Whichever gender – male or female – that the child may be."

Objects of different weights will fall at the same rate in a vacuum (book 2):

“Whatever falls through water or thin air, the rate Of speed at which it falls must be related to its weight, Because the substance of water and the nature of thin air Do not resist all objects equally, but give way faster To heavier objects, overcome, while on the other hand Empty void cannot at any part or time withstand Any object, but it must continually heed Its nature and give way, so all things fall at equal speed, Even though of differing weights, through the still void.”

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me. In hindsight, they nailed so many huge topics that didn't end up emerging again for millennia that it was surely not mere chance, and the fact that they successfully hit so many nails on the head without the hammer we use today indicates (at least to me) that there's value to looking closer at their methodology.

Which was also super simple:

Step 1: Entertain all possible explanations for things, not prematurely discounting false negatives or embracing false positives.

Step 2: Look for where single explanations can explain multiple phenomena.

While we have a great methodology for testable hypotheses, the scientific method isn't very useful for untestable fields or topics. And in those cases, I suspect better understanding and appreciation for the Epicurean methodology might yield quite successful 'counterfactual' results (it's served me very well throughout the years, especially coupled with the identification of emerging research trends in things that can be evaluated with the scientific method).

comment by Garrett Baker (D0TheMath) · 2024-04-24T04:14:17.602Z · LW(p) · GW(p)

A precursor to Lucretius's thoughts on natural selection is Empedocles, who we have far fewer surviving writings from, but which is clearly a precursor to Lucretius' position. Lucretius himself cites & praises Empedocles on this subject.

Replies from: kromem
comment by kromem · 2024-04-25T01:17:39.018Z · LW(p) · GW(p)

Do you have a specific verse where you feel like Lucretius praised him on this subject? I only see that he praises him relative to other elementaists before tearing him and the rest apart for what he sees as erroneous thinking regarding their prior assertions around the nature of matter, saying:

"Yet when it comes to fundamentals, there they meet their doom. These men were giants; when they stumble, they have far to fall:"

(Book 1, lines 740-741)

I agree that he likely was a precursor to the later thinking in suggesting a compository model of life starting from pieces which combined to forms later on, but the lack of the source material makes it hard to truly assign credit.

It's kind of like how the Greeks claimed atomism originated with the much earlier Mochus of Sidon, but we credit Democritus because we don't have proof of Mochus at all but we do have the former's writings. We don't even so much credit Leucippus, Democritus's teacher, as much as his student for the same reasons, similar to how we refer to "Plato's theory of forms" and not "Socrates' theory of forms."

In any case, Lucretius oozes praise for Epicurus, comparing him to a god among men, and while he does say Empedocles was far above his contemporaries saying the same things he was, he doesn't seem overly deferential to his positions as much as criticizing the shortcomings in the nuances of their theories with a special focus on theories of matter. I don't think there's much direct influence on Lucretius's thinking around proto-evolution, even if there's arguably plausible influence on Epicurus's which in turn informed Lucretius.

Replies from: D0TheMath
comment by Garrett Baker (D0TheMath) · 2024-04-25T17:20:39.102Z · LW(p) · GW(p)

[edit: nevermind I see you already know about the following quotes. There's other evidence of the influence in Sedley's book I link below]

In De Reum Natura around line 716:

Add, too, whoever make the primal stuff Twofold, by joining air to fire, and earth To water; add who deem that things can grow Out of the four- fire, earth, and breath, and rain; As first Empedocles of Acragas, Whom that three-cornered isle of all the lands Bore on her coasts, around which flows and flows In mighty bend and bay the Ionic seas, Splashing the brine from off their gray-green waves. Here, billowing onward through the narrow straits, Swift ocean cuts her boundaries from the shores Of the Italic mainland. Here the waste Charybdis; and here Aetna rumbles threats To gather anew such furies of its flames As with its force anew to vomit fires, Belched from its throat, and skyward bear anew Its lightnings' flash. And though for much she seem The mighty and the wondrous isle to men, Most rich in all good things, and fortified With generous strength of heroes, she hath ne'er Possessed within her aught of more renown, Nor aught more holy, wonderful, and dear Than this true man. Nay, ever so far and pure The lofty music of his breast divine Lifts up its voice and tells of glories found, That scarce he seems of human stock create.

Or for a more modern translation from Sedley's Lucretius and the Transformation of Greek Wisdom

Of these [sc. the four-element theorists] the foremost is
Empedocles of Acragas, born within the three-cornered terres-
trial coasts of the island [Sicily] around which the Ionian Sea,
flowing with its great windings, sprays the brine from its green
waves, and from whose boundaries the rushing sea with its
narrow strait divides the coasts of the Aeolian land with its
waves. Here is destructive Charybdis, and here the rumblings of
Etna give warning that they are once more gathering the wrath
of their flames so that her violence may again spew out the fire
flung from her jaws and hurl once more to the sky the lightning
flashes of flame. Although this great region seems in many ways
worthy of admiration by the human races, and is said to deserve
visiting for its wealth of good things and the great stock of men
that fortify it, yet it appears to have had in it nothing more
illustrious than this man, nor more holy, admirable, and pre-
cious. What is more, the poems sprung from his godlike mind
call out and expound his illustrious discoveries, so that he
scarcely seems to be born of mortal stock.

comment by Lukas_Gloor · 2024-04-24T12:54:24.578Z · LW(p) · GW(p)

Very cool! I used to think Hume was the most ahead of his time, but this seems like the same feat if not better.

Replies from: dr_s
comment by dr_s · 2024-04-25T07:46:45.889Z · LW(p) · GW(p)

Democritus also has a decent claim to that for being the first to imagine atoms and materialism altogether.

Replies from: kromem
comment by kromem · 2024-04-26T01:43:57.865Z · LW(p) · GW(p)

Though the Greeks actually credited the idea to an even earlier Phonecian, Mochus of Sidon.

Through when it comes to antiquity credit isn't really "first to publish" as much as "first of the last to pass the survivorship filter."

comment by francis kafka (kingofthenerdz3) · 2024-04-24T02:43:37.870Z · LW(p) · GW(p)

Have you read Michel Serres's The Birth of Physics? He suggests that the Epicureans and Lucretius in particular have worked out a serious theory of physics that's closer to thermodynamics and fluid mechanics than Newtonian physics

comment by Q Home · 2024-04-24T06:00:46.888Z · LW(p) · GW(p)

Often I see people dismiss the things the Epicureans got right with an appeal to their lack of the scientific method, which has always seemed a bit backwards to me.

The most important thing, I think, is not even hitting the nail on the head, but knowing (i.e. really acknowledging) that a nail can be hit in multiple places. If you know that, the rest is just a matter of testing.

Replies from: CuriousMeta
comment by CuriousMeta · 2024-04-30T07:04:45.345Z · LW(p) · GW(p)

~Don't aim for the correct solution, (first) aim for understanding the space of possible solutions

answer by DirectedEvolution · 2024-04-24T04:32:08.121Z · LW(p) · GW(p)

A singleton is hard to verify unless there was a long period of time after its discovery during which it was neglected, as in the case of Mendel.

Yet if your discovery is neglected in this way, the context in which it is eventually rediscovered matters as well. In Mendel's case, his laws were rediscovered by several other scientists decades later. Mendel got priority, but it still doesn't seem like his accomplishment had much of a counterfactual impact.

In the case of Shannon, Einstein, etc, it's possible their fields were "ripe and ready" for what they accomplished - as perhaps evidenced by the fact that their discoveries were accepted - and that they were simply plugged in enough to their research communities during a period of faster global dissemination of knowledge that any hot-on-heels competitors never quite got a chance to publish. But I don't know enough about these cases to be confident.

I can think of a couple cases in which I might be convinced of this sort of counterfactual impact from a scientific singleton:

  • All peers in a small, tight-knit research community explicitly stated none of them were even close (though even this is hard to trust - are they being gracious? how do they know their own students wouldn't have figured it out in another year's time?). Do we have any such testimonials for Shannon, Einstein, etc?
  • The discovery was actually lost, then discovered and immediately appreciated for its significance. Imagine a math proof written in a mathematician's papers, lost on their death, rediscovered in an antique shop 40 years later, and immediately heralded as a major advance - like if we'd found a proof by Fermat of Fermat's Last Theorem in an attic in 1950.
  • Money was the bottleneck. There are many places a billion dollars can be put into research. If somebody launches a billion-dollar research institute in an underfunded subject that's been languishing for decades and the institute they founded starts coming up with major technical advances, that's evidence it was a game-changer. Of course it's possible that billionaire put their money into the field because they had information that the research was coming to fruition and they wanted to get in on something hot, but I probably have more trouble believing they could make such a prediction so accurately than that their money made a counterfactual impact.

A discovery can also be "counterfactually important" even if it only speeds up science a bit and is only slightly a singleton. Let's say that every year, there's one important scientific discovery and a million unimportant ones, and the important ones must be discovered in sequence. If you discover 2025's important discovery in 2024, all the future important discoveries in the sequence also arrive a year earlier. If each discovery is worth $1 billion/year, then you've now created $1 billion counterfactual dollars per year every year as long as this model holds.

answer by Garrett Baker · 2024-04-23T23:33:51.981Z · LW(p) · GW(p)

Possibly Wantanabe's singular learning theory. The math is recent for math, but I think only like '70s recent, which is long given you're impressed by a 20-year math gap for Einstein. The first book was published in 2010, and the second in 2019, so possibly attributable to the deep learning revolution, but I don't know of anyone making the same math--except empirical stuff like the "neuron theory" of neural network learning which I was told about by you, empirical results like those here, and high-dimensional probability (which I haven't read, but whose cover alone indicates similar content).

comment by Leon Lang (leon-lang) · 2024-04-25T13:49:44.535Z · LW(p) · GW(p)

I guess (but don't know) that most people who downvote Garrett's comment overupdated on intuitive explanations of singular learning theory, not realizing that entire books with novel and nontrivial mathematical theory have been written on it. 

comment by tailcalled · 2024-04-25T06:20:40.133Z · LW(p) · GW(p)

Isn't singular learning theory basically just another way of talking about the breadth of optima?

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-25T10:50:11.740Z · LW(p) · GW(p)

Singular Learning Theory is another way of "talking about the breadth of optima" in the same sense that Newton's Universal Law of Gravitation is another way of "talking about Things Falling Down". 

Replies from: tailcalled
comment by tailcalled · 2024-04-25T13:39:55.474Z · LW(p) · GW(p)

Newton's Universal Law of Gravitation was the first highly accurate model of things falling down that generalized beyond the earth, and it is also the second-most computationally applicable model of things falling down that we have today.

Are you saying that singular learning theory was the first highly accurate model of breadth of optima, and that it's one of the most computationally applicable ones we have?

Replies from: alexander-gietelink-oldenziel, Lblack, Algon
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-25T15:42:56.687Z · LW(p) · GW(p)

Did I just say SLT is the Newtonian gravity of deep learning? Hubris of the highest order!

But also yes... I think I am saying that

  • Singular Learning Theory is the first highly accurate model of breath of optima.
    •  SLT tells us to look at a quantity Watanabe calls , which has the highly-technical name 'real log canonical threshold (RLCT). He proves several equivalent ways to describe it one of which is as the (fractal) volume scaling dimension around the optima.
    • By computing simple examples (see Shaowei's guide in the links below) you can check for yourself how the RLCT picks up on basin broadness.
    • The RLCT = first-order term for in-distribution generalization error and also Bayesian learning (technically the 'Bayesian free energy').  This justifies the name of 'learning coefficient' for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions. 
    • Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won't be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.
  • It's one of the most computationally applicable ones we have? Yes. SLT quantities like the RLCT can be analytically computed for many statistical models of interest, correctly predicts phase transitions in toy neural networks and it can be estimated at scale.

EDIT: no hype about future work. Wait and see ! :)

Replies from: Lblack, tailcalled
comment by Lucius Bushnaq (Lblack) · 2024-04-25T23:12:32.383Z · LW(p) · GW(p)

The RLCT = first-order term for in-distribution generalization error
 

Clarification: The 'derivation' for how the RLCT predicts generalization error IIRC goes through the same flavour of argument as the one the derivation of the vanilla Bayesian Information Criterion uses. I don't like this derivation very much. See e.g. this one on Wikipedia. 

So what it's actually showing is just that:

  1. If you've got a class of different hypotheses , containing many individual hypotheses  .
  2. And you've got a prior ahead of time that says the chance any one of the hypotheses in  is true is some number ., let's say it's  as an example.
  3. And you distribute this total probability  around the different hypotheses in an even-ish way, so , roughly.
  4. And then you encounter a bunch of data  (the training data) and find that only one or a tiny handful of hypotheses in  fit that data, so  for basically only one hypotheses ...
  5. Then your posterior probability  that the hypothesis  is correct will probably be tiny, scaling with . If we spread your prior  over lots of hypotheses, there isn't a whole lot of prior to go around for any single hypothesis. So if you then encounter data that discredits all hypotheses in M except one, that tiny bit of spread-out prior for that one hypothesis will make up a tiny fraction of the posterior, unless  is really small, i.e. no hypothesis outside the set  can explain the data either.

So if our hypotheses correspond to different function fits (one for each parameter configuration, meaning we'd have  hypotheses if our function fits used  -bit floating point numbers), the chance we put on any one of the function fits being correct will be tiny. So having more parameters is bad, because the way we picked our prior means our belief in any one hypothesis goes to zero as  goes to infinity.

So the Wikipedia derivation for the original vanilla posterior of model selection is telling us that having lots of parameters is bad, because it means we're spreading our prior around exponentially many hypotheses.... if we have the sort of prior that says all the hypotheses are about equally likely. 

But that's an insane prior to have! We only have  worth of probability to go around, and there's an infinite number of different hypotheses. Which is why you're supposed to assign prior based on K-complexity, or at least something that doesn't go to zero as the number of hypotheses goes to infinity. The derivation is just showing us how things go bad if we don’t do that.

In summary: badly normalised priors behave badly

SLT mostly just generalises this derivation to the case where parameter configurations in our function fits don't line up one-to-one with hypotheses.

It tells us that if we are spreading our prior around evenly over lots of parameter configurations, but exponentially many of these parameter configurations are secretly just re-expressing the same hypothesis, then that hypothesis can actually get a decent amount of prior, even if the total number of parameter configurations is exponentially large.

So our prior over hypotheses in that case is actually somewhat well-behaved in that it can end up normalised properly when we take . That is a basic requirement a sane prior needs to have [LW · GW], so we're at least not completely shooting ourselves in the foot anymore. But that still doesn't show why this prior, that neural networks sort of[1] implicitly have, is actually good. Just that it's no longer obviously wrong in this specific way.

Why does this prior apparently make decent-ish predictions in practice? That is, why do neural networks generalise well? 

I dunno. SLT doesn't say. It just tells us how the parameter prior to hypothesis prior conversion ratio works, and in the process shows us that neural networks priors can be at least somewhat sanely normalised for large numbers of parameters. More than we might have initially thought at least. 

That's all though. It doesn't tell us anything else about what makes a Gaussian over transformer parameter configurations a good starting guess for how the universe works.

How to make this story tighter?

If people aim to make further headway on the question of why some function fits generalise somewhat and others don't, beyond: 'Well, standard Bayesianism suggests you should at least normalise your prior so that having more hypotheses isn't actively bad', then I'd suggest a starting point might be to make a different derivation for the posterior on the fits that isn't trying to reason about  defined as the probability that one of the function fits is 'true' in the sense of exactly predicting the data. Of course none of them are. We know that. When we fit a  billion parameter transformer to internet data, we don't expect going in that any of these  parameter configurations will give zero loss up to quantum noise on any and all text prediction tasks in the universe until the end of time. Under that definition of , which the SLT derivation of the posterior and most other derivations of this sort I've seen seem to implicitly make, we basically have  going in! Maybe look at the Bayesian posterior for a set of hypotheses we actually believe in at all before we even see any data, like  .

SLT in three sentences

'You thought your choice of prior was broken because it's nor normalised right, and so goes to zero if you hand it too many hypotheses. But you missed that the way you count your hypotheses is also broken, and the two mistakes sort of cancel out. Also here's a bunch of algebraic geometry that sort of helps you figure out what probabilities your weirdo prior actually assigns to hypotheses, though that parts not really finished'.

SLT in one sentence

'Loss basins with bigger volume will have more posterior probability if you start with a uniform-ish prior over parameters, because then bigger volumes get more prior, duh.'

 

 

  1. ^

    Sorta, kind of, arguably. There's some stuff left to work out here. For example vanilla SLT doesn't even actually tell you which parts of your posterior over parameters are part of the same hypothesis. It just sort of assumes that everything left with support in the posterior after training is part of the same hypothesis, even though some of these parameter settings might generalise totally differently outside the training data. My guess is that you can avoid matching this up by comparing equivalence over all possible inputs by checking which parameter settings give the same hidden representations over the training data, not just the same outputs.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-26T19:53:49.122Z · LW(p) · GW(p)

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.

EDIT: I have now changed my mind about this, not least because of Lucius's influence. I currently think Bushnaq's padding argument suggests that the essentials of SLT is the uniform prior on codes is equivalent to the Solomonoff prior through overparameterized and degenerate codes; SLT is a way to quantitatively study this phenomena especially for continuous models.

The story that symmetries mean that the parameter-to-function map is not injective is true but already well-understood outside of SLT. It is a common misconception that this is what SLT amounts to.

To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.

The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.

I don't have the time to recap this story here.

Replies from: mattmacdermott, Lblack
comment by mattmacdermott · 2024-04-26T21:16:47.551Z · LW(p) · GW(p)

Lucius-Alexander SLT dialogue?

comment by Lucius Bushnaq (Lblack) · 2024-04-27T07:07:55.122Z · LW(p) · GW(p)

I would not say that the central insight of SLT is about priors. Under weak conditions the prior is almost irrelevant. Indeed, the RLCT is independent of the prior under very weak nonvanishing conditions.

I don't think these conditions are particularly weak at all. Any prior that fulfils it is a prior that would not be normalised right if the parameter-function map were one-to-one. 

It's a kind of prior people like to use a lot, but that doesn't make it a sane choice. 

A well-normalised prior for a regular model probably doesn't look very continuous or differentiable in this setting, I'd guess.

To be sure - generic symmetries are seen by the RLCT. But these are, in some sense, the uninteresting ones. The interesting thing is the local singular structure and its unfolding in phase transitions during training.

The generic symmetries are not what I'm talking about. There are symmetries in neural networks that are neither generic, nor only present at finite sample size. These symmetries correspond to different parametrisations that implement the same input-output map. Different regions in parameter space can differ in how many of those equivalent parametrisations they have, depending on the internal structure of the networks at that point.

The issue of the true distribution not being contained in the model is called 'unrealizability' in Bayesian statistics. It is dealt with in Watanabe's second 'green' book. Nonrealizability is key to the most important insight of SLT contained in the last sections of the second to last chapter of the green book: algorithmic development during training through phase transitions in the free energy.

I know it 'deals with' unrealizability in this sense, that's not what I meant. 

I'm not talking about the problem of characterising the posterior right when the true model is unrealizable. I'm talking about the problem where the actual logical statement we defined our prior and thus our free energy relative to is an insane statement to make and so the posterior you put on it ends up negligibly tiny compared to the probability mass that lies outside the model class. 

But looking at the green book, I see it's actually making very different, stat-mech style arguments that reason about the KL divergence between the true distribution and the guess made by averaging the predictions of all models in the parameter space according to their support in the posterior. I'm going to have to translate more of this into Bayes to know what I think of it.
 

comment by tailcalled · 2024-04-25T19:11:28.969Z · LW(p) · GW(p)
  • The RLCT = first-order term for in-distribution generalization error and also Bayesian learning (technically the 'Bayesian free energy').  This justifies the name of 'learning coefficient' for lambda. I emphasize that these are mathematically precise statements that have complete proofs, not conjectures or intuitions. 

Link(s) to your favorite proof(s)?

Also, do these match up with empirical results?

  • Knowing a little SLT will inoculate you against many wrong theories of deep learning that abound in the literature. I won't be going in to it but suffice to say that any paper assuming that the Fischer information metric is regular for deep neural networks or any kind of hierarchichal structure is fundamentally flawed. And you can be sure this assumption is sneaked in all over the place. For instance, this is almost always the case when people talk about Laplace approximation.

I have a cached belief that the Laplace approximation is also disproven by ensemble studies, so I don't really need SLT to inoculate me against that. I'd mainly be interested if SLT shows something beyond that.

it can be estimated at scale.

As I read the empirical formulas in this paper, they're roughly saying that a network has a high empirical learning coefficient if an ensemble of models that are slightly less trained on average have a worse loss than the network.

But then so they don't have to retrain the models from scratch, they basically take a trained model, and wiggle it around using Gaussian noise while retraining it.

This seems like a reasonable way to estimate how locally flat the loss landscape is. I guess there's a question of how much the devil is in the details; like whether you need SLT to derive an exact formula that works.


I guess I'm still not super sold on it, but on reflection that's probably partly because I don't have any immediate need for computing basin broadness. Like I find the basin broadness theory nice to have as a model, but now that I know about it, I'm not sure why I'd want/need to study it further.

There was a period where I spent a lot of time thinking about basin broadness. I guess I eventually abandoned it because I realized the basin was built out of a bunch of sigmoid functions layered on top of each other, but the generalization was really driven by the neural tangent kernel, which in turn is mostly driven by the Jacobian of the network outputs for the dataset as a function of the weights, which in turn is mostly driven by the network activations. I guess it's plausible that SLT has the best quantities if you stay within the basin broadness paradigm. 🤔

Replies from: alexander-gietelink-oldenziel
comment by Lucius Bushnaq (Lblack) · 2024-04-25T20:30:17.110Z · LW(p) · GW(p)

It's measuring the volume of points in parameter space with loss  when  is infinitesimal. 

This is slightly tricky because it doesn't restrict itself to bounded parameter spaces,[1] but you can fix it with a technicality by considering how the volume scales with  instead.

In real networks trained with finite amounts of data, you care about the case where  is small but finite, so this is ultimately inferior to just measuring how many configurations of floating point numbers get loss , if you can manage that.

I still think SLT has some neat insights that helped me deconfuse myself about networks.

For example, like lots of people, I used to think you could maybe estimate the volume of basins with loss  using just the eigenvalues of the Hessian. You can't. At least not in general. 

 

  1. ^

    Like the floating point numbers in a real network, which can only get so large. A prior of finite width over the parameters also effectively bounds the space

comment by Algon · 2024-04-25T18:48:33.312Z · LW(p) · GW(p)

Second most? What's the first? Linearization of a Newtonian V(r) about the earth's surface?

Replies from: tailcalled
comment by tailcalled · 2024-04-25T19:12:27.023Z · LW(p) · GW(p)

Yes.

answer by Alexander Gietelink Oldenziel · 2024-04-25T11:58:16.462Z · LW(p) · GW(p)
  • Scott Garrabrant's discovery of Logical Inductors. 

I remembered hearing about the paper from a friend and thinking it couldn't possibly be true in a non-trivial sense. To someone with even a modicum of experience in logic -  a computable procedure assigning probabilities to arbitrary logical statements in a natural way is surely to hit a no-go diagonalization barrier. 

Logical Inductors get around the diagonalization barrier in a very clever way.  I won't spoil how it does here. I recommend the interested reader to watch Andrew's Critch talk on Logical Induction. 

It was the main reason convincing that MIRI != clowns but were doing substantial research.  

The Logical Induction paper has a fairly thorough discussion of previous work.  Relevant previous work to mention is de Finetti's on betting and probability,  previous work by MIRI & associates (Herreshof, Taylor, Christiano, Yudkowsky...), the work of Shafer-Vovk on financial interpretations of probability & Shafer's work on aggregation of experts.  There is also a field which doesn't have a clear name that studies various forms of expert aggregation. Overall, my best judgement is that nobody else was close before Garrabrant. 

  • The Antikythera artifact: a Hellenistic Computer.  
    • You probably learned heliocentrism= good, geocentrism=bad, Copernicus-Kepler-Newton=good epicycles=bad. But geocentric models and heliocentric models are equivalent, it's just that Kepler & Newton's laws are best expressed in a heliocentric frame. However, the raw data of observations is actually made in a geocentric frame. Geocentric models stay closer to the data in some sense. 
    • Epicyclic theory is now considered bad, an example of people refusing to see the light of scientific revolution. But actually, it was an enormous innovation. Using high-precision gearing epicycles could be actually implemented on a (Hellenistic) computer  implicitly doing Fourier analysis to predict the motion of the planets. Astounding. 
    • A Roman author (Pliny the Elder?) describes a similar device in posession of Archimedes of Rhodes. It seems likely that Archimedes or a close contemporary (s) designed the artifact and that several were made in Rhodes. 

Actually, since we're on the subject of scientific discoveries 

  • Discovery & description of the complete Antikythera mechanism.  The actual artifact that was found is just a rusty piece of bronze. Nobody knew how it worked.  There were several sequential discoveries over multiple decades that eventually led to the complete solution of the mechanism.The final pieces were found just a few years ago. An astounding scientific achievement. Here is an amazing documentary on the subject: 
comment by cousin_it · 2024-05-06T08:44:16.743Z · LW(p) · GW(p)

I think Diffractor's post [AF · GW] shows that logical induction does hit a certain barrier, which isn't quite diagonalization, but seems to me about as troublesome:

As the trader goes through all sentences, its best-case value will be unbounded, as it buys up larger and larger piles of sentences with lower and lower prices. This behavior is forbidden by the logical induction criterion... This doesn't seem like much, but it gets extremely weird when you consider that the limit of a logical inductor, P_inf, is a constant distribution, and by this result, isn't a logical inductor! If you skip to the end and use the final, perfected probabilities of the limit, there's a trader that could rack up unboundedly high value!

answer by CronoDAS · 2024-04-24T23:24:04.913Z · LW(p) · GW(p)

Antonie van Leeuwenhoek, known as the Father of Microbiology, made the first microscopes capable of seeing microorganisms and is credited as the person who discovered them. He kept his lensmaking techniques secret, however, and microscopes capable of the same magnification didn't become generally available until many, many years later.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-25T10:29:47.003Z · LW(p) · GW(p)

Yes, beautiful example ! Van Leeuwenhoek was the one-man ASML of the 17th century. In this case, we actually have evidence to the counterfactual impact as other lensmakers trailed van Leeuwenhoek by many decades.


It's plausible that high-precision measurement and fabrication is the key bottleneck in most technological and scientific progress- it's difficult to oversell the importance of van Leeuwenhoek. 

Antonie van Leeuwenhoek made more than 500 optical lenses. He also created at least 25 single-lens microscopes, of differing types, of which only nine have survived. These microscopes were made of silver or copper frames, holding hand-made lenses. Those that have survived are capable of magnification up to 275 times. It is suspected that Van Leeuwenhoek possessed some microscopes that could magnify up to 500 times. Although he has been widely regarded as a dilettante or amateur, his scientific research was of remarkably high quality.[39]

The single-lens microscopes of Van Leeuwenhoek were relatively small devices, the largest being about 5 cm long.[40][41] They are used by placing the lens very close in front of the eye. The other side of the microscope had a pin, where the sample was attached in order to stay close to the lens. There were also three screws to move the pin and the sample along three axes: one axis to change the focus, and the two other axes to navigate through the sample.

Van Leeuwenhoek maintained throughout his life that there are aspects of microscope construction "which I only keep for myself", in particular his most critical secret of how he made the lenses.[42] For many years no one was able to reconstruct Van Leeuwenhoek's design techniques, but in 1957, C. L. Stong used thin glass thread fusing instead of polishing, and successfully created some working samples of a Van Leeuwenhoek design microscope.[43] Such a method was also discovered independently by A. Mosolov and A. Belkin at the Russian Novosibirsk State Medical Institute.[44] In May 2021 researchers in the Netherlands published a non-destructive neutron tomography study of a Leeuwenhoek microscope.[22] One image in particular shows a Stong/Mosolov-type spherical lens with a single short glass stem attached (Fig. 4). Such lenses are created by pulling an extremely thin glass filament, breaking the filament, and briefly fusing the filament end. The nuclear tomography article notes this lens creation method was first devised by Robert Hooke rather than Leeuwenhoek, which is ironic given Hooke's subsequent surprise at Leeuwenhoek's findings.

answer by Jesse Hoogland · 2024-04-24T04:17:02.290Z · LW(p) · GW(p)

If you'll allow linguistics, Pāṇini was two and a half thousand years ahead of modern descriptive linguists.

answer by Thomas Kwa · 2024-04-24T02:04:00.729Z · LW(p) · GW(p)

Maybe Galois with group theory? He died in 1832, but his work was only published in 1846, upon which it kicked off the development of group theory, e.g. with Cayley's 1854 paper defining a group. Claude writes that there was not much progress in the intervening years:

The period between Galois' death in 1832 and the publication of his manuscripts in 1846 did see some developments in the theory of permutations and algebraic equations, which were important precursors to group theory. However, there wasn't much direct progress on what we would now recognize as group theory.

Some notable developments in this period:

1. Cauchy's work on permutations in the 1840s further developed the idea of permutation groups, which he had first explored in the 1820s. However, Cauchy did not develop the abstract group concept.

2. Plücker's 1835 work on geometric transformations and his introduction of homogeneous coordinates laid some groundwork for the later application of group theory to geometry.

3. Eisenstein's work on cyclotomy and cubic reciprocity in the 1840s involved ideas related to permutations and roots of unity, which would later be interpreted in terms of group theory.

4. Abel's work on elliptic functions and the insolubility of the quintic equation, while published earlier, continued to be influential in this period and provided important context for Galois' ideas.

However, none of these developments directly anticipated Galois' fundamental insights about the structure of solutions to polynomial equations and the corresponding groups of permutations. The abstract concept of a group and the idea of studying groups in their own right, independent of their application to equations, did not really emerge until after Galois' work became known.

So while the 1832-1846 period saw some important algebraic developments, it seems fair to say that Galois' ideas on group theory were not significantly advanced or paralleled during this time. The relative lack of progress in these 14 years supports the view of Galois' work as a singular and ahead-of-its-time discovery.

answer by Carl Feynman · 2024-04-25T01:30:37.396Z · LW(p) · GW(p)

Wegener’s theory of continental drift was decades ahead of its time. He published in the 1920s, but plate tectonics didn’t take over until the 1960s.  His theory was wrong in important ways, but still.

answer by cousin_it · 2024-04-24T08:31:33.776Z · LW(p) · GW(p)

I sometimes had this feeling from Conway's work, in particular, combinatorial game theory and surreal numbers to me feel closer to mathematical invention than mathematical discovery. This kind of things are also often "leaf nodes" on the tree of knowledge, not leading to many followup discoveries, so you could say their counterfactual impact is low for that reason.

In engineering, the best example I know is vulcanization of rubber. It has had a huge impact on today's world, but Goodyear developed it by working alone for decades, when nobody else was looking in that direction.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-24T10:10:33.824Z · LW(p) · GW(p)

Not inconceivable, I would even say plausible, that surreal numbers & combinatorial game theories impact is still in the future.

answer by lukehmiles · 2024-04-24T07:51:57.925Z · LW(p) · GW(p)

Pasteur had (also highly "counterfactual") help I think! Ignaz Semmelweis worked in this maternity ward where the women & babies kept dying.  The hospital had opened up some investigations over the years as to the cause of death but kept closing them with garbage explanations. He went somewhere else for a while and when he got back he noticed that the death numbers were down in his absence. Then he noticed his hands smelled like death after one of his routine autopsies and he was about to go plunge them in some poor mother! He had washed them but just with regular soap. If he put some bleach in the washwater then his hands didn't stink. He connected the dots. He had killed hundreds of mothers & babies but wrote a book about it anyway and thereby popularized disinfection (and strongly suggested the root cause of disease).

Probably the main reason that germ theory took so long to work out is that the people with the right evidence were too guilty and ashamed to share it. 

answer by cubefox · 2024-04-24T19:26:19.056Z · LW(p) · GW(p)

That the earth is a sphere:

Today, we have lost sight of how counter-intuitive it is to believe the earth is not flat. Its spherical shape has been discovered just once, in Athens in the fourth century BC. The earliest extant reference to it being a globe is found in Plato’s Phaedo, while Aristotle’s On the Heavens contains the first examination of the evidence. Everyone who has ever known the earth is round learnt it indirectly from Aristotle.

Thus begins "The Clash Between the Jesuits and Traditional Chinese Square-Earth Cosmology". The article tells the dramatic story of how some Jesuits tried to establish the spherical-Earth theory in 16th century China, where it was still unknown, partly by creating an elaborate world map to gain the trust of the emperor.

They were ultimately not successful, and the spherical-Earth theory only gained influence in China when Western texts were increasingly translated into Chinese more than two thousand years after the theory was originally invented.

Which makes it a good candidate for one of the most non-obvious / counterfactual theories in history.

comment by Garrett Baker (D0TheMath) · 2024-04-24T19:49:50.898Z · LW(p) · GW(p)

I find this very hard to believe. Shouldn't Chinese merchants have figured out eventually, traveling long distances using maps, that the Earth was a sphere? I wonder whether the "scholars" of ancient China actually represented the state-of-the-art practical knowledge that the Chinese had.

Nevertheless, I don't think this is all that counterfactual. If you're obsessed with measuring everything, and like to travel (like the Greeks), I think eventually you'll have to discover this fact.

Replies from: ChristianKl, cubefox
comment by ChristianKl · 2024-04-25T17:52:12.467Z · LW(p) · GW(p)

Merchants were a lot weaker in China than in Europe. Chinese merchants also did a lot less sea voyages due to geography.

If a bunch of low-status merchants believed that the Earth is a sphere it might not have influenced Chinese high-class beliefs in the same way as beliefs of political powerful merchants in Europe.

comment by cubefox · 2024-04-24T20:29:49.722Z · LW(p) · GW(p)

I see no reason to doubt that the article is accurate. Why would Chinese scholars completely miss the theory if it was obvious among merchants? There should in any case exist some records of it, some maps. Yet none exist. And why would it even be obvious that the Earth is a sphere from long distance travel alone?

Nevertheless, I don't think this is all that counterfactual. If you're obsessed with measuring everything, and like to travel (like the Greeks), I think eventually you'll have to discover this fact.

I don't think this makes sense. If the Chinese didn't reinvent the theory in more than two thousand years, this makes it highly "counterfactual". The longer a theory isn't reinvented, the less obvious it must be.

Replies from: dr_s
comment by dr_s · 2024-04-25T07:56:46.980Z · LW(p) · GW(p)

Maybe it's the other way around, and it's the Chinese elite who was unusually and stubbornly conservative on this, trusting the wisdom of their ancestors over foreign devilry (would be a pretty Confucian thing to do). The Greeks realised the Earth was round from things like seeing sails appear over the horizon. Any sailing peoples thinking about this would have noticed sooner or later.

Kind of a long shot, but did Polynesian people have ideas on this, for example?

Replies from: cubefox
comment by cubefox · 2024-04-25T11:59:16.917Z · LW(p) · GW(p)

There is a large difference between sooner and later. Highly non-obvious ideas will be discovered later, not sooner. The fact that China didn't rediscover the theory in more than two thousand years means that it the ability to sail the ocean didn't make it obvious.

Kind of a long shot, but did Polynesian people have ideas on this, for example?

As far as we know, nobody did, except for early Greece. There is some uncertainty about India, but these sources are dated later and from a time when there was already some contact with Greece, so they may have learned it from them.

Replies from: dr_s
comment by dr_s · 2024-04-26T11:38:22.162Z · LW(p) · GW(p)

Well, it's hard to tell because most other civilizations at the required level of wealth to discover this (by which I mean both sailing and surplus enough to have people who worry about the shape of the Earth at all) could one way or another have learned it via osmosis from Greece. If you only have essentially two examples, how do you tell whether it was the one who discovered it who was unusually observant rather than the one who didn't who was unusually blind? But it's an interesting question, it might indeed be a relatively accidental thing which for some reason was accepted sooner than you would have expected (after all, sails disappearing could be explained by an Earth that's merely dome-shaped; the strongest evidence for a completely spherical shape was probably the fact that lunar eclipses feature always a perfect disc shaped shadow, and even that requires interpreting eclipses correctly, and having enough of them in the first place).

comment by johnlawrenceaspden · 2024-04-25T10:57:39.828Z · LW(p) · GW(p)

I don't buy this, the curvedness of the sea is obvious to sailors, e.g. you see the tops of islands long before you see the beach, and indeed to anyone who has ever swum across a bay! Inland peoples might be able to believe the world is flat, but not anyone with boats.

Replies from: cubefox
comment by cubefox · 2024-04-25T21:05:09.251Z · LW(p) · GW(p)

What's more likely: You being wrong about the obviousness of the sphere Earth theory to sailors, or the entire written record (which included information from people who had extensive access to the sea) of two thousand years of Chinese history and astronomy somehow ommitting the spherical Earth theory? Not to speak of other pre-Hellenistic seafaring cultures which also lack records of having discovered the sphere Earth theory.

answer by junk heap homotopy · 2024-04-24T14:53:28.319Z · LW(p) · GW(p)

Set theory is the prototypical example I usually hear about. From Wikipedia:

Mathematical topics typically emerge and evolve through interactions among many researchers. Set theory, however, was founded by a single paper in 1874 by Georg Cantor: "On a Property of the Collection of All Real Algebraic Numbers".

answer by Alexander Gietelink Oldenziel · 2024-04-24T08:31:58.445Z · LW(p) · GW(p)

An example that's probably * not* a highly counterfactual discovery is the discovery of DNA as the inheritance particle by Watson & Crick [? Wilkins, Franklin, Gosling, Pauling...].

I had great fun reading Watson's scientific-literary fiction the Double Helix. Watson and Crick are very clear that competitors were hot on their heels, a matter of months, a year perhaps.

EDIT: thank you nitpickers. I should have said structure of DNA, not its role as the carrier of inheritance.

comment by johnswentworth · 2024-04-24T15:38:56.964Z · LW(p) · GW(p)

Nitpick: you're talking about the discovery of the structure of DNA; it was already known at that time to be the particle which mediates inheritance IIRC.

comment by tailcalled · 2024-04-25T06:32:35.528Z · LW(p) · GW(p)

I would say "the thing that contains the inheritance particles" rather than "the inheritance particle". "Particulate inheritance" is a technical term within genetics and it refers to how children don't end up precisely with the mean of their parents' traits (blending inheritance), but rather with some noise around that mean, which particulate inheritance asserts is due to the genetic influence being separated into discrete particles with the children receiving random subsets of their parent's genes. The significance of this is that under blending inheritance, the genetic variation between organisms within a species would be averaged away in a small number of generations, which would make evolution by natural selection ~impossible (as natural selection doesn't work without genetic variation).

answer by francis kafka · 2024-04-24T02:41:14.317Z · LW(p) · GW(p)

Peter J. Bowler suggests that evolution by natural selection is this in his book "Darwin Deleted" - given that in real life, there was an "eclipse of Darwinism", he suggests that without Darwin, various non-Darwinian theories of evolution would have been developed further, and evolution by natural selection would have come rather late

comment by Jesse Hoogland (jhoogland) · 2024-05-09T16:59:01.277Z · LW(p) · GW(p)

Anecdotally (I couldn't find confirmation after a few minutes of searching), I remember hearing a claim about Darwin being particularly ahead of the curve with sexual selection & mate choice. That without Darwin it might have taken decades for biologists to come to the same realizations. 

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-25T10:32:26.705Z · LW(p) · GW(p)

Don't forget Wallace !

Replies from: kingofthenerdz3
comment by francis kafka (kingofthenerdz3) · 2024-04-26T10:34:49.795Z · LW(p) · GW(p)

Bowler's comment on Wallace is that his theory was not worked out to the extent that Darwin's was, and besides I recall that he was a theistic evolutionist. Even with Wallace, there was still a plethora of non-Darwinian evolutionary theories before and after Darwin, and without the force of Darwin's version, it's not likely or necessary that Darwinism wins out. 

But Wallace’s version of the theory was not the same as Darwin’s, and he had very different ideas about its implications. And since Wallace conceived his theory in 1858, any equivalent to Darwin’s 1859 Origin of Species would have appeared years later.

Also 

Natural selection, however, was by no means an inevitable expression of mid-nineteenth-century thought, and Darwin was unique in having just the right combination of interests to appreciate all of its key components. No one else, certainly not Wallace, could have articulated the idea in the same way and promoted it to the world so effectively.

And he points out that minus Darwin, nobody would have paid as much attention to Wallace. 

The powerful case for transmutation mounted in the Origin of Species prompted everyone to take the subject seriously and begin to think more constructively about how the process might work. Without the Origin, few would have paid much attention to Wallace’s ideas (which were in many respects much less radical than Darwin’s anyway). Evolutionism would have developed more gradually in the course of the 1860s and ’70s, with Lamarckism being explored as the best available explanation of adaptive evolution. Theories in which adaptation was not seen as central to the evolutionary process would have sustained an evolutionary program that did not enquire so deeply into the actual mechanism of change, concentrating instead on reconstructing the overall history of life on earth from fossil and other evidence. Only toward the end of the century, when interest began to focus on the topic of heredity (largely as a result of social concerns), would the fragility of the non-Darwinian ideas be exposed, paving the way for the selection theory to emerge at last.

Bowler also points out that Wallace didn't really form the connection between both natural and artificial selection. 

Replies from: Lukas_Gloor
comment by Lukas_Gloor · 2024-04-26T12:38:30.015Z · LW(p) · GW(p)

In some of his books on evolution, Dawkins also said very similar things when commenting on Darwin vs Wallace, basically saying that there's no comparison, Darwin had a better grasp of things, justified it better and more extensively, didn't have muddled thinking about mechanisms, etc.

Replies from: kingofthenerdz3
comment by francis kafka (kingofthenerdz3) · 2024-04-26T14:27:33.538Z · LW(p) · GW(p)

I mean to some extent, Dawkins isn't a historian of science, presentism, yadda yadda but from what I've seen he's right here. Not that Wallace is somehow worse, given that of all the people out there he was certainly closer than the rest. That's about it

answer by johnswentworth · 2024-04-23T22:30:53.196Z · LW(p) · GW(p)

Here are some candidates from Claude and Gemini (Claude Opus seemed considerably better than Gemini Pro for this task). Unfortunately they are quite unreliable: I've already removed many examples from this list which I already knew to have multiple independent discoverers (like e.g. CRISPR and general relativity). If you're familiar with the history of any of these enough to say that they clearly were/weren't very counterfactual, please leave a comment.

  • Noether's Theorem
  • Mendel's Laws of Inheritance
  • Godel's First Incompleteness Theorem (Claude mentions Von Neumann as an independent discoverer for the Second Incompleteness Theorem)
  • Feynman's path integral formulation of quantum mechanics
  • Onnes' discovery of superconductivity
  • Pauling's discovery of the alpha helix structure in proteins
  • McClintock's work on transposons
  • Observation of the cosmic microwave background
  • Lorentz's work on deterministic chaos
  • Prusiner's discovery of prions
  • Yamanaka factors for inducing pluripotency
  • Langmuir's adsorption isotherm (I have no idea what this is)
comment by Jan_Kulveit · 2024-04-24T14:04:42.741Z · LW(p) · GW(p)

Mendel's Laws seem counterfactual by about ˜30 years, based on partial re-discovery taking that much time. His experiments are technically something which someone could have done basically any time in last few thousand years, having basic maths

Replies from: johnswentworth
comment by johnswentworth · 2024-04-24T15:30:12.054Z · LW(p) · GW(p)

I buy this argument.

comment by Ben (ben-lang) · 2024-04-24T10:03:39.005Z · LW(p) · GW(p)

I would guess that Lorentz's work on deterministic chaos does not get many counterfactual discovery points. He noticed the chaos in his research because of his interactions with a computer doing simulations. This happened in 1961. Now, the question is, how many people were doing numerical calculations on computer in 1961? It could plausibly have been ten times as many by 1970. A hundred times as many by 1980? Those numbers are obviously made up but the direction they gesture in is my point. Chaos was a field that was made ripe for discovery by the computer. That doesn't take anything away from Lorentz's hard work and intelligence, but it does mean that if he had not taken the leap we can be fairly confident someone else would have. Put another way: If Lorentz is assumed to have had a high counterfactual impact, then it becomes a strange coincidence that chaos was discovered early in the history of computers.

Replies from: johnswentworth
comment by johnswentworth · 2024-04-24T15:29:55.205Z · LW(p) · GW(p)

I buy this argument.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-24T08:28:54.780Z · LW(p) · GW(p)

Feymann's path integral formulation can't be that counterfactually large. It's mathematically equivalent to Schwingers formulation and done several years earlier by Tomonaga.

Replies from: johnswentworth
comment by johnswentworth · 2024-04-24T15:28:47.424Z · LW(p) · GW(p)

I don't buy mathematical equivalence as an argument against, in this case, since the whole point of the path integral formulation is that it's mathematically equivalent but far simpler conceptually and computationally.

Replies from: alexander-gietelink-oldenziel
comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-24T16:40:26.536Z · LW(p) · GW(p)

Idk the Nobel prize committee thought it wasn't significant enough to give out a separate prize 🤷

I am not familiar enough with the particulars to have an informed opinion. My best guess is that in general statements to the effect of "yes X also made scientific contribution A but Y phrased it better' overestimate the actual scientific counterfactual impact of Y. It generically weighs how well outsiders can understand the work too much vis a vis specialists/insiders who have enough hands-on experience that the value-add of a simpler/neater formalism is not that high (or even a distraction).

The reason Dick Feynmann is so much more well-known than Schwinger and Tomonaga surely must not be entirely unrelated with the magnetic charisma of Dick Feynmann.

comment by Garrett Baker (D0TheMath) · 2024-04-24T18:10:03.627Z · LW(p) · GW(p)

I've heard an argument that Mendel was actually counter-productive to the development of genetics. That if you go and actually study peas like he did, you'll find they don't make perfect Punnett squares, and from the deviations you can derive recombination effects. The claim is he fudged his data a little in order to make it nicer, then this held back others from figuring out the topological structure of genotypes.

Replies from: Jiro
comment by Jiro · 2024-04-24T20:17:02.403Z · LW(p) · GW(p)

I've heard, in this context, the partial counterargument that he was using traits which are a little fuzzy around the edges (where is the boundary between round and wrinkled?) and that he didn't have to intentionally fudge his data in order to get results that were too good, just be not completely objective in how he was determining them.

Of course, this sort of thing is why we have double-blind tests in modern times.

comment by transhumanist_atom_understander · 2024-05-11T00:40:05.679Z · LW(p) · GW(p)

Observation of the cosmic microwave background was a simultaneous discovery, according to James Peebles' Nobel lecture. If I'm understanding this right, Bob Dicke's group at Princeton was already looking for the CMB based on a theoretical prediction of it, and were doing experiments to detect it, with relatively primitive equipment, when the Bell Labs publication came out.

answer by FreakTakes · 2024-05-02T20:23:20.476Z · LW(p) · GW(p)

Fun question!

IMO Edison and Shannon are both strong candidates for quite different reasons.

Edison solved a bunch of necessary problems in one go when building a working, commercializable lighting system. He did this in an area where many others had only chipped away at corners of the problem. He was not the first to the area...but I don't think there are any strong claims that the area would have come along nearly as quickly if not for him/his team. I talk about this in-depth in a Works in Progress piece on Edison as an exception technical entrepreneur.

As far as Shannon goes, I'm not saying he initially published on his two major discoveries much earlier than others would have initially published...but Shannon had a sort of uncanny ability to open and largely close a sub-field all in one go. This is rare in scientific branch creation. Usually a process likes this takes something like 5-10 people something like 5-20 years to do. My FreakTakes piece on the early years of molecular biology give a sort of blow-by-blow of what this often looks like. Shannon's excellence helped circumvent a lot of that. So IMO the thoroughness of his thinking was a huge time-saver.

answer by Mateusz Bagiński · 2024-04-24T12:00:33.340Z · LW(p) · GW(p)

Maybe Hanson et al.'s Grabby aliens model? @Anders_Sandberg [LW · GW]  said that some N years before that (I think more or less at the time of working on Dissolving the Fermi Paradox), he "had all of the components [of the model] on the table" and it just didn't occur to him that they can be composed in this way. (personal communication, so I may be misremembering some details). Although it's less than 10 years, so...

Speaking of Hanson, prediction markets seem like a more central example. I don't think the idea was [inconceivable in principle] 100 years ago.

ETA: I think Dissolving the Fermi Paradox may actually be a good example. Nothing in principle prohibited people puzzling about "the great silence" from using probability distributions instead of point estimates in the Drake equation. Maybe it was infeasible to compute this back in the 1950s/60s, but I guess it should be doable in 2000s and still, the paper was published only in 2017.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-25T09:56:08.301Z · LW(p) · GW(p)

Here's a document called "Upper and lower bounds for Alien Civilizations and Expansion Rate" I wrote in 2016.  Hanson et al. Grabby Aliens paper was submitted in 2021. 

The draft is very rough. Claude summarizes it thusly:

The document presents a probabilistic model to estimate upper and lower bounds for the number of alien civilizations and their expansion rates in the universe. It shares some similarities with Robin Hanson's "Grabby Aliens" model, as both attempt to estimate the prevalence and expansion of alien civilizations, considering the idea of expansive civilizations that colonize resources in their vicinity.

However, there are notable differences. Hanson's model focuses on civilizations expanding at the highest possible speed and the implications of not observing their visible "bubbles," while this document's model allows for varying expansion rates and provides estimates without making strong claims about their observable absence. Hanson's model also considers the idea of a "Great Filter," which this document does not explicitly discuss.

Despite these differences, the document implicitly contains the central insight of Hanson's model – that the expansive nature of spacefaring civilizations and the lack of observable evidence for their existence imply that intelligent life is sparse and far away. The document's conclusions suggest relatively low numbers of spacefaring civilizations in the Milky Way (fewer than 20) and the Local Group (up to one million), consistent with the idea that intelligent life is rare and distant.

The document's model assumes that alien civilizations will become spacefaring and expansive, occupying increasing volumes of space over time and preventing new civilizations from forming in those regions. This aligns with the "grabby" nature of aliens in Hanson's model. Although the document does not explicitly discuss the implications of not observing "grabby" aliens, its low estimates for the number of civilizations implicitly support the idea that intelligent life is sparse and far away.

The draft was never finished as I felt the result wasn't significant enough. To be clear, the Hanson-Martin-McCarter-Paulson paper contains more detailed models and much more refined statistical analysis.  I didn't pursue these ideas further. 

I wasn't part of the rationality/EA/LW community. Nobody I talked to was interested in these questions. 

Let this be a lesson for young people: Don't assume. Publish! Publish in journals. Publish on LessWrong. Make something public even if it's not in a journal!

comment by ChrisHibbert · 2024-04-25T03:22:43.756Z · LW(p) · GW(p)

The Iowa Election Markets were roughly contemporaneous with Hanson's work. They are often co-credited.

answer by transhumanist_atom_understander · 2024-04-29T01:43:22.161Z · LW(p) · GW(p)

Green fluorescent protein (GFP). A curiosity-driven marine biology project (how do jellyfish produce light?), that was later adapted into an important and widely used tool in cell biology. You splice the GFP gene onto another gene, and you've effectively got a fluorescent tag so you can see where the protein product is in the cell.

Jellyfish luminescence wasn't exactly a hot field, I don't know of any near-independent discoveries of GFP. However, when people were looking for protein markers visible under a microscope, multiple labs tried GFP simultaneously, so it was determined by that point. If GFP hadn't been discovered, would they have done marine biology as a subtask, or just used their next best option?

Fun fact: The guy who discovered GFP was living near Nagasaki when it was bombed. So we can consider the hypothetical where he was visiting the city that day.

answer by Jonas Hallgren · 2024-04-24T14:06:53.460Z · LW(p) · GW(p)

The Buddha with dependent origination. I think it says somewhere that most of the stuff in Buddhism was from before the Buddha's time. These are things such as breath-based practices and loving kindness, among others. He had one revelation that made the entire enlightenment thing basically which is called dependent origination.*

*At least according to my meditation teacher, I believe him since he was a neuroscientist and astrophysics masters at Berkeley before he left for India though so he's got some pretty good epistemics.

It basically states that any system is only true based on another system being true. It has some really cool parallels to Gödel's Incompleteness Theorem but on a metaphysical level. Emptiness of emptiness and stuff. (On a side note I can recommend TMI + Seeing That Frees if you want to experience som radical shit there.)

comment by Valdes (Cossontvaldes) · 2024-06-05T07:46:24.380Z · LW(p) · GW(p)

For anyone wondering TMI almost certainly stands for "The Mind Illuminated"; a book by John Yates, Matthew Immergut, and Jeremy Graves . Full title: The Mind Illuminated: A Complete Meditation Guide Integrating Buddhist Wisdom and Brain Science for Greater Mindfulness

comment by yanni kyriacos (yanni) · 2024-05-19T11:28:44.672Z · LW(p) · GW(p)

Hi Jonas! Would you mind saying about more about TMI + Seeing That Frees? Thanks!

Replies from: Jonas Hallgren
comment by Jonas Hallgren · 2024-05-19T12:14:08.539Z · LW(p) · GW(p)

Sure! Anything more specific that you want to know about? Practice advice or more theory?

Replies from: yanni
comment by yanni kyriacos (yanni) · 2024-05-19T21:35:04.879Z · LW(p) · GW(p)

Thanks :) Uh, good question. Making some good links? Have you done much nondual practice? I highly recommend Loch Kelly :)

answer by Niclas Kupper · 2024-04-24T11:18:02.527Z · LW(p) · GW(p)

Grothendiek seems to have been an extremely singular researcher, various of his discoveries would have likely been significantly delayed without him. His work on sheafs is mind bending the first time you see it and was seemingly ahead of its time.

comment by Alexander Gietelink Oldenziel (alexander-gietelink-oldenziel) · 2024-04-25T10:06:15.583Z · LW(p) · GW(p)

Here are some reflections [LW(p) · GW(p)] I wrote on the work of Grothendieck and relations with his contemporaries & predecessors. 

Take it with a grain of salt - it is probably too deflationary of Grothendieck's work, pushing back on mythical narratives common in certain mathematical circles where Grothendieck is held to be an Christ-like figure. I pushed back on that a little.  Nevertheless, it would probably not be an exaggeration to say that Grothendieck's purely scientific contributions [as opposed to real-life consequences] were comparable to those of Einstein. 

answer by martinkunev · 2024-05-09T04:52:52.867Z · LW(p) · GW(p)

I have previously used special relativity as an example to the opposite. It seems to me that the Michelson-Morley experiment laid the groundwork and all alternatives were  more or less rejected by the time special relativity was formulated. This could be hindsight bias though.

If nobel prizes are any indicator, then the photoelectric effect is probably more counterfactually impactful than special relativity.

answer by AnthonyC · 2024-04-29T14:35:06.169Z · LW(p) · GW(p)

I think it's worth noting that small delays in discovering new things would, in aggregate, be very impactful. On average, how far apart are the duplicate discoveries? If we pushed all the important discoveries back a couple of years by eliminating whoever was in fact historically first, then the result is a world that is perpetually several years behind our own in everything. This world is plausibly 5-10% poorer for centuries, maybe more if a few key hard steps have longer delays, or if the most critical delays happened a long time ago and were measured in decades or centuries instead.

answer by Jordan Taylor · 2024-04-28T08:02:34.350Z · LW(p) · GW(p)

Special relativity is not such a good example here when compared to general relativity, which was much further ahead of its time. See, for example, this article: https://bigthink.com/starts-with-a-bang/science-einstein-never-existed/

Regarding special relativity, Einstein himself said:[1]

There is no doubt, that the special theory of relativity, if we regard its development in retrospect, was ripe for discovery in 1905. Lorentz had already recognized that the transformations named after him are essential for the analysis of Maxwell's equations, and Poincaré deepened this insight still further. Concerning myself, I knew only Lorentz's important work of 1895 [...] but not Lorentz's later work, nor the consecutive investigations by Poincaré. In this sense my work of 1905 was independent. [..] The new feature of it was the realization of the fact that the bearing of the Lorentz transformation transcended its connection with Maxwell's equations and was concerned with the nature of space and time in general. A further new result was that the "Lorentz invariance" is a general condition for any physical theory. 

As for general relativity, the ideas and the mathematics required (Riemannian Geometry) were much more obscure and further afield. The only people who came close, Nordstrom and Hilbert, arguably did so because they were directly influenced by Einstein's ongoing work on general relativity (not just special relativity). 

https://www.quora.com/Without-Einstein-would-general-relativity-be-discovered-by-now 

  1. ^
answer by Shmi (shminux) · 2024-04-26T16:48:19.441Z · LW(p) · GW(p)

First, your non-standard use of the term "counterfactual" is jarring, though, as I understand, it is somewhat normalized in your circles. "Counterfactual" unlike "factual" means something that could have happened, given your limited knowledge of the world, but did not. What you probably mean is "completely unexpected", "surprising" or something similar. I suspect you got this feedback before.

Sticking with physics. Galilean relativity was completely against the Aristotelian grain. More recently, the singularity theorems of Penrose and Hawking unexpectedly showed that black holes are not just a mathematical artifact, but a generic feature of the world. A whole slew of discoveries, experimental and theoretical, in Quantum mechanics were almost all against the grain. Probably the simplest and yet the hardest to conceptualize was the Bell's theorem

Not my field, but in economics, Adam Smith's discovery of what Scott Alexander later named Moloch was a complete surprise, as I understand it. 

comment by kave · 2024-04-26T18:11:39.847Z · LW(p) · GW(p)

What you probably mean is "completely unexpected", "surprising" or something similar

I think it means the more specific "a discovery that if it counterfactually hadn't happened, wouldn't have happened another way for a long time". I think this is roughly the "counterfactual" in "counterfactual impact", but I agree not the more widespread one.

It would be great to have a single word for this that was clearer.

Replies from: kave
comment by kave · 2024-04-26T23:45:32.228Z · LW(p) · GW(p)

Maybe "counterfactually robust" is an OK phrase?

16 comments

Comments sorted by top scores.

comment by Templarrr (templarrr) · 2024-04-24T08:18:38.016Z · LW(p) · GW(p)

Penicillin. Gemini tells me that the antibiotic effects of mold had been noted 30 years earlier, but nobody investigated it as a medicine in all that time.

Gemini is telling you a popular urban legend-level understanding of what happened. The creation of Penicillin as a random event, "by mistake", has at most tangential touch with reality. But it is a great story, so it spread like wildfire. 

In most cases when we read "nobody investigated" it actually means "nobody succeeded yet, so they weren't in a hurry to make it known", which isn't very informative point of data. No one ever succeeds, until they do. And in this case it's not even that - antibiotic properties of some molds were known and applied for centuries before that (well, obviously, before the theory of germs they weren't known as "antibiotic", just that they helped...), the great work of Fleming and later scientists was about finding the particularly effective type of mold and extracting the exact effective chemical as well as finding a way to produce that at scale.

comment by Wei Dai (Wei_Dai) · 2024-04-25T03:17:38.640Z · LW(p) · GW(p)

Even if someone made a discovery decades earlier than it otherwise would have been, the long term consequences of that may be small or unpredictable. If your goal is to "achieve high counterfactual impact in your own research" (presumably predictably positive ones) you could potentially do that in certain fields (e.g., AI safety) even if you only counterfactually advance the science by a few months or years. I'm a bit confused why you're asking people to think in the direction outlined in the OP.

comment by niplav · 2024-04-24T09:28:36.175Z · LW(p) · GW(p)

I think the Diesel engine would've taken 10 years or 20 years longer to be invented: From the Wikipedia article it sounds like it was fairly unintuitive to the people at the time.

comment by Niclas Kupper (niclas-kupper) · 2024-04-24T11:19:19.241Z · LW(p) · GW(p)

It would be interesting for people to post current research that they think has some small chance of outputting highly singular results!

comment by MiguelDev (whitehatStoic) · 2024-04-24T06:52:42.099Z · LW(p) · GW(p)

But if your goal is to achieve high counterfactual impact in your own research, then you should probably draw inspiration from the opposite: "singular" discoveries, i.e. discoveries which nobody else was anywhere close to figuring out.


This idea reminds me of the concepts in this post: Focus on the places where you feel shocked everyone's dropping the ball [LW · GW].

comment by [deleted] · 2024-04-24T02:21:47.055Z · LW(p) · GW(p)

Gemini may just be wrong about the mold claim. According to Wikipedia, Ernest Duchesne was curing guinea pigs of typhoid in 1897.

comment by ClareChiaraVincent · 2024-05-14T17:47:56.284Z · LW(p) · GW(p)

I don't know for sure about Pasteur (not my specialty) but from reading some primary sources from around the end of the spontaneous generation debate (Tyndall I think, can't quite remember!) I was struck by how much effort it took. I think it was just a lot harder to get from "first idea" to "compelling empirical results" than might immediately be clear! 

comment by Review Bot · 2024-04-28T08:06:22.526Z · LW(p) · GW(p)

The LessWrong Review [? · GW] runs every year to select the posts that have most stood the test of time. This post is not yet eligible for review, but will be at the end of 2025. The top fifty or so posts are featured prominently on the site throughout the year.

Hopefully, the review is better than karma at judging enduring value. If we have accurate prediction markets on the review results, maybe we can have better incentives on LessWrong today. Will this post make the top fifty?

comment by Johannes C. Mayer (johannes-c-mayer) · 2024-04-26T20:52:11.572Z · LW(p) · GW(p)

A few adjacent thoughts:

  • Haskell is powerful in the sense that when your program compiles, you get the program that you actually want a much higher probability compared to most other languages. Many stupid mistakes that are runtime errors in other languages are now compile-time errors. Why is almost nobody using Haskell?
  • Why is there basically no widely used homoiconic language, i.e. a language in which you can use the language itself directly to manipulate programs written in the language.

Here we have some technologies that are basically ready to use (Haskell or Clojure), but people decide to mostly not use them. And with people, I mean professional programmers and companions who make software.

  • Why did nobody invent Rust earlier, by which I mean a system-level programming language that prevents you from making really dumb mistakes by having the computer check whether you made them?
  • Why did it take like 40 years to get a latex replacement, even though latex is terrible in very obvious ways?

These things have in common that there is a big engineering challenge. It feels like maybe this explains it, together with that people who would benefit from these technologies where in the position that the cost of creating them would have exceeded the benefit that they would expect from them.

For Haskell and Clojure we can also consider this point. Certainly, these two technologies have their flaws and could be improved. But then again we would have a massive engineering challenge.

Replies from: Radford Neal
comment by Radford Neal · 2024-04-27T17:00:49.451Z · LW(p) · GW(p)

"Why is there basically no widely used homoiconic language"

Well, there's Lisp, in its many variants.  And there's R.  Probably several others.

The thing is, while homoiconicity can be useful, it's not close to being a determinant of how useful the language is in practice.  As evidence, I'd point out that probably 90% of R users don't realize that it's homoiconic.

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2024-04-28T17:02:48.418Z · LW(p) · GW(p)

I am also not sure how useful it is, but I would be very careful with saying that R programmers not using it is strong evidence that it is not that useful. Basically, that was a bit the point I wanted to make with the original comment. Homoiconicity might be hard to learn and use compared to learning a for loop in python. That might be the reason that people don't learn it. Because they don't understand how it could be useful. Probably actually most R users did not even hear about homoiconicity. And if they would they would ask "Well I don't know how this is useful". But again that does not mean that it is not useful.

Probably many people at least vaguely know the concept of a pure function. But probably most don't actually use it in situations where it would be advantageous to use pure functions because they can't identify these situations.

Probably they don't even understand basic arguments, because they've never heard them, of why one would care about making functions pure. With your line of argument, we would now be able to conclude that pure functions are clearly not very useful in practice. Which I think is, at minimum, an overstatement. Clearly, they can be useful. My current model says that they are actually very useful.

[Edit:] Also R is not homoiconic lol. At least not in a strong sense like lisp. At least what this guy on github says. Also, I would guess this is correct from remembering how R looks, and looking at a few code samples now. In LISP your program is a bunch of lists. In R not. What is the data structure instance that is equivalent to this expression: %sumx2y2% <- function(e1, e2) {e1 ^ 2 + e2 ^ 2}?

Replies from: Radford Neal
comment by Radford Neal · 2024-04-28T19:08:48.294Z · LW(p) · GW(p)

R is definitely homoiconic.  For your example (putting the %sumx2y2% in backquotes to make it syntactically valid), we can examine it like this:

 > x <- quote (`%sumx2y2%` <- function(e1, e2) {e1 ^ 2 + e2 ^ 2})
> x
`%sumx2y2%` <- function(e1, e2) {
   e1^2 + e2^2
}
> typeof(x)
[1] "language"
> x[[1]]
`<-`
> x[[2]]
`%sumx2y2%`
> x[[3]]
function(e1, e2) {
   e1^2 + e2^2
}
> typeof(x[[3]])
[1] "language"
> x[[3]][[1]]
`function`
> x[[3]][[2]]
$e1


$e2


> x[[3]][[3]]
{
   e1^2 + e2^2
}

And so forth.  And of course you can construct that expression bit by bit if you like as well.  And if you like, you can construct such expressions and use them just as data structures, never evaluating them, though this would be a bit of a strange thing to do. The only difference from Lisp is that R has a variety of composite data types, including "language", whereas Lisp just has S-expressions and atoms.

Replies from: johannes-c-mayer
comment by Johannes C. Mayer (johannes-c-mayer) · 2024-04-29T20:05:37.489Z · LW(p) · GW(p)

Ok, I was confused before. I think Homoiconicity is sort of several things. Here are some examples:

  • In basically any programming language L, you can have program A, that can write a file containing a valid L source code that is then run by A.
  • In some sense, python is homoiconic, because you can have a string and then exec it. Before you exec (or in between execs) you can manipulate the string with normal string manipulation.
  • In R you have the quote operator which allows you to take in code and return and object that represents this code, that can be manipulated.
  • In Lisp when you write an S-expression, the same S-expression can be interpreted as a program or a list. It is actually always a (possibly nested) list. If we interpret the list as a program, we say that the first element in the list is the symbol of the function, and the remaining entries in the list are the arguments to the function.

Although I can't put my finger on it exactly, to me it feels like the homoiconicity is increasing in further down examples in the list.

The basic idea though seems to always be that we have a program that can manipulate the representation of another program. This is actually more general than homoiconicity, as we could have a Python program manipulating Haskell code for example. It seems that the further we go down the list, the easier it gets to do this kind of program manipulation.

comment by segfault (caleb-ditchfield) · 2024-04-25T04:16:20.176Z · LW(p) · GW(p)

Could you define what you mean here by counterfactual impact?

My knowledge of the word counterfactual comes mainly from the blockchain world, where we use it in the form of "a person could do x at any time, and we wouldn't be able to stop them, therefore x is counterfactually already true or has counterfactually already occured"

Replies from: ChristianKl, caleb-ditchfield
comment by ChristianKl · 2024-04-26T13:40:36.770Z · LW(p) · GW(p)

Counterfactual means, that if something would not have happened something else would have happened. It's a key concept in Judea Pearl's work on causality.