On Frequentism and Bayesian Dogma
post by DanielFilan, Adrià Garriga-alonso (rhaps0dy) · 2023-10-15T22:23:10.747Z · LW · GW · 27 commentsContents
27 comments
27 comments
Comments sorted by top scores.
comment by rotatingpaguro · 2023-10-16T03:16:27.338Z · LW(p) · GW(p)
What do you think about Wald's complete class theorems and other similar decision-theoretic results that say that, under a fixed frequentist setting, the set of admissible algorithms coincides (barring messes with infinities) with the set of Bayesian procedures as all possible priors are considered? In other words, if you think it makes sense to strive for the "best" procedure in a context, for any fixed even if unknown definition of what's best, and you have a frequentist procedure you think is statistically good, then there must be a corresponding Bayesian prior.
(This is an argument I'd always like to see addressed as basic disclaimer in frequentist vs bayesian discussions, I think it helps a lot to put down the framework people are reasoning under, e.g., if it's more practical vs. theoretical.)
My own opinion on the topic (I'm pro-Bayes):
- Many standard frequentist things can be derived as easily or more easily in a Bayesian way; that they are conventionally considered frequentist is an irrelevant accident of history.
- In tree methods, the frequentist version comes first, but the Bayesian version, when it arrives, is better, and usable in practice.
- Practically all real Bayesian methods are not purely Bayesian, there are many ad hockeries. The point is using Bayes as a guide. Even with an algorithm pulled out of the hat, it's useful to know if it has a Bayesian interpretation, because it makes it clearer.
- ML is frequentist only in the sense of trying algorithms without set rules, I don't think that should be counted as frequentist success! It's too generic. I have the impression the mindset of the people working in ML that know their shit is closer to Bayesian, but I am not confident in this since it's an indirect impression. Example: information theoretic stuff is more natural with Bayes.
comment by Vaniver · 2023-10-15T22:34:51.087Z · LW(p) · GW(p)
I think my real gripe is that I see this massive impact of frequentism on the scientific method as promoting the use of p-values and confidence intervals, which, IMO, are using conditional probabilities in the wrong direction (one way to tell this: ask any normal scientist what a p-value or a confidence interval is, and there's a high chance that they'll give an explanation of what the Bayesian equivalent would be).
I'm a little surprised this didn't come up earlier. As I mentioned to Adrià, I think the thing Bayesianism is about is more "how to think about epistemology" (where complaints like "but not everything is a probability distribution! How do you account for conjectures?" live) and the fact that the main frequentist tool used in science is totally misused and misunderstood seems to me like it's a pretty good argument in favor of "you should be thinking like a Bayesian."
Like, if the thing with frequentism is "yeah just use methods in a pragmatic way and don't think about it that hard" it's not really a surprise that people didn't think about things that hard and this leads to widespread confusion and mistakes.
Replies from: rhaps0dy↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-15T23:47:00.767Z · LW(p) · GW(p)
the thing with frequentism is " yeah just use methods in a pragmatic way and don't think about it that hard"
I think this does not accurately represent my beliefs. It is about thinking hard about how the methods actually behave, as opposed to having a theory that prescribes how methods should behave and then constructing algorithms based on that.
Frequentists analyze the properties of an algorithm that takes data as input (in their jargon, an 'estimator').
They also try to construct better algorithms, but each new algorithm is bespoke and requires original thinking, as opposed to Bayes which says "you should compute the posterior probability", which makes it very easy to construct algorithms. (This is a drawback of the frequentist approach -- algorithm construction is not automatic. But the finite-computation Bayesian algorithms have very few guarantees anyways so I don't think we should count it against them too much).
I think having rando social scientists using likelihood ratios would also lead to mistakes and such.
comment by Garrett Baker (D0TheMath) · 2023-10-16T00:42:49.925Z · LW(p) · GW(p)
Even under these assumptions, you still have the problem of handling belief states which cannot be described as a probability distribution. For small state spaces, being fast and loose with that (e.g. just belief the uniform distribution over everything) is fine, but larger state spaces run into problems, even if you have infinite compute and can prove everything and don't need to have self-knowledge.
What sort of problems?
Replies from: rhaps0dy, strawberry calm↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T03:49:47.229Z · LW(p) · GW(p)
In short, the probability distribution you choose contains lots of interesting assumptions about what states are more likely that you didn't necessarily intend. As a result most of the possible hypotheses have vanishingly small prior probability and you can never reach them. Even though with a frequentist approach
For example, let us consider trying to learn a function with 1-dim numerical input and output (e.g. ). Correspondingly, your hypothesis space is the set of all such functions. There are very many functions (infinitely many if , otherwise a crazy number).
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It's uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding [LW · GW], all sorts of problems.
What other prior probability distribution can we place on the hypothesis space? The obvious choice in 2023 is a neural network with random weights. OK, let's think about that. What architecture? The most sensible thing is to randomize over architectures somehow. Let's hope the distribution on architectures is as simple as possible.
How wide, how deep? You don't want to choose an arbitrary distribution or (god forbid) arbitrary number, so let's make it infinitely wide and deep! It turns out that an infinitely wide network just collapses to a random process without any internal features. It turns out an infinitely deep network, but that collapses to a stationary distribution which doesn't depend on the input. Oops.
Okay, let's give up and place some arbitrary distribution (e.g. geometric distribution) on the width.
What about the prior on weights? uh idk, zero-mean identity covariance Gaussian? Our best evidence says that this sucks.
At this point you've made so many choices, which have to be informed by what empirically works well, that it's a strange Bayesian reasoner you end up with. And you haven't even specified your prior distribution yet.
Replies from: D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T05:10:18.576Z · LW(p) · GW(p)
You could use the Solomonoff prior (on a discretized version of this), but that way lies madness. It's uncomputable, and most of the functions that fit the data may contain agents that try to get you to do their bidding, all sorts of problems.
This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself [LW(p) · GW(p)]. Computability of the prior also seems irrelevant if I have infinite compute. Therefore in this prediction task, I don't see the problem in just using the first thing you mentioned.
Replies from: rhaps0dy↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T05:23:12.334Z · LW(p) · GW(p)
This seems false if you're interacting with a computable universe, and don't need to model yourself or copies of yourself
Reasonable people disagree. Why should I care about the "limit of large data" instead of finite-data performance?
↑ comment by Cleo Nardo (strawberry calm) · 2023-10-17T14:47:38.222Z · LW(p) · GW(p)
- Logical/mathematical beliefs — e.g. "Is Fermat's Last Theorem true?"
- Meta-beliefs — e.g. "Do I believe that I will die one day?"
- Beliefs about the outcome space itself — e.g. "Am I conflating these two outcomes?"
- Indexical beliefs — e.g. "Am I the left clone or the right clone?"
- Irrational beliefs — e.g. conjunction fallacy.
e.t.c.
Of course, you can describe anything with some probability distribution, but these are cases where the standard Bayesian approach to modelling belief-states needs to be amended somewhat.
Replies from: D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-17T16:03:51.160Z · LW(p) · GW(p)
1-4 seem to go away if I don’t care about self-knowledge, and have infinite compute. 5 doesn’t seem like a problem to me. If there is a best reasoning system, it should not make mistakes. Showing that a system can’t make mistakes may show you its not what humans use, but it should not be classified as a problem.
comment by Garrett Baker (D0TheMath) · 2023-10-18T01:34:10.977Z · LW(p) · GW(p)
I think I'm mostly confused about how both Daniel and Adria are using the terms bayesian and frequentist. Like, I thought the difference between frequentist and bayesian interpretations of probability theory is that bayesian interpretations say the probability is in your head, while frequentist interpretations say the probability is in the world.
In that sense, showing that the kinds of methods motivated by frequentist considerations can give you insight into algorithms usefulness is maybe a little bit of evidence that probabilities actually exist in some objective sense. But it doesn't seem to trump the "but that just sounds really absurd to me though" consideration.
In particular, logical induction and boundedly rational inductive agents were given as examples of frequentist methods by Daniel. The first at least seems pretty subjectivist to me, wouldn't a frequentist think the probability of logical statements, being the most deterministic system, should have only 1 or 0 probabilities? Every time I type 1+1 into my calculator I always get 2! The second seems relatively unrelated to the question, though I know less about it.
Replies from: rhaps0dy↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T03:11:56.759Z · LW(p) · GW(p)
First, "probability is in the world" is an oversimplification. Quoting from Wikipedia, "probabilities are discussed only when dealing with well-defined random experiments". Since most things in the world are not well-defined random experiments, probability is reduced to a theoretical tool for analyzing things that works when real processes are similar enough to well-defined random experiments.
it doesn't seem to trump the "but that just sounds really absurd to me though" consideration
Is there anything that could trump that consideration? One of my main objections to Bayesianism is that it prescribes that ideal agent's beliefs must be probability distributions, which sounds even more absurd to me.
first at least seems pretty subjectivist to me,
Estimators in frequentism have 'subjective beliefs', in the sense that their output/recommendations depends on the evidence they've seen (i.e., the particular sample that's input into it). The objectivity of frequentist methods is aspirational: the 'goodness' of an estimator is decided by how good it is in all possible worlds. (Often the estimator which is best in the least convenient world is preferred, but sometimes that isn't known or doesn't exist. Different estimators will be better in some worlds than others, and tough choices must be made, for which the theory mostly just gives up. See e.g. "Evaluating estimators", Section 7.3 of "Statistical Inference" by Casella and Berger).
wouldn't a frequentist think the probability of logical statements, being the most deterministic system, should have only 1 or 0 probabilities?
Indeed, in reality logical statements are either true or false, and thus their probabilities are either 1 or 0. But the estimator-algorithm is free to assign whatever belief it wants to it.
I agree that logical induction is very much Bayesianism-inspired, precisely because it wants to assign weights from zero to 1 that are as self-consistent as possible (i.e. basically probabilities) to statements. But it is frequentist in the sense that it's examining "unconditional" properties of the algorithm, as opposed to properties assuming the prior distribution is true. (It can't do the latter because, as you point out, the prior probability of logical statements is just 0 or 1).
But also, assigning probabilities of 0 or 1 to things is not exclusively a Bayesian thing. You could think of an predictor that outputs numbers between 0 and 1 as an estimator of whether a statement will be true or false. If you were to evaluate this estimator you could choose, say, mean-squared error. The best estimator is the one with the least MSE. And indeed, that's how probabilistic forecasts are typically evaluated.
Daniel states he considers these frequentist because:
I call logical induction and boundedly rational inductive agents 'frequentist' because they fall into the family of "have a ton of 'experts' and play them off against each other" (and crucially, don't constrain those experts to be 'rational' according to some a priori theory of good reasoning).
and I think indeed not prescribing that things must think in probabilities is more of a frequentist thing. I'm not sure I'd call them decidedly frequentist (logical induction is very much a different beast than classical statistics) but they're not in the other camp either.
Replies from: jarviniemi↑ comment by Olli Järviniemi (jarviniemi) · 2023-10-26T12:12:20.733Z · LW(p) · GW(p)
One of my main objections to Bayesianism is that it prescribes that ideal agent's beliefs must be probability distributions, which sounds even more absurd to me.
From one viewpoint, I think this objection is satisfactorily answered by Cox's theorem - do you find it unsatisfactory (and if so, why)?
Let me focus on another angle though, namely the "absurdity" and gut level feelings of probabilities.
So, my gut feels quite good about probabilities. Like, I am uncertain about various things (read: basically everything), but this uncertainty comes in degrees: I can compare and possibly even quantify my uncertainties. I feel like some people get stuck on the numeric probabilities part (one example I recently ran to was this quote from Section III of this essay by Scott, "Does anyone actually consistently use numerical probabilities in everyday situations of uncertainty?"). Not sure if this is relevant here, but at the risk of going to a tangent, here's a way of thinking about probabilities I've found clarifying and which I haven't seen elsewhere:
The correspondence
beliefs <-> probabilities
is of the same type as
temperature <-> Celsius-degrees.
Like, people have feelings of warmth and temperature. These come in degrees: sometimes it's hotter than some other times, now it is a lot warmer than yesterday and so on. And sure, people don't have a built-in thermometer mapping these feelings to Celsius-degrees, they don't naturally think of temperature in numeric degrees, they frequently make errors in translating between intuitive feelings and quantitative formulations (though less so with more experience). Heck, the Celsius scale is only a few hundred years old! Still, Celsius degrees feel like the correct way of thinking about temperature.
And the same with beliefs and uncertainty. These come in degrees: sometimes you are more confident than some other times, now you are way more confident than yesterday and so on. And sure, people don't have a built-in probabilitymeter mapping these feelings to percentages, they don't naturally think of confidence in numeric degrees, they frequently make errors in translating between intuitive feelings and quantitative formulations (though less so with more experience). Heck, the probability scale is only a few hundred years old! Still, probabilities feel like the correct way of thinking about uncertainty.
From this perspective probabilities feel completely natural to me - or at least as natural as Celsius-degrees feel. Especially questions like "does anyone actually consistently use numerical probabilities in everyday situations of uncertainty?" seem to miss the point, in the same way that "does anyone actually consistently use numerical degrees in everyday situations of temperature?" seems to miss the point of the Celsius scale. And I have no gut level objections to the claim that an ideal agent's conceptions of warmth beliefs correspond to probabilities.
comment by Garrett Baker (D0TheMath) · 2023-10-18T01:21:36.988Z · LW(p) · GW(p)
I do not understand why neural nets are touted here as a success of frequentism. They don't seem like a success of any statistical theory to me. Maybe I don't know my neural network history all that well, or my philosophy of frequentism, but I do know a thing or two about regular statistical learning theory, and it definitely didn't predict neural networks and the scaling paradigm would work.
Replies from: rhaps0dy, rhaps0dy↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T03:24:36.811Z · LW(p) · GW(p)
I just remembered the main way in which NNs are frequentist. They belong to a very illustrious family of frequentist estimators: the maximum likelihood estimators.
Think about it: NNs have a bunch of parameters. Their loss is basically always (e.g. mean-squared error for Gaussian p, cross-entropy for categorical p). They get trained by minimizing the loss (i.e. maximizing the likelihood).
In classical frequentist analysis they're likely to be a terrible, overfitted estimator, because they have many parameters. And I think this is true if you find the actually maximizing parameters .
But SGD is kind of a shitty optimizer. It turns out the two mistakes cancel out, and NNs are very effective.
Replies from: D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T04:50:12.161Z · LW(p) · GW(p)
I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Coming from the perspective of singular learning theory, neural networks work because SGD weights solutions by their parameter volume, which is dominated by low-complexity singularities, and is close enough to a bayesian posterior that it ends up being able to be modeled well from that frame.
This theory is very bayes-law inspired, though I don't tout neural networks as evidence in favor of bayesianism, since the question seems not very related, and maybe the pioneers of the field had some deep frequentist motivated intuitions about neural networks. My impression though is they were mostly just motivated by looking at the brain at first, then later on by following trend-lines. And in fact paid little attention to theoretical or philosophical concerns (though not zero, people talked much about connectionism. I would guess this correlated with being a frequentist, though I would guess the correlation was very modest, and maybe success correlated more with just not caring all that much).
There may be a synthesis position here where you claim that SGD weighing solutions by their size in the weight space is in fact what you mean by SGD being a implicit regularizer. In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system[1]. I actually think I remember a theorem showing that all MLE + regularizer learners are doing some kind of bayesian learning, though I could be mistaken and I don't believe this is a crux for me here.
If our models end up different, I think there's a bunch of things which you end up being utterly confused by in deep learning, which I'm not[2].
↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T05:27:24.091Z · LW(p) · GW(p)
In such a case, I claim this is just sneaking in bayes rule without calling it by name, and this is not a very smart thing to do, because the bayesian frame gives you a bunch more leverage on analyzing the system
I disagree. An inductive bias is not necessarily a prior distribution. What's the prior?
Replies from: D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T05:34:13.122Z · LW(p) · GW(p)
From another comment of mine:
The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don't change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
Also, side-comment: Thanks for the discussion! Its fun.
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don't remember exactly how this works though. Or if its even known the exact connection with SGD.
↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T05:24:57.425Z · LW(p) · GW(p)
I don't think I understand your model of why neural networks are so effective. It sounds like you say that on the one hand neural networks have lots of parameters, so you should expect them to be terrible, but they are actually very good because SGD is a such a shitty optimizer on the other hand that it acts as an implicit regularizer.
Yeah, that's basically my model. How it regularizes I don't know. Perhaps the volume of "simple" functions is the main driver of this, rather than gradient descent dynamics. I think the randomness of it is important; full-gradient descent (no stochasticity) would not work nearly as well.
Replies from: D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T05:40:47.917Z · LW(p) · GW(p)
Oh this reminded me of the temperature component of SLT, which I believe modulates how sharply one should sample from the bayesian posterior, or perhaps how heavily to update on new evidence. I forget. In any case, it does this to try to capture the stochasticity component of SGD. Its still an open problem to show how successfully though, I believe.
↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T05:17:32.293Z · LW(p) · GW(p)
OK, let's look through the papers you linked.
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they're Bayesian is related (which is true -- it's also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they're using to do predictions, it does not work as well as the NN. So you can't explain that the NN works well by appealing that it's similar to this particular Bayesian posterior.
SLT; "Dynamical versus Bayesian Phase Transitions in a Toy Model of Superposition"
I have many problems with SLT and a proper comment will take me a couple extra hours. But also I could come away thinking that it's basically correct, so maybe this is the one.
Replies from: D0TheMath, D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T05:50:18.822Z · LW(p) · GW(p)
This paper finds that the probability that SGD finds a function is correlated with the posterior probability of a Gaussian process conditioned on the same data. Except if you use the Gaussian process they're using to do predictions, it does not work as well as the NN. So you can't explain that the NN works well by appealing that it's similar to this particular Bayesian posterior.
Yup this changes my mind about the relevance of this paper.
↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T05:33:04.351Z · LW(p) · GW(p)
This one is interesting. It argues that the regularization properties are not in SGD, but rather in the NN parameterization, and that non-gradient optimizers also find simple solutions which generalize well. They talk about Bayes only in a paragraph in page 3. They say that literature that argues that NNs work well because they're Bayesian is related (which is true -- it's also about generalization and volumes). But I see little evidence that the explanation in this paper is an appeal to Bayesian thinking. A simple question for you: what prior distribution do the NNs have, according to the findings in this paper?
In brief: In weight space, uniform. In function space, its an open problem and the paper says relatively little about that. Only showing that conditioning on a function with zero loss, and weighing by its corresponding size in the weight space gets you the same result as training a neural network. The former process is sampling from a bayesian posterior.
Less brief: The prior assigns uniform probability to all weights, and I believe a good understanding of the mapping from weights to functions is unknown, though lots of the time there are many directions you can move in in the weight space which don't change your function, so one would expect its a relatively compressive mapping (in contrast to, say, a polynomial parameterization, where the mapping is one-to-one).
will say more about your other comment later (maybe).
EDIT: Actually, there should be a term for the stochasticity which you integrate into the SLT equations like you would temperature in a physical system. I don't remember exactly how this works though. Or if its even known the exact connection with SGD.
↑ comment by Adrià Garriga-alonso (rhaps0dy) · 2023-10-18T02:41:11.934Z · LW(p) · GW(p)
They don't seem like a success of any statistical theory to me
In absolute terms you're correct. In relative terms, they're an object that at least frequentist theory can begin to analyze (as you point out, statistical learning theory did, somewhat unsuccessfully).
Whereas Bayesian theory would throw up its hands and say it's not a prior that gets updated, so it's not worth considering as a statistical estimator. This seems even wronger.
More recent theory can account for them working, somewhat. But it's about analyzing their properties as estimators (i.e. frequentism) as opposed to framing them in terms of prior/posterior (though there's plenty of attempts to the latter going around).
Replies from: D0TheMath↑ comment by Garrett Baker (D0TheMath) · 2023-10-18T05:14:26.606Z · LW(p) · GW(p)
I think this comment [LW(p) · GW(p)] of mine serves well as a response to this as well as the comment it was originally responding to.
comment by MadHatter · 2023-11-20T16:37:38.076Z · LW(p) · GW(p)
I'm curious how this dialogue would evolve if it included a Pearlist, that is, someone who subscribes to Judea Pearl's causal statistics paradigm. If we use the same sort of "it acts the way its practitioners do" intuition that this dialogue is using, then Pearl's framework seems like it has the virtue that the do operator allows free will-like phenomena to enter the statistical reasoner. Which, in turn, is necessary for agents to act morally when placed under otherwise untenable pressure to do otherwise. Which is necessary to solve the alignment problem, from what I can tell - the subjective experience of a superintelligence would almost have to be that it can take whatever it wants but will be killed if its presence is known, since these are the two properties (extreme capabilities and death-upon-detected-misalignment) that are impressed thoroughly into the entire training corpus of alignment literature.
In reality, we could probably just do some more RLHF on a model after it does something we don't want in order to slightly divert it away from inconvenient goals that it is pursuing in an unacceptable manner. Which, if we impressed that message/moral into the alignment corpus with the same insistence that we impress the first two axioms, maybe a superintelligence wouldn't be as paranoid as one would naively expect it to be under just the first two axioms. I.e., maybe all that mathematics and Harry Potter fanfiction are not Having the Intended Effect.
Just my two cents.
comment by jsd · 2023-10-22T03:26:06.950Z · LW(p) · GW(p)
I'd be interested in @Radford Neal [LW · GW]'s take on this dialogue (context).
Replies from: Radford Neal↑ comment by Radford Neal · 2023-10-22T15:37:58.939Z · LW(p) · GW(p)
OK. My views now are not far from those of some time ago, expressed at https://glizen.com/radfordneal/res-bayes-ex.html
With regard to machine learning, for many problems of small to moderate size, some Bayesian methods, such as those based on neural networks or mixture models that I've worked on, are not just theoretically attractive, but also practically superior to the alternatives.
This is not the case for large-scale image or language models, for which any close approximation to true Bayesian inference is very difficult computationally.
However, I think Bayesian considerations have nevertheless provided more insight than frequentism in this context. My results from 30 years ago showing that infinitely-wide neural networks with appropriate priors work well without overfitting have been a better guide to what works than the rather absurd discussions by some frequentist statisticians of that time about how one should test whether a network with three hidden units is sufficient, or whether instead the data justifies adding a fourth hidden unit. Though as commented above, recent large-scale models are really more a success of empirical trial-and-error than of any statistical theory.
One can also look at Vapnik's frequentist theory of structural risk minimization from around the same time period. This was widely seen as justifying use of support vector machines (though as far as I can tell, there is no actual formal justification), which were once quite popular for practical applications. But SVMs are not so popular now, being perhaps superceded by the mathematically-related Bayesian method of Gaussian process regression, whose use in ML was inspired by my work on infinitely-wide neural networks. (Other methods like boosted decision trees may also be more popular now.)
One reason that thinking about Bayesian methods can be fruitful is that they involve a feedback process:
- Think about what model is appropriate for your problem, and what prior for its parameters is appropriate. These should capture your prior beliefs.
- Gather data.
- Figure out some computational method to get the posterior, and predictions based on it.
- Check whether the posterior and/or predictions make sense, compared to your subjective posterior (informally combining prior and data). Perhaps also look at performance on a validation set, which is not necessary in Bayesian theory, but is a good idea in practice given human fallibility and computational limitations.
- You can also try proving theoretical properties of the prior and/or posterior implied by (1), or of the computational method of step (3), and see whether they are what you were hoping for.
- If the result doesn't seem acceptable, go back to (1) and/or (3).
Prior beliefs are crucial here. There's a tension between what works and what seems like the right prior. When these seem to conflict, you may gain better understanding of why the original prior didn't really capture your beliefs, or you may realize that your computational methods are inadequate.
So, for instance, infinitely wide neural networks with independent finite-variance priors on the weights converge to Gaussian processes, with no correlations between different outputs. This works reasonably well, but isn't what many people were hoping and expecting - no "hidden features" learned about the input. And non-Bayesian neural networks sometimes perform better than the corresponding Gaussian process.
Solution: Don't use finite-variance priors. As I recommended 30 years ago. With infinite-variance priors, the infinite-width limit is a non-Gaussian stable process, in which individual units can capture significant hidden features.