Confounded No Longer: Insights from 'All of Statistics'post by TurnTrout · 2018-05-03T22:56:27.057Z · score: 56 (13 votes) · LW · GW · 6 comments
Foreword All of Statistics 1: Introduction 2: Probability 3: Random Variables Conjugate Variables 4: Expectation Evidence Preservation Marginal Variance Bessel's Correction 5: Inequalities 6: Convergence Equality of Continuous Variables Types of Convergence In Probability In Distribution In Quadratic Mean 7: Models, Statistical Inference and Learning 8: Estimating the CDF and Statistical Functionals 9: The Bootstrap 10: Parametric Inference Fisher Information Factorization Theorem 11: Hypothesis Testing and p-values Frequently Confused [Size Joke Here] The p-value Alignment Problem 12: Bayesian Inference Jeffreys' Prior 13: Statistical Decision Theory 14: Linear Regression The Bias-Variance Tradeoff Degrees of Confusion 15: Multivariate Models 16: Inference about Independence 17: Undirected Graphs and Conditional Independence 18: Log-Linear Models 19: Causal Inference Simpson's Paradox 20: Directed Graphs 21: Nonparametric Curve Estimation 22: Smoothing Using Orthogonal Functions 23: Classification 24: Stochastic Processes 25: Simulation Methods Final Verdict Forwards Tips Depth Red None 6 comments
Using fancy tools like neural nets, boosting and support vector machines without understanding basic statistics is like doing brain surgery before knowing how to use a bandaid.
For some reason, statistics always seemed somewhat disjoint from the rest of math, more akin to a bunch of tools than a rigorous, carefully-constructed framework. I am here to atone for my foolishness.
This academic term started with a jolt - I quickly realized that I was missing quite a few prerequisites for the Bayesian Statistics course in which I had enrolled, and that good ol' AP Stats wasn't gonna cut it. I threw myself at All of Statistics, doing a good number of exercises, dissolving confusion wherever I could find it, and making sure I could turn each concept around and make sense of it from multiple perspectives.
I then went even further, challenging myself during the bits of downtime throughout my day to do things like explain variance from first principles, starting from the sample space, walking through random variables and expectation - without help.
All of Statistics
In which sample spaces are formalized.
3: Random Variables
In which random variables are detailed and a multitude of distributions are introduced.
Consider that a random variable is a function . For random variables , we can then produce conjugate random variables , with
is conservation of expected evidence (thanks to Alex Mennen for making this connection explicit).
Why does marginal variance have two terms? Shouldn't the expected conditional variance be sufficient?
This literally plagued my dreams.
Proof (of the variance; I cannot prove it plagued my dreams):
The middle term is eliminated as the expectations cancel out after repeated applications of conservation of expected evidence. Another way to look at the last two terms is the sum of the expected sample variance and the variance of the expectation.
When calculating variance from observations , you might think to write
where is the sample mean. However, this systematically underestimates the actual sample variance, as the sample mean is itself often biased (as demonstrated above). The corrected sample variance is thus
Equality of Continuous Variables
For continuous random variables , we have , which is surprising. In fact, for , as well!
The continuity is the culprit. Since the cumulative density functions are continuous, the limit of the density allotted to any given point is 0. Read more here.
Types of Convergence
Let be a sequence of random variables, and let be another random variable. Let denote the CDF of , and let denote the CDF of .
converges to in probability, written , if, for every , as .
Random variables are functions , assigning a number to each possible outcome in the sample space . Considering this fact, two random variables converge in probability when their assigned values are "far apart" (greater than ) with probability 0 in the limit.
converges to in distribution, written , if at all for which is continuous.
A similar geometric intuition:
Note: the continuity requirement is important. Imagine we distribute points uniformly on ; we see that . However, is 0 when , but . Thus CDF convergence does not occur at .
In Quadratic Mean
converges to in quadratic mean, written , if as .
The expectation of the quadratic mean approaches 0; in contrast to convergence in probability, dealing with expectation means that values of highly deviant with respect to come into play. For example, if but the extremal values of