**othercriteria**on The mystery of Brahms · 2015-10-28T22:39:43.925Z · score: 1 (1 votes) · LW · GW

Seconding all of gjm's criticisms, and adding another point.

The *sostenuto* (middle) pedal was invented in 1844. The *sustain* (right) pedal has been around roughly as long as the piano itself, since piano technique is pretty much unthinkable without it.

**othercriteria**on LW survey: Effective Altruists and donations · 2015-05-15T17:23:22.840Z · score: 0 (0 votes) · LW · GW

The explanation by owencb is what I was trying to address. To be explicit about when the offset is being added, I'm suggesting replacing your `log1p(x) ≣ log(1 + x)`

transformation with `log(c + x)`

for `c`

=10 or `c`

=100.

If the choice of log-dollars is just for presentation, it doesn't matter too much. But in a lesswrong-ish context, log-dollars also have connotations of things like the Kelly criterion, where it is taken completely seriously that there's more of a difference between $0 and $1 than between $1 and $3^^^3.

**othercriteria**on LW survey: Effective Altruists and donations · 2015-05-14T04:38:11.166Z · score: 4 (6 votes) · LW · GW

Given that at least 25% of respondents listed $0 in charity, the offset you add to the charity ($1 if I understand `log1p`

correctly) seems like it could have a large effect on your conclusions. You may want to do some sensitivity checks by raising the offset to, say, $10 or $100 or something else where a respondent might round their giving down to $0 and see if anything changes.

**othercriteria**on May 2015 Media Thread · 2015-05-06T16:33:51.205Z · score: 1 (1 votes) · LW · GW

Curtis Yarvin, who looked to Mars for tips and tricks on writing a "tiny, diamond-perfect kernel" for a programming environment.

**othercriteria**on How urgent is it to intuitively understand Bayesianism? · 2015-04-10T13:42:16.232Z · score: 1 (1 votes) · LW · GW

The Rasch model does not hate truth, nor does it love truth, but the truth if made out of items which it can use for something else.

**othercriteria**on Harry Potter and the Methods of Rationality discussion thread, March 2015, chapter 119 · 2015-03-11T00:06:08.929Z · score: 1 (1 votes) · LW · GW

Not, it's much more akin to Dennett's "Where Am I?" or to becoming meguca.

**othercriteria**on Harry Potter and the Methods of Rationality discussion thread, February 2015, chapter 108 · 2015-02-24T20:46:01.942Z · score: 1 (1 votes) · LW · GW

This seems like a good occasion to quote the twist reveal in Orson Scott Card's Dogwalker:

We stood there in his empty place, his shabby empty hovel that was ten times better than anywhere we ever lived, and Doggy says to me, real quiet, he says, "What was it? What did I do wrong? I thought I was like Hunt, I thought I never made a single mistake in this job. in this one job."

And that was it, right then I knew. Not a week before, not when it would do any good. Right then I finally knew it all, knew what Hunt had done. Jesse Hunt never made mistakes. But he was also so paranoid that he haired his bureau to see if the babysitter stole from him. So even though he would never accidentally enter the wrong P-word, he was just the kind who would do it on purpose. "He doublefingered every time," I says to Dog. "He's so damn careful he does his password wrong the first time every time, and then comes in on his second finger."

"So one time he comes in on the first try, so what?" He says this because he doesn't know computers like I do, being half-glass myself.

"The system knew the pattern, that's what. Jesse H. is so precise he never changed a bit, so when we came in on the first try, that set off alarms. It's my fault, Dog. I knew how crazy paranoidical he is, I knew that something was wrong, but not till this minute I didn't know what it was. I should have known it when I got his password, I should have known. I'm sorry, you never should have gotten me into this, I'm sorry, you should have listened to me when I told you something was wrong. I should have known, I'm sorry."

**othercriteria**on Bayesian Utility: Representing Preference by Probability Measures · 2015-01-14T23:42:22.490Z · score: 0 (0 votes) · LW · GW

This seems cool but I have a nagging suspicion that this reduces to greater generality and a handful of sentences if you use conditional expectation of the utility function and the Radon-Nikodym theorem?

**othercriteria**on December 2014 Media Thread · 2014-12-04T15:47:40.256Z · score: 1 (1 votes) · LW · GW

Noun phrases that are insufficiently abstract.

**othercriteria**on Neo-reactionaries, why are you neo-reactionary? · 2014-11-19T04:13:01.248Z · score: 5 (9 votes) · LW · GW

echo chambers [...] where meaningless duckspeak is endlessly repeated

Imagine how intolerable NRx would be if it were to acquire one of these. Fortunately, their ideas are too extreme for 4chan, even, so I have no idea where such a forum would be hosted.

**othercriteria**on Open thread, Nov. 17 - Nov. 23, 2014 · 2014-11-18T14:31:56.940Z · score: 6 (6 votes) · LW · GW

How meaningful is the "independent" criterion given the heavy overlaps in works cited and what I imagine must be a fairly recent academic MRCA among all the researchers involved?

**othercriteria**on Open thread, Nov. 17 - Nov. 23, 2014 · 2014-11-17T18:56:24.954Z · score: 9 (9 votes) · LW · GW

stupid problem

embarrassingly simple math since forever

I should have been years ahead of my peers

momentary lack of algebraic insight ("I could solve this in an instant if only I could get rid of that radical")

for which I've had the intuitions since before 11th grade when they began teaching it to us

Sorry to jump from object-level to meta-level here but it seems pretty clear that the problem here is not just about math. Your subjective assessments of how difficult these topics are is inconsistent with how well you report you are doing at them. And you're attaching emotions of shame and panic ("problem has snowballed") to observations that should just be objective descriptions of where you are now. Get these issues figured out first (unless you're in some educational setting with its own deadlines). Math isn't going anywhere; it will still be there when you're in a place where doing it won't cause you distress.

**othercriteria**on The "best" mathematically-informed topics? · 2014-11-14T17:21:38.309Z · score: 2 (2 votes) · LW · GW

It's been a while since I've thought about how to learn ecology, but maybe check out Ben Bolker's Ecological Models and Data in R? It would also be a decent way to start to learn how to do statistics with R.

**othercriteria**on The "best" mathematically-informed topics? · 2014-11-14T17:04:20.413Z · score: 2 (2 votes) · LW · GW

That is important destination but maybe too subtle a starting point.

Start with ecological models for inter-species interactions (predation, competition, mutualism, etc.) where there are more examples and the patterns are simpler, starker, and more intuitive. Roughly, death processes may depend on all involved populations but birth processes depend on each species separately. Then move to natural selection and evolution, intra-species interactions, where the birth processes for each genotype may depend on populations of all the different genotypes, and death processes depend on the phenotypes of all the different populations.

**othercriteria**on 2014 Less Wrong Census/Survey · 2014-11-14T15:05:41.488Z · score: 0 (0 votes) · LW · GW

The conscientiousness/akrasia interactions are also fascinating, but even harder to measure. There's a serious missing-not-at-random censoring effect going on for people too conscientious to leave off digit ratio but too akrasic to do the measurement. I nearly fell into this bucket.

**othercriteria**on The "best" mathematically-informed topics? · 2014-11-14T14:44:22.482Z · score: 5 (5 votes) · LW · GW

do what gwern does

Or do the complete opposite.

The impression I get of gwern is that he reads widely, thinks creatively, and experiments frequently, so he is constantly confronted with hypotheses that he has encountered or has generated. His use of statistics is generally *confirmatory*, in that he's using data to filter out unjustified hypotheses so he can further research or explore or theorize about the remaining ones.

Another thing you can do with data is *exploratory* data analysis, using statistics to pull out interesting patterns for further consideration. The workflow for this might look more like:

- Acquire (often multivariate) data from another researcher, source, or experiment.
- Look at its marginal distributions to check your understanding of the system and catch really obvious outliers.
- Maybe use tools like mixture modeling or Box-Cox transformation to clarify marginal distributions.
- Use statistical tools like (linear, logistic, support vector, etc.) regression, PCA, etc., to find patterns in the data.
- Do stuff with the resulting patterns: think up mechanisms, do confirmatory analysis, check literature, show them to other people, etc.

A lot of what you get out of this process will be spurious, but seeing hypotheses that the data seemed to support go down in flames is a good way to convince yourself of the value of *confirmatory* analysis, and of tools for dealing with this multiple testing problem.

I remember Gelman saying useful stuff like this, but it's been a while since I read that post so I might be mischaracterizing it.

(Ilya, you know all of this, surely at a deeper level than I do. I'm just rhetorically talking to you as a means to dialogue at Capla. Gwern, hopefully my model of you is not too terrible.)

**othercriteria**on Open thread, Nov. 3 - Nov. 9, 2014 · 2014-11-03T17:57:41.031Z · score: 1 (1 votes) · LW · GW

No idea. Factor analysis is the standard tool to see that some instrument (fancy work for ability) is *not* unitary. It's worth learning about anyways, if it's not in your toolbox.

**othercriteria**on Open thread, Nov. 3 - Nov. 9, 2014 · 2014-11-03T17:53:53.129Z · score: 5 (5 votes) · LW · GW

Some people like to layer trousers

A simple way to do this is flannel-lined jeans. The version of these made by L.L. Bean have worked well for me. They trade off a bit of extra bulkiness for substantially greater warmth and mildly improved wind protection. Random forum searches suggest that the fleece-lined ones are even warmer, but you lose the cool plaid patterning on the rolled up cuffs.

**othercriteria**on Open thread, Nov. 3 - Nov. 9, 2014 · 2014-11-03T15:31:19.758Z · score: 2 (2 votes) · LW · GW

A not quite nit-picking critique of this phenomenon is that it's treating a complex cluster of abilities as a unitary one.

In some of the (non-Olympic!) distance races I've run, it's seemed to me that I just couldn't move my legs any faster than they were going. In others, I've felt great except for a side stitch that made me feel like I'd vomit if I pushed myself harder. And in still others, I couldn't pull in enough air to make my muscles do what I wanted. In the latter case, I'd definitely notice the lower oxygen levels but in the former cases, maybe I wouldn't.

So dial down my oxygen and ask to do a road race? Maybe I'll notice, maybe I won't. But ask me to do a decathlon, and some medley swimming, and a biathlon? I bet I'll notice the low oxygen on at least some of those subtasks, whichever of them that require just the wrong mix of athletic abilities.

For the reading one, I can believe this if I'm doing some light pleasure reading and just trying to push plot into my brain as fast as possible. But if I'm reading math research papers, getting the words and symbols into my head is not the rate-limiting step. If there are some typos in the prose, or even in the results or proofs, it doesn't make much of a difference. There might be some second-order effects--when I try to fill in details and an equation doesn't balance, I can be less certain that the error is mine--but these are minor.

So maybe sharpen your claim down to unitary(-ish) abilities?

**othercriteria**on Is this paper formally modeling human (ir)rational decision making worth understanding? · 2014-10-24T00:36:42.064Z · score: 4 (4 votes) · LW · GW

Seconding a lot of calef's observations.

If the new topic you want to learn is "extended behavior networks", then maybe this is your best bet. But if you really want to learn about something like AI or ML or the design of agents that behave reasonably by the standards of some utility-like theory, then this is probably a bad choice. A quick search in Google Scholar (if you're not using this, or some equivalent, making this a step before going to the hivemind is a good idea) suggests that extended behavior networks are backwater-y. If the idea of a network of things interacting to make a decision appeals to you, maybe look into Petri nets or POMDPs. Or better yet, start with something like Russel and Norvig's AIMA to get a better view of the landscape. If the irrationality part is interesting, start with Kahneman, Slovic, and Tversky's Judgment under uncertainty: Heuristics and biases, which gives you a curated collection of jargoney papers.

**othercriteria**on Anthropic signature: strange anti-correlations · 2014-10-23T05:47:59.043Z · score: 0 (0 votes) · LW · GW

While maybe not essential, the "anti-" aspect of the correlations induced by anthropic selection bias at least seems important. Obviously, the appropriate changes of variables can make any particular correlation go either positive or negative. But when the events all measure the same sort of thing (e.g., flooding in 2014, flooding in 2015, etc.), the selection bias seems like it would manifest as anti-correlation. Stretching an analogy beyond its breaking point, I can imagine these strange anti-correlations inducing something like anti-ferromagnetism.

**othercriteria**on What math is essential to the art of rationality? · 2014-10-21T18:21:41.723Z · score: 0 (0 votes) · LW · GW

To pick a frequentist algorithm is to pick a prior with a set of hypotheses, i.e. to make Bayes' Theorem computable and provide the unknowns on the r.h.s. above (as mentioned earlier you can in theory extract the prior and set of hypotheses from an algorithm by considering which outcome your algorithm would give when it saw a certain set of data, and then inverting Bayes' Theorem to find the unknowns.

Okay, this is the last thing I'll say here until/unless you engage with the Robins and Wasserman post that IlyaShpitser and I have been suggesting you look at. You can indeed pick a prior and hypotheses (and I guess a way to go from posterior to point estimation, e.g., MAP, posterior mean, etc.) so that your Bayesian procedure does the same thing as your non-Bayesian procedure for any realization of the data. The problem is that in the Robins-Ritov example, your prior may need to depend on the data to do this! Mechanically, this is no problem; philosophically, you're updating on the data twice and it's hard to argue that doing this is unproblematic. In other situations, you may need to do other unsavory things with your prior. If the non-Bayesian procedure that works well looks like a Bayesian procedure that makes insane assumptions, why should we look to Bayesian as a foundation for statistics?

(I may be willing to bite the bullet of poor frequentist performance in some cases for philosophical purity, but I damn well want to make sure I understand what I'm giving up. It is supremely dishonest to pretend there's no trade-off present in this situation. And a Bayes-first education doesn't even give you the concepts to see what you gain and what you lose by being a Bayesian.)

**othercriteria**on What math is essential to the art of rationality? · 2014-10-16T17:58:06.985Z · score: 2 (2 votes) · LW · GW

You're welcome for the link, and it's more than repaid by your causal inference restatement of the Robins-Ritov problem.

Of course arguably this entire setting is one Bayesians don't worry about (but maybe they should? These settings do come up).

Yeah, I think this is the heart of the confusion. When you encounter a problem, you can turn the Bayesian crank and it will always do the Right thing, but it won't always do the *right* thing. What I find disconcerting (as a Bayesian drifting towards frequentism) is that it's not obvious how to assess the adequacy of a Bayesian analysis from within the Bayesian framework. In principle, you can do this mindlessly by marginalizing over all the model classes that might apply, maybe? But in practice, a single model class usually gets picked by non-Bayesian criteria like "does the posterior depend on the data in the right way?" or "does the posterior capture the 'true model' from simulated data?". Or a Bayesian may (rightly or wrongly) decide that a Bayesian analysis is not appropriate in that setting.

**othercriteria**on What math is essential to the art of rationality? · 2014-10-16T01:32:48.885Z · score: 2 (2 votes) · LW · GW

Have you seen the series of blog posts by Robins and Wasserman that starts here? In problems like the one discussed there (such as the high-dimensional ones that are commonly seen these days), Bayesian procedures, and more broadly any procedures that satisfy the likelihood principle, just don't work. The procedures that do work, according to frequentist criteria, do not arise from the likelihood so it's hard to see how they could be approximations to a Bayesian solution.

You can also see this situation in the (frequentist) classic Theory of Point Estimation written by Lehmann and Casella. The text has four central chapters: "Unbiasedness", "Equivariance", "Average Risk Optimality", and "Minimaxity and Admissibility". Each of these introduces a principle for the design of estimators and then shows where this principle leads. "Average Risk Optimality" leads to Bayesian inference, but also Bayes-Lite methods like empirical Bayes. But each of the other three chapters leads to its own theory, with its own collection of methods that are optimal under that theory. Bayesian statistics is an important and substantial part of the story told by in that book, but it's not the whole story. Said differently, Bayesian statistics may be a framework for Bayesian procedures and a useful way of analyzing non-Bayesian statistics, but they are not the framework for all of statistics.

**othercriteria**on What math is essential to the art of rationality? · 2014-10-15T23:30:43.013Z · score: 2 (2 votes) · LW · GW

(Theoretical) Bayesian statistics is the study of probability flows under minimal assumptions - any quantity that behaves like we want a probability to behave can be described by Bayesian statistics.

But nobody, least of all Bayesian statistical practitioners, does this. They encounter data, get familiar with it, pick/invent a model, pick/invent a prior, run (possibly approximate) inference of the model against the data, verify if inference is doing something reasonable, and jump back to an earlier step and change something if it doesn't. After however long this takes (if they don't give up), they might make some decision based on the (possibly approximate) posterior distribution they end up with. This decision might involve taking some actions in the wider world and/or writing a paper.

This is essentially the same workflow a frequentist statistician would use, and it's only reasonable that a lot of the ideas that work in one of these settings would be useful, if not obvious or well-motivated, in the other.

I know that philosophical underpinnings and underlying frameworks matter but to quote from a recent review article by Reid and Cox (2014):

A healthy interplay between theory and application is crucial for statistics, as no doubt for other fields. This is particularly the case when by theory we mean foundations of statistical analysis, rather than the theoretical analysis of specific statistical methods. The very word foundations may, however, be a little misleading in that it suggests a solid base on which a large structure rests for its entire security. But foundations in the present context equally depend on and must be tested and revised in the light of experience and assessed by relevance to the very wide variety of contexts in which statistical considerations arise. It would be misleading to draw too close a parallel with the notion of a structure that would collapse if its foundations were destroyed.

**othercriteria**on What math is essential to the art of rationality? · 2014-10-15T19:04:46.883Z · score: 0 (2 votes) · LW · GW

Thanks for pointing out the Gelman and Shalizi paper. Just skimmed it so far, but it looks like it really captures the zeitgeist of what reasonably thoughtful statisticians think of the framework they're in the business of developing and using.

Plus, their final footnote, describing their misgivings about elevating Bayesianism beyond a tool in the hypothetico-deductive toolbox, is great:

Ghosh and Ramamoorthi (2003, p. 112) see a similar attitude as discouraging inquiries into consistency: ‘the prior and the posterior given by Bayes theorem [sic] are imperatives arising out of axioms of rational behavior – and since we are already rational why worry about one more’ criterion, namely convergence to the truth?

**othercriteria**on What math is essential to the art of rationality? · 2014-10-15T18:18:50.721Z · score: 4 (6 votes) · LW · GW

I would advise looking into frequentist statistics **before** studying Bayesian statistics. Inference done under Bayesian statistics is curiously silent about anything besides the posterior probability, including whether the model makes sense for the data, whether the knowledge gained about the model is likely to match reality, etc. Frequentist concepts like consistency, coverage probability, ancillarity, model checking, etc., don't just apply to frequentist estimation; they can be used to asses and justify Bayesian procedures.

If anything, Bayesian statistics should just be treated as a factory that churns out estimation procedures. By a corollary of the complete class theorem, this is also the only way you can get good estimation procedures.

**ETA**: Can I get comments in addition to (or instead of) down votes here? This is a topic I don't want to be mistaken about, so please tell me if I'm getting something wrong. Or rather if my comment is coming across as "boo Bayes", which calls out for punishment.

**othercriteria**on What math is essential to the art of rationality? · 2014-10-15T17:28:14.025Z · score: 2 (2 votes) · LW · GW

This is really good and impressive. Do you have such a list for statistics?

**othercriteria**on Rationality Quotes October 2014 · 2014-10-15T13:42:06.526Z · score: 0 (0 votes) · LW · GW

The example I'm thinking about is a non-random graph on the square grid where west/east neighbors are connected and north/south neighbors aren't. Its density is asymptotically right at the critical threshold and could be pushed over by adding additional west/east non-neighbor edges. The connected components are neither finite nor giant.

**othercriteria**on Open thread, Oct. 13 - Oct. 19, 2014 · 2014-10-14T05:11:21.452Z · score: 4 (4 votes) · LW · GW

If you want a solid year-long project, find a statistical model you like and figure out how to do inference in it with variational Bayes. If this has been done, change finite parts of the model into infinite ones until you reach novelty or the model is no longer recognizable/tractable. At that point, either try a new model or instead try to make the VB inference online or parallelizable. Maybe target a NIPS-style paper and a ~30-page technical report in addition to whatever your thesis will look like.

And attend a machine learning class, if offered. There's a lot of lore in that field and you'll miss out if you do the read-the-book-work-each-problem thing that is alleged to work in math.

**othercriteria**on 2014 Less Wrong Census/Survey - Call For Critiques/Questions · 2014-10-13T02:58:06.835Z · score: 2 (2 votes) · LW · GW

But to all of us perched on the back of Cthulhu, who is forever swimming left, is it the survey that will seem fixed and unchanging from our moving point of view?

**othercriteria**on LessWrong Help Desk - free paper downloads and more (2014) · 2014-10-12T18:40:41.492Z · score: 1 (1 votes) · LW · GW

Buehler, Denis. "Incomplete understanding of complex numbers Girolamo Cardano: a case study in the acquisition of mathematical concepts." Synthese 191.17 (2014): 4231-4252.

Vélez, Ricardo, and Tomás Prieto-Rumeau. "Random assignment processes: strong law of large numbers and De Finetti theorem." TEST (2014): 1-30.

**othercriteria**on 2014 Less Wrong Census/Survey - Call For Critiques/Questions · 2014-10-12T03:17:29.535Z · score: 6 (6 votes) · LW · GW

In the mental health category, I'd love to see (adult) ADHD there as well. I'm less directly interested in substance abuse disorder and learning disabilities (in the US sense) / non-autism developmental disabilities, but those would be interesting additions too.

**othercriteria**on Open thread, Oct. 6 - Oct. 12, 2014 · 2014-10-11T23:52:46.234Z · score: 0 (0 votes) · LW · GW

I'd believe that; my knowledge of music history isn't that great and seeing teleology where there isn't any is an easy mistake.

I guess what I'm saying, speaking very vaguely, is that melodies existing within their own tonal contexts are as old as bone flutes, and their theory goes back at least as far as Pythagoras. And most folk music traditions cooked up their own favorite scale system, which you can just stay in and make music as long as you want to. For that matter, notes in these scale systems can be played as chords and a lot of the combinations make musical sense (often with nicer consonance than is possible with notes that have to respect even temperament).

What western art music and its audience co-evolved into (not necessarily uniquely among music traditions?) was a state where something like the first few bars of the Schubert String Quintet can function. The first violin plays a note twice, with the harmonic context changing under it, driving the melody forward, driving the harmony forward, etc. I should probably have said a non-static harmonic context to be more clear.

**othercriteria**on Rationality Quotes October 2014 · 2014-10-07T22:28:19.153Z · score: 0 (0 votes) · LW · GW

In something like the Erdös-Rényi random graph, I agree that there is an asymptotic equivalence between the existence of a giant component and paths from a randomly selected points being able to reach the "edge".

On something like an n x n grid with edges just to left/right neighbors, the "edge" is reachable from any starting point, but all the connected components occupy just a 1/n fraction of the vertices. As n gets large, this fraction goes to 0.

Since, at least as a reductio, the details of graph structure (and not just its edge fraction) matters and because percolation theory doesn't capture the idea of time dynamics that are important in understanding epidemics, it's probably better to start from a more appropriate model.

Maybe look at Limit theorems for a random graph epidemic model (Andersson, 1998)?

**othercriteria**on Open thread, Oct. 6 - Oct. 12, 2014 · 2014-10-07T21:40:36.942Z · score: 5 (5 votes) · LW · GW

The idea that melodies, or at least an approximation accurate to within a few cents, can be embedded into a harmonic context. Yet in western art music, it took centuries for this to go from technically achievable but unthinkable to experimental to routine.

**othercriteria**on Rationality Quotes October 2014 · 2014-10-06T05:13:45.028Z · score: 2 (2 votes) · LW · GW

I think percolation theory concerns itself with a different question: is there a path from starting point to the "edge" of the graph, as the size of the graph is taken to infinity. It is easy to see that it is possible to hit infinity while infecting an arbitrarily small fraction of the population.

But there are crazy universality and duality results for random graphs, so there's probably some way to map an epidemic model to a percolation model without losing anything important?

**othercriteria**on Open thread, July 28 - August 3, 2014 · 2014-07-31T15:37:48.689Z · score: 2 (2 votes) · LW · GW

This comment rubbed me the wrong way and I couldn't figure out why at first, which is why I went for a pithy response.

I think what's going on is I was reacting to the pragmatics of your exchange with Coscott. Coscott informally specified a model and then asked what we could conclude about a parameter of interest, which coin was chosen, given a *sufficient statistic* of all the coin toss data, the number of heads observed.

This is implicitly a statement that model checking isn't important in solving the problem, because everything that could be used for model checking, e.g., statistics on runs to verify independence, the number of tails observed to check against a type of miscounting where the number of tosses don't add to 1,000,000, mental status inventories to detect hallucination, etc., is left out of the statistic communicated.

Maybe Coscott (the fictional version who flipped all those coins) did model checking or maybe not, but if it was done and the data suggested miscounting or hallucination, then Coscott wouldn't have stated the problem like this.

So, yeah, the points you raise are valid object-level ones, but bringing them up this way in a problem poser / problem solver context was really unexpected and seemed to violate the norms for this sort of exchange.

**othercriteria**on Why the tails come apart · 2014-07-31T14:40:07.182Z · score: 0 (0 votes) · LW · GW

I'm quite confident in predicting that generic models are much more likely to be overfitted than to have too few degrees of freedom.

It's easy to regularize estimation in a model class that's too rich for your data. You can't "unregularize" a model class that's restrictive enough not to contain an adequate approximation to the truth of what you're modeling.

**othercriteria**on Open thread, July 28 - August 3, 2014 · 2014-07-29T23:22:02.613Z · score: 2 (4 votes) · LW · GW

When I know I'm to be visited by one of my parents and I see someone who looks like my mother, should my first thought be "that person looks so unlike my father that maybe it is him and I'm having a stroke"? Should I damage my eyes to the point where this phenomenon doesn't occur to spare myself the confusion?

**othercriteria**on [QUESTION]: Academic social science and machine learning · 2014-07-29T17:02:05.081Z · score: 1 (1 votes) · LW · GW

What I was saying was sort of vague, so I'm going to formalize here.

Data is coming from some random process X(θ,ω), where θ parameterizes the process and ω captures all the randomness. Let's suppose that for any particular θ, living in the set Θ of parameters where the model is well-defined, it's easy to sample from X(θ,ω). We don't put any particular structure (in particular, cardinality assumptions) on Θ. Since we're being frequentists here, nature's parameter θ' is fixed and unknown. We only get to work with the realization of the random process that actually happens, X' = X(θ',ω').

We have some sort of analysis t(⋅) that returns a scalar; applying it to the random data gives us the random variables t(X(θ,ω)), which is still parameterized by θ and still easy to sample from. We pick some null hypothesis Θ0 ⊂ Θ, usually for scientific or convenience reasons.

We want some measure of how weird/surprising the value t(X') is if θ' were actually in Θ0. One way to do this, if we have a simple null hypothesis Θ0 = { θ0 }, is to calculate the p-value p(X') = P(t(X(θ0,ω)) ≥ t(X')). This can clearly be approximated using samples from t(X(θ0,ω)).

For composite null hypotheses, I guessed that using p(X') = sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ t(X')) would work. Paraphrasing jsteinhardt, if Θ0 = { θ01, ..., θ0n }, you could approximate p(X') using samples from t(X(θ01,ω)), ... t(X(θ01,ω)), but it's not clear what to do when Θ0 has infinite cardinality. I see two ways forward. One is approximating p(X') by doing the above computation over a finite subset of points in Θ0, chosen by gridding or at random. This should give an approximate lower bound on the p-value, since it might miss θ where the observed data look unexceptional. If the approximate p-value leads you to fail to reject the null, you can believe it; if it leads you to reject the null, you might be less sure and might want to continue trying more points in Θ0. Maybe this is what jsteinhardt means by saying it "doesn't terminate"? The other way forward might be to use features of t and Θ0, which we do have some control over, to simplify the expression sup{θ0 ∈ Θ0} P(t(X(θ0,ω)) ≥ c). Say, if t(X(θ,ω)) is convex in θ for any ω and Θ0 is a convex bounded polytope living in some Euclidean space, then the supremum only depends on how P(t(X(θ0,ω)) ≥ c) behaves at a finite number of points.

So yeah, things are far more complicated than I claimed and realize now working through it. But you can do sensible things even with a composite null.

**othercriteria**on Why the tails come apart · 2014-07-28T17:32:30.975Z · score: 2 (2 votes) · LW · GW

Good point. When I introduced that toy example with Cauchy factors, it was the easiest way to get factors that, informally, don't fill in their observed support. Letting the distribution of the factors drift would be a more realistic way to achieve this.

the whole underlying distribution switched and all your old estimates just went out of the window...

I like to hope (and should probably endeavor to ensure) that I don' t find myself in situations like that. A system that generatively (what the joint distribution of factor X and outcome Y looks like) evolves over time, might be discriminatively (what the conditional distribution of Y looks like given X) stationary. Even if we have to throw out our information about what new X's will look like, we may be able to keep saying useful things about Y once we see the corresponding new X.

**othercriteria**on [QUESTION]: Academic social science and machine learning · 2014-07-28T01:32:39.274Z · score: 1 (1 votes) · LW · GW

If you're working with composite hypotheses, replace "your statistic" with "the supremum of your statistic over the relevant set of hypotheses".

**othercriteria**on Why the tails come apart · 2014-07-28T01:16:59.916Z · score: 8 (8 votes) · LW · GW

This looks cool. My biggest caution would be that this effect may be tied to the specific class of data generating processes you're looking at.

Your framing seems to be that you look at the world as being filled with entities whose features under any conceivable measurements are distributed as independent multivariate normals. The predictive factor is a feature and so is the outcome. Then using extreme order statistics of the predictive factor to make inferences about the extreme order statistics of the outcome is informative but unreliable, as you illustrated. Playing around in R, reliability seems better for thin-tailed distributions (e.g., uniform) and worse for heavy-tailed distributions (e.g., Cauchy). Fixing the distributions and letting the number of observations vary, I agree with you that the probability of picking exactly the greatest outcome goes to zero. But I'd conjecture that the probability that the observation with the greatest factor is in some fixed percentile of the greatest outcomes will go to one, at least in the thin-tailed case and maybe in the normal case.

But consider another data generating process. If you carry out the following little experiment in R

```
fac <- rcauchy(1000)
out <- fac + rnorm(1000)
plot(rank(fac), rank(out))
rank(out)[which.max(fac)]
```

it looks like extreme factors are great predictors of extreme outcomes, even though the factors are only unreliable predictors of outcomes overall. I wouldn't be surprised if the probability of the greatest factor picking the greatest outcome goes to one as the number of observations grows.

Informally (and too evocatively) stated, what seems to be happening is that as long as new observations are expanding the space of factors seen, extreme factors pick out extreme outcomes. When new observations mostly duplicate already observed factors, all of the duplicates would predict the most extreme outcome and only one of them can be right.

**othercriteria**on [QUESTION]: Academic social science and machine learning · 2014-07-22T13:03:06.834Z · score: 1 (1 votes) · LW · GW

The grandchild comment suggests that he does, at least to the the level of a typical user (though not a researcher or developer) of these methods.

**othercriteria**on [QUESTION]: Academic social science and machine learning · 2014-07-22T12:58:38.468Z · score: 1 (1 votes) · LW · GW

You really should have mentioned here one of your Facebook responses that maybe the data generating processes seen in social science problems don't look like (the output of generative versions of) ML algorithms. What's the point of using a ML method that scales well computationally if looking at more data doesn't bring you to the truth (consistency guarantees can go away if the truth is outside the support of your model class) or has terrible bang for the buck (even if you keep consistency, you may take an efficiency hit)?

Also, think about how well these methods work over the entire research process. Looking at probit modeling, the first thing that pops out about it is how those light normal tails suggest that it is sensitive to outliers. If you gave a statistician a big, messy-looking data set on an unfamiliar subject, this would probably push them to use something like logistic regression with its reliance on a heavier-tailed distribution and better expected robustness. But if you're the social scientist who assembled the data set, you may be sure that you've dealt with any data collection, data entry, measurement, etc. errors and you may be deeply familiar with each observation. At this stage, outliers are not unknowable random noise that gets in the way of the signal but may themselves be the signal, as they have an disproportionate effect on the learned model. At the least, they are where additional scrutiny should be focused, as long as the entity doing the analysis can provide that scrutiny.

**othercriteria**on OpenWorm and differential technological development · 2014-05-20T12:46:34.176Z · score: 1 (1 votes) · LW · GW

The faster cell simulation technologies advance, the weaker is the hardware they'll run on.

If hardware growth strictly followed Moore's Law and CPUs (or GPUs, etc.) were completely general-purpose, this would be true. But, if cell simulation became a dominant application for computing hardware, one could imagine instruction set extensions or even entire architecture changes designed around it. Obviously, it would also take some time for software to take advantage of hardware change.

**othercriteria**on Open Thread April 16 - April 22, 2014 · 2014-04-16T15:37:06.976Z · score: 0 (0 votes) · LW · GW

I was just contesting your statement as a universal one. For this poll, I agree you can't really pursue the covariate strategy. However, I think you're overstating challenge of getting more data and figuring out what to do with it.

For example, measuring BPD status is difficult. You can do it by conducting a psychological examination of your subjects (costly but accurate), you can do it by asking subjects to self-report on a four-level Likert-ish scale (cheap but inaccurate), or you could do countless other things along this tradeoff surface. On the other hand, measuring things like sex, age, level of education, etc. is easy. And even better, we have baseline levels of these covariates for communities like LessWrong, the United States, etc. with respect to which we might want to see if our sample is biased.

**othercriteria**on Open Thread April 16 - April 22, 2014 · 2014-04-16T15:06:55.523Z · score: 0 (0 votes) · LW · GW

Sure you can, in principle. When you have measured covariates, you can compare their sampled distribution to that of the population of interest. Find enough of a difference (modulo multiple comparisons, significance, researcher degrees of freedom, etc.) and you've detected bias. Ruling out systematic bias using your observations alone is much more difficult.

Even in this case, where we don't have covariates, there are some patterns in the ordinal data (the concept of ancillary statistics might be helpful in coming up with some of these) that would be extremely unlikely under unbiased sampling.

**othercriteria**on Open Thread April 8 - April 14 2014 · 2014-04-08T17:07:22.272Z · score: 1 (1 votes) · LW · GW

There's actually some really cool math developed about situations like this one. Large deviation theory describes how occurrences like the 1,000,004 red / 1,000,000 blues one become unlikely at an exponential rate and how, conditioning *on* them occurring, information about *the manner in which* they occurred can be deduced. It's a sort of trivial conclusion in this case, but if we accept a principle of maximum entropy, we can be dead certain that any of the 2,000,004 red or blue draws looks marginally like a Bernoulli with 1,000,004:1,000,000 odds. That's just the likeliest way (outside of our setup being mistaken) of observing our extremely unlikely outcome.