Mean of quantiles

post by Stuart_Armstrong · 2015-09-09T18:55:46.492Z · LW · GW · Legacy · 21 comments

Contents

21 comments

In a previous post, I looked at some of the properties of using the median rather than the mean.

Inspired by Househalter's comment, it seems we might be able to take a compromise between median and mean. It seems to me that simply taking the mean of the lower quartile, median, and upper quartile would also have the nice features I described, and would likely be closer to the mean.

Furthermore, there's no reason to stop there. We can take the mean of the n-1 n-quantiles.

Two questions:

  1. As n increases, does this quantity tend to the mean if it exists? (I suspect yes).
  2. For some distributions (eg Cauchy distribution) this quantity will tend to a limit as n increases, even if there is no mean. Is this an effective way of extending means to distributions that don't possess them?

Note the unlike the median approach, for large enough n, this maximiser will pay Pascal's mugger.

21 comments

Comments sorted by top scores.

comment by IlyaShpitser · 2015-09-09T22:55:42.560Z · LW(p) · GW(p)

Isn't there a big literature on summary statistics? Why reinvent the wheel?

Replies from: Gunnar_Zarncke
comment by Gunnar_Zarncke · 2015-09-10T19:57:23.716Z · LW(p) · GW(p)

Just because it is obvious to you doesn't mean that everybody immediately jumps to it. Me included and I like the statistics classes back then. Could you please point to the wheels?

Replies from: twanvl
comment by twanvl · 2015-09-11T11:53:21.045Z · LW(p) · GW(p)

The wheels in this case come from robust statistics.

One example of a good robust estimator for the center is the [truncated mean]https://en.wikipedia.org/wiki/Truncated_mean). To put it simply: throw away the lowest x% and the highest x% of the samples, and take the mean of the rest. If x=0 you get the regular mean, if x=50% then you get the median.

Replies from: IlyaShpitser
comment by IlyaShpitser · 2015-09-11T15:18:33.893Z · LW(p) · GW(p)

Descriptive statistics, also, probably.


"Please do my literature search for me" is not a reasonable request, though.

Replies from: Vaniver
comment by Vaniver · 2015-09-11T20:36:20.969Z · LW(p) · GW(p)

"Please do my literature search for me" is not a reasonable request, though.

Knowing the term of art is very helpful for doing these sorts of searches, though. If one goes to the wikipedia page on summary statistics, one can easily get to the page for descriptive statistics, but neither of them are very good at leading one on to robust statistics, or anything besides the mean or median.

Replies from: IlyaShpitser
comment by IlyaShpitser · 2015-09-13T15:38:14.742Z · LW(p) · GW(p)

Google for "summary statistics" is sufficient. But more importantly, lit. search isn't a google/wikipedia type of activity.

comment by Lumifer · 2015-09-09T20:21:34.940Z · LW(p) · GW(p)

we might be able to take a compromise between median and mean.

I'll repeat my comment to Househalter's post: in taking your compromise, what are you optimizing for? On the basis of which criteria will you distinguish a "good" compromise from a "bad" compromise?

Replies from: Houshalter
comment by Houshalter · 2015-09-10T00:19:22.720Z · LW(p) · GW(p)

It's somewhat subjective. There is no law of the universe that says one approach is right and the other is long (so long as you avoid inconsistencies, which the policy selection idea does.) There is no objective way to compress an entire distribution down to a single value.

Do you want to select a distribution that contains the best average future? Do you want to select from a distribution that contains the best median future? Or do you want something in between? What do we, humans, actually want?

Replies from: AlexMennen, Lumifer
comment by AlexMennen · 2015-09-10T04:34:34.664Z · LW(p) · GW(p)

Do you want to select a distribution that contains the best average future? Do you want to select from a distribution that contains the best median future?

That question doesn't quite make sense as stated. Utilities are ways of modeling our preferences, not features of the universe for us to have preferences about. So it makes sense to ask what ways of choosing between probability distributions over utilities are compatible with what we actually want, but not which way is what we actually want.

Replies from: Houshalter
comment by Houshalter · 2015-09-10T23:19:02.161Z · LW(p) · GW(p)

Well that's just the thing. With median utility, you don't actually need to put a number on it at all. You just need a preference ordering of outcomes.

I am actually somewhat confused on how to assign utility to outcomes with expected utility. Just because you think an outcome is a thousand times more desirable, doesn't necessarily mean you would accept a 1,000:1 bet for it. Or does it? I do not know.

Like, 1,000 people dying seems like it should objectively be 1,000 worse than 1 person dying. Does that mean I should pay the mugger? Or should I come up with a scheme to discount the disutility of each person dying over threshold? Or just do the intuitive thing and ignore super low probability risks, no matter how much utility they promise. But otherwise keep the intuition that all lives are equally valuable and not discountable.

Replies from: AlexMennen
comment by AlexMennen · 2015-09-11T00:50:46.845Z · LW(p) · GW(p)

Just because you think an outcome is a thousand times more desirable, doesn't necessarily mean you would accept a 1,000:1 bet for it.

Correct. VNM utility is not necessarily linear with respect to the intuitive strength of the preference. Your utility function is defined based on what bets you would accept, rather than being a way of telling you what bets you should accept.

Like, 1,000 people dying seems like it should objectively be 1,000 worse than 1 person dying. Does that mean I should pay the mugger?

Nope; see above. You can define a notion of utility that corresponds to this kind of notion of importance, but this will not necessarily be the decision-theoretic notion of utility. For example, suppose an agent wants there to be many happy people, and thinks that the goodness of an outcome is proportional to the number of happy people, so it gives its utility function as U(there are n happy people) = n. And suppose it has the following way of assigning utilities to uncertain outcomes: It picks some strictly increasing continuous function f (which could be arctan, for instance), it calculates the expected value of f(n), and applies f^-1 to that to get the utility. Assuming f is nonlinear, this agent does not use the mean utility as the utility of a gamble, but it is still VNM-rational, and thus by the VNM theorem, there exists a utility function V (not the same as U), such that the agent acts as if it was maximizing the expected value of V; this utility function is given by V(there are n happy people) = f(n).

comment by Lumifer · 2015-09-10T00:44:58.709Z · LW(p) · GW(p)

What do we, humans, actually want?

Ah. We finally get to interesting questions :-)

comment by gjm · 2015-09-09T20:02:32.681Z · LW(p) · GW(p)

Suppose the cdf of your distribution is F, whose inverse I will call G to avoid cumbersome notation. Then your quantity is the limit of 1/(n-1) [G(1/n) + G(2/n) + ... + G((n-1)/n)], which is one Riemann sum on the way to the integral of G. An easy change of variable shows that this integral equals the mean (or doesn't exist if the mean doesn't). So that resolves question 1.

Replies from: Stuart_Armstrong
comment by Stuart_Armstrong · 2015-09-10T09:06:05.229Z · LW(p) · GW(p)

Cheers! That's just what I needed.

comment by gjm · 2015-09-09T19:36:52.225Z · LW(p) · GW(p)

The general term you want to look up here is "L-estimators".

[EDITED to add:] ... Well, kinda. L-estimators are traditionally things you apply to a sample from the distribution, but you're proposing to compute analogous quantities for the distribution itself. But I think this is (except for really pathological cases?) equivalent to taking a really big sample -- e.g., >> n in your example -- and computing the corresponding L-estimator for that sample.

comment by gjm · 2015-09-09T20:12:40.804Z · LW(p) · GW(p)

If we apply this to the Cauchy distribution, your sum is one of the Riemann sums on the way to (aside from a constant factor) the integral from -pi/2 to +pi/2 of tan x dx. This integral diverges because at each endpoint it's like the integral of 1/x, but your procedure is a bit like a Cauchy principal value -- it's like taking the limit of the integral from (-pi/2+epsilon) to (pi/2-epsilon).

So it seems like it might misbehave interestingly for distributions with oddly asymmetrical tails, or with singular behaviour "inside", though "misbehave" is a rather unfair term (you can't really expect it to do well when the mean doesn't exist).

I'm not sure how we could answer question 2; what counts as "effective"? Perhaps an extension of the notion of mean is "effective" if it has nice algebraic properties; e.g., pseudomean(X)+pseudomean(Y) = pseudomean(X+Y) whenever any two of the pseudomeans exist, etc. I suspect that that isn't the case, but I'm not sure why :-).

Replies from: Stuart_Armstrong, Stuart_Armstrong
comment by Stuart_Armstrong · 2015-09-10T15:39:20.025Z · LW(p) · GW(p)

but your procedure is a bit like a Cauchy principal value

Interestingly, we can imagine doing the integral of G (the inverse of the CDF) that you define. The Cauchy principal value is like integrating G between x- and x+ such that G(x-)=-y and G(x+)=y, and letting y go to infinity. The averaging I described is like integrating G between x and 1-x and letting x tend to zero.

comment by Stuart_Armstrong · 2015-09-10T08:41:25.740Z · LW(p) · GW(p)

So it seems like it might misbehave interestingly for distributions with oddly asymmetrical tails

Yep; it's not too hard to construct things where the limit doesn't exist. However, all the counterexamples I've found share an interesting property: they're not bounded above by any multiple of a power of (1/x). This might be the key requirement...

pseudomean(X)+pseudomean(Y) = pseudomean(X+Y)

Yes, that's exactly the property I'm looking for.

comment by gjm · 2015-09-09T20:59:38.625Z · LW(p) · GW(p)

Your mean of quartiles is very much like Tukey's trimean, which is (LQ+2M+UQ)/4 rather than (LQ+M+UQ)/3. I expect it has broadly similar statistical properties. Tukey was a smart chap and I would guess the trimean is "better" for most purposes. But, as Lumifer says, what counts as better will depend on what you're trying to do and why.

(E.g., the trimean will have a better breakdown point but be less efficient than the mean; a worse breakdown point but more efficient than the median.)

Replies from: Baughn
comment by Baughn · 2015-09-11T13:07:44.259Z · LW(p) · GW(p)

(E.g., the trimean will have a better breakdown point but be less efficient than the mean; a worse breakdown point but more efficient than the median.)

What does "efficient" mean, in this context? Time to calculate would be my first guess, but the median should be faster to calculate than the trimean.

Replies from: gjm
comment by gjm · 2015-09-11T14:13:20.436Z · LW(p) · GW(p)

See here.

[EDITED to add:] Sorry, that's a bit rude; I should also give a brief explanation here.

Any estimator will be noisy. All else being equal, you would prefer one with less noise. There is a thing called the Cramér-Rao inequality that gives a lower bound on how noisy an estimator can be, as measured by its variance. (But see the note below.)

The efficiency of an estimator is the ratio between its variance and the bound given by Cramér-Rao. An estimator whose efficiency is 1 has as little variance as any estimator can have. (Such estimators need not exist.) Noisier estimators have lower efficiency.

Efficiency depends on the underlying distribution. I shouldn't really have said that the mean will be more efficient than the median; if the underlying distribution is thin-tailed enough then it will be; e.g., for large samples of normally distributed data the mean has efficiency 1 while the median has efficiency about 0.64. But if the actual distribution is fat-tailed, the median may actually be a more efficient estimator than the mean.

(Note: You might notice that I said something obviously false above. It's trivial to make a completely un-noisy, zero-variance estimator of any parameter: just always estimate zero. But this will be a biased estimator; i.e., its expectation will not equal the underlying value it's meant to be estimating. The Cramér-Rao inequality only applies to unbiased estimators. In some cases, for some applications, the "best" estimator may actually be a biased but less noisy one. For instance, suppose you have some samples of a normally-distributed random variable and you want to estimate its variance. The "obvious" thing to do is to compute 1/n sum(x-xbar)^2. That gives you an unbiased estimator but, famously, you can get rid of the bias by computing 1/(n-1) sum (x-xbar)^2 instead. But if you want to minimize your mean squared error -- i.e., minimize the expectation of (est. variance - actual variance)^2 -- then what you want is neither of those, but instead 1/(n+1) sum (x-xbar)^2.)

[EDITED again to add:] Er, I should add that everything above is addressing the situation where you have some samples of a random variable and want to estimate its location parameter. Stuart is actually considering something a bit different, where you actually know the random variable itself and want a representative location parameter that may or may not be the mean, in order to make comparisons between different random variables by computing that parameter for each and comparing. So this is of doubtful relevance to the OP.