How not to sort by a complicated frequentist formula

post by Meni_Rosenfeld · 2013-01-01T21:58:53.054Z · LW · GW · Legacy · 26 comments

In How Not To Sort By Average Rating, Evan Miller gives two wrong ways to generate an aggregate rating from a collection of positive and negative votes, and one method he thinks is correct. But the "correct" method is complicated, poorly motivated, insufficiently parameterized, and founded on frequentist statistics. A much simpler model based on a prior beta distribution has more solid theoretical foundation and would give more accurate results.

Evan mentions the sad reality that big organizations are using obviously naive methods. In contrast, more dynamic sites such as Reddit have adopted the model he suggested. But I fear that it would cause irreparable damage if the world settles on this solution.

Should anything be done about it? What can be done?

This is also somewhat meta in that LW also aggregates ratings, and I believe changing the model was once discussed (and maybe the beta model was suggested).

 

In the Bayesian model, as in Evan's model, we assume for every item there is some true probability p of upvoting, representing its quality and the rating we wish to give. Every vote is a Bernoulli trial which gives information on p. The prior for p is the beta distribution with some parameters a and b. After observing n actual votes, of which k are positive, the parameters of the posterior distribution are a+k and b+(n-k), so the posterior mean of p is (a+k)/(a+b+n). This gives the best estimate for the true quality, and reproduces all the desired effects - convergence to the proportion of positive ratings, where items with insufficient data are pulled towards the prior mean.

The specific parameters a and b depend on the quality distribution in the specific system. a/(a+b) is the average quality and can be taken as simply the empirical proportion of positive votes among all votes in the system. a+b is an inverse measure of variance - a high value means most items are average quality, and a low value means items are either extremely good or extremely bad. It is harder to calibrate, but can still be done using the overall data (e.g., MLE from the entire voting data).

 

For the specific problem of sorting, there are other considerations than mere quality. A comment can be in universal agreement, but not otherwise interesting or notable. These may not deserve as prominent a mention as controversial comments which provoke stronger reactions. For this purpose, the "sorting rating" can be multiplied by some function of the total number of votes, such as the square root. If the identity function is used, this becomes similar to a simple difference between the number of positive and negative votes.

26 comments

Comments sorted by top scores.

comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-02T00:15:57.756Z · LW(p) · GW(p)

The main thing you want to calculate here is expected-value-of-information. Otherwise new posts drop into the void. Trying to maximize upvotes in the long run means showing new posts that might have a high-upvoting parameter.

Replies from: Manfred, Meni_Rosenfeld
comment by Manfred · 2013-01-02T00:58:23.174Z · LW(p) · GW(p)

Well, you have to make somewhat of a secret sauce, because what you show on top should depend on what you want the top things and the people who see them to do. If you're a site like reddit, putting highly-rated stuff on the front page is good, but what you really want is highly rated new stuff. But if you're urbandictionary, you don't really care when something was submitted, you want to put the best answer at the top - except that if a word has multiple definitions, you want to have variety in your top definitions. Or maybe you're amazon, and you want people to see stuff they've already bought so it's easy for them to rate it. Etc.

comment by Meni_Rosenfeld · 2013-01-02T10:26:45.496Z · LW(p) · GW(p)

This is interesting, especially considering that it favors low-data items, as opposed to both the confidence-interval-lower-bound and the notability adjustment factor, which penalize low-data items.

You can try to optimize it in an explore-vs-exploit framework, but there would be a lot of modeling parameters, and additional kinds of data will need to be considered. Specifically, a measure of how many of those who viewed the item bothered to vote at all. Some comments will not get any votes simply because they are not that interesting; so if you keep placing them on top hoping to learn more about them, you'll end up with very few total votes because you show people things they don't care about.

Replies from: Eliezer_Yudkowsky
comment by Eliezer Yudkowsky (Eliezer_Yudkowsky) · 2013-01-02T10:29:20.603Z · LW(p) · GW(p)

Yep. You'd want to check or guess the size of the user's monitor and where they were scrolling to, and calculate upvotes-per-actual-user-read. As things are read and not upvoted, your confidence that they're not super-high-value items increases and the value of information from showing them again diminishes.

comment by dbaupp · 2013-01-01T22:21:33.955Z · LW(p) · GW(p)

(Link to How Not To Sort By Average Rating.)

Something of interest: Jeffery's interval. Using the lower bound of a credible interval based on that distribution (which is the same as yours) will probably give better results than just using the mean: it handles small sample sizes more gracefully. (I think, but I'm certainly willing to be corrected.)

But I fear that it would cause irreparable damage if the world settles on this solution.

This is probably vastly exaggerating the possible consequences; it's just a method of sorting, and either the Wilson's interval method and a Bayesian method are definitely far better than the naive methods.

Replies from: gwern, Meni_Rosenfeld, Meni_Rosenfeld
comment by gwern · 2013-01-01T22:46:27.781Z · LW(p) · GW(p)

Something of interest: Jeffery's interval. Using the lower bound of a credible interval based on that distribution (which is the same as yours) will probably give better results than just using the mean: it is handles small sample sizes more gracefully. (I think, but I'm certainly willing to be corrected.)

I recently did a similar thing for ranking vendors by feedback, using both a Jeffreys interval and a Wilson interval; even on the vendors with little feedback, they were overall pretty similar. IIRC, I don't think they differed by more than 10% anywhere.

comment by Meni_Rosenfeld · 2013-01-01T23:28:35.138Z · LW(p) · GW(p)

(Link to How Not To Sort By Average Rating.)

I forgot to link in the OP. Then remembered, and forgot again.

Something of interest: Jeffery's interval. Using the lower bound of a credible interval based on that distribution (which is the same as yours) will probably give better results than just using the mean: it handles small sample sizes more gracefully. (I think, but I'm certainly willing to be corrected.)

This seems to use specific parameters for the beta distribution. In the model I describe, the parameters are tailored per domain. This is actually an important distinction.

I think using the lower bound of an interval makes every item "guilty until proven innocent" - with no data we assume the item is of low quality. In my method we give the mean quality of all items (and it is important we calibrate the parameters for the domain). Which is better is debatable.

comment by Meni_Rosenfeld · 2013-01-02T17:03:02.910Z · LW(p) · GW(p)

But I fear that it would cause irreparable damage if the world settles on this solution.

This is probably vastly exaggerating the possible consequences; it's just a method of sorting, and either the Wilson's interval method and a Bayesian method are definitely far better than the naive methods.

I just feel that it will place this low-hanging fruit out of reach. e.g.,

Me: Hey Reddit, I have this cool new sorting method for you to try!

Reddit: What do you mean? We've already moved beyond the naive methods into the correct method. Here, see Miller's paper. No further changes are needed.

Maybe I'm exaggerating - I mean, things can be improved again after being improved once - but I just feel that if the world had a "naive rating method" itch to scratch, and something like Miller's method became the go-to method, something is wrong.

comment by IlyaShpitser · 2013-01-02T10:23:10.491Z · LW(p) · GW(p)

But the "correct" method is complicated, poorly motivated, insufficiently parameterized, and founded on frequentist statistics.

A quick question. It seems from this sentence that you view "insufficiently parameterized" as a flaw in a method. Can you explain what that phrase means and why it is a flaw? Is this a statement of preference for parametric stat or something else?

Replies from: Meni_Rosenfeld
comment by Meni_Rosenfeld · 2013-01-02T11:35:14.474Z · LW(p) · GW(p)

It means that the model used per item doesn't have enough parameters to encode what we know about the specific domain (where domain is "Reddit comments", "Urban dictionary definitions", etc.)

The formulas discussed define a certain mapping between pairs (positive votes, negative votes) to a quality score. In Miller's model, the same mapping is used everywhere without consideration of the characteristics of the specific domain. In my model, there are parameters a and b (or alternatively, a/(a+b) and a+b) that we first train per-domain, and then apply per item.

For example, let's say you want to decide the order of a (5, 0) item and a (40, 10) item. Miller's model just gives one answer. My model gives different answers depending on:

The average quality - if the overall item quality is high (say, most items have 100% positive votes), the (5,0) item should be higher because it's likely one of those 100% items, while (40,10) has proven itself to be of lower quality. If, however, most items have low quality, (40,10) will be higher because it has proven itself to be one of the rare high-quality items, while (5,0) is more likely to be a low-quality item which lucked out.

The variance in quality - say the average quality is 50%. If the variance in quality is low, (5,0) will be lower because it is likely to be an average item which lucked out, while (40, 10) has proven to be of high quality. If the variance is high (with most items being either 100% or 0%), (5,0) will be higher because in all likelihood it is one of the 100% items, while (40, 10) has proven to be only 80%.

In short, using a cookie-cutter model without any domain-specific parameters doesn't make the most efficient use of the data possible.

comment by VincentYu · 2013-01-02T04:37:03.822Z · LW(p) · GW(p)

This is also somewhat meta in that LW also aggregates ratings, and I believe changing the model was once discussed (and maybe the beta model was suggested).

LW actually uses the same comment sorting that Reddit uses by default (i.e., it uses the solution suggested by Miller).

Replies from: TrE
comment by TrE · 2013-01-02T10:06:44.184Z · LW(p) · GW(p)

Specifically, the option can be found above the "Show help" button. "Sort by: Best" is what you want.

comment by EHeller · 2013-01-03T03:22:59.735Z · LW(p) · GW(p)

Its not obvious to me that your method improves upon the Wilson score. Certainly, the traditional Bayesian approach (Jeffreys interval) is rarely that different from the WIlson score- have you played with values to see what the largest differences would look like?

Replies from: Meni_Rosenfeld
comment by Meni_Rosenfeld · 2013-01-03T06:02:26.967Z · LW(p) · GW(p)

Given that a and b are arbitrary, I think the differences can be large. Whether they actually are large for typical datasets I can't readily answer.

In any case the advantages are:

  1. Simplicity. Tuning the parameters is a bit involved, but once you do the formula to apply for each item is very simple. In many (not all) cases, a complicated formula reflects insufficient understanding of the problem.

  2. Motivation. Taking the lower bound of a confidence/credible interval makes some sense but it's not that obvious. The need for it arises because we don't model the prior mean, so we don't want to take risk on unproven items. A posterior mean of the quality is more natural, and won't cause much problems because items default to the true population mean.

  3. Parametrization. The interval methods has a parameter for the probability to take for the size of the interval, but it's not at all clear how to choose it. My method has parameters for mean and variance which are based on the data.

  4. Generalization. This framework makes it easier to clearly think about what we want, and replace the posterior mean of p with a posterior mean of some other quantity of interest. e.g., the suggested "explore vs. exploit" tends to give something closer to an interval upper bound than lower bound, and other methods have been suggested.

Replies from: EHeller
comment by EHeller · 2013-01-03T08:18:05.642Z · LW(p) · GW(p)

Yes, a and b are arbitrary- but if they aren't chosen well your model could be hugely inferior. I'd suggest making a few toy data sets and actually comparing the standard methods (Wilson Score, Jeffreys interval) to yours before suggesting everyone embrace it.

Edit for clarity: Just to be clear, the Jeffery's interval (which is usually very close to the Wilson coeff) is essentially the same as your model but with the initial parameters 1/2,1/2.

comment by A1987dM (army1987) · 2013-01-02T23:23:31.565Z · LW(p) · GW(p)

This is the way I would do it, also taking into account EY's point of not hiding away new comments:

Assume each comment has an ‘upvote rate’ U, such that the probability that a comment at the age of t has u upvotes is a Poisson distribution with parameter Ut,

  • P(u|U, t) = (Ut)^u exp(−Ut)/u!

and similarly for downvotes,

  • P(d|D, t) = (Dt)^d exp(−Dt)/d!

If the prior probability distribution for U and D is P(U, D), their posterior probability distribution will be

  • P(U, D|u, d, t) = D^d U^u exp(−(D + U)t) P(U, D)/(a normalization constant).

Then, you sort comments according to a functional of the posterior pdf of U and D; in analogy with expected utility maximization you could use the posterior expectation value of some function f(U, D), but other choices would be possible. (This reduces to your proposal when you take f(U, D) = U/(U + D).)

Of course this model isn't entirely realistic because U and D ought to vary with time (according to timezone, how old the thread is and whether it's currently linked to from the main page, etc.), but the main effect of disregarding this (pretending that a comment has the same probability of getting upvoted in the 10,000th hour after its publication as in the 1st hour) would be to cause very recent comments to be sorted higher, which IMO is a Good Thing anyway.

Replies from: EHeller, Meni_Rosenfeld
comment by EHeller · 2013-01-03T04:45:28.067Z · LW(p) · GW(p)

Why a Poisson distribution? It seems fairly clear we are looking at Bernoulli trials (people who look either upvote, or not). I doubt its a rare enough event (though it depends on the site, I suppose) that a poisson is a better approximation than a normal.

Replies from: Meni_Rosenfeld, army1987
comment by Meni_Rosenfeld · 2013-01-03T09:58:23.312Z · LW(p) · GW(p)

I think it's reasonable to model this as a Poisson process. There are many people who could in theory vote, only few of them do, at random times.

comment by A1987dM (army1987) · 2013-01-03T11:42:36.236Z · LW(p) · GW(p)

You'd need to know how many people read each comment, though.

comment by Meni_Rosenfeld · 2013-01-03T10:02:34.973Z · LW(p) · GW(p)

I think some factor for decreasing votes over time should be included. Exponentially decaying rates seem reasonable, and the decay time constant can be calibrated with the overall data in the domain (assuming we have data on voting times available).

Replies from: army1987
comment by A1987dM (army1987) · 2013-01-03T11:52:12.289Z · LW(p) · GW(p)

Exponentially decaying rates seem reasonable

That's likely way too fast. It's not that rare for people to comment on posts several years old (especially on Main), and I'd guess such people also vote comments. (Well, I most certainly do.) You can use an exponential decay with a very large time constant, but that would mean that comments from yesterday are voted nearly as often as comments from three months ago. So, the increase in realism compared to a constant rate isn't large enough to justify the increase in complexity. (OTOH, hyperbolic decay is likely much more realistic, but it also has more parameters.)

comment by GuySrinivasan · 2013-01-02T18:20:27.235Z · LW(p) · GW(p)

Some users' votes are not independent of the current net vote count. http://lesswrong.com/lw/7x/voting_etiquette/5ic

Replies from: army1987, Meni_Rosenfeld
comment by A1987dM (army1987) · 2013-01-02T23:28:00.742Z · LW(p) · GW(p)

Guilty as charged.

Showing upvotes and downvotes separately would stop me (and, probably, other people) from doing that, but I'm kind of getting tired of beating this dead horse.

comment by Meni_Rosenfeld · 2013-01-02T19:30:56.951Z · LW(p) · GW(p)

True. This is a problem since the current net vote count is mutable, while an individual vote, once cast, is not. You could try fitting a much more complicated model that can reproduce this behavior, calibrate it with A/B testing, etc. Or maybe try to prevent it by sorting according to quality, but not actually displaying the metrics.

comment by TrE · 2013-01-02T09:53:30.365Z · LW(p) · GW(p)

Specifically, on a continuous quality scale from 0 to 1, with a prior of a uniform density in this interval with k upvotes and n downvotes, one receives for the posterior distribution the (unnormalized) measure .

A Gaussian-like prior might be more suited here, though.

Knowing the actual probability distribution and not just the average can be useful if, for some reason, you're not interested in the comments with the best average, but in those which are least or most controversial.

Replies from: Meni_Rosenfeld
comment by Meni_Rosenfeld · 2013-01-02T10:21:41.609Z · LW(p) · GW(p)

The beta distribution is a conjugate prior for Bernoulli trials, so if you start with such a prior the posterior is also beta, which greatly simplifies the calculations. It also converges to normal for large alpha and beta, and in any case can be fit into any mean and variance, so it's a good choice.

Whatever your target function is, you'll want the item with the greatest posterior mean for this target. To do this generally you'll need the posterior distribution of p rather than the mean of p itself. But the distribution just describes what you know about p, it doesn't itself encode properties such as "controversial".