Boo votes, Yay NPS

post by Gordon Seidoh Worley (gworley) · 2019-05-14T19:07:52.432Z · LW · GW · 19 comments

Contents

  TL;DR
  Motivation
  Boo Votes
  Yay NPS
None
19 comments

TL;DR

Many votes on LW are "boos" and "yays", and consequently they aren't very useful for determining what is worth reading. A modified version of a Net Promoter Score (NPS) on each post may provide a better metric for determining read worthiness.

Motivation

It's come up a couple time in my recent [LW · GW] comments [EA(p) · GW(p)] that I've expressed a theory that votes on LW, AF, and EAF are "boos" and "yays". I have an idea about how we could do better assuming the purpose of votes is not to jeer and cheer but to provide information about the post, specifically how much the post is worth reading, so I'm finally writing it up so others can, yes, boo or applaud my effort, but more importantly so we might discuss ways to improve the system. If you don't like my proposal and agree we could do better than votes, I encourage you to write up your ideas and share them.

So, there are many things votes could be for, but I view votes as a solution to a problem, so what's the problem votes are trying to solve? The number one question I want answered about every post is some version of "should I read this?". There's subtly different ways to phrase this question: "is this worth engaging with?", "should I read this carefully or just skim it?", "is this worth my time and energy?", etc.

I want a solution to this problem because when I come to LW/AF/EAF every day I want a reliable signal about what it's worth me spending my energy engaging with (I generally don't want to just read, but also comment, discuss, understand, grow). Right now votes don't provide this to me, as I'll explain below, but they do provide other things. So keep in mind that my goal in this proposal is primarily to solve the particular problem of "should I read this?" and not the many other problems votes might be solutions to like "how to deliver simple positive/negative feedback?", "how can I express my pleasure or displeasure with a post?", "how do we determine status within the forum?", or "how do we increase platform engagement?". I don't ignore these other purposes, but I take them as secondary—and maybe there's other purposes I forgot to list and so forgot to take into account! The point being I want it to be clear I'm making a proposal that's trying to solve a particular problem, and if you complain "but wait, it doesn't solve this other problem" my response will be "yep, sure doesn't", so any discussion of this sort should be sure to explain why we should care about this other thing.

Okay, all that out of the way, let's talk about votes, and then NPS.

Boo Votes

Up/down voting is very simple and has a long history on LW, thanks to its presence on Reddit (from which, if I recall correctly, the original forum's codebase was forked). It has a number of nice features, and LW has made them nicer:

And of course lots of popular forums of all sorts use votes: Facebook, Twitter, Reddit, Tumblr. Even when votes aren't present something like voting is in the form of "reacts" where a person can choose from a list of named images/sounds/etc. to express something and that something generally can include a simple vote (usually using a universally recognized vote react, like thumbs up/down); cf. Slack, Discord, most massively multiplayer games, Twitch. So it would seem that people like votes a lot and they are used to some effect in lots of places.

Unfortunately for our purposes of trying to figure out "should I read this?", most of what votes are doing is only indirectly engaged with this question. Votes, especially if we think of them as a degenerate case of reacts, are more used to express an opinion on the content than to determine whether or not the content is worth reading, and when there are two voting options they tend to be rounded off to down = boo and up = yay. If you have any doubts about this, just spend more time on social media and let me know if you still disagree in general, i.e. you disagree that most people do this, not that you don't do this or your small group of friends don't do this.

On that point of using votes for something else, it's tempting to think "hey, this is LW; we're rational AF; we know better than to use votes as boos and yays". To which I say "please, tell us more about how you've managed to create a community of perfectly rational agents".

Joking aside, my point is that I've been on the receiving end of all kinds of voting patterns, so I've gotten a chance to see how people use votes on LW. Further, I've talked to people about my posts (either in comments or elsewhere) and in some cases explicitly learned how they voted on my posts and why, and it's lead me to a few conclusions about how people use votes here.

The result is what I consider a lot of voting anomalies from the perspective of trying to answer the question "should I read this?". Some claims of things I've seen (I won't link specific posts because I don't want to risk applying shame to anyone for what happened to their post in the votes, and also it's a lot of work to dig up all the examples that caused me to form these beliefs):

My personal experience is mainly with writing heretical [LW · GW] posts [LW · GW] of good quality such that I get more up votes than down but also a lot of down votes (maybe 1/3 down and 2/3 up), and it caused me to pay more attention to voting patterns, engage more with low score posts, and try to figure out just what was going on when posts got low scores that I gave upvotes. What I learned lead me to surmise what I've presented above.

So votes seem to be largely used to signal approval and disapproval of posts, which I suggest is only weakly correlated with telling me whether or not I should read a post. As a result I basically ignore votes and have to skim everything to figure out where the good stuff is. But what if we could do something better...?

Yay NPS

Net Promoter Score (NPS) is a simple metic many companies use to evaluate questions of customer satisfaction. To calculate it people are asked "how likely are you to recommend our product or service to a friend or colleague?" and asked for a number from 0 to 10, 0 meaning "not likely at all" and 10 meaning "already have". I really like NPS because it asks people to imagine recommending something and then asking them for something like a probability of how likely they are to do it, although I've never seen a version that did this explicitly.

Responses are then converted into a score by first segmenting respondents into detractors, passives, and promoters, and then taking percent promoters minus percent detractors. I find this metric to be of limited value, and more prefer to engage directly with the full distribution of responses, but if you really needed a single scalar this is one way to get it.

What I imagine doing is asking people to score posts like this:

How likely are you to recommend a friend or colleague read this post?

|--0%-------------50%-------------100%--|

So they are asked the question and given a slider to mark their likelihood, which includes 100% because they may have already shared it (but there's probably some UI work here to make it clear that 100% and 99% are drastically different responses).

Does this answer our question "should I read this?"? I think it may do a better job than votes, to be sure. Rather than an ambiguous vote, people are now at least being asked to respond directly to a question and give their response to it. Also, we could better use the distribution of responses to make reading decisions. For example, heretical posts might get bimodal distributions of scores, with clusters of strong detractors and strong promoters, and maybe you choose to read a post when it has at least n promoters, regardless of detractors. Maybe you choose to filter out posts with more than n detractors because you don't like controversy or low quality content. Maybe you filter on NPS or mean or median or something else, or sort based on it. And every post, rather than showing a simple number for its score like we do now you show a box-plot or some other suitable visualization showing the distribution of responses.

Now unfortunately NPS is more complicated than votes, so it may work against other problems people are trying to solve with votes. How does NPS help us deal with the problems addressed by karma? How do we prevent NPS from devolving into a binary where people always vote 100% to upvote and everything else is a downvote (the eBay/Uber/Lyft voting problem, where anything less than 5-stars is considered a downvote)? And do we measure comment quality with NPS, or keep votes there, or do something else?

I also don't really expect the LW team to drop everything and implement NPS. Heck, if I were working on LW I probably wouldn't jump all over this. My goal in writing this, maybe more than anything, is to get us thinking about how to better answer the question "should I read this?" and I wanted to provide at least one solution I've thought of and think could be better in some ways. I mostly think we could do more to give better signals of quality on LW and make them less distorted by and engaged with other signals people try to send with votes.

So, what do you think of the current state of votes? What problems do you want to solve on LW that votes or something else may be solutions to? And how would you improve votes or something else to solve those problems?

19 comments

Comments sorted by top scores.

comment by jimrandomh · 2019-05-14T21:24:14.139Z · LW(p) · GW(p)

(I'm a member of the LW team, but this is an area where we still have a lot of uncertainty, so we don't necessarily agree internally and our thinking is likely to change.)

There are three proposed changes being bundled together here: (1) The guidance given about how to vote; (2) the granularity of the votes elicited; and (3) how votes are aggregated and presented to readers.

As you correctly observe, votes are serving multiple purposes: it gives information to other readers about what's worth their time to read, it gives readers information about what other people are reading, and it gives authors feedback about whether they did a good job. Sometimes these come apart; for example, if someone helpfully clears up a confusion that only one person had, then their comment should receive positive feedback, but isn't worth reading for most people.

These things are, in practice, pretty tightly correlated, especially when judged by voters who are only spending a little bit of time on each vote. And that seems like the root issue: disentangling "how I feel about this post" from "is this post worth reading" requires more time and distance than is currently going into voting. One idea I'm considering, is retrospective voting: periodically show people a list of things they've read in the past (say, the past week), and ask people to rate them then. This would be less noisy, because it elicits comparisons rather than ups/downs in isolation, and it might also change people's votes in a good way by giving them some distance.

Switching from the current up/down/super-up/super-down to 0-100% range voting, seems like the main effect is it's creating a distinction between implicit and explicit neutral votes. That is, currently if people feel something is meh, they don't vote, but in the proposed system they would instead give it a middling score. The advantage of this is that you can aggregate scores in a way that measures quality, without being as conflated with attention; right now if a post/comment has been read more times, it gets more votes, and we don't have a good way of distinguishing this from a post/comment with fewer reads but more votes per reader.

But I'm skeptical of whether people will actually cast explicit neutral votes, in most cases; that would require them to break out of skimming, slow down, and make a lot more explicit decisions than they currently do. A more promising direction might be to collect more granular data on scroll positions and timings, so that we can estimate the number of people who read a comment and skimmed a comment without voting, and use that as an input into scoring.

The third thing is aggregation--how we convert a set of votes into a sort-order to guide readers to the best stuff--which is an aspect of the current system I'm currently least satisfied with. That includes things like karma-weighting of votes, and also the handling of polarizing posts. In the long term, I'm hoping to generate a dataset of pairwise comparisons by trusted users, which we can use as a ground truth to test algorithms against. But polarizing posts will always be difficult to score, because the votes reflect an underlying disagreement between humans and the answer to whether a post should be shown may depend on things the voters haven't evaluated, like the truth of the post's claims.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2019-05-14T23:35:58.105Z · LW(p) · GW(p)
But I'm skeptical of whether people will actually cast explicit neutral votes, in most cases; that would require them to break out of skimming, slow down, and make a lot more explicit decisions than they currently do. A more promising direction might be to collect more granular data on scroll positions and timings, so that we can estimate the number of people who read a comment and skimmed a comment without voting, and use that as an input into scoring.

This is very much a problem in collecting NPS data in its original context, too: you get lots of data from upset customers and happy customer and meh customers stay silent. You can do some interpolation about what missing votes mean, and coupled with scrolling behavior you could get some sense of read count that you could use to make adjustments, but that obviously makes things a bit more complicated.

comment by Said Achmiz (SaidAchmiz) · 2019-05-15T16:19:19.193Z · LW(p) · GW(p)

My personal experience is mainly with writing heretical posts of good quality such that I get more up votes than down but also a lot of down votes (maybe 1⁄3 down and 2⁄3 up)

How the heck can you tell what the upvote/downvote breakdown is…? I’ve never seen anything about this in the UI.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2019-05-15T18:36:04.358Z · LW(p) · GW(p)

Well, I do say "maybe"; this is a guess based on how the score evolved over time and how the the total score compares to the number of votes.

Replies from: SaidAchmiz
comment by Said Achmiz (SaidAchmiz) · 2019-05-15T18:41:29.274Z · LW(p) · GW(p)

It seems to me that variability in vote weights adds so much noise here that you can’t make anything resembling the kind of inference that you’d like to make…

Replies from: Raemon
comment by Raemon · 2019-05-15T18:48:09.945Z · LW(p) · GW(p)

I think this is particularly hard with posts, after the first few votes (esp. once someone has strong upvoted it). But I think it's easier with comments if you're paying attention, esp. if you see that your karma is fairly low but has a largish number of votes.

Replies from: Dagon
comment by Dagon · 2019-05-17T08:08:32.882Z · LW(p) · GW(p)

Yeah, I comment far more than I post, and I have a general idea that if I get no downvotes, it means I'm optimizing too much for approval rather than exploration of non-obvious ideas. The vote-delta feature makes it a little easier to see the downvotes (I've disabled the hiding of negatives), but mostly I have to look and figure out if the average votes-per-voter is particularly low.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2019-05-17T17:42:58.852Z · LW(p) · GW(p)

An interesting feature would be to show the ratio instead of only the vote count in addition to the score.

comment by Gordon Seidoh Worley (gworley) · 2019-05-29T18:00:18.976Z · LW(p) · GW(p)

Additional evidence of boo/yay voting culture: Ben Hoffman's "Downing children are rare" post (on EAF [EA · GW], on LW [LW · GW], my comment about votes [EA(p) · GW(p)])

Replies from: Wei_Dai
comment by Wei Dai (Wei_Dai) · 2019-05-29T23:53:54.065Z · LW(p) · GW(p)

It could also be (partly) caused by inferential distance, i.e. the LW audience can understand Ben's ideas better because we've read more of his recent posts. ETA: or we're not as familiar with existing counterarguments, which the EAF audience might expect the post to better address.

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2019-05-30T01:46:20.635Z · LW(p) · GW(p)

I agree there are other possible interpretations; mainly wanted to document for myself in case I wanted to reference it later, and it seems potentially relevant, especially if we wanted to go back and interview the voters or analyze the comments.

comment by Dagon · 2019-05-15T21:43:30.071Z · LW(p) · GW(p)

I'm not sure it makes sense to put much effort into this kind of gamification. Karma, voting, reactions, etc. are all cost-free, and therefore very weak signals.

It means different things to different people, and it's so cheap that it's hard to imagine that a change in text or mechanism will radically change it's use (though tweaks may be able to moderately reframe things).

As long as voting has the general effect of increasing a sense of involvement, even for those who don't post/comment, it's probably worth keeping.

comment by habryka (habryka4) · 2019-05-14T22:59:30.377Z · LW(p) · GW(p)

A related concept to all of this is the idea of adding more react-like functionality to comments. See my thoughts here: https://www.lesswrong.com/posts/EQJfdqSaMcJyR5k73/habryka-s-shortform-feed#HefgGrp3nMwf5gKkd [LW(p) · GW(p)]

Do you have a sense of how much this would solve the problems you are worried about?

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2019-05-14T23:18:04.818Z · LW(p) · GW(p)

I'm not sure how much it would help, but that's mainly because I am both not troubled by voting up things I disagree with or voting down things I disagree with and because I have a history of leaving constructive feedback comments, especially in cases where I feel like a comment/post is being treated overly harshly or where the author seems unfamiliar with community norms. I can imagine that others who are less willing to do that might be more willing to leave reacts that at least convey some of that information.

Replies from: Raemon
comment by Raemon · 2019-05-14T23:27:12.185Z · LW(p) · GW(p)

I think whatever plan addresses the situation definitely needs to address that most people are much more lazy than you in this regard.

comment by mako yass (MakoYass) · 2019-05-15T03:52:21.961Z · LW(p) · GW(p)
it lets you express yourself in two ways (unlike on Twitter where the only option is to vote up something, and a "downvote" requires writing your own tweet expressing dislike)

Weird oversight in not observing that twitter's retweets not only represent the question "would you recommend this to a friend", but are also guaranteed to yield a truthful answer, because retweeting is an act of recommendation to friends, for which the user is then held accountable.

One of the other things I like about resharing is that the resultant salience is completely subjective. Users can share space even if they're looking for different things. There can be a curator for any sense of quality.

Replies from: mr-hire
comment by Matt Goldenberg (mr-hire) · 2019-05-17T08:42:42.564Z · LW(p) · GW(p)
would you recommend this to a friend", but are also guaranteed to yield a truthful answer, because retweeting is an act of recommendation to friends, for which the user is then held accountable.

Note that even in this relatively straightforward case, the meaning of retweets can become conflated, from "I would recommend this to a friend" to "I agree with this". I sometimes have to be careful about what I retweet because of this confusion, and things that I would otherwise recommend people read I hesitate to retweet because I don't want people thinking I agree.

comment by knite · 2019-05-14T22:56:36.272Z · LW(p) · GW(p)

What are your thoughts on *re-labeling* the voting UI so that it's more clear to voters what the site norms are for the meaning of up- and downvotes?

Replies from: gworley
comment by Gordon Seidoh Worley (gworley) · 2019-05-14T23:22:13.207Z · LW(p) · GW(p)

Maybe that's useful, but we'd have to figure out what votes are supposed to mean in the first place, i.e. I'm not sure there is a well defined notion of what votes are for now such that we could change the UI to encourage using them in the expected manner.

I have an idea of what problems I'd like solved and voting is one way to solve some of those problems and in that context we might have some sense of how we would like to ask users to use voting, but that only makes sense in that context. On it's own voting is just incrementing/decrementing counters in the database and that counter is used to inform some algorithms about the order in which content is displayed on the site; we have to decide what that means to us and what we would like it to do beyond what it naturally does on its own such that providing instruction and shaping the UI to encourage particular behaviors is meaningful.

So that's a long way to say yes, but conditional on having explicit norms.