Should you write longer comments? (Statistical analysis of the relationship between comment length and ratings)

post by cleonid · 2015-07-20T14:09:17.233Z · LW · GW · Legacy · 47 comments

Contents

47 comments

A few months ago we have launched an experimental website. In brief, our goal is to create a platform where unrestricted freedom of speech would be combined with high quality of discussion. The problem can be approached from two directions. One is to help users navigate through content and quickly locate the higher quality posts. Another, which is the topic of this article, is to help users improve the quality of their own posts by providing them with meaningful feedback.

One important consideration for those who want to write better comments is how much detail to leave out. Our statistical analysis shows that for many users there is a strong connection between the ratings and the size of their comments. For example, for Yvain (Scott Alexander) and Eliezer_Yudkowsky, the average number of upvotes grows almost linearly with increasing comment length.

 

 

This trend, however, does not apply to all posters. For example, for the group of top ten contributors (in the last 30 days) to LessWrong, the average number of upvotes increases only slightly with the length of the comment (see the graph below).  For quite a few people the change even goes in the opposite direction – longer comments lead to lower ratings.

 

 

Naturally, even if your longer comments are rated higher than the short ones, this does not mean that inflating comments would always produce positive results. For most users (including popular writers, such as Yvain and Eliezer), the average number of downvotes increases with increasing comment length. The data also shows that long comments that get most upvotes are generally distinct from long comments that get most downvotes. In other words, long comments are fine as long as they are interesting, but they are penalized more when they are not.

 

 

The rating patterns vary significantly from person to person. For some posters, the average number of upvotes remains flat until the comment length reaches some threshold and then starts declining with increasing comment length. For others, the optimal comment length may be somewhere in the middle. (Users who have accounts on both Lesswrong and Omnilibrium can check the optimal length for their own comments on both websites by using this link.)

Obviously length is just one among many factors that affect comment quality and for most users it does not explain more than 20% of variation in their ratings. We have a few other ideas on how to provide people with meaningful feedback on both the style and the content of their posts. But before implementing them, we would like to get your opinions first. Would such feedback be actually useful to you?

47 comments

Comments sorted by top scores.

comment by gjm · 2015-07-20T17:04:58.278Z · LW(p) · GW(p)

In the "top 10" aggregate, you are at risk of the following Simpsonian problem: you have two posters A and B; one writes longer comments than the other and also happens to be cleverer / more interesting / funnier / better at appealing to the prejudices of the LW crowd. So in the whole group there is a positive correlation between length and quality, but actually everyone likes A's shorter comments better and everyone likes B's shorter comments better. (Or, of course, likewise but with "longer" and "shorter" switched.)

Replies from: cleonid
comment by cleonid · 2015-07-20T22:15:41.704Z · LW(p) · GW(p)

It’s an interesting possibility. But I have looked at the data and for all ten users the comments above 1000 characters get higher average ratings than shorter comments.

Replies from: gjm
comment by gjm · 2015-07-20T22:17:14.680Z · LW(p) · GW(p)

Aha, excellent.

comment by John_Maxwell (John_Maxwell_IV) · 2015-07-21T04:34:43.469Z · LW(p) · GW(p)

One story is that people are more willing to read long comments from Yvain or Eliezer because they have reputations for being insightful.

Replies from: David_Bolin
comment by David_Bolin · 2015-07-21T07:46:26.629Z · LW(p) · GW(p)

I think this is probably true, and I have seen cases where e.g. Eliezer is highly upvoted for a certain comment, and some other person little or not at all for basically the same insight in a different case.

However, it also seems to me that their long comments do tend to be especially insightful in fact.

comment by Lumifer · 2015-07-20T16:05:46.243Z · LW(p) · GW(p)

I think these plots would by much improved by adding error bars. In particular, I suspect that the number of short posts is greater than the number of long posts and so the average-karma estimates for long posts are more uncertain.

Also, did you bucketize the word counts? What do specific points on your plots correspond to?

Replies from: cleonid
comment by cleonid · 2015-07-20T22:32:17.468Z · LW(p) · GW(p)

Each point on the graph corresponds to an average of several hundred (about two thousand for the middle graph) data points. A number of short posts is indeed greater than the number of long posts, so the horizontal distance between the points on the graph increases with increasing number of characters.

Replies from: Lumifer
comment by Lumifer · 2015-07-21T01:30:12.169Z · LW(p) · GW(p)

Any particular reason you did a plot this way instead of having a cloud of points and drawing some kind of regression line or curve through? You are unnecessarily losing information by aggregating into buckets.

Replies from: cleonid
comment by cleonid · 2015-07-21T11:43:00.411Z · LW(p) · GW(p)

True, but it is virtually impossible to see a meaningful pattern when you have thousands data points on the graph and R2<0.2.

Replies from: Douglas_Knight, Lumifer
comment by Douglas_Knight · 2015-07-22T05:28:45.244Z · LW(p) · GW(p)

I disagree. I find point clouds useful, as long as they are not pure black. Kernel density plots are better, though.

But Lumifer gave you a concrete suggestion: plot a regression curve, not a bunch of buckets. Bucketing and drawing lines between points are kinds of smoothing, so you should instead use a good smoothing. Say, loess. Just use ggplot and trust its defaults. (not loess with this many points)

comment by Lumifer · 2015-07-21T16:30:10.527Z · LW(p) · GW(p)

Well, one question is if it's "impossible to see a meaningful pattern", should you melt-and-recast the data so that the pattern appears X-/

Another observation is that you are constrained by Excel. R can deal with such problems easily -- do you have the raw dataset available somewhere?

comment by IlyaShpitser · 2015-07-20T19:20:00.657Z · LW(p) · GW(p)

You cannot use observed dependences in the data to suggest decision changes because p(y | x) is not in general equal to p(y | do(x)).

Replies from: Andy_McKenzie
comment by Andy_McKenzie · 2015-07-20T19:41:39.721Z · LW(p) · GW(p)

What should cleonid do instead (if anything)? And even if something is not true in general, could it still be used as an approximation?

Replies from: IlyaShpitser
comment by IlyaShpitser · 2015-07-20T20:28:40.256Z · LW(p) · GW(p)

Run a trial, or barring that, correct for obvious confounders that affect both post length and post quality (I am sure we can both think of a few right now).

Replies from: ChristianKl
comment by ChristianKl · 2015-07-20T20:42:49.236Z · LW(p) · GW(p)

Run a trial, or barring that, correct for obvious confounders that affect both post length and post quality (I am sure we can both think of a few right now).

Why not be more specific about what confounders should be included or how a trial should look like?

Replies from: IlyaShpitser
comment by IlyaShpitser · 2015-07-20T21:09:12.246Z · LW(p) · GW(p)

Are you going to pay me to seriously look into this?

comment by eternal_neophyte · 2015-07-20T18:19:35.813Z · LW(p) · GW(p)

Upvotes are in my opinion a poor metric to measure the quality of a post. You're confusing information on how insightful, thoughtful or useful your writing is with information on how pleasing it is due to the upvoter due to providing social confirmation of their beliefs or entertaining them for other reasons.

A much more useful way to measure the quality of your own writing is to look at how interesting or thoughtful the replies you get are: this shows that people find your ideas worth engaging with. This is a subjective assessment however that can't be captured by the real line.

Replies from: jsteinhardt
comment by jsteinhardt · 2015-07-21T05:19:15.434Z · LW(p) · GW(p)

I think some of my most-researched comment / posts have gotten relatively few replies. The more thorough you are, the less room there is for people to disagree without putting a decent amount of thought in. On the other hand, if you dash out a post without much fact-checking, you'll probably get lots of replies :).

Replies from: eternal_neophyte
comment by eternal_neophyte · 2015-07-21T12:41:22.750Z · LW(p) · GW(p)

If your comments are that watertight perhaps you should spin them into articles?

Replies from: jsteinhardt
comment by jsteinhardt · 2015-07-21T14:51:42.881Z · LW(p) · GW(p)

Yeah I guess I'm really talking about posts (i.e. articles) more than comments.

comment by Gurkenglas · 2015-07-20T14:42:36.464Z · LW(p) · GW(p)

Naturally, even if your longer comments are rated higher than the short ones, this does not mean that inflating comments would always produce positive results.

Here's where I got really hopeful that you would address the part about correlation and causation, Goodhart's law, etc.

comment by cleonid · 2015-07-20T14:12:33.538Z · LW(p) · GW(p)

Would statistical feedback on the style and content of your posts be useful to you?

[pollid:1010]

Replies from: Gunnar_Zarncke, None, NancyLebovitz
comment by Gunnar_Zarncke · 2015-07-20T15:17:42.338Z · LW(p) · GW(p)

I would have preferred a weaker option like "It might be interesting,and it might conceivably help me to improve my posts".

comment by [deleted] · 2015-07-22T11:42:05.123Z · LW(p) · GW(p)

I liked it for the sheer level of awesomeness of investing work to analyse comments, and I like the reminder how really chaotic stuff can be quantified as well, but I find comments not really that important, important stuff tends to be rewritten as a post, so I treat them more as just a discussion. Similar to chatting about a post in person.

However if you want to know the usefulness, I think depends on whether you care about upvotes. I care about them only in a negative way, I tend to recycle usernames on Reddit when they get too much karma although I have not done it here yet.

I don't think if I would optimize my upvotes it would also result in optimizing the usefulness of my comments for others. If anything, the number of replies I get is a better measure, blatantly stupid stuff usually gets ignored, if something gets answered a lot then at the very least it is wrong in an interesting way.

I mean, for example here, I think the fact that I reply to your post and survey hopefully means more to you than if I just hit the upvote button. Same story.

comment by NancyLebovitz · 2015-07-20T15:14:34.751Z · LW(p) · GW(p)

I don't know whether feedback would affect how I post. It would depend on whether the feedback made sense to me and whether it pointing in a direction of something I thought I could do and was worth doing.

comment by Lumifer · 2015-07-23T05:26:09.956Z · LW(p) · GW(p)

Since I critiqued the graphs in the OP and offered to make them better, cleonid was gracious enough to provide me with the dataset for Eliezer (EY) and Yvain (Scott Alexander) (SA) posts and invite me to play with it. For convenience, I'll split my comment into two parts -- the preliminaries and the analysis itself.

First, two caveats.

I'm looking at data for posts by EY and SA. They, being superstars, are not representative of LW members. This analysis will not tell you, gentlereader, whether you would get more karma by making your posts longer or shorter (unless, of course, you are EY or SA :-D).

Next, the data that I have lacks timestamps. Therefore I'm forced to make the assumption (almost certainly false) that nothing changed with time and treat the data as a blended uniform mass. The temporal dimension is, sadly, lacking.

A brief description of the data: we have two tables (EY and SA) which list the number of characters in a post, as well as the number of upvotes and downvotes that the post received. There are about 3100 data points for EY and about 1400 for SA.

The distribution of post lengths for EY and for SA is rather different: EY posts tend to be much shorter. This is visible in the following graph which plots the empirical distributon of post lengths (cut off at 2000 chars, but there is a long thin tail beyond it).

In numbers, the median length of an EY post is only about 200 characters -- half of his posts are less than that. In fact, a quarter of his posts are less than 87 characters. On the other side, 10% of his posts are longer than 950 characters, 5% are longer than about 1500 characters. The mode of the distribution -- the most frequent length of the post -- is around 70 characters.

SA writes more: his median post length is 450 characters, more than double that of EY's. The shortest quarter of the posts is below 165 characters, the longest 10% are over 1900 characters and the longest 5% -- over 2700 characters. The mode for SA's posts is 180 characters, two and a half times as much as EY's mode.

Continued in post 2.

comment by Irgy · 2015-07-21T04:24:02.892Z · LW(p) · GW(p)

My prior expectation would be: A long comment from a specific user has more potential to be interesting than a short one because it has more content. But, A concise commenter has more potential to write interesting comments of a given length than a verbose commenter.

So while long comments might on average be rated higher, shorter versions of the same comment may well rate higher than longer versions of the same comment would have. It seems like this result does nothing to contradict that view but in the process seems to suggest people should write longer comments. The problem is that verbosity is per-person while information content is per-comment. Also verbosity in general can't be separated from other personal traits that lead to better comments.

You could test this by having people write both long and short versions of comments that appear to different pools of readers and comparing the ratings.

comment by Lumifer · 2015-07-23T06:02:56.055Z · LW(p) · GW(p)

Continued from part 1.

The gist of part 2 is four graphs.

The graphs plot most of the data (except for outliers) in the following form. Each post is represented by two points with the same X coordinate: the number of characters. The Y coordinate for one point is the number of upvotes the post received, the Y coordinate of the other point is the number of downvotes for the same post. Upvotes are light green and downvotes are pink.

The upvotes and the downvotes are modeled separately by two loess (local regression) curves. The difference between two graphs for each of the posters is in the details of the fit. Specifically, one fit assumes gaussian errors and so the loess curve tends to approximate the local mean. The other fit assumes heavy-tailed errors and its loess curve tends to approximate the local median. Since the distribution of votes is skewed, the mean and the median are noticeably different.

Each plot has four vertical lines at four quantiles: 25%, 50%, 75%, and 95%. The lower numbers represent the loess estimate of the number of downvotes for this particular post length. The upper numbers represent the loess estimate of the number of upvotes.

We will start with the robust fit which approximates the median. Here is the plot for EY

and here is the plot for SA

As you can see, longer posts pay off though not in a particularly spectacular manner for EY -- long posts work better for SA. The downvotes also increase, but insignificantly. If we treat the loess estimate as the median, in all cases half of the posts has zero downvotes.

Since the votes are positively skewed, the means should be higher than the medians and we can see it in the second set of graphs with non-robust loess fits. EY

and SA

The overall pattern is very much the same, but the numbers are higher. Again, longer posts bring much more karma for SA, not so much but still some for EY.

comment by [deleted] · 2015-07-20T15:58:21.948Z · LW(p) · GW(p)

I would love to see mine, considering I have two very different styles of post which have different average lengths.

Replies from: cleonid
comment by cleonid · 2015-07-20T22:21:13.163Z · LW(p) · GW(p)

You can get the rating statistics of your LW comments by registering on Omnilibrium and then clicking on this link.

Replies from: Stingray
comment by Stingray · 2015-07-20T23:06:25.962Z · LW(p) · GW(p)

Admit it, this whole post is a secret ploy to get more people to register to Omnilibrium :)

Replies from: John_Maxwell_IV, ChristianKl, David_Bolin
comment by John_Maxwell (John_Maxwell_IV) · 2015-07-21T04:39:12.737Z · LW(p) · GW(p)

If cleonid is willing to do in this level of analysis for us in their spare time, I say they deserve all the registrations they get.

Replies from: Stingray
comment by Stingray · 2015-07-21T15:22:10.733Z · LW(p) · GW(p)

I don't disagree :)

comment by ChristianKl · 2015-07-20T23:13:05.436Z · LW(p) · GW(p)

But he does provide the data for LW comments on that page :) The page also told me how many comments I have written on LW which LW doesn't even tell me, so it's cool ;)

comment by David_Bolin · 2015-07-21T08:04:30.590Z · LW(p) · GW(p)

I tried to register there just now but the email which is supposed to contain the link to verify my email is empty (no link). What can I do about it?

comment by ZeitPolizei · 2015-07-22T15:28:46.457Z · LW(p) · GW(p)

I would be interested to see the results of some Clustering Algorithm on the comment data. It may be, that long comments can be classified into high karma and low karma and we can then analyze what the differences between them are. If it is possible to extract features of high-quality posts, then those features can be the goal, instead of just the length.

I also think it's dangerous to focus too strongly on karma, because karma score is only a rough approximation of actual quality. For example, I believe many short comments, that only ask for some clarification are generally more important than is reflected by their karma.

comment by Gunslinger (LessWrong1) · 2015-07-20T16:48:39.887Z · LW(p) · GW(p)

Some comments are very insightful and are not particularly long. I recently have been reading Roissy's site, and I searched a few terms at LW. I discovered the user Vladimir_M who had some very interesting comments, not only on that front. HughRistik was also someone who stood out. Both of their commenting were very insightful. I am disappointed those people were not on the list.

My other problem with this is that if karma is a good indicator. That pretty much depends on the userbase; I guess we have another thing to test here.

comment by [deleted] · 2015-07-20T16:01:56.299Z · LW(p) · GW(p)

I think there may be a collective action problem here. Optimizing length to maximize upvotes seems like a way to encourage others' biases. Then there are prisoner's dilemma possibilities. If everybody increases their comment length, won't we have to increase our comment length even more to stand out?

comment by Elo · 2015-07-20T22:47:47.200Z · LW(p) · GW(p)

I wonder if the value of a post is not correlated with upvotes. i.e. a post that is 1.3* more valuable than another might have to be 100 words longer but only get 10% more upvotes.

I feel like even if we could encourage only the posts who are below 500 characters that seem to have downvotes to consider increasing their length in order to roughly correlate with sharing more words = being clearer = providing more value. Even at the worst effects of the results of such a strategy, we would probably see the garden get a little nicer.

And continuing the metaphor, its not like we are chopping down trees, just clearing out a few weed around the roses.

I propose a character count on the comment boxes so that people know how many characters they are writing, then possibly a popup (similar to the one about comments on downvoted posts), that says,

"we noticed that comments are more meaningful, helpful and thought out when they are at a minimum 500 characters, you can still post a shorter comment but you have N characters to go to cross the arbitrary threshold we decided on. You can still post but it will cost -1 karma. If you think that you will get at least one upvote more than usual then its certainly worthwhile posting as is; otherwise can you add more useful characters to your post?"

[pollid:1011]

Replies from: Lumifer
comment by Lumifer · 2015-07-21T01:24:32.039Z · LW(p) · GW(p)

"we noticed that comments are more meaningful, helpful and thought out when they are at a minimum 500 characters, you can still post a shorter comment but you have N characters to go to cross the arbitrary threshold we decided on. You can still post but it will cost -1 karma. If you think that you will get at least one upvote more than usual then its certainly worthwhile posting as is; otherwise can you add more useful characters to your post?"

I can tell you exactly what the outcome of that will be. I am sure you'll figure it out, too, if you think about your suggestion for a minute or two.

Replies from: Elo
comment by Elo · 2015-07-21T04:47:21.044Z · LW(p) · GW(p)

no I couldn't think of it before I sugested the idea, please be explicit about it. exactly what will go wrong and is there a way to solve that without breaking the entire idea?

Replies from: Lumifer
comment by Lumifer · 2015-07-21T05:12:12.311Z · LW(p) · GW(p)

People will still write short replies.

Andthenfilltheremainderof500characterswithtrashjustsothatthestupidmachinebesatisfiedandtheywouldnothavetopaythe-1karmapricesinceit'seasytojustfillupspacewatchme:ooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahooohaaahphew.

Replies from: gjm, Elo
comment by gjm · 2015-07-21T14:01:09.896Z · LW(p) · GW(p)

And then get a storm of downvotes that cancels out the benefit they hoped to gain by padding their comment. And then probably not do it again.

What I'd be more worried about is that short comments may be more valuable than you would think from their average karma -- e.g., perhaps in some cases short not-exceptionally-insightful comments form (as it were) the skeleton of a discussion, within which insights might emerge. Or perhaps if everyone felt they mustn't post short comments unless they were exceptionally insightful, the barrier to participation would feel high enough that scarcely anyone would ever post anything, and LW would just wither.

comment by Elo · 2015-07-21T05:52:58.390Z · LW(p) · GW(p)

sure. some people will. and some people will re-think their choices and write with more effort. and some people will accept -1 karma. I don't see these 3 choices as a problem, if we can marginally increase the quality of interactions...

Replies from: Lumifer, David_Bolin
comment by Lumifer · 2015-07-21T16:22:47.301Z · LW(p) · GW(p)

I don't expect that an incentive to add some unnecessary volume will improve the quality of comments.

Recall Blaise Pascal's "I would have written a shorter letter, but I did not have the time" :-)

Replies from: Elo
comment by Elo · 2015-07-22T01:47:03.861Z · LW(p) · GW(p)

a lovely quote, I would not say that volume is correlated with quality, but I would say the potential benefits outweigh the disadvantages. Obviously enough people disagree with me.

comment by David_Bolin · 2015-07-21T07:41:54.312Z · LW(p) · GW(p)

I don't think this would be helpful, basically for the reason Lumifer said. In terms of how I vote personally, if I consider a comment unproductive, being longer increases the probability that I will downvote, since it wastes more of my time.