25

If you've been reading the community bulletins lately, you've likely seen The Hunting of the Snark, a post on the official StackExchange blog by Joel Spolsky, the CEO of the StackExchange network. He discusses a statistical analysis conducted on a sample of SE comments to evaluate their "friendliness" from an outside user's perspective. The comments were randomly sampled from StackOverflow and the content analysts were members of Amazon's Mechanical Turk community, a market for work that connects companies to workers who do small, short tasks for affordable fees.

Not so long ago, I was a graduate student in political science and one of the classes I took was Statistical Content Analysis. The class's final project, in fact its entire purpose, was to conduct a detailed analysis of the New York Times' war reporting, to test whether or not many assumptions Americans make about news coverage during wars were accurate (spoiler: evidence suggests they're not). The project was huge and quite fun, but by far its most painful section was the 'training and reliability testing phase', which occurred before we could conduct full analysis. It had two purposes (see page 9 of the linked paper for a detailed description, as well as references to intercoder reliability standards in the content analysis statistical literature):

  1. Confirm all coders, i.e., readers of the content, were trained on the same qualitative definitions. In Joel's analysis, this meant everyone would know exactly how the project defined "friendly" and "unfriendly."

  2. Confirm all coders interpreted these rules reliably, i.e. we sampled our sample, analyzed the subset, and then statistically demonstrated our pairwise correlations on qualitative evaluations were quite similar.

Reliability testing hurt because we had to do it three or four times. Until -1- was locked down and -2- showed high enough pairwise correlations, our results for the full analysis were suspect. They couldn't be demonstrated valid or invalid. Most importantly, we had to do pilot tests of reliability before the final sample set.

My question is this: Joel's statistical analysis lacked a pilot reliability test and didn't establish any operational definitions of "friendliness". Was the final data reliable enough to say anything about the statistical validity of his results?

For one perspective, consider this primer on the value of intercoder reliability and consistent operational definitions. From deeper in the same source, you can read about pilot reliability tests (item 5 in the list).

Per Andy W.'s suggestion in his answer, I'm attempting to calculate a variety of reliability statistics on the dataset, which is available here, using this command series in R (updated as I calculate new statistics).

Descriptive statistics are here

Percentage agreement (with tolerance = 0): 0.0143

Percentage agreement (with tolerance = 1): 11.8

Krippendorff's alpha: 0.1529467

I also attempted an item-response model for this data in another question.

Christopher
  • 353
  • 2
  • 6
  • 1
    The [did publicly release the coding data](http://blog.stackoverflow.com/?attachment_id=12040) so one could go and assess the reliability of the coders themselves if one wanted to. – Andy W Aug 02 '12 at 17:53
  • I can get the final data set, yes, but my understanding of content analysis methodology was that reliability testing must occur before the full content analysis, and that no statistical manipulation could extract reliability from the final dataset. If that's not the case, then I suppose Mechanical Turk is a content analyst's dream. – Christopher Aug 02 '12 at 18:05
  • 3
    Re: #1 - It should be noted that this wasn't so much an exercise on if the comments *were* friendly or not, but more of an exercise on if the comments were *perceived* as friendly or not to an outside user. – Rachel Aug 02 '12 at 19:26
  • @Rachel: Clarified the purpose of the study in the first paragraph – Christopher Aug 02 '12 at 19:28
  • It seems that [Seth Rogers](http://meta.stackoverflow.com/users/168422/seth-rogers) was the statistician on this exercise -- it might be worth pinging him to get his feedback here. – jscs Aug 02 '12 at 19:28
  • @Rachel: I lack the stats knowledge to express this in the correct terminology, but no matter what was being tested, it still needs to be demonstrated to some degree of certainty that the respondants were not just mashing the "friendly" and "unfriendly" buttons at random. – jscs Aug 02 '12 at 19:32
  • @JoshCaswell I pinged Seth on twitter... – Christopher Aug 02 '12 at 19:33
  • @JoshCaswell I think that was part of the experiment though. People are unique, and everyone views things differently. I think you will actually get a more accurate result on if users view comments as friendly/unfriendly by asking 20 random people, instead of only asking 20 people that share the same friendly/unfriendly view as you. It was noted in one of the blog posts that the users that performed the evaluation were rated highly on MTurk, so I can assume they are not users who would simply hit a random button. – Rachel Aug 02 '12 at 19:36
  • 1
    @Rachel: Random button presses are just the most extreme form of unreliability. The tester needs to ensure -- to some reasonable extent -- that the measurements are actually measuring the same thing. If you ask a child, an average adult, and a basketball star "Is this fence tall or short?", and then compile the answers as a statistical exercise, I don't believe that you can have much confidence in the analysis. As I said, though, I'm no statistician, so I am open to being wrong about this, and will be interested to read any answers. – jscs Aug 02 '12 at 19:46
  • 1
    @JoshCaswell I'm quite interested in the answers as well :) I think the test done by Joel was more like asking someone "Do you think this question is tall or short", as opposed to asking "Is this fence tall or short". What matters is the user's opinion and point of view, not if the fence is actually tall or short. – Rachel Aug 02 '12 at 19:50
  • 3
    @Rachel I don't think that's right. If they were measuring how outsiders perceive comments on SO, they'd have needed quite a larger sample set than 20 people. – Christopher Aug 02 '12 at 19:52
  • 2
    It's the difference between concluding something about how outsiders perceive the comments, and concluding something about the comments themselves. In the first case, you'd need a much larger sample of people, and the conclusion would be "Outsiders think 2.3% of SO comments are unfriendly." In the second, it's "2.3% of SO comments are unfriendly." They're different conclusions, and I think the second one might not be possible to make, because we can't demonstrate the coders evaluate the comments similarly without a reliability test. – Christopher Aug 02 '12 at 19:56
  • 1
    @Christopher I don't think it was the same 20 people rating all 7000 comments. They simply asked for 20 ratings on each of the comments, and those ratings could come from any high-rated user. – Rachel Aug 02 '12 at 20:02
  • @Rachel: You still have to find out what they think "friendly" means. One way or the other, there needs to be a standard set for the measurement. We can say "We define this comment, _c_, as 'friendly'. Now we will find out if this agrees with the opinion of the person in the street." (this tests respondents' own definitions) or, "We wish to find out if this comment is 'friendly'. We will ask testers to rate it according to these criteria." (this tests the comment itself). (cont'd) – jscs Aug 02 '12 at 20:03
  • 2
    @Christopher Friendliness is very subjective though. Depending on who you ask, the same comment can be viewed as both friendly and unfriendly. That is why I think its more important to get the point of view from a large number of random users instead of someone someone who has the exact same viewpoint as yourself. – Rachel Aug 02 '12 at 20:04
  • To have both sides of the question undefined creates meaningless answers -- does the object under measured have some quality which may be measured differently by different measurers? (Also, if the former was the intent of the exercise, then there need to be _way_ more data points.) – jscs Aug 02 '12 at 20:04
  • @Rachel: "Friendliness is subjective" -- this is exactly the problem; everyone has his or her own definition. Imagine that you asked 20 people to rate 7000 comments as "glurkly", "frabnous", or "unglurkly". What do the results mean? – jscs Aug 02 '12 at 20:23

2 Answers2

7

Reliability of scores is frequently interpreted in terms of Classical Test Theory. Here one has a true score, X, but what you observe at any particular outcome is not only the true score, but the true score with some error (i.e. Observed = X + error). In theory, by taking multiple observed measures of the same underlying test (making some assumptions about the distribution of the errors of those tests) one can then measure the unobserved true score.

Note here in this framework that you have to assume that your multiple observed measures are measuring the same underlying test. Poor reliability of test items is then frequently taken as evidence that the observed measures are not measuring the same underlying test. This is just a convention of the field though, poor reliability, in and of itself, does not prove (in any statistical sense) the items are not measuring the same construct. So it could be argued that by taking many observed measures, even with very unreliable tests, one could come about a reliable measure of the true score.

It also stands to be mentioned that classical test theory isn't necessarily the only way to interpret such tests, and many scholors would argue that the concept of latent variables and item-response theory is always more appropriate than classical test theory.


Also a similar implicit assumption in classical test theory is when people say reliabilities are too high. It doesn't say anything about the validity of whether particular item(s) measure some underlying test, but that when reliabilities are too high researchers take it as evidence that errors between the tests are not independent.

I'm not quite sure why you are so vehement about not going in and calculating the reliabilities yourself. Why could one not do this and subsequently interpret the analysis in light of this extra information?

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
Andy W
  • 15,245
  • 8
  • 69
  • 191
  • So first let me point out that I'm not a grad student doing stats anymore for a good reason: it wasn't quite my forte. I might be misremembering the methodology. All the same, I think you and I might be talking about different measures of reliability, or at the least there is research to suggest measuring intercoder reliability before the final analysis is conducted matters for validity. I've edited the question to include one source I found on the web, which cites considerably more research on the subject. – Christopher Aug 02 '12 at 19:15
  • It is a different context (reliability of dichotomous test items instead of some continuous outcome), but the logic is functionally the same. Hence why I did not mention any specific measure of reliability (there are many). Your quote does not insinuate anything about `before the final analysis`, so I'm not quite sure where that notion comes from. – Andy W Aug 02 '12 at 19:54
  • Ah ha. You are correct, it is not quite a requirement. Reading further into that link I posted, it looks like this pilot tests are considered a methodological best practice (search for pilot test in it). – Christopher Aug 02 '12 at 19:59
  • I've changed my question to accommodate the new information. Thank you for the help correcting my error. – Christopher Aug 02 '12 at 20:07
  • I dug my knowledge of R out of the deep recesses of my memory and am attempting to calculate [Krippendorff's alpha](http://en.wikipedia.org/wiki/Krippendorff%27s_alpha) on the dataset right now. If my machine is actually able to return a result, I'll pass it along. – Christopher Aug 02 '12 at 20:51
  • Added a few (classical test theory I believe) reliability measures. Attempted an item-response model, but I'm not quite sure how to approach the data. All the ordinal data examples I can find map hundreds of individuals to a few questions with ordinal responses. This data is thousands of comments mapped to 20 individuals who gave ordinal responses. As such, I can run the analysis without data manipulation, but I'm not quite sure what I'm testing. – Christopher Aug 03 '12 at 14:31
  • @Christopher, that is probably best as a second question for our site! – Andy W Aug 03 '12 at 14:33
  • 2
    [Other question](http://stats.stackexchange.com/questions/33639/what-am-i-measuring-when-i-apply-a-graded-response-model-to-the-hunting-of-the) is up. – Christopher Aug 03 '12 at 22:08
7

Percentage agreement (with tolerance = 0): 0.0143

Percentage agreement (with tolerance = 1): 11.8

Krippendorff's alpha: 0.1529467

These agreement measures state that there is virtually no categorial agreement - each coder has his or her own internal cutoff point for judging comments as "friendly" or "unfriendly".

If we assume that the three categories are ordered, i.e.: Unfriendly < Neutral < Friendly, we can also calculate the intraclass correlation as another measure of agreement. On a random sample of 1000 comments, there is an ICC (2,1) of .28, and an ICC (2, k) of .88. That means, if you would only take one of the 20 raters, results would be very unreliable (.28), if you take the average of 20 raters, results are reliable (.88). Taking different combinations of three random raters, the averaged reliability is between .50 and .60, which still would be judged to be too low.

The average bivariate correlation between two coders is .34, which also is rather low.

If these agreement measures are seen as a quality measure of the coders (who actually should show good agreement), the answer is: they are not good coders and should be better trained. If this is seen as a measure of "how good is spontaneous agreement amongst random persons", the answer also is: not very high. As a benchmark, the average correlation for physical attractiveness ratings is around .47 - .71 [1]

[1] Langlois, J. H., Kalakanis, L., Rubenstein, A. J., Larson, A., Hallam, M., & Smoot, M. (2000). Maxims or myths of beauty? A meta-analytic and theoretical review. Psychological Bulletin, 126, 390–423. doi:10.1037/0033-2909.126.3.390

Felix S
  • 4,432
  • 4
  • 26
  • 34