Questions tagged [agreement-statistics]

Agreement is the degree to which two raters, instruments, etc, give the same value when applied to the same object. Special statistical methods have been designed for this task.

Agreement is the degree to which two raters, instruments, etc., give the same value (rating / measurement) when applied to the same object. Agreement can be assessed to determine if one measurement can be substituted for another, the reliability of a measurement, etc. Trying to assess agreement using a correlation coefficient (or perhaps a chi-squared test for categorical variables) is a very common and intuitive mistake. Special statistical methods have been designed for this task.

Some references:

384 questions
29
votes
2 answers

Inter-rater reliability for ordinal or interval data

Which inter-rater reliability methods are most appropriate for ordinal or interval data? I believe that "Joint probability of agreement" or "Kappa" are designed for nominal data. Whilst "Pearson" and "Spearman" can be used, they are mainly used for…
shadi
  • 497
  • 1
  • 4
  • 10
25
votes
2 answers

Is Joel Spolsky's "Hunting of the Snark" post valid statistical content analysis?

If you've been reading the community bulletins lately, you've likely seen The Hunting of the Snark, a post on the official StackExchange blog by Joel Spolsky, the CEO of the StackExchange network. He discusses a statistical analysis conducted on a…
Christopher
  • 353
  • 2
  • 6
13
votes
2 answers

Interrater reliability for events in a time series with uncertainty about event time

I have multiple independent coders who are trying to identify events in a time series -- in this case, watching video of face-to-face conversation and looking for particular nonverbal behaviors (e.g., head nods) and coding the time and category of…
dschulman
13
votes
2 answers

Quadratic weighted kappa

I have done a little Googling about quadratic weighted kappa, but I couldn't find a good explanation that make me understood that. Can somebody give some resource or brief explanation?
13
votes
4 answers

How can I best deal with the effects of markers with differing levels of generosity in grading student papers?

Around 600 students have a score on an extensive piece of assessment, which can be assumed to have good reliability/validity. The assessment is scored out of 100, and it's a multiple-choice test marked by computer. Those 600 students also have…
12
votes
4 answers

Matthews correlation coefficient with multi-class

Matthews correlation coefficient ($\textrm{MCC}$) is a measurement to measure the quality of a binary classification ([Wikipedia][1]). $\textrm{MCC}$ formulation is given for binary classification utilizing true positives ($TP$), false positives…
9
votes
1 answer

Computing inter-rater reliability in R with variable number of ratings?

Wikipedia suggests that one way to look at inter-rater reliability is to use a random effects model to compute intraclass correlation. The example of intraclass correlation talks about looking…
dfrankow
  • 2,816
  • 6
  • 30
  • 39
9
votes
2 answers

Inter-rater reliability with many non-overlapping raters

I have a data set of 11,000+ distinct items, each of which was classified on a nominal scale by at least 3 different raters on Amazon's Mechanical Turk. 88 different raters provided judgments for the task, and no one rater completed more about 800…
9
votes
2 answers

How can I use this data to calibrate markers with different levels of generosity in grading student papers?

12 teachers are teaching 600 students. The 12 cohorts taught by these teachers range in size from 40 to 90 students, and we expect systematic differences between the cohorts, as graduate students were disproportionately allocated to particular…
8
votes
2 answers

Creating and interpreting Bland-Altman plot

Yesterday I heard of the Bland-Altman plot for the first time. I have to compare two methods of measuring blood pressure, and I need to produce a Bland-Altman plot. I am not sure if I get everything about it right, so here's what I think I know: I…
8
votes
1 answer

Quadratic weighted kappa versus linear weighted kappa

When should I use quadratic weighted kappa or linear weighted kappa? I have two observers evaluating the classes of a number of objects. The classes are fail, pass1, pass2, and excellent (ordinal scale). The errors in classification between "fail"…
andreSmol
  • 487
  • 1
  • 6
  • 14
8
votes
1 answer

What am I measuring when I apply a graded response model to the "Hunting of the Snark" dataset?

In another question, I asked about the statistical validity of StackExchange's "Hunting of the Snark" dataset, and whether or not we could draw any conclusions from its results. I measured a few reliability coefficients to better understand them;…
Christopher
  • 353
  • 2
  • 6
8
votes
1 answer

How to perform inter-rater reliability with multiple raters, different raters per participant, and possible changes over time?

Participants were rated twice, with the 2 ratings separated by 3 years. For most participants the ratings were done by different raters, but for some (< 10%) the same rater performed both ratings. There were 8 raters altogether, with 2 doing ratings…
8
votes
3 answers

Bland-Altman (Tukey Mean-Difference) plot for differing scales

I find that Bland-Altman plots for comparing two methods are extremely useful in assessing agreement. However, I'm curious if there is a similar method or transformation that can be used when the scales of the two methods are not identical, but…
8
votes
0 answers

Inter-rater agreement of a gold standard dataset - a ceiling for reliable evaluation of algorithms?

In my field, a dated gold standard dataset is used to track progress in algorithm development. Now when the state-of-the-art algorithms obtain higher correlation than is the inter-rater agreement of the dataset, there is concern whether the dataset…
tomas
  • 361
  • 2
  • 6
1
2 3
25 26