Questions tagged [agreement-statistics]

Agreement is the degree to which two raters, instruments, etc, give the same value when applied to the same object. Special statistical methods have been designed for this task.

Agreement is the degree to which two raters, instruments, etc., give the same value (rating / measurement) when applied to the same object. Agreement can be assessed to determine if one measurement can be substituted for another, the reliability of a measurement, etc. Trying to assess agreement using a correlation coefficient (or perhaps a chi-squared test for categorical variables) is a very common and intuitive mistake. Special statistical methods have been designed for this task.

Some references:

The Wikipedia entry on inter-rater agreement.
Jon Uebersax's website on agreement statistics.
Robinson, W.S. (1057). The statistical measurement of agreement. American Sociological Review, 22, 1, pp. 17-25.
Bland J.M. & Altman D.G. (1986). Statistical methods for assessing agreement between two methods of clinical measurement. Lancet, 327, 8476, pp. 307–310.

384 questions

votes

2 answers

Inter-rater reliability for ordinal or interval data

Which inter-rater reliability methods are most appropriate for ordinal or interval data? I believe that "Joint probability of agreement" or "Kappa" are designed for nominal data. Whilst "Pearson" and "Spearman" can be used, they are mainly used for…

asked Oct 12 '10 at 22:48

shadi

votes

2 answers

Is Joel Spolsky's "Hunting of the Snark" post valid statistical content analysis?

If you've been reading the community bulletins lately, you've likely seen The Hunting of the Snark, a post on the official StackExchange blog by Joel Spolsky, the CEO of the StackExchange network. He discusses a statistical analysis conducted on a…

reliability agreement-statistics methodology

asked Aug 02 '12 at 17:21

Christopher

votes

2 answers

Interrater reliability for events in a time series with uncertainty about event time

I have multiple independent coders who are trying to identify events in a time series -- in this case, watching video of face-to-face conversation and looking for particular nonverbal behaviors (e.g., head nods) and coding the time and category of…

time-series reliability agreement-statistics

asked Dec 22 '10 at 15:41

dschulman

votes

2 answers

Quadratic weighted kappa

I have done a little Googling about quadratic weighted kappa, but I couldn't find a good explanation that make me understood that. Can somebody give some resource or brief explanation?

machine-learning reliability measurement-error agreement-statistics

asked Nov 29 '16 at 13:30

Taufan Silitonga

votes

4 answers

How can I best deal with the effects of markers with differing levels of generosity in grading student papers?

Around 600 students have a score on an extensive piece of assessment, which can be assumed to have good reliability/validity. The assessment is scored out of 100, and it's a multiple-choice test marked by computer. Those 600 students also have…

agreement-statistics

asked Nov 13 '14 at 01:03

user1205901 - Reinstate Monica

11,303
26
77
152

votes

4 answers

Matthews correlation coefficient with multi-class

Matthews correlation coefficient ($\textrm{MCC}$) is a measurement to measure the quality of a binary classification ([Wikipedia][1]). $\textrm{MCC}$ formulation is given for binary classification utilizing true positives ($TP$), false positives…

machine-learning classification multi-class agreement-statistics

asked Dec 21 '15 at 14:10

John David

votes

1 answer

Computing inter-rater reliability in R with variable number of ratings?

Wikipedia suggests that one way to look at inter-rater reliability is to use a random effects model to compute intraclass correlation. The example of intraclass correlation talks about looking…

r reliability random-effects-model agreement-statistics

asked Nov 16 '11 at 19:45

dfrankow

2,816
6
30
39

votes

2 answers

Inter-rater reliability with many non-overlapping raters

I have a data set of 11,000+ distinct items, each of which was classified on a nominal scale by at least 3 different raters on Amazon's Mechanical Turk. 88 different raters provided judgments for the task, and no one rater completed more about 800…

reliability agreement-statistics cohens-kappa

asked Aug 24 '11 at 22:23

Judd Antin

votes

2 answers

How can I use this data to calibrate markers with different levels of generosity in grading student papers?

12 teachers are teaching 600 students. The 12 cohorts taught by these teachers range in size from 40 to 90 students, and we expect systematic differences between the cohorts, as graduate students were disproportionately allocated to particular…

teaching agreement-statistics

asked Apr 13 '15 at 01:08

user1205901 - Reinstate Monica

11,303
26
77
152

votes

2 answers

Creating and interpreting Bland-Altman plot

Yesterday I heard of the Bland-Altman plot for the first time. I have to compare two methods of measuring blood pressure, and I need to produce a Bland-Altman plot. I am not sure if I get everything about it right, so here's what I think I know: I…

data-visualization interpretation scatterplot agreement-statistics bland-altman-plot

asked May 23 '14 at 14:57

Oldboy

votes

1 answer

Quadratic weighted kappa versus linear weighted kappa

When should I use quadratic weighted kappa or linear weighted kappa? I have two observers evaluating the classes of a number of objects. The classes are fail, pass1, pass2, and excellent (ordinal scale). The errors in classification between "fail"…

agreement-statistics cohens-kappa

asked May 23 '13 at 07:53

andreSmol

votes

1 answer

What am I measuring when I apply a graded response model to the "Hunting of the Snark" dataset?

In another question, I asked about the statistical validity of StackExchange's "Hunting of the Snark" dataset, and whether or not we could draw any conclusions from its results. I measured a few reliability coefficients to better understand them;…

psychometrics reliability agreement-statistics

asked Aug 03 '12 at 22:07

Christopher

votes

1 answer

How to perform inter-rater reliability with multiple raters, different raters per participant, and possible changes over time?

Participants were rated twice, with the 2 ratings separated by 3 years. For most participants the ratings were done by different raters, but for some (< 10%) the same rater performed both ratings. There were 8 raters altogether, with 2 doing ratings…

reliability psychometrics agreement-statistics intraclass-correlation

asked Jun 25 '12 at 15:51

Joyce

votes

3 answers

Bland-Altman (Tukey Mean-Difference) plot for differing scales

I find that Bland-Altman plots for comparing two methods are extremely useful in assessing agreement. However, I'm curious if there is a similar method or transformation that can be used when the scales of the two methods are not identical, but…

data-visualization agreement-statistics concordance bland-altman-plot

asked Aug 14 '15 at 13:34

Ashe

1,085
10
25

votes

0 answers

Inter-rater agreement of a gold standard dataset - a ceiling for reliable evaluation of algorithms?

In my field, a dated gold standard dataset is used to track progress in algorithm development. Now when the state-of-the-art algorithms obtain higher correlation than is the inter-rater agreement of the dataset, there is concern whether the dataset…

spearman-rho agreement-statistics psychometrics

asked Aug 07 '15 at 07:17

tomas

2 3

…

25 26 Next