1

I am writing with questions about a study I am helping with:

Two “raters” will score athletes subjectively from 0 to 3 (3>2>1>0) on each of 10 skill performances. Scores will be summed to a maximum of 30, and the total bracketed into approximately three categories for comparison with regard to knee injury incidence. Reliability/agreement between the two raters’ scores must be determined.

How should the inter-rater reliability testing be set up? Kappa and/or Kendall? Same data for individual skills and total score? How much data is needed for this irr testing?

The lead investigator anticipates highly consistent scores between raters and plans to divide their testing of subjects accordingly, for the main study.

Jeff
  • 11
  • 1
  • Kindly narrow down your questions to those not already answered by existing threads on this site. – rolando2 Sep 07 '11 at 00:55
  • 1
    @rolando2 Could you please provide a link to (at least) one such thread? – whuber Sep 07 '11 at 16:08
  • Further to whuber's comment, I outlined the nitty gritty of one part of a study. It is very specific, and there are just a few questions. I looked though other postings. If they do answer the questions, links or references would help. Can someone address these few questions? Thank you. – Jeff Sep 07 '11 at 22:07
  • Some existing threads with relevant info: http://stats.stackexchange.com/questions/7208/can-one-use-cohens-kappa-for-two-judgements-only/7354#7354 and http://stats.stackexchange.com/questions/3539/inter-rater-reliability-for-ordinal-or-interval-data and http://stats.stackexchange.com/questions/12415/assessing-and-testing-inter-rater-agreement-with-kappa-statistic-on-a-set-of-bina – rolando2 Sep 07 '11 at 22:28
  • I read Stemler, Steven E. (2004). A comparison of consensus, consistency, and measurement approaches to estimating interrater reliability. Practical Assessment, Research & Evaluation, 9(4). http://PAREonline.net/getvn.asp?v=9&n=4 . – Jeff Sep 13 '11 at 06:40
  • That helped, but I still would like to know the # of subjects to test with 2 raters before splitting the data between raters. ? – Jeff Sep 13 '11 at 06:50
  • If consensus is high, consistency must be low and is not so relevant, right? – Jeff Sep 15 '11 at 20:51

0 Answers0