5

I have read that Kendall W should be avoided when it comes to deal with non-rankings especially for rating scales which tend to have a lot of ties. Yet posts here seem to suggest it for ratings. As stated in this post I have a small study of 21 respondents, who rated some items from 0-5, with 0 being unimportant and 5 being very important and I'm looking for measures of agreement for specific respondents. I am not looking for Absolute agreement.

Whilst ICC was suggested as a possible solution, there is an issue with the use of the F test in this case, given the small number of respondents.

What are your views on Kendall W in this case?

Cesare Camestre
  • 699
  • 3
  • 15
  • 28
  • 1
    Where have you read that? A succinct description of the argument or at least a specific reference would probably be useful. – Gala Jul 22 '13 at 09:36
  • Its an issue to do with a lot of ties. [See the discussion here](https://www.ibm.com/developerworks/community/forums/html/topic?id=77777777-0000-0000-0000-000014903859) @GaëlLaurans – Cesare Camestre Jul 22 '13 at 09:43
  • Added links to relevant posts @GaëlLaurans, I hope this helps in understanding further the problem. I was tempted to use the Kendall W because it was previously used in a similar case, but there are too many ties. – Cesare Camestre Jul 22 '13 at 09:56
  • I agree with Gaël that the null is absurd, but if you nevertheless need a p-value then you can get one for the ICC (which is definitely preferable to W) by doing a permutation test: randomly permute each subject's ratings, then recompute the ICC; do this a few thousand times, and see where your actual ICC comes in the distribution of random ICCs. – Ray Koopman Oct 20 '13 at 22:20

1 Answers1

1

I don't have anything specific to say about Kendall's W but I don't get this concern about the ICC, the F test and the sample size.

Your sample is not so small that testing would necessarily be impossible but why would you want to do such a test? To see if agreement is different from 0? This is quite a low bar and should be evident from the data. If you have doubts about that, these ratings certainly don't form a good measure of anything the raters agree on so worrying about which specific measure of inter-rater of agreement you are using and the niceties of the relevant tests would not really be your main concern.

On the other hand, anything you compute on a sample this small will obviously be subject to a lot of sampling variability and uncertainty. It's a rather basic fact that has nothing to do with ICC or the F-test specifically and there is no miracle inter-rater agreement index that would allow you to go around that.

At the end of the day, I think the underlying issue is that you seem to be asking many rather abstract questions in search for the “true” inter-rater agreement and some sort of fail/pass test that would tell you if it is “good enough”. Such a thing simply does not exist in my opinion and published threshold are really quite arbitrary. Instead of trying to interpret every bit of advice recommending one index or another, I think it could be more fruitful to read broadly about inter-rater agreement measures (see the references provided in other questions on this topic) and think about what each of them reveal about your data rather than focus solely on whether agreement is “good” or not.

Gala
  • 8,323
  • 2
  • 28
  • 42
  • What I am after - "Did most respondents answer the questions in the same way?". Did most respondents rate the answers in a similar way - i.e. they all rated them high / low depending on the question. That is the question – Cesare Camestre Jul 22 '13 at 11:11
  • My point is that this is not a well-defined question that could be addressed with a single technique, at least outside of some obvious cases (e.g. everybody giving the exact same rating). It admits several answers, depending on whether you are interested in absolute ratings, if you want to know if participants tend to rate particular objects higher or lower, if you are afraid of certain patterns of random responses, etc. The answer will also be quantitative and not binary. – Gala Jul 22 '13 at 11:44
  • I cannot guarantee that I can or will provide more specific feedback in any case but giving us more information on what you want to do with these data and this inter-rater agreement measure would probably be better than just asking “What's a good measure?” or “I want to know if people respond in the same way?” It would also help you look at the literature in another way. – Gala Jul 22 '13 at 11:47
  • I am not interested in ABSOLUTE ratings, but if participants tend to respond in the same way i.e. did they all rate items in the more or less the same way? Am I clear enough. I'm after a statement "Participants expressed agreement that the issue is important to them"; but i need to substantiate that agreement with something. – Cesare Camestre Jul 22 '13 at 12:12
  • If all you want to do is show that there is *some* agreement between raters so that you can justify your methods, then why not just present the ICC value (which is standardized, and therefore the most informative to the reader), and the F-test which shows that there is significant agreement (p <0.0001), and move on with your life? Almost any option will detect that the level of agreement is >0, because as Gael says, it's a very low bar. – atrichornis Jul 22 '13 at 12:28
  • @atrichornis The problem with the ICC is the F-test. The sample size is too small to be normally distributed. If you get my point. – Cesare Camestre Jul 23 '13 at 10:51
  • 1
    Sample size doesn't affect the true distribution (but it might limit your ability to detect the true distribution). What it will do is reduce statistical power, but the F-test is showing significant agreement, and every other test you try will probably also show significant agreement, because that's what the data show. The problem with the F-test is not the sample size, it's the non-continuous and closed-ended (ordinal) rating system (or that's how it seems to me). But every test has potential problems - you have to decide. Personally, I'd just present both sets of results. – atrichornis Jul 23 '13 at 13:57
  • By the way, there is a variant of Kendall's W with a correction for ties: http://en.wikipedia.org/wiki/Kendall's_W#Correction_for_ties You'll need to find a software package that provides the option to correct for ties (there are R packages that do, but I gather you're not an R user) – atrichornis Jul 23 '13 at 14:14
  • 1
    @CesareCamestre Forget the F-test, then. – Gala Jul 23 '13 at 14:23
  • @GaëlLaurans can it simply be forgotten? – Cesare Camestre Jul 23 '13 at 14:56
  • @atrichornis I'm not sure thats ok given the number of ties in this case. I had seen that previously in Siegel's book on nonparametric statistics. – Cesare Camestre Jul 23 '13 at 14:56
  • @CesareCamestre Why not? Why do you think it's interesting in the first place? – Gala Jul 23 '13 at 19:53
  • It tests if its statistically different from zero... which could be relevant when quoting it. @GaëlLaurans – Cesare Camestre Jul 23 '13 at 20:32
  • 1
    Like I said it's an extremely low bar, something that is usually obviously true because the null hypothesis is absurd. I am still not sure I get precisely what it is your are doing but is it plausible that your participants would have absolutely no agreement on what's important or not? If not, why test it? (Routinely putting a *p*-value next to any number, no matter what, is *not* relevant.) – Gala Jul 23 '13 at 20:42
  • So you are suggesting presenting the icc to highlight there is some agreement amongst the respondents but you can go without presenting the f test. I am of the understanding that given the non normal distribution we have to use the mixed version of the icc – Cesare Camestre Jul 23 '13 at 20:46