The best measure of reliability for interval data between 0 and 1

Question

I have 6 sets of interval data each of which between 0 and 1. Each set, calculated by a computer program, is related to the degree of similarity between some sounds (pairwise). What do you think in the best inter-rater reliability measure I can use to see how close the 6 judges are? If I want to explain the data in each set, it can be: 0.98, 0.01, 0.5, ... which shows 'sound1' and 'sound2' are very similar (0.98), 'sound1' and 'sound3' are much different (0.01) and so on. Thank you so much.

@Shadi How many discrete values do you have? Are they really to be considered to be ordinal, if they lie between 0 and 1? — chl, Oct 19 '10 at 15:42
They are continuous, every float value between 0 and 1. Is there any incoherence between ordinal data and [0 1] interval? Thanks. — shadi, Oct 19 '10 at 15:49
For me this would rather be considered an interval scale: http://en.wikipedia.org/wiki/Interval_scale#Interval_scale — Henrik, Oct 19 '10 at 15:54
Yeah, I believe you are right. So you mean I cannot use ordinal measures for interval data, right? I corrected my question. I am looking forward to your guidances. Thanks. — shadi, Oct 19 '10 at 15:59
When you stay between 0 and 1, does that include the endpoints 0 and 1 themselves or exclude them? — onestop, Oct 19 '10 at 16:10
You can use any ordinal measures on interval data. But, you should not do so. You should use measures for interval data as they use more information are more powerful, etc as measures for ordinal data. Measurements are itself ordinally scaled (from low to high): nominal, ordinal, interval (and ratio). It is generally a good idea to use measures of that level of measurement that one has as these are the most powerful/informative ones. — Henrik, Oct 19 '10 at 16:29
@Shadi Following my latest comment, I asked for closing this question so that you can reformulate a new one by adding precision on your design, especially the fact that you actually have 6 similarity matrices instead of 6 series of measurement. This way, others may provide useful insights into this question. You can still link to this question, but I really feel it call for a new thread with your added clarifications so that everyone can contribute. — chl, Oct 20 '10 at 17:34

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

4

Referring to your comments to @Henrik, I'm inclined to think that you rather have continuous measurements on a set of objects (here, your similarity measure) for 6 raters. You can compute an intraclass correlation coefficient, as described here Reliability in Elicitation Exercise. It will provide you with a measure of agreement (or concordance) between all 6 judges wrt. assessments they made, or more precisely the part of variance that is explained by between-rater variance. There's a working R script in appendix.

Note that this assumes that your measures are considered as real valued measurement (I refer to @onestop's comment), not really proportions of similarity or whatever between your paired sounds. I don't know of a specific version of the ICC for % or values bounded on an interval, only for binary or ranked data.

Update:

Following your comments about parameters of interest and language issue:

There are many other online ressources on the ICC; I think David Howell provides a gentle and well illustrated introduction to it. Its discussion generalize to k-sample (judges/raters) without any difficulty I think, or see this chapter from Sea and Fortna on Psychometric Methods. What you have to think to is mainly whether you want to consider your raters as an unique set of observers, not necessarily representative of all the raters that would have assess your object of measurement (this is called a fixed effect), or as a random sample of raters sampled from a larger (hypothetical) population of potential raters: in the former case, this corresponds to a one-way anova or a consistency ICC, in the latter case we talk about an agreement ICC.
A colleague of mine successfully used Kevin Brownhill's script (from Matlab Central file exchange). The ICC you are interested in is then cse=3 (if you consider that your raters are not representative of a more general population of raters).

edited Apr 13 '17 at 12:44

Community

1

answered Oct 19 '10 at 16:14

chl

50,972
18
205
364

If I want to explain the data in each set, it can be: 0.98, 0.01, 0.5, ... which shows 'sound1' and 'sound2' are very similar (0.98), 'sound1' and 'sound3' are much different (0.01) and so on. In this case, do you believe I can use ICC or Pearson is better? – shadi Oct 19 '10 at 16:59
@Shadi Did you read the thread I pointed to? You cannot use correlation-based criteria. So, I would better advice you to rely on the ICC unless someone has a better idea to cope with bounded values. For me it's not a problem as you're likely to end-up with similar results than with any other most complicated method. Still I agree with others than it makes sense not to use ANOVA or mixed-effects models with inappropriate link function in certain cases. – chl Oct 19 '10 at 17:00
Yes, I have studied them. Thank you. Actually I am completely new to this concept. I did not understand almost 80% of the thread. I wanted to use ICC in Matlab, but at first I should know the meaning of 'type', 'alpha' and 'r0' to know what value is the best for my purpose. Do you know any quick way to get some information? Thanks for your guidance. – shadi Oct 19 '10 at 18:17
@Shadi Oups, sorry, I didn't think of language issue. I've updated my response. HTH – chl Oct 19 '10 at 19:42
Dear chl, Thanks for your great guides. I'm not sure if I explained my data well or not. when you asked : "how many discrete data do you have", did you mean something like "partner1" and "partner2" in (uvm.edu/~dhowell/StatPages/More_Stuff/icc/icc.html)? If so, I should say just 1. I try to explain my data better. I have, for example, 100 different sounds. I want to know how similar each two sounds are. I used a program to do that for me. It generated some (4851) numbers. I also used 5 other programs to do the same thing. Now, I want to know how close the results of the 6 programs are – shadi Oct 19 '10 at 22:43
@Shadi So you have 6 "raters" or "methods" that are assessing 100 objects similarity. How do you explain that there are 4851 measurements: Are there replicate measurements for each object (sound)? The coding (ordered or discrete, continuous) is for the measurements. In your case, I consider it as reflecting a continuous scale of similarity on [0,1]. – chl Oct 20 '10 at 06:30
yes, as you said I have 100 objects and if I want to know the similarity between each two sound, it will be (99*98)/2. (s1,s2)(s1,s3)(s1,s4)...(s1,s100)(s2,s3)(s2,s4)...(s2,s100)...(s99,s100). and I know that (s2,s1)=(s1,s2) so I do not calculate that. so, I will have 4851 measurements for each method(each method calculates the similarity of all the sounds pairwise).and totally I have 6 methods. So, I have 6*4851 data in general. It is a continuous scale of similarity on [0 1]. Thanks. – shadi Oct 20 '10 at 14:46
Could you please tell me if you still believe that the best one is Intra-class correlation? Thanks a lot. – shadi Oct 20 '10 at 16:54
@Shadi Huh, that makes a difference; because here you might also be interested in assessing whether your 6 similarity matrix (and not six series of measurement) present some form a variance attributable to the raters. Provided you consider a stacked version of your pairwise similarity (a long vector), the ICC remain applicable; otherwise there exist methods to assess the comparability of (dis)similarity matrix but I feel this should be clarified either in your question or best, in a new question (and mods could close this one) so that others may contribute. Let me ask the mods first. – chl Oct 20 '10 at 17:02

score 1 · Answer 2 · answered Oct 19 '10 at 16:33

1

If you want to compare just two measures, simply take the correlation coefficient (Pearson's r).

answered Oct 19 '10 at 16:33

Henrik

13,314
9
63
123

2

No you can't use correlation to assess the reliability of the measurements or the inter-rater agreement, even for two series of measurement. The correlation computed from two raters will remain the same even if you add some arbitrary value to the 2nd rater's assessments, while the agreement ICC will decrease and correctly reflects that there is a rater-effect. – chl Oct 19 '10 at 16:39
No, it is for more than two measures. – shadi Oct 19 '10 at 16:40
1

@Shadi Yes, I understand your design; I just take as an illustration @Henrik's POV with 2 series; the same line of reasoning applies with $k$ series of measurement. – chl Oct 19 '10 at 16:44
@chl You are right! I will keep this post to remind everyone that this is wrong. – Henrik Oct 20 '10 at 08:21

The best measure of reliability for interval data between 0 and 1

2 Answers2