Measuring inter-subject agreement in user ratings

Question

In my system I have users rating objects (e.g. films) from 1 to 5 stars. The mean rating clearly tells something about the overall opinion on an object, but I'd like to use a more precise measure of variability. For example, if the stars are set randomly, if half of the users give 1 and the other half is 5, the average is still 3, so it doesn't tell me the agreement among users. I'd like a measure that tells me the inter-subject agreement, ranging from 1=all users agree, 0=random ratings.

What would be a good and standard measure for this aspect of a set of ratings?

_Disagreement_ and _randomness of votes_ are two different things. For example when there are only $1$s and $5$s the pattern is very far from random (and means that it is either loved or hated). So do you want '0' for random rating and negative for 'controversial' stuff? — Piotr Migdal, Feb 21 '12 at 11:27
I guess I need two different measures, one for randomness and one for polarization. so if everybody agrees, I get rand=0, and pol=0, if it's random, I get rand=1, pol=0, if it's perfectly polarized I get rand=0, pol=1. Does it make sense? — Mulone, Feb 21 '12 at 11:34
@Mulone Two variables would make it even more complicated (but yes, it can be done with e.g. entropy and st.dev.). Rather you can measure e.g. standard deviation of grades. For the maximal agreement $SD=0$, for the maximal disagreement $SD=2$, while for random case (here by random I mean that every grade has the same number of responders, which is likely _not_ to be the best model of randomness for your system) $SD=\sqrt{2}$. — Piotr Migdal, Feb 21 '12 at 11:42
@PiotrMigdal What would you recommend as a model of randomness in this context? What particular measure of entropy would be appropriate? Thanks again. — Mulone, Feb 21 '12 at 12:33
@Mulone Different states (i.e. ratings) need not to be equally occupied. It is not a die roll. Here by a 'random' distribution should be rather the distribution of all ratings from the service. But I is it that for your service you just care how people agree? — Piotr Migdal, Feb 21 '12 at 13:32
This website has a lot of good information about inter-rater agreement: http://www.john-uebersax.com/stat/agree.htm — gung - Reinstate Monica, Feb 21 '12 at 20:56

score 1 · Accepted Answer · answered Feb 21 '12 at 19:54

As your ratings are gradual (i.e. not categorical), one standard tool is just the standard deviation (SD). For your data (i.e. $1-5$ with steps of $1$) it is

$0$ for complete agreement (i.e. everyone casts the same vote),
$\sqrt{2}$ when votes are 'random' ($\frac{1}{5}$ votes per each possibility),
$2$ for the maximal disagreement.

As you see, disagreement and random votes are not the same thing. To make it short, when people answer either '1' or '5' it is a disagreement clearly stronger than just random.

You can rescale it $$\text{agreement} = 1-\frac{\text{SD}}{\sqrt{2}}.$$

score 0 · Answer 2 · answered Feb 21 '12 at 12:45

0

Try clustering (subspace/spectral mappings are quite appropriate), building association rules,... or any other method whose objective is to detect and present bias or regions of low entropy in the data.

answered Feb 21 '12 at 12:45

jcb

114
3

Actually what I meant is inter-subject within the same object. I'm not trying to compare the users. – Mulone Feb 21 '12 at 13:03

Measuring inter-subject agreement in user ratings

2 Answers2

Linked