0

What is an appropriate test to see if there is a relationship between two things, where one of them is itself obtained by averaging human rankings.

As an example, there are good and bad bottles of wine. The company scientist has found a possible objective measure of the quality of the wine. To validate this measure, gather 100 different bottles, and ask 10 people to rate each bottle on a 1-5 scale. Then average the ratiings, so each bottle has a single average score A. It also has an objective score B given by the scientist's measure, which is on a continuous scale.

One could do correlation between A,B across the 100 bottles, or alternately gather some of the highly human-rated wines in group A1, some low-rated wines in group A2, and then do a t-test of difference of means of the scientist measure on groups A1 vs A2.

But neither of these take into account the fact that the ratings A were themselves obtained by averaging, which has its own variance.

(To explain the question further, suppose the wine bottles were rated on a 1-1000 scale rather than a 1-5 scale. Consider two bottles, one has ratings of between 498 and 502 with an average rating of 500, and the second has an average rating of 520 with similar small variance. The objective measure also gives the second bottle a higher score, so this example is weak support for a relationship. But now suppose that the ratings of the first bottle ranged from 1 to 1000, with an average of 500, and the ratings of the second also had huge variance. In this case the difference in means seems accidental, and this pair of (A,B) should provide less support for the proposed relationship)

How to account for this?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
  • 1
    Is not using the average an option? Is it permissible to use the raw data? – Sal Mangiafico Jan 13 '19 at 17:02
  • 1
    This is an FAQ that has received no good answers. It's the same question as how to rank ratings given to products on Web sites, for instance. The main reason it has no good answers is that any answer ultimately will reflect a trade-off between the sizes of the differences and the uncertainties in them. That trade-off reflects the degree to which you might be risk-adverse or risk-seeking and therefore introduces a personal, subjective element into the problem. – whuber Jan 13 '19 at 19:33
  • Some solutions based on multi-criterion decision analysis are discussed at https://stats.stackexchange.com/questions/9137 and https://stats.stackexchange.com/questions/3201 – whuber Jan 13 '19 at 19:36
  • @Sal Mangiafico Yes using the individual raw rankings rather than the average is possible. However, we don't know how to do this. Simply doing correlation of the individual rankings against the scientist's measure would tell us how each person correlates with the objective measure. That would be useful if the objective measure was proven and we wanted to identify a couple human raters. – largewords Jan 13 '19 at 23:54
  • But in our case, we're not interested in the individual people, only in the using the overall human-rated quality of a bottle of wine to validate the objective measure, where each bottle is ranked by a small group of people (we do not care about their individual opinions). – largewords Jan 13 '19 at 23:56
  • => Another approach would be to have only 1 person rate each bottle, and increase the number of bottles as needed. That would solve the question, but that is too expensive in our actual case. I.e. we need to keep the number of bottles of wine small. – largewords Jan 13 '19 at 23:57
  • @whuber I looked at those examples and see that there is a intrinsically subject tradeoff between the importance of several attributes, a "Pareto" situation. However I do not see how it applies to our example: we only have **ONE** subjective attribute: "quality" of the wine, and one objective attribute, the scientist's score. There is no other attribute to trade off against. (Each individual rater has to internally decide what that one quality score is. ) – largewords Jan 13 '19 at 23:58
  • We are not breaking out separate things like dryness, full body, etc.) Since the quality a bottle is subjective, we want to average across a few people to get a better measure. – largewords Jan 13 '19 at 23:59
  • But you *do* have two characteristics that you explicitly mention: the mean and the dispersion of the ratings. If you don't care about the dispersion, then the answer is simple and obvious: use the means alone. – whuber Jan 14 '19 at 00:40
  • You are the expert, but I can't imagine treating the mean and dispersion differently. For example, suppose there was no multiple rating of each bottle, so the situation would be a standard correlation or one-way anova. In that case the ratings still have a mean and variance, but one would not weight the differently. The difference here is that ratings are grouped by bottle into a mean (per bottle) – largewords Jan 14 '19 at 07:00
  • Is this a situation for a grouped or hierarchical model? – largewords Jan 14 '19 at 07:01

0 Answers0