Predicting mean values which are based on different number of observations

Question

Suppose the outcome that I want to predict is a vector of mean values, each of which was calculated from a sample of different size. More specifically, I am predicting mean ratings for products, and each of those ratings was calculated based on a different number of people who voted for a particular product. So the data might look like this:

    MeanRating Votes
           4.5     1
           2.0    10
           5.0  1000

Obviously, the quality of information is different for each of the shown ratings, i.e. I can be very confident about MeanRating = 5 as it is based on 1000 votes, but I am not convinced at all by only one vote that the MeanRating = 4.5 is based on.

To predict MeanRating, I want to test a number of machine learning methods that differ in their underlying math (e.g., neural nets vs gradient boosting machines vs random forest, etc.). In know that in some cases (e.g., tree-based methods) one could pass Votes as weights. However, as far as I know such an approach is not universally applicable to all possible methods that I'd potentially like to test. Hence my question: how can I take into account the quality of information about my outcome variable (e.g., MeanRating in the above example), regardless of the method my model is based on?

Can you say more about the variables that ' explain' mean rating? Just an example of a variable? — , Sep 06 '16 at 15:18

score 1 · Accepted Answer · edited Apr 13 '17 at 12:44

Try to put the quality measure into the data, not in the model.

You have to make an assumption though, and that is that you need to guess at the distribution of your data points, I'll try to explain below.

I believe the following pragmatic approach will deliver workable results: assuming a Gaussian distribution for your ratings, the recipe would be as follows:

For each of your measurements, you determine a parameter $\sigma=\mu\sqrt{\nu}/\nu$, where $\mu$ is the mean rating and $\nu$ is the accompanying vote count.
For each measurement, you generate a large number $N$ of random numbers from the normal distribution that has mean $\mu$ and standard deviation $\sigma$.

The effect is that your rating of 4.5 with only 1 vote is now represented in your data with a rather large variety, whereas the rating of 2 with 10 votes, has a much smaller variety of 0.6.

Feed the augmented data to your algorithms where all data points have the same weight.

Edit: On the the value of $\sigma$

The $\sigma$ value has its roots in Poissonian statistics, where the variation on a count is equal to the count itself. In your case the vote count.

Edit: On the "boundedness" of your distribution

If you do as the recipe above suggests, your augmented data will include mean ratings that are not valid values, i.e. less than 1 and more than 5 (or 10). You could keep those values and merely truncate the predictions of your models.

There is an alternative described here. That will improve the mathematical foundation of your analysis. However, I doubt it will matter significantly in practice.

You should read this about the beta distribution (scroll down to the part on user ratings): http://stats.stackexchange.com/questions/47771/what-is-the-intuition-behind-beta-distribution — Ytsen de Boer, Sep 08 '16 at 10:12

Predicting mean values which are based on different number of observations

1 Answers1