Suppose the outcome that I want to predict is a vector of mean values, each of which was calculated from a sample of different size. More specifically, I am predicting mean ratings for products, and each of those ratings was calculated based on a different number of people who voted for a particular product. So the data might look like this:
MeanRating Votes
4.5 1
2.0 10
5.0 1000
Obviously, the quality of information is different for each of the shown ratings, i.e. I can be very confident about MeanRating = 5
as it is based on 1000 votes, but I am not convinced at all by only one vote that the MeanRating = 4.5
is based on.
To predict MeanRating
, I want to test a number of machine learning methods that differ in their underlying math (e.g., neural nets vs gradient boosting machines vs random forest, etc.). In know that in some cases (e.g., tree-based methods) one could pass Votes
as weights. However, as far as I know such an approach is not universally applicable to all possible methods that I'd potentially like to test. Hence my question: how can I take into account the quality of information about my outcome variable (e.g., MeanRating
in the above example), regardless of the method my model is based on?