0

I have a big amount of products and a lot of reviews for these products, including a rating of the product.

My problem is that every product does not have the same amount of reviews. For example, one product can have 125 reviews with an average 4.2/5 ratings, whereas another one may have 1 review of 5/5.

Is there any model or algorithm that can sort my products by best products, but considering the amount of reviews ?

It is maybe a simple question, but I've never seen it posted this way. I suppose it is a common problem but I did not find a right title to search for solutions.

Example (asked by RayVelcoro)

Let's say I have 2 movies : A - rated 4.3/5 - 152 reviews B - rated 5/5 - 2 reviews

Since B have been rated only two times, I cannot be sure that B is better than A : Maybe, when B will have more than 50 reviews, it will be rated only 3.8/5

How can I take that in account in a search that sorts results by best movie ?

haverchuck
  • 103
  • 2

2 Answers2

1

I'll start with a very simple suggestion, add $5$ $3$-star reviews to each product. In your example above we would then be comparing an average of $4.15$ to an average of $3.33$. This starts every product at $3$ and requires some data to move away from $3$.

While this seems like a silly idea with no mathematical justification, it actually is not. The idea of pseudo-counts, are derived from Bayesian models whose posterior means can be computed simply by adding a few fictional data points. I have not formulated a specific prior here because I know very little about distributions on ordinal data such as yours, but the idea still works.

This is a form of regularization or shrinkage and these effects can be achieved in many ways, and in models of various levels of complexity.

If you like the simplicity of pseudo-counts but want to be a bit more rigorous you can try adding $n$ reviews with an average of $\mu$ to each product and tune $n, \mu$ using cross validation to maximize some measurement of prediction accuracy. I would start with $\mu$ as the average rating in your whole data set, and play with values of $n$.

jlimahaverford
  • 3,535
  • 9
  • 23
  • thanks ! that seems to be a good solution ! I just have one more question : do you take μ as the average rating in the data set of reviews, or in the data set of products ? I mean, do you weight the rating of a products with the number of reviews of this product or every product rating has the same weight in μ ? – haverchuck Sep 22 '15 at 21:19
  • (sum of all reviews / number of all reviews) was what I was suggesting. – jlimahaverford Sep 22 '15 at 21:20
0

This looks like a Bayesian Hierarchical Model waiting to happen.

Fit an unknown parameter to each movie/product that links to the data through a multinomial or ordered logit, or possibly even just a simple normal. Then put a prior on the parameters and fit the model. Movies/products with a large number of ratings will have a parameter that is mainly determined by the data (pulled a little ways towards the mean by the prior). Movies/products with very few ratings will be highly influenced by the overall mean and a little by the corresponding ratings (how much depends on the hyper prior). So a product/movie with 2 ratings of 5 out of 5 will be pulled towards the mean (but still above one with 2 ratings that were 1 out of 5 or 3 out of 5). Sort on the mean (or median/mode, etc.) of the posteriors for the movie/product parameters and that should give you what you are looking for.

Greg Snow
  • 46,563
  • 2
  • 90
  • 159