0

I am trying to predict the users rating on movies. These ratings are continuous ranging from 1 to 5. I have been using xgboost with objective function reg:squarederror indicating regression with squared loss.

As you can see most of the ratings are concentrated at 4 and there are many predictions more than 5! I wonder what are the possibilities to inform xgboost regarding this limitation. What kind of cost function can I make in this scenario ? Alternatively, I can bin my values into 10 bins and do a multi-class prediction but still I wonder if there are more correct statistically solutions !

enter image description here

Areza
  • 1,058
  • 2
  • 11
  • 30
  • You could try ordinal regression. Search this site, or start with https://stats.stackexchange.com/questions/281619/linear-regression-or-ordinal-logistic-regression-to-predict-wine-rating-from-0 – kjetil b halvorsen May 12 '20 at 16:01
  • @jketil thanks for the comment - very relevant - but in this case, rates are continuous values not categorical - ordinal regression goes under categorical. so still I think this is a different case :) – Areza May 12 '20 at 17:28
  • You can use ordinal regression also with continuous response, Its covered in chap 15 of Frank Harrell's *Regression Modeling strategies*, and his function `orm` in R package `rms` fits such models. – kjetil b halvorsen May 12 '20 at 19:11

0 Answers0