2

For a multiple linear regression model, I have done two things to preprocess my data:

  1. I have scaled continuous variables with StandardScaler
  2. I have encoded categorical variables with OneHotEncoder

My dependent variable is the rating (a float number) that varies from 1.0 to 10.0. Do I need to perform any encoding on that variable? How can it influence my model?

I use scikit-learn for everything listed above.

Mihai Chelaru
  • 269
  • 3
  • 11
Daria
  • 375
  • 2
  • 11

1 Answers1

0

You don't need to scale the target variable; it's done for making covariates unit-less so that they contribute to MSE in equal scales, and don't suffer from different regularization penalties. For the one-hot encoding, you face collinearity issues since your covariates are now linearly dependent on each other, e.g. your encoding variables always sum up to 1. This post might be a good reference for dealing with it.

gunes
  • 49,700
  • 3
  • 39
  • 75
  • I have read it before and set the intercept to false. I am wondering how I can get the RMSE less than MSE? – Daria Jun 07 '19 at 14:24
  • RMSE is the square root of MSE, and it'll be smaller than it when MSE is larger than $1$. So, I didn't get your point. – gunes Jun 07 '19 at 14:27
  • OMG, now I got it. My target variable is a imdb rating and it varies from 1 to 10. How should I interpret the RMSE = 1.09 in this case? – Daria Jun 07 '19 at 14:31
  • One possible interpretation would be that you estimate the imdb rating for a movie with $\approx \pm 1$ error. This interpretation would have been better suited for MAE, but it does make some sense in RMSE, too. – gunes Jun 07 '19 at 14:43
  • Thank you. What's more appropriate measure for multiple regression model - MSE or MAE or RMSE? – Daria Jun 07 '19 at 14:45
  • In training LR model, the packages I'm familiar with always use MSE, which is also equivalent to using RMSE, since their argmin are the same. However, when it comes to evaluating your model, it's generally upto you and how you make sense of the resulting metrics. MAE and RMSE make more sense than MSE since they're in target's units. But, depending on the case, you might want to look at the deviation of error or histogram of it. – gunes Jun 07 '19 at 14:56
  • Thank you, got your point. Do you know how I can check the assumptions of linear regression with the help of python? – Daria Jun 07 '19 at 17:54
  • Not specific to python, http://people.duke.edu/~rnau/testing.htm is useful. – gunes Jun 08 '19 at 02:57