Very low Rsquared of Lasso on Test sample. But very low MSE too?

Question

I am not sure what is going wrong here. I did the following :

#Running Lasso: 
from sklearn import linear_model 
lasso=linear_model.LassoCV(max_iter=2000,cv=10,normalize=False)
lasso.fit(tourism_train_X,tourism_train_Y)
lasso.alpha_
scores=np.zeros((100,1))
scores[:,0]=np.mean(lasso.mse_path_,axis=1)
scores=np.sort(scores)
lasso.coef_

Before this, this is how I split the dataset and the pre-processing involved.

import numpy as np
from sklearn.cross_validation import train_test_split
tourism_train_X,tourism_test_X,tourism_train_Y,tourism_test_Y=train_test_split(tourism_train, tourism_Y, test_size=0.20, random_state=42)

Encoding the categorical variable (only one):

# Encoding categorical variables
from sklearn import preprocessing
tourism_train_X=preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0).fit_transform(tourism_train_X)
tourism_test_X=preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0).fit_transform(tourism_test_X)

tourism_train_X=preprocessing.OneHotEncoder(categorical_features=[1],sparse=False).fit_transform(tourism_train_X)
tourism_test_X=preprocessing.OneHotEncoder(categorical_features=[1],sparse=False).fit_transform(tourism_test_X)

Standardising the variables both in train and test set:

# Standardising the variables
tourism_train_X=preprocessing.scale(tourism_train_X)
tourism_test_X=preprocessing.scale(tourism_test_X)

If you see I am doing a 10 fold cross validation to choose best lasso coeff.

Now when I check it on my test set. I get this.

# Test error of Lasso: 
from sklearn.metrics import mean_squared_error
mse_test_tourism=mean_squared_error(tourism_test_Y,lasso.predict(tourism_test_X))
# R^2 of Test sample
rsquared_test_tourism=lasso.score(tourism_test_X,tourism_test_Y)
print("The MSE on Test data is :", mse_test_tourism)
print("The R^2 on Test data is:", rsquared_test_tourism)

It gives this:

The mse is very low, but Rsquared is way less.

('The MSE on Test data is :', 0.0046515559443549301)
('The R^2 on Test data is:', 0.03861779182108882

What does this mean? According to this the model doesn't explain anything if we look at Rsquared. But MSE of the model on test dataset is very low.

Any answers? as a note, my target variable (Y) is a log transformed variable.

score 4 · Answer 1 · edited Apr 13 '17 at 12:44

4

Essentially, you're comparing the wrong things.

The Mean Squared Error tells you the average error of each prediction. It's sensitive to the units you're predicting, so since you're predicting a log transformed variable, it's not surprising that the mean squared error would be small. To compare to R squared, you need to look at the Mean Squared Error as a proportion of the variance in Y (which is essentially the formula for R squared, turned around a bit).

R squared is the proportion of variance explained - so it takes into account the variance of your dependent variable, as well as the average error, to get the percentage of variance that isn't error (essentially).

What is probably happening in your data is that you have very low variance, such that your mean squared error is still pretty large in comparison to this variance, and that your model doesn't explain a lot (as shown by the low R squared).

This post may be helpful to give a more mathematically complete answer: What is the difference between "coefficient of determination" and "mean squared error"?

edited Apr 13 '17 at 12:44

Community

1

answered May 05 '15 at 00:27

Sean Murphy

591
2
8

But here is the thing. Shouldn't LAssocv done in scikit should give the best model for regression. I mean it should give me the model with the least error right? And what would you say about my MSE? It';s like 0.04%? That's pretty low – Baktaawar May 05 '15 at 00:38
No, MSE is reported (unless numpy does something very strange) in absolute terms - it isn't a percentage. That's what I mean - R squared is a percentage, MSE is just the average size of the error in your prediction (not a percentage). I can't speak to LAssocv, but even if it gives the best model, it will only predict to the extent that the input variables have explanatory power. What is the R squared if you predict your Y variable in the test dataset? – Sean Murphy May 05 '15 at 01:43
It is for the test dataset only. Both MSE and Rsquared for test set. The test set doesn't have more than 30 obs. My total obs are 140. I take out 20% for test. Then on remaining I divide into ten fold cv.The model got on those 10 fold cv dataset is tested on test dataset. (0.20*140 samples) – Baktaawar May 05 '15 at 01:49
To add one more question. I did train, test and val split before doing separate preprocessing on both train and test dataset as you see above. Is this right order or first do preprocessing on whole dataset and then do train, test, split – Baktaawar May 05 '15 at 01:56
Keep in mind that if your response is measured in inches, MSE is measured in square inches. R squared is a proportion and therefore unitless. Dimensional analysis is sadly neglected in stats classes – shadowtalker May 05 '15 at 19:27

Very low Rsquared of Lasso on Test sample. But very low MSE too?

1 Answers1