I am not sure what is going wrong here. I did the following :
#Running Lasso:
from sklearn import linear_model
lasso=linear_model.LassoCV(max_iter=2000,cv=10,normalize=False)
lasso.fit(tourism_train_X,tourism_train_Y)
lasso.alpha_
scores=np.zeros((100,1))
scores[:,0]=np.mean(lasso.mse_path_,axis=1)
scores=np.sort(scores)
lasso.coef_
Before this, this is how I split the dataset and the pre-processing involved.
import numpy as np
from sklearn.cross_validation import train_test_split
tourism_train_X,tourism_test_X,tourism_train_Y,tourism_test_Y=train_test_split(tourism_train, tourism_Y, test_size=0.20, random_state=42)
Encoding the categorical variable (only one):
# Encoding categorical variables
from sklearn import preprocessing
tourism_train_X=preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0).fit_transform(tourism_train_X)
tourism_test_X=preprocessing.Imputer(missing_values='NaN', strategy='mean', axis=0).fit_transform(tourism_test_X)
tourism_train_X=preprocessing.OneHotEncoder(categorical_features=[1],sparse=False).fit_transform(tourism_train_X)
tourism_test_X=preprocessing.OneHotEncoder(categorical_features=[1],sparse=False).fit_transform(tourism_test_X)
Standardising the variables both in train and test set:
# Standardising the variables
tourism_train_X=preprocessing.scale(tourism_train_X)
tourism_test_X=preprocessing.scale(tourism_test_X)
If you see I am doing a 10 fold cross validation to choose best lasso coeff.
Now when I check it on my test set. I get this.
# Test error of Lasso:
from sklearn.metrics import mean_squared_error
mse_test_tourism=mean_squared_error(tourism_test_Y,lasso.predict(tourism_test_X))
# R^2 of Test sample
rsquared_test_tourism=lasso.score(tourism_test_X,tourism_test_Y)
print("The MSE on Test data is :", mse_test_tourism)
print("The R^2 on Test data is:", rsquared_test_tourism)
It gives this:
The mse is very low, but Rsquared is way less.
('The MSE on Test data is :', 0.0046515559443549301)
('The R^2 on Test data is:', 0.03861779182108882
What does this mean? According to this the model doesn't explain anything if we look at Rsquared. But MSE of the model on test dataset is very low.
Any answers? as a note, my target variable (Y) is a log transformed variable.