Using and interpreting $k$-fold cross validation for regression

Question

I have created a model using using regression and was trying to check if it is an underfit or overfit by running CV on top of it using Python (sklearn). I have used a higher degree relation between $x$, $y$ using regression model on this data set.

Below is my code:

import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn import linear_model, cross_validation
import math
data = pd.read_csv("location")
from sklearn import metrics

x = data.temp.values
y = data.cnt.values

x = x.reshape(len(x), 1)
y = y.reshape(len(y), 1)

train_x = x[:482]
test_x = x[482:]

train_y = y[:482]
test_y = y[482:]


regr = linear_model.LinearRegression()
clf = regr.fit(train_x**4, train_y)
scores = cross_validation.cross_val_score(clf, x, y, cv=10)

print('Coefficients: \n', regr.coef_)
print("Residual sum of squares: %.2f"
    % np.mean((regr.predict(test_x) - test_y) ** 2))
#Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(test_x, test_y))
print(scores)

Below is the o/p

Residual sum of squares: 1906498.12
Variance score: 0.26
score (o/p of CV):[-14.82381293 -0.29423447 -13.56067979 -1.6288903 -0.31632439
0.53459687 -1.34069996 -1.61042692 -4.03220519 -0.24332097]

Questions:

What do these different values mean?
In all the codes online everyone has taken the mean of it, why?
Is this the right approach to validating whether the model is appropriate or not?

Note that in statistics ML is the common acronym for maximum likelihood, so you could rather spell out what ML stands for in your question. — Richard Hardy, Feb 21 '16 at 09:06

score 1 · Accepted Answer · answered Feb 21 '16 at 20:15

For regression, sklearn by default uses the 'Explained Variance Score' for cross validation in regression. Please read sec 3.3.4.1 of Model Evaluation in sklearn. The cross_val_score function computes the variance score for each of the 10 folds as shown in this link.
Since you have 10 different variance scores for each of the 10 folds of the data, the variance score that you would expect to see in the data can be computed by taking the mean of 10 scores. Hence, it is the mean value which is usually reported.
To figure out the appropriateness of the model, you need to have some comparison involved. In your case, there is just one model. Cross validation is normally used to figure out the optimal value of a parameter. In your case, the power of the independent variable could be optimized using cross validation. A suggestion would be to compute mean value of cross validation scores for each of the models with different power values and pick the model with the best mean score. In this way, you can figure out the model which has the best bias-variance combination.You can control what type of score to be outputted by using the scoring parameter. For eg. To obtain the mean squared error, use

scores_mse = cross_validation.cross_val_score(clf, x, y, scoring = 'mean_squared_error' ,cv=10)

Also, note that the mse score values will be shown as negative. The reason is that the API was built to maximize score values. Hence, the MSE is negated to pick the model with the maximum MSE score.

score 0 · Answer 2 · answered Feb 21 '16 at 10:51

Im not fimiliar with CV in Python. My guess is it would be the Means Squared Error.
You take to mean to get the overall MSE.
I don't think so. CV works as follows: You divide your data in 10 (for example, this number can be different pieces. You train your model on 9 of them, and test the model on the last piece. You repeat this for all pieces (so 10 times). You end up with 10 different measures of MSE and you take the mean of them to get a measure of overall MSE (if all pieces of the data have the same size).

You do this because this is a more appropriate estimate of how your model would perform on external data (more appropriate than looking at the residual sum of squares).

Like I said, Im not fimiliar with Python CV but it looks like you have built one model (in my opinions, R is a better/easier program for doing this) and tested it on different parts. Dividing your data set once in training and test data would not be sufficient for a 10 fold CV. But most likely, dividing it is not necessary as the CV function will do that automatically.

Using and interpreting $k$-fold cross validation for regression

2 Answers2

Linked