I have created a model using using regression and was trying to check if it is an underfit or overfit by running CV on top of it using Python (sklearn). I have used a higher degree relation between $x$, $y$ using regression model on this data set.
Below is my code:
import matplotlib.pyplot as plt
import seaborn as sns
import pandas as pd
import numpy as np
from sklearn import linear_model, cross_validation
import math
data = pd.read_csv("location")
from sklearn import metrics
x = data.temp.values
y = data.cnt.values
x = x.reshape(len(x), 1)
y = y.reshape(len(y), 1)
train_x = x[:482]
test_x = x[482:]
train_y = y[:482]
test_y = y[482:]
regr = linear_model.LinearRegression()
clf = regr.fit(train_x**4, train_y)
scores = cross_validation.cross_val_score(clf, x, y, cv=10)
print('Coefficients: \n', regr.coef_)
print("Residual sum of squares: %.2f"
% np.mean((regr.predict(test_x) - test_y) ** 2))
#Explained variance score: 1 is perfect prediction
print('Variance score: %.2f' % regr.score(test_x, test_y))
print(scores)
Below is the o/p
Residual sum of squares: 1906498.12
Variance score: 0.26
score (o/p of CV):[-14.82381293 -0.29423447 -13.56067979 -1.6288903 -0.31632439
0.53459687 -1.34069996 -1.61042692 -4.03220519 -0.24332097]
Questions:
- What do these different values mean?
- In all the codes online everyone has taken the mean of it, why?
- Is this the right approach to validating whether the model is appropriate or not?