1

I am trying to make a regression with SVR and I found a problem in the process, the regression with random data is ok, but I tried it with my data, and with all of these three kernels the prediction's output is constant (see the plot). Here is a piece of my data, maybe the problem is here, but I cant'see why.

data.csv

2006,46,97,97,0.04124
2006,47,97,97,0.06957
2006,48,115,97,0.06569
2006,49,137,115,0.05357
2006,50,112,137,0.04132
2006,51,121,112,0.06154
2006,52,130,121,0.02586

And here is the code I'm using.

import pandas as pd
from sklearn.svm import SVR
import matplotlib.pyplot as plt
import numpy as np

#Importing data
data = pd.read_csv('data.csv')
data = data.as_matrix()

#Random data generator
#datar = np.random.random_sample((7,21))
#inputdatar = datar[:,0:4]

inputdata = data[:,0:4]
output1 = data[:,4]

svr_rbf = SVR(kernel='rbf',gamma=1)
svr_rbf.fit(inputdata,output1)
pre = svr_rbf.predict(inputdata)
axis = range(0,data.shape[0])

plt.scatter(axis, output1, color='black', label='Data')
plt.plot(axis, pre, color='red', label='Regression')
plt.show()

enter image description here

I think maybe it's hyperparameter tuning problem, but I'm not sure if the data would cause a problem as well. Any lights?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
Adelson Araújo
  • 224
  • 2
  • 10

2 Answers2

4

I believe the problem is that your data (in particular, the target variable) isn't scaled.

The SVR implementation in scikit-learn has a parameter, epsilon, that controls the loss function. Quoting from the docs, "It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value."

The default value of epsilon is 0.1. As you can see, all of your predictions are within 0.1 of the true value, so the loss is zero.

I'd fix this by scaling (normalizing) your data, or by using a different (smaller) value of epsilon. For instance, with epsilon equal to 0.001, I get a very non-linear curve that fits the data perfectly (probably not what you want either, to be fair).

vbox
  • 586
  • 2
  • 4
  • Thanks, @vbox! Do you think normalizing is better than smaller epsilon? – Adelson Araújo Dec 31 '16 at 16:43
  • In general, I'd say that normalizing is a better approach. For instance, SVR uses regularization, and it is well-known that regularization only makes sense when the features have been normalized first. See for instance, [this post](http://stats.stackexchange.com/questions/111017/question-about-standardizing-in-ridge-regression). – vbox Dec 31 '16 at 17:14
0

That's a matter of scaling your data. I believe you can solve that simply by using a StandardScaler from sklearn Here is some sample code that I used:

# For SVR we need Feature Scaling
from sklearn.preprocessing import StandardScaler
sc_x = StandardScaler()
sc_y = StandardScaler()
# Scale x and y (two scale objects)
x = sc_x.fit_transform(x)
y = sc_y.fit_transform(y)