Do we have to scale new unseen feature data for prediction

Question

In machine learning most algorithms require some kind of scaling to decrease error. This is my code:

# ensure results are repeatable
set.seed(7)
# load the library
library(caret)
# load the dataset
data(iris)
head(iris)
X=scale(iris[,-5])
X=data.frame(X)
head(X)
y=iris[,5]
y=data.frame(y)
head(y)
X=cbind(X,y)
# prepare training scheme
control <- trainControl(method="repeatedcv", number=5, repeats=1)
# train the model 
  model <- train(y~., data=X, method="svmLinear2", trControl=control, tuneLength=5)
  # summarize the model
  print(model)
#saving model
save(model, file="model.Rdata")

#loading model
supmod<-load("model.Rdata")

#new data
# Sepal.Length Sepal.Width Petal.Length Petal.Width 
# 4.2             3.2          1.7         0.23  
new<-c(4.2,3.2,1.7,0.23)
pre<-predict(supmod,new)
#dont know how to predict this model with unseen data

In the above code I have two question one related to scaling of the new data and other related to coding error passing the new data to the loaded model.

The real iris feature data looks like this

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1          5.1         3.5          1.4         0.2
2          4.9         3.0          1.4         0.2
3          4.7         3.2          1.3         0.2
4          4.6         3.1          1.5         0.2
5          5.0         3.6          1.4         0.2
6          5.4         3.9          1.7         0.4

But before passing to svm algorithm we have to scale the data and i use scale() to scale data and its look like this.

  Sepal.Length Sepal.Width Petal.Length Petal.Width
1   -0.8976739  1.01560199    -1.335752   -1.311052
2   -1.1392005 -0.13153881    -1.335752   -1.311052
3   -1.3807271  0.32731751    -1.392399   -1.311052
4   -1.5014904  0.09788935    -1.279104   -1.311052
5   -1.0184372  1.24503015    -1.335752   -1.311052
6   -0.5353840  1.93331463    -1.165809   -1.048667

It is this scaled data that we use for training and testing our model. lets say I have successfully trained the model and use it for prediction of new unseen data (eg this one row).

Sepal.Length Sepal.Width Petal.Length Petal.Width 
 4.2             3.2          1.7         0.23

Do I need to scale this new data? or I just have to pass this data directly to my model?
The next question is related to a coding error

predict(supmod,new) returns this error

Error in UseMethod("predict") : no applicable method for 'predict' applied to an object of class "character"

Found this link http://stats.stackexchange.com/questions/89172/how-to-scale-new-observations-for-making-predictions-when-the-model-was-fitted-w which solves scaling problem — Eka, Apr 13 '16 at 03:51

bask0 · Answer 1 · 2016-04-04T05:14:11.893

1) You should scale the new data as well. You can scale all the data, training and new data together, if possible. Or you store the scaling function and apply it later to the new data. If you have data d that is normally distributed with, lets say mean=m and sd=s, you scale the data by: (d-m)/s. Just apply this function to the new data as well, using the same mean and sd.

2) You can't assign the data you load directly.

#loading model
supmod<-load("model.Rdata")

The resulting variable does only contain the string "model".

Try this:

load("model.Rdata")

This loads the model, the name of the variable is "model".

3) Futher, you have to pass a data.frame (with the same rownames as the training dataset) to predict:

new <- data.frame(Sepal.Length=4.2, Sepal.Width=3.2, Petal.Length=1.7, Petal.Width=0.23)

pre<-predict(model,new)

You should not scale training and testing data together. This is what is known as a data leakage. You should scale your training data and use the mean and sd to scale all future unseen data. This is mentioned but thought I should clarify for any future readers, such as myself. — Sam, Feb 20 '18 at 14:41

score 0 · Answer 2 · answered Apr 24 '16 at 11:32

You need to scale the new data as well. Make sure it is done with the sd and mean from the trainingsdata.

But since you are using caret you can specify a preprocess function in trainControl, like the code below. If you use this option in caret, you do not need to scale your new data. The predict function in caret will take care of this. The train object you created will contain the necessary information to scale when predicting with new data.

control <- trainControl(method="repeatedcv", 
                        number=5, 
                        repeats=1, 
                        preProc = c("center", "scale"))

Do we have to scale new unseen feature data for prediction

2 Answers2