Do I have to preprocess my new data for a prediction, if I have used preprocessing for building the model?

Question

In this example preprocessing is used to construct a NN:

nnetTune <- train(x = solTrainXtrans, y = solTrainY,
              method = "avNNet",
              tuneGrid = nnetGrid,
              trControl = ctrl,
              preProc = c("center", "scale"),
              linout = TRUE,
              trace = FALSE,
              MaxNWts = 13 * (ncol(solTrainXtrans) + 1) + 13 + 1,
              maxit = 1000,
              allowParallel = FALSE)

If I make predictions with new data, do I have to pre-process this new data or can I directly insert the new data in the model?

Would that be different if I use the model below where data X is preprocessed (centered and scaled) before it is inserted in the nnet?

fit <- nnet(Y~., X, size=12, maxit=500, linout=T, decay=0.01)

Thank you!

micts · Accepted Answer · 2018-08-29T13:57:08.053

4

Yes, the new data have to be pre-processed as well.

EDIT (based on your last comment):

For your fist code block, I am not sure whether the new data are automatically pre-processed, just because you used the preProc argument.

For your second code block, yes, nnet() does not provide any functionality to pre-process the data.

I would recommend to use the preProcess() function of caret. Actually, when you use preProc as your input argument the preProcess() function is called. You can define the kind of pre-processing you need in the preProcess() function, and then using the predict() function you actually pre-process the data in question. Now, the advantage of using preProcess() is that you can either use the predict() function to pre-process new data, or use the newdata input argument of the preProcess function, which actually does the same thing. Refer to the documentation for more details.

Of course you can pre-process just a single observation. In your example, you center and scale the training set. This means that you compute the mean value and standard deviation of the training set, and then you subtract the mean value and divide by the standard deviation, so as the transformed training set has now mean value of 0, and standard deviation of 1. If you want to pre-process just a single observation, you can just subtract and divide this observation with the aforementioned mean value, and standard deviation, respectively.

As a simple example on how to use the preProcess() function (taken from the documentation):

data(BloodBrain)

preProc  <- preProcess(bbbDescr[1:100,-3])
training <- predict(preProc, bbbDescr[1:100,-3])
test     <- predict(preProc, bbbDescr[101:208,-3])

One last thing; you mention
If I do preprocessing by myself with the testdata and use that preprocessed testdata as input for fitting the nnet... - Just to make this clear, you should fit the training data, and then use the predict() function to generate predictions for new data.

Hope it helps!

edited Aug 29 '18 at 13:57

answered Aug 28 '18 at 19:41

micts

86
5

`"You can just pre-process the whole data set, and then split into training and test set."` - That depends. Basing parameters on validation/test data may potentially lead to biased models, as you allow leakage to happen. – Firebug Aug 28 '18 at 23:06
You are right, this could be the case, since the calculation of the mean and standard deviation would include the training set. But I think this could be the other way around too. For example, if you want to standardize and the max and min values are different in the test set. The main point still stands though, that you should probably pre-process the new data too. – micts Aug 29 '18 at 00:07
@ micts Thanks. I mean after I have validated the avNNet-model with the test set. If I want to make a single prediction with the model, I can't center or scale it since it is a single prediction. Or does the model convert it to a scaled value by itself before it makes a prediction? – Marcel Aug 29 '18 at 09:58
In the second model (nnet) it is not possible (as far as I know) to do preprocessing within the training process (I don't see a parameter preProc such as in avNNet). If I do preprocessing by myself with the testdata and use that preprocessed testdata as input for fitting the nnet, I suppose I can't use the nnet-model after validation with not-preprocessed data. Although how can I preprocess new data if I have only a single prediction? – Marcel Aug 29 '18 at 10:03
I edited my answer, based on your last comment. – micts Aug 29 '18 at 13:39
I think you have misunderstood Firebug's comment: The usual workflow is to calculate all preprocessing functions on the training set only and then just apply them on all other data sets (validation, train, single observations). Otherwise the answer is very helpful. – Michael M Aug 29 '18 at 13:47
Yes, I indeed misunderstood the comment. I will edit my answer. Thank you for pointing that out. – micts Aug 29 '18 at 13:56
Thank you. I will use Caret in that case with the preProc argument for training, test and new data. I suppose I don't need to do any "reverse preprocessing" on the outcomes. There is some discussion about this at the end of this post: https://machinelearningmastery.com/pre-process-your-dataset-in-r/ . Although I suppose it is not relevant since the predictions are not centered and scaled. – Marcel Aug 29 '18 at 15:30

score 2 · Answer 2 · answered Oct 05 '18 at 18:21

No pre-processing is needed to be done on the test data if you are using the first block of code to train your model. predict.train automatically pre-processes the test data. You can simply use the test data without pre-processing Eg.- predict(nnetTune, newdata = testdata)

You can also refer to the Documentation of caret package and this answer.

In the second case, when you are using nnet function, you would have to pre-process the data. You can do that in the following way:

transformed <- preProcess(train_data, method = c("center","scale"))
preproc_testdata <- predict(transformed, testdata)

Hope that helps!

Do I have to preprocess my new data for a prediction, if I have used preprocessing for building the model?

2 Answers2