Scaling data with a time feature

Question

I'm going through a solution of the bike sharing demand problem and one moment about scaling data is unclear to me. Concretely, why do we fit scaler only on our training data instead of the whole dataset?

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(train_data, train_labels)
scaled_train_data = scaler.transform(train_data)
scaled_test_data = scaler.transform(test_data)

I think it has something to do with the fact that we have a time feature. Due to that feature we divide our dataset into training set and test set so that examples in the training set happened earlier that examples in the test set. I thought that due to the same reason we scale data differently as well, but I don't have a good intuition about the matter.

So, why do we do it this way?

score 0 · Accepted Answer · answered Mar 17 '17 at 11:59

0

Think of it this way what is the test data supposed to represent ? its supposed to represent the data that your model has not seen as of yet, the unknown that you want it to predict. The standard scalar uses properties of the data like its mean and variance to scale the data . Now think would it make sense for your model to use these statistics from the complete data for scaling ? that would mean you already know something about the unknown test data which you should not and you risk introducing a bias because of this. What is happening in the example given by you is that you are estimating the mean and variance (used for scaling) from the training data and use the same mean and variance to scale the test data. Even if you did not have any time based features you would still do the scaling this way, because you want to avoid using any information from the test data. Remember ,the whole purpose of the test data is to simulate a data set that you don't have.

answered Mar 17 '17 at 11:59

Vaibhav Arora

338
1
8

Hm. I guess, I have been looking at the whole scaling process incorrectly. So, should we use means and stds from training set in order to scale all our data in every case or there are some cases where it's not true? – Ivan Panshin Mar 17 '17 at 13:09
in certain cases if you know your max and min values (and you know these are never really going to change) for example in images(in grey scale) we encounter pixel values from 0 to 255. so simply dividing each pixel value with (255-0) would suffice. have a look at https://en.wikipedia.org/wiki/Feature_scaling#Standardization . Normalization will help certain algorithms like Neural networks, logistic regression etc but for some it may not be needed e.g. Random forests. – Vaibhav Arora Mar 17 '17 at 14:21
you might find http://stats.stackexchange.com/questions/189652/is-it-a-good-practice-to-always-scale-normalize-data-for-machine-learning useful. – Vaibhav Arora Mar 17 '17 at 14:21
As far as your question goes i dont recall reading much about what normalization to use when.just stick with standardization for most cases (using means and stds) scaling for some (Images) and youll be fine :) – Vaibhav Arora Mar 17 '17 at 14:28

Scaling data with a time feature

1 Answers1