Effectiveness of Standardization and Normalization in Machine Learning

Question

I am just studying the basics of machine learning and had a question about the standardisation and normalisation of the features and its effectiveness.

I have read this CrossValidated question and this blog among many other articles but still was not clear on when I should apply which technique.

And to make things worse, in sklearn there is a StandardScaler, a MinMaxScaler, a MaxAbsScaler, RobustScaler, and Normalization.

So, I took the SECOM Manufacturing dataset and decided to run almost all the classification algorithms I could find with the data without scaling and by applying all these scaling methodologies and calculated the test accuracy.

What I got was not in line with most of the things I had read online.

For e.g.

Methods using Euclidean distance was supposed to be the most affected by scaling, but in fact, I saw the accuracy decrease when I scaled in the KNN algorithm.
Logistic Regression and SVM were algorithms which used Gradient Descent and they were also supposed to show an improvement, but could not really see that also, especially in SVM (or SVC, the classification variant)
Most of the algorithms shows RobustScaler to have lower accuracy and Normalization to have a higher accuracy.

Please below the results of the experiment I did. Can you please tell me if I am doing something wrong or interpreting this wrong?

Code I used

def benchmark_ml_algo_v2(df_features, df_target):
    #Machine Learning Algorithm (MLA) Selection and initialization
    MLA = [
        #Ensemble Methods
        AdaBoostClassifier(),
        BaggingClassifier(),
        ExtraTreesClassifier(),
        GradientBoostingClassifier(),
        RandomForestClassifier(n_estimators = 100),
        xgb.XGBClassifier(),
        lgb.LGBMClassifier(),
    #     CatBoostClassifier(),
        #Gaussian Processes
        GaussianProcessClassifier(),

        #GLM
        LogisticRegressionCV(),
        PassiveAggressiveClassifier(max_iter=1000),
        RidgeClassifierCV(),
        SGDClassifier(max_iter= 1000),
        Perceptron(max_iter=1000),

        #Navies Bayes
        GaussianNB(),

        #Nearest Neighbor
        KNeighborsClassifier(n_neighbors = 3),

        #SVM
        SVC(probability=True),
        LinearSVC(),

        #Trees    
        DecisionTreeClassifier(),
        ExtraTreeClassifier(),

        ]

    scalers = [
        None,
        MinMaxScaler(),
        MaxAbsScaler(),
        RobustScaler(),
        Normalizer()
    ]

    import time

    #create table to compare MLA
    MLA_columns = ['MLA Name','Scaler', 'MLA Test Accuracy','MLA Time']
    MLA_compare = pd.DataFrame(columns = MLA_columns)

    train_x, test_x, train_y, test_y = train_test_split(df_features, df_target,random_state  = 40)

    #index through MLA and save performance to table
    row_index = 0
    for alg in MLA:
        print ("Running {}...".format(alg.__class__.__name__))
        for scaler in scalers:

            #set name and parameters
            MLA_compare.loc[row_index, 'MLA Name'] = alg.__class__.__name__
            MLA_compare.loc[row_index, 'Scaler'] = scaler.__class__.__name__

            start = time.clock()

            #initialize scaler
            if scaler:
                #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
                alg.fit(X=scaler.fit_transform(train_x), y=train_y.values.ravel())
            else:
                #score model with cross validation: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html#sklearn.model_selection.cross_validate
                alg.fit(X=train_x, y=train_y.values.ravel())
            MLA_compare.loc[row_index, 'MLA Time'] = time.clock() - start
            if scaler:
                MLA_compare.loc[row_index, 'MLA Test Accuracy'] = alg.score(X=scaler.fit_transform(test_x), y=test_y)
            else:
                MLA_compare.loc[row_index, 'MLA Test Accuracy'] = alg.score(X=test_x, y=test_y)

            row_index+=1


    #print and sort table: https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html
    return MLA_compare.sort_values(by = ['MLA Test Accuracy'], ascending = False)
    #print(MLA_compare)

Edit: 15, Jan, 2018

On request, I re-did the experiment with 10 tries and 10 random splits and took the mean accuracy for the comparison. Below is the result.

+1 for the thorough results you quote. When you look at the unscaled distributions of SECOM Manufacturing data, are there orders of magnitude differences between features without scaling? — Zhubarb, Jan 14 '18 at 12:03
there is no single answer to this question: scaling improves gradient descent, but scaling may impact regularisation (positively or negatively) and arguable same for KNN - is the relative scale of the different variables prenormalisation important or not., also scaling will impact your regularisation coefficient (you will need to optimise for it separately) — seanv507, Jan 14 '18 at 12:15
@Berkmeister Yes.. there are Features with values ranging in 3000 to features that are in the range of 0.1 and even some negatives. — Manu Joseph, Jan 14 '18 at 13:44
@seanv507 So, in an application stand point, you just gotta try with different combinations until you get one which behaves best for your data? — Manu Joseph, Jan 14 '18 at 13:45
Yes Manu, in practice you try different combinations...and rationalize it post hoc — seanv507, Jan 14 '18 at 19:12
I applaud your experiments, but there is an important component missing. To really get a clear picture, you need to run each experiment many times, on different random data sets. Many of these algorithms are non-deterministic, and will perform better or worse on one run to another due to chance. You should be looking for and comparing the *average* behaviour across many different runs and data sets. (I also hate the accuracy metric as a base of comparison for anything like this, but that's another story). — Matthew Drury, Jan 15 '18 at 06:29
@MatthewDrury Hmmm... you make a valid point. Initially I did the experiment with random test train split multiple times, but then made the split constant to measure the performance across scaling methods and not be influenced by the randomness of the train test split. I'll do the experiment with multiple tries and post them also... — Manu Joseph, Jan 15 '18 at 06:39
You should plot more than the mean accuracy, since what you really want to know is how differences in the mean accuracy compare to the variance introduced due to the randomness in your data and algorithms. Plot a point estimate for the mean accuracy, and also some way of showing the variance, like an interval. Also, 10 is a pretty small number, I would do at least 100. — Matthew Drury, Jan 15 '18 at 17:25

Effectiveness of Standardization and Normalization in Machine Learning

0 Answers0