12

I'm just getting into machine learning, and I have seen two conflicting practices for normalization. To be concrete, let's suppose that we have a $n \times d$ matrix containing our training data, where $n$ is the number of samples and $d$ is the number of features.

When people say that they normalize their data before running whatever algorithm, I have seen that they do one of the following things:

  • normalize the columns of the data matrix so that $A_{1,i}^2 + A_{2, i}^2 + \cdots + A_{n, i}^2 = 1$ for each feature $i$
  • normalize the rows of the matrix so that each sample vector has the same norm

In general, when someone refers to normalization of data, which of the following are they referring to?

I was under the impression that it was the first one (seems to make the most sense to me), but looking at the documentation for sklearn's preprocessing library, it appears that the default behavior is the second one. This doesn't make sense to me.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
Jim Koff
  • 121
  • 1
  • 3
  • 3
    Welcome to CV. You raise a good point, one that is glossed over by years of imprecise statistical habit. It's like naive analysts who talk about "correlation" without specifying what type of correlation is used. From the context, one can generally assume Pearson correlation as that is the modal approach. Just so with your question -- one can generally assume column normalization as that is the modal approach. The ipsative scaling (or row norming) that you mention is much less frequently practiced. – Mike Hunter Mar 05 '16 at 11:29

3 Answers3

7

Normalization is much trickier than most people think. Consider categorical and nonlinear predictors. Categorical (multinomial; polytomous) predictors are represented by indicator variables and should not be normalized. For continuous predictors, most relationships are nonlinear, and we fit them by expanding the predictor with nonlinear basis functions. The simplest case is perhaps a quadratic relationship $\beta_{1}x + \beta_{2}x^2$. Do we normalize $x$ by its standard deviation then square the normalized value for the second term? Do we normalize the second term by the standard deviation of $x^2$?

The mere use of normalizing so that the sum of squares for a column equals one, or normalizing by the standard deviation assumes that the predictor is one such that squaring it is the right thing to do. In general this only works correctly when the predictor has a symmetric distribution. For asymmetric distributions, the standard deviation is not an appropriate summary statistic for dispersion. One might just as easily entertain Gini's mean difference or the interquartile range. It's all arbitrary.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • 1
    +1. You're dealing with more advanced issues, but for completeness I'd also mention that normalization is done separately for training and testing data -- whether holdout or cross-validation -- otherwise we're leaking information from the future. – Wayne Jun 03 '17 at 12:26
  • 1
    I do not believe that to be the case. Normalization, when called for, needs to be done in such a way that the predictors have the same definition for training and test sets. Plus, split-sample validation is a pretty terrible way to validate predictive models. – Frank Harrell Jun 03 '17 at 12:40
  • Yes, I misspoke: calculations involved in normalization are done on the training set and then applied to the test set, but the test set itself is not included in the original calculations. Otherwise, when doing something like standardizing, the test data -- representing future data -- will be reflected in the mean and standard deviation used to standardize. Does that sound right? – Wayne Jun 03 '17 at 12:47
  • 1
    No; normalization does not use $Y$ so it will not create overfitting. It may create incorrect fits if effects are nonlinear though (see above). If you believe in normalization, use maximal $N$ when doing it. – Frank Harrell Jun 03 '17 at 13:13
  • Ah, I'm mistaken again. So you're saying that using X values "from the future" isn't bad because I could've -- hopefully in a principled manner -- have said that $X_1$ can not physically exceed the range [a, b] and therefore I can scale it as such even though I have never seen a or b in my data. OK, I'd like to withdraw my comments, but that would leave your comments hanging. – Wayne Jun 03 '17 at 15:06
  • It's OK to do anything in the $X$-only space without including that aspect in the validation process but that doesn't necessarily mean that it's the correct thing to do. It can hurt or help but will no be tilted (biased) towards systematically predicting better, which would cause overfitting. – Frank Harrell Jun 03 '17 at 15:29
  • For normalization of feature values in the range [0,1], would you recommend normalizing feature values within each fold of CV after the objects are randomly assigned (permuted) to a fold, or normalize a feature's values over the entire set of data before partitioning into folds? –  Nov 14 '17 at 15:11
  • Normalization is almost always done on the whole sample before any analysis, and done in such as way that is not informed by $Y$. – Frank Harrell Nov 17 '17 at 12:06
-1

In general, normalizing the features of one sample. I would not really talk much about rows and columns here, since the feature matrix can obviously transposed. I almost always span features over the rows as this makes it easier to perform calulations on the matrix in, e.g., C++.

Normalizing along the samples (I think this is your first bullet point) does indeed not make much sense. I think it is sometimes done in Auto-Encoder/Decoder methods (edit: actually only on the weight matrix) when the weights are shared in a particular way.

Think about it like this: if you normalize along the samples, how do you normalize a new sample that should be classified? Using the normalization term you have obtained during training or re-calculating the norm over the training examples + the new examples. Certainly the second one will eventually make the classifier fail. The first one will not guarantee that your normalization sums up to one anymore.

pAt84
  • 551
  • 3
  • 9
-1

That depends on the analysis steps following the normalization

If nothing else is said, then it commonly refers to normalizing the features under consideration across all samples (e.g. to afterwards classify samples or to predict their value w.r.t to some quantitative attributes, or to conduct dimensionality reduction techniques under the requirement of avoiding some bias introduced by the hetereogeneous range of attributes)

In specific fields however, in particular in analysis of microarray data, normalization along the samples is a widely used preprocessing step to remove unwanted variation during quality control (hopefully mostly technical noise, but it also affects real biological differences of course). You may e.g. want to have a look at https://en.wikipedia.org/wiki/Quantile_normalization.

This normalization technique affects even both directions at the same time (samples and features):

  1. Look for the feature with the smallest value within each sample (may be a different attribute for each of the samples)
  2. Collect all these smallest values and calculate the average of them
  3. Assign this new value to the original places you took it from, so that all samples now have the same value at the attribute that originally showed the smallest value within the respective sample
  4. Do the same with the 2nd smallest value, 3rd,... until all data are processed this way

Finally the range of all data is the same for any of the samples. This data set is then the basis for further processing.

jf1
  • 300
  • 1
  • 7