67

My question is do we need to standardize the data set to make sure all variables have the same scale, between [0,1], before fitting logistic regression. The formula is:

$$\frac{x_i-\min(x_i)}{\max(x_i)-\min(x_i)}$$

My data set has 2 variables, they describe the same thing for two channels, but the volume is different. Say it's the number of customer visits in two stores, y here is whether a customer purchases. Because a customer can visit both stores, or twice first store, once second store before he makes a purchase. but the total number of customer visits for 1st store is 10 times larger than the second store. When I fit this logistic regression, without standardization, coef(store1)=37, coef(store2)=13; if I standardize the data, then coef(store1)=133, coef(store2)=11. Something like this. Which approach makes more sense?

What if I am fitting a decision tree model? I know tree structure models don't need standardization since the model itself will adjust it somehow. But checking with all of you.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
user1946504
  • 1,247
  • 3
  • 14
  • 17
  • 13
    You don't need to standardize unless your regression is regularized. However, it sometimes helps interpretability, and rarely hurts. – alex Jan 23 '13 at 17:02
  • @alex, what you mean by "regularized regression"? – Tomas Jan 23 '13 at 17:10
  • 4
    Isn't the usual way to standardize $\frac{x_i-\bar{x}}{sd(x)}$? – Peter Flom Jan 23 '13 at 17:28
  • 2
    @Peter, that's what I thought before, but I found an article http://www.benetzkorn.com/2011/11/data-normalization-and-standardization/

    , it seems that normalization and standardization are different things. One is to make mean 0 variance 1, the other is to rescale each variable. That's where I get confused. Thanks for your reply.

    – user1946504 Jan 23 '13 at 17:56
  • @alex, you're right about it won't hurt, cuz I calculated the AUC in test group, and found similar results. The thing is I am calculating weight of each channel using coefficient. That makes a lot of difference. – user1946504 Jan 23 '13 at 17:59
  • 11
    To me standardization makes interpretation much more difficult. – Frank Harrell Jan 23 '13 at 20:49
  • An old post on standardization/normalization: http://stats.stackexchange.com/questions/10289/whats-the-difference-between-normalization-and-standardization/10291#10291 – bill_080 Jan 24 '13 at 00:20
  • 2
    To clarify on what @alex said, scaling your data means the optimal regularisation factor `C` changes. So you need to choose `C` after standardising the data. – akxlr Aug 21 '15 at 14:07
  • 1
    Standardization of the variable shouldn't be confused with [standardized coefficients](https://thinklab.com/discussion/computing-standardized-logistic-regression-coefficients/205#5). – Antoine Lizée May 03 '16 at 08:07
  • 2
    I feel sometimes it's just simple and intuitive to do trivial feature scaling, like if house prices are actually stored in thousands (e.g. 60,000) it wouldn't hurt dividing it by 1000. This kind of common sense scaling can't hurt. – Dhiraj Apr 05 '17 at 10:36

3 Answers3

52

Standardization isn't required for logistic regression. The main goal of standardizing features is to help convergence of the technique used for optimization. For example, if you use Newton-Raphson to maximize the likelihood, standardizing the features makes the convergence faster. Otherwise, you can run your logistic regression without any standardization treatment on the features.

Aymen
  • 867
  • 8
  • 10
  • Thanks for your reply. Does that mean standardization is preferred? Since we definitely want the model converge and when we have millions of variables, it's just easier to implement the logic of standardization in the modeling pipeline than tuning the variables one by one as needed. Am I understanding right? – user1946504 Sep 18 '14 at 21:33
  • 5
    that depends on the purpose of the analysis. Modern software can handle pretty extreme data without standardizing. If there is a natural unit for each variables (years, euros, kg, etc.) then I would be hesitant to standardize, though I feel free to change the unit from kg to for example tons or grams whenever that makes more sense. – Maarten Buis Nov 18 '14 at 08:43
31

If you use logistic regression with LASSO or ridge regression (as Weka Logistic class does) you should. As Hastie,Tibshirani and Friedman points out (page 82 of the pdf or at page 63 of the book):

The ridge solutions are not equivariant under scaling of the inputs, and so one normally standardizes the inputs before solving.

Also this thread does.

25

@Aymen is right, you don't need to normalize your data for logistic regression. (For more general information, it may help to read through this CV thread: When should you center your data & when should you standardize?; you might also note that your transformation is more commonly called 'normalizing', see: How to verify a distribution is normalized?) Let me address some other points in the question.

It is worth noting here that in logistic regression your coefficients indicate the effect of a one-unit change in your predictor variable on the log odds of 'success'. The effect of transforming a variable (such as by standardizing or normalizing) is to change what we are calling a 'unit' in the context of our model. Your raw $x$ data varied across some number of units in the original metric. After you normalized, your data ranged from $0$ to $1$. That is, a change of one unit now means going from the lowest valued observation to the highest valued observation. The amount of increase in the log odds of success has not changed. From these facts, I suspect that your first variable (store1) spanned $133/37\approx 3.6$ original units, and your second variable (store2) spanned only $11/13\approx 0.85$ original units.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650