44

I have a data set with a set of features. Some of them are binary $(1=$ active or fired, $0=$ inactive or dormant), and the rest are real valued, e.g. $4564.342$.

I want to feed this data to a machine learning algorithm, so I $z$-score all the real-valued features. I get them between ranges $3$ and $-2$ approximately. Now the binary values are also $z$-scored, therefore the zeros become $-0.222$ and the ones become $0.5555$.

Does standardising binary variables like this make sense?

Nick Stauner
  • 11,558
  • 5
  • 47
  • 105
siamii
  • 1,767
  • 5
  • 21
  • 29

5 Answers5

21

A binary variable with values 0, 1 can (usually) be scaled to (value - mean) / SD, which is presumably your z-score.

The most obvious constraint on that is that if you happen to get all zeros or all ones then plugging in SD blindly would mean that the z-score is undefined. There is a case for assigning zero too in so far as value - mean is identically zero. But many statistical things won't make much sense if a variable is really a constant. More generally, however, if the SD is small, there is more risk that scores are unstable and/or not well determined.

A problem over giving a better answer to your question is precisely what "machine learning algorithm" you are considering. It sounds as if it's an algorithm that combines data for several variables, and so it usually will make sense to supply them on similar scales.

(LATER) As the original poster adds comments one by one, their question is morphing. I still consider that (value - mean) / SD makes sense (i.e. is not nonsensical) for binary variables so long as the SD is positive. However, logistic regression was later named as the application and for this there is no theoretical or practical gain (and indeed some loss of simplicity) to anything other than feeding in binary variables as 0, 1. Your software should be able to cope well with that; if not, abandon that software in favour of a program that can. In terms of the title question: can, yes; should, no.

Nick Cox
  • 48,377
  • 8
  • 110
  • 156
  • I'm using a logistic regression, which is a linear model. When the binary is 0, it means "no face recognized" or "I have no useful information to add to the model". Wouldn't that be an equivalent to a 0 instead of -0.222? Because -0.2222 will contribute some information to the model, since it's not 0. – siamii May 18 '13 at 17:19
  • For example, if the binary was "white" or "black" then I understand that it makes sense to z-score it, since both "white" and "black" has useful information in them. – siamii May 18 '13 at 17:21
  • Culture clash here: I regard logistic regression as a now classical statistical model, and not an algorithm! That aside, if 0 and 1 are the values of your response (outcome, dependent) variable, they should be left as is for all logistic regression software I've ever heard of. If they are values of a predictor, the software shouldn't demand that predictors be in similar units. If it does, then you'll need to say _much_ more about the software (than zero) to get a better reply. – Nick Cox May 18 '13 at 17:23
  • It's a predictor and it doesn't demand to be in similar units. I just z-score, because I assume it helps the prediction. So what I'm confused about is that "0" means "I have no idea what's going on, therefore I just stay 0 and not do anything" whereas "1" means "Hey, I recognized that there is a face in this picture. I have useful information for you". So if I z-score this, and shift the 0 to -0.2222, then doesn't that mean that "I have some information for you, it is -0.2222"? – siamii May 18 '13 at 17:28
  • 4
    Short answer is that it means nothing different and I see no reason why changing 0, 1 to z-scores will help anything in this situation. To convince yourself, try it both ways and see that nothing important changes. – Nick Cox May 18 '13 at 17:40
  • 4
    On the contrary, I think most people would use 0, 1 here. – Nick Cox May 18 '13 at 18:06
  • Thanks, last, what if the values range between 0-100 instead of binary 0 and 1, and the range denotes the confidence level. For example, 0 still means "not recognized face" and 100 means "I'm very sure I recognized a face" while 10 could mean "I might have recognized a face". Would you z-score this? – siamii May 18 '13 at 18:38
  • 2
    When you're doing logistic regression, the software will almost surely perform the standardization under the hood anyway (to achieve better numerical properties). Thus it's a good idea to keep the binary indicator expressed in a meaningful way. Standardizing it doesn't sound either good or useful. – whuber May 18 '13 at 19:17
  • 2
    Any machine learning method that requires you to "standardize" binary predictors is suspect. – Frank Harrell May 18 '13 at 19:57
  • @whuber I use my own implementation, because it's a modified logistic regression I have. – siamii May 18 '13 at 22:24
  • 4
    Since it's your own implementation, then nobody else has any basis to give you an objective answer! You need to examine how your software treats the data in order to decide whether prior standardization make sense. – whuber May 19 '13 at 14:59
  • Is there not a good reason to standardize binary predictors in penalized regression (lasso, ridge, elastic net)? – The Laconic Mar 13 '18 at 03:22
  • @The Laconic If there is, you can explain that in an answer. – Nick Cox Mar 13 '18 at 07:05
  • Nice answer. Is there any example of research texts that do standardize binary predictors? – yoyostein Dec 13 '19 at 02:32
  • @yoyostein An example is given in another answer in this thread. – Nick Cox Dec 13 '19 at 09:02
16

Standardizing binary variables does not make any sense. The values are arbitrary; they don't mean anything in and of themselves. There may be a rationale for choosing some values like 0 & 1, with respect to numerical stability issues, but that's it.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • what if they were between 0-100. As I said, they mean stuff like "recognized a face" and "not recognized face", and 0-100 means the confidence level. Does it make sense to z-score that? – siamii May 18 '13 at 18:36
  • 1
    Your 0-100 example sounds like an ordinal rating. There's a bit of detail regarding how to best deal w/ that situation & it's been discussed on CV quite a bit. Search on the [tag:ordinal] tag to learn more. – gung - Reinstate Monica May 18 '13 at 18:45
  • well, the problem is that only some of the variables are 0-100. Others are for example -400 - +400 – siamii May 18 '13 at 18:49
  • What is the problem w/ that? Is this a numerical stability issue? – gung - Reinstate Monica May 18 '13 at 18:52
  • perhaps, do you suggest I don't z-score? – siamii May 18 '13 at 22:24
  • I don't have a strong opinion either way. Agresti has argued that you can re-represent ordinal predictors w/ continuous values that you think are reasonable. If you aren't perfectly correct, there will be some measurement error, but if you have any knowledge of the topic to draw on / aren't totally off, the effect will be minor. If it subsequently seemed normal enough, I might turn them into z-scores. You should read the material that's on the site, these issues have been discussed. – gung - Reinstate Monica May 19 '13 at 00:11
4

One nice example where it can be useful to standardize in a slightly different way is given in section 4.2 of Gelman and Hill (http://www.stat.columbia.edu/~gelman/arm/). This is mostly when the interpretation of the coefficients is of interest, and perhaps when there are not many predictors.

There, they standardize a binary variable (with equal proportion of 0 and 1) by $$ \frac{x-\mu_x}{2\sigma_x}, $$ instead of the normal $\sigma$. Then these standardized coefficients take on values $\pm 0.5 $ and then the coefficients reflect comparisons between $x=0$ and $x=1$ directly. If scaled by $\sigma$ instead then the coefficient would correspond to half the difference between the possible values of $x$.

Gosset's Student
  • 348
  • 2
  • 11
  • Please explain "with equal proportion of 0 and 1" as the binary variables I see are rarely like that. – Nick Cox Mar 13 '18 at 07:27
  • I don't think the proportion will actually make a difference, they just use it to make the example cleaner. – Gosset's Student Mar 13 '18 at 13:27
  • Very skewed proportions will make a difference and a different approach should be considered. E.g.; if the binary variable has probabilities `.1` and `.9`, $\sigma$ is then `sqrt(.1*.9)` = `.3`, and not the proposed $\pm0.5$. – jorijnsmit Dec 17 '20 at 16:19
1

What do you want to standardize, a binary random variable, or a proportion?

It makes no sense to standardize a binary random variable. A random variable is a function that assigns a real value to an event $Y:S\rightarrow \mathbb{R} $. In this case 0 for failure and 1 to success, i.e. $Y\in \lbrace 0,1\rbrace$.

In the case of a proportion, this is not a binary random variable, this is a continuous variable $X\in[0,1]$, $x\in \mathbb{R}^+$.

QAChip
  • 69
  • 1
  • 4
0

In logistic regression binary variables may be standardise for combining them with continuos vars when you want to give to all of them a non informative prior such as N~(0,5) or Cauchy~(0,5). The standardisation is adviced to be as follows: Take the total count and give

1 = proportion of 1's

0 = 1 - proportion of 1's.

-----

Edit: Actually I was not right at all, it is not an standardisation but a shifting to be centered at 0 and differ by 1 in the lower and upper condition, lets say that a population is 30% with company A and 70% other, we can define centered "Company A" variable to take on the values -0.3 and 0.7.