10

Suppose we have several features (e.g. $\geq20$) that do not follow a Gaussian distribution. Do we have to worry about the features not following a Gaussian distribution if we apply standardization on the data?

Namely, even if the features do not follow a normal distribution initially, aren't they made to follow Gaussian distribution after standardization with mean $0$ and variance $1$?

Frans Rodenburg
  • 10,376
  • 2
  • 25
  • 58
Akash Dubey
  • 425
  • 4
  • 15
  • 7
    Your last statement is incorrect: standardization does not transform a dataset's distribution from non-normal to normal. – Emil Sep 03 '18 at 09:51
  • @Emil After standardization, the mean and variance become 0 and 1 respectively and i also know that a random variable with mean 0 and var 1 follows standard normal distribution. Correct me if I am wrong. – Akash Dubey Sep 03 '18 at 09:57
  • 5
    Akash, think about what happens to the distribution: Subtracting the mean sets the location of the mean to $0$. Dividing by the standard deviation either compresses or stretches the distribution such that it becomes as narrow or wide as necessary for it to have a standard deviation of $1$. Where in this process did we change the shape? Why would a non-normal distribution suddenly become normal? See here for example for non-normal distributions that meet the criteria: https://stats.stackexchange.com/a/314003/176202 – Frans Rodenburg Sep 03 '18 at 10:34
  • Okay yes. I get it. But if it is true, a standard normal variate must not always follow a normal distribution, which I believed it did? Does it? – Akash Dubey Sep 03 '18 at 10:39
  • 2
    The standard normal is a *normal distribution* with $\mu=0$ and $\sigma=1$, so to say that it is not normal makes no sense. Note that an arbitrary distribution with mean $0$ and standard deviation $1$ is not called a standard normal distribution. – Frans Rodenburg Sep 03 '18 at 10:41
  • 3
    You do not have to thank people on CV, but you can show your appreciation by upvoting and accepting @Emil's answer. On a different note, if you comment on a thread, only the OP is notified. You can ping others by using @ followed by their username. – Frans Rodenburg Sep 03 '18 at 10:53
  • Whether you have to worry about the features having a non-Gaussian distribution depends on what you're doing with them: what classifier are you using? If your classifier is one that requires dependent variables be Gaussian, then look into e.g. [Box-Cox transforms](https://stats.stackexchange.com/search?q=box-cox+transform). Also, plot us the distribution (a rough histogram is fine). Standardization doesn't change the shape of the distribution, it only slides it around and compresses/expands it. – smci Sep 03 '18 at 21:31

1 Answers1

19

The short answer: yes, you do need to worry about your data's distribution not being normal, because standardization does not transform the underlying distribution structure of the data. If $X\sim\mathcal{N}(\mu, \sigma^2)$ then you can transform this to a standard normal by standardizing: $Y:=(X-\mu)/\sigma \sim\mathcal{N}(0,1)$. However, this is possible because $X$ already follows a normal distribution in the first place. If $X$ has a distribution other than normal, standardizing it in the same way as above will generally not make the data normally distributed.

A simple example of exponentially distributed data and its standardized version:

x <- rexp(5000, rate = 0.5)
y <- (x-mean(x))/sd(x)
par(mfrow = c(2,1))
hist(x, freq = FALSE, col = "blue", breaks = 100, xlim = c(min(x), quantile(x, 0.995)),
     main = "Histogram of exponentially distributed data X with rate = 0.5")
hist(y, freq = FALSE, col = "yellow", breaks = 100, xlim = c(min(y), quantile(y, 0.995)),
     main = "Histogram of standardized data Y = ( X-E(X) ) / StDev(X)")

Now if we check the mean and standard deviation of the original data $x$, we get

c(mean(x), sd(x))
[1] 2.044074 2.051816

whereas for the standardized data $y$, the corresponding results are

c(mean(y), sd(y))
[1] 7.136221e-17 1.000000

As you can see, the distribution of the data after standardization is decidedly not normal, even though the mean is (practically) 0 and the variance 1. In other words, if the features do not follow a normal distribution before standardization, they will not follow it after the standardization either.

Emil
  • 1,003
  • 1
  • 8
  • 12
  • I am lilltle confused here, Let our data follows any distribution initially, with any mean and variance but fter standardization, the mean and variance of the data become 0 and 1 respectively and i also know that a random variable with mean 0 and var 1 follows standard normal distribution. Does a standard normal distribution does not follow normal distribution? – Akash Dubey Sep 03 '18 at 10:24
  • 11
    "I also know that a random variable with mean 0 and var 1 follows standard normal distribution". This sentence is wrong. There are many different examples of a random variable that has mean 0 and var 1 but with a distribution that is **not** normal. – Emil Sep 03 '18 at 10:39