6

If I standardize a standard normal random variate , will it be still standard normal ? That is, if $X\sim N(0,1)$ , then can I do $$X^*=\frac{x-\bar x}{sd(x)}$$ ? and will $X^*\sim N(0,1)$ ?

In R code :

x <- rnorm(5)
scale(x)

It seems to me I am standardizing a standard normal, sounds like double standardization. Also I don't know whether it will retain the standard normal distribution.

user81411
  • 731
  • 1
  • 7
  • 14
  • "Valid" in what sense? For what purpose? – whuber Aug 14 '15 at 02:12
  • On reflection, I think I've seen a similar question before which had one or more good answers. Can't seem to locate it right now though. – Glen_b Aug 14 '15 at 02:31
  • @whuber though "valid" is a technical term, I have actually used it in plain language. You can think it as "logical" . – user81411 Aug 14 '15 at 02:41
  • 3
    That doesn't help us understand what you mean by it, unfortunately. **What is the purpose**? For some purposes this standardization is helpful and mathematically correct--"valid," if you like. For others--including some of those pointed out in existing answers--it is not valid or could be misinterpreted. Unless you can edit this question to specify your meaning, it will have to be closed as being objectively unanswerable. – whuber Aug 14 '15 at 03:50

5 Answers5

10

If $X_i$ are iid Normal(0,1), then a sample from it won't have sample mean 0 or sample standard deviation 1 just due to random variation.

Now consider what happens when we do $Z=\frac{X-\overline{X}}{s_X}$

While we do now have sample mean 0 and sample standard deviation 1, what we don't have is $Z$ being normally distributed.

In small to moderate sample sizes, it has short tails, and substantially smaller kurtosis than a standard normal, Indeed from simulation for samples of size n=10 it looks pretty similar to a scaled beta(4,4) (that has been scaled to lie in (-3,3) ):

enter image description here

(The x-axis is a random sample of B(4,4) scaled to (-3,3). Of course this doesn't mean the distribution shape is a beta(4,4). -- Edit: as Henry points out, it is in fact a Beta(4,4).)

The values in res were generated as follows:

res=replicate(100000,scale(rnorm(10)))

For samples of size 5, the result looks rather like a scaled beta(3/2,3/2).

Further, the values in each sample are no longer independent, since they sum to 0 and their squares sum to $n-1$

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • 1
    The distribution is exactly a beta distribution with both shape parameters equal to $\frac n2-1$, stretched onto the interval $[-\sqrt n, \sqrt n]$ or onto the interval $[-\sqrt {n-1}, \sqrt {n-1}]$ (depending on whether you use $\frac1{n-1}$ or $\frac1n$ to find the sample variance), as shown in [an answer to the linked question](https://stats.stackexchange.com/a/182087/2958) – Henry Mar 29 '21 at 16:23
7

We have that

$$X_i^* = \frac{X_i}{s} - \frac{\bar X}{s}$$

The sample variance from a normal sample follows an exact distribution,

$$(n-1)s^2/\sigma^2\sim\chi^2_{n-1} \implies s^2 \sim \frac{1}{n-1}\chi^2_{n-1} \implies s \sim \frac{1}{\sqrt{n-1}}\chi_{n-1}$$

i.e. $s$ follows the square root of a chi-square divided by its degrees of freedom.

But even if this means that $\frac{X_i}{s}$ is the ratio of a standard normal over the square root of a chi-square divided by its degrees of freedom, the numerator is not independent of the denominator, and so we cannot say that the ratio follows a Student's $t$-distribution (and personally I do not know its distribution).

As for the second term, it is known that the sample mean and the sample variance are independent random variables if and only if the sample consists of independent normals, which is the case here.

Furthermore, the sample mean follows a zero-mean normal distribution with variance here $1/n$, so $\sqrt{n}\bar X$ follows a standard normal.

So we have that $$\frac{\sqrt {n} \bar X}{s} \sim t \implies \frac{\bar X}{s} \sim \frac{1}{\sqrt {n}}t $$

i.e. the second term of $X_i^*$ follows a scaled student's $t$-distribution

So in all

$$X^*_i = \frac{Z_i}{\sqrt{\chi^2_{n-1}/(n-1)}} - \frac{1}{\sqrt {n}}t$$

where I have used the symbol $Z$ to denote a random variable following a standard normal. The first term is not a Student's $t$, and moreover, it is not independent from the second term. Put together it doesn't look much of a normal or of a Student's distribution either.

Alecos Papadopoulos
  • 52,923
  • 5
  • 131
  • 241
  • https://en.wikipedia.org/wiki/Standard_score. For a sample, it says standardization should by this way$Z = \frac{\bar{X}-\operatorname{E}[X]}{\sigma(X)/\sqrt{n}}.$ Your explains seem easier for me to understand. – Deep North Aug 14 '15 at 16:17
  • 1
    @DeepNorth Note that wikipedia expression refers to the _true_ mean (and standard deviation), which is a constant, not the _sample_ mean, which is a random variable that estimates the true mean. Likewise for the standard deviation. – Alecos Papadopoulos Aug 14 '15 at 18:25
  • Thank you very much, but I think when people say "standardize a **standard normal random** variate", Isn't that mean the true mean and ture variance already be known? Anyway, I like your explanation. – Deep North Aug 15 '15 at 00:37
  • And I think when we already know $E(X)=0$ and replace $\sigma$ by s, Z has a t distribution by Wiki's method. I think then we may need bootstrap to get different sample means. I checked the source code of scale function,but it seems the function does not use bootstrap. – Deep North Aug 15 '15 at 01:07
  • @DeepNorth The notation of the OP pointed towards using the sample moments. even though the true moments may be known, this is why most of the answers here explored this case which admittedly may be no more than an entertaining curiosity. As for wiki's method, note that it standardizes _the sample mean_, not each individual realization from the sample (which is what the OP was asking about). – Alecos Papadopoulos Aug 15 '15 at 01:34
5

The original standard normal variables have TRUE mean 0 (E(X) = 0) and are independent. By taking a set of them and dividing them by their standard deviation, you DO standardize them, but the result, ironically, isn't standard normal. They are dependent (because they share the denominator) and actually have t-distributions. So if you want standard normal, just stick with rnorm(5).

AlaskaRon
  • 51
  • 1
  • But when you standardize the original, they also share the same denominator, right? – Deep North Aug 14 '15 at 02:32
  • 2
    Can you explain why you say the values have t-distributions? I really don't think they do. – Glen_b Aug 14 '15 at 11:55
  • 2
    That is an interesting comment, @Glen_b. Evidently the question is referring to *samples* from a standard normal distribution. If we consider a sample of size $2$, then standardizing turns it into the dataset $(-1,1)$. That's certainly not a $t$ distribution! (Nor is it remotely normal for that matter... .) One might describe it as a "scaled Beta$(0,0)$" distribution. – whuber Aug 14 '15 at 13:56
  • @whuber To clarify the meaning behind my question -- to get a t-distribution, you'd have a 0-mean normally distributed numerator divided by (a constant times) the square root of {an (independent from the numerator) chi-square divided by its d.f}. But we don't actually have that here. $X_i-\overline{X}$ and $s_X$ are dependent since if $s$ is small, $X_i-\overline{X}$ must be small. e.g. see `plot(c(0,2.8),c(0,4),type="n"); jk=replicate(10000, {x=rnorm(5);num=x-mean(x);points(sd(x),num[1])})` – Glen_b Aug 14 '15 at 15:53
0

Just did some experiments. It seems after scale again, you are closer to get some data with $\mu=0$ and $\sigma=1$.

set.seed(123)
x <- rnorm(1000,0,1)
mean(x)
sd(x)
y<-scale(x)
mean(y)
sd(y)

Results:

> mean(x)
[1] 0.01612787

> sd(x)
[1] 0.991695


> y<-scale(x)

> mean(y)
[1] -8.235085e-18

> sd(y)
[1] 1
Deep North
  • 4,527
  • 2
  • 18
  • 38
  • 10
    You seem to have discovered that standardizing (which is designed to create a zero mean and unit variance) makes the data have zero mean and unit variance. – whuber Aug 14 '15 at 02:13
  • Hehe, thanks, you seem to say I am coloring the red color with red color, anyway, English is not my native language. – Deep North Aug 14 '15 at 04:09
  • 5
    Having the sample mean be 0 and sample sd be 1 doesn't necessarily mean the distribution is closer to N(0,1). – Glen_b Aug 14 '15 at 11:53
  • Ok, I will change the text – Deep North Aug 14 '15 at 12:39
0

Intuitive proof by counterexample

There are already some general answers that cover the question, but personally I find the following reasoning most easy to follow.

Suppose your sample size is 1.

Your definition of $X^*$ is as follows

$$X^*=\frac{x-\bar x}{sd(x)}$$

Because the sample size is 1, we have $\bar x = x$, so for any $x$ the expression reduces to

$$X^*=\frac{\bar x-\bar x}{sd(x)} =\frac{0}{0}$$

As $X^*$ is clearly not normally distributed for sample size 1, it can definitely not have a standard normal distribution in general.

  • 1
    You have noted that normalization is *undefined* for a sample of size 1. That does not seem to have any implications for larger samples. @Glen_b has dealt with those cases in his answer. – whuber Aug 14 '15 at 14:51
  • @whuber I thought that showing that it is undefined would be sufficient for a counter example. ---- Sidenote: though the possibility is (infinitely) small, normalization can actually be undefined for a sample of any size. Not sure if that would improve my answer enough to satisfy you? – Dennis Jaheruddin Aug 14 '15 at 15:39
  • The problem with a situation where something is *always* undefined is that it leaves everybody wondering whether your conclusions are special to that situation or if they generalize. That's why this answer does not suffice. Your argument would be far more convincing when applied to samples of size two (or greater)--and that's exactly what @Glen_b's answer does. The fact that standardization can be undefined is not a theoretical problem when the underlying distribution is continuous, for then the chance of encountering such a situation is zero and therefore can be neglected. – whuber Aug 14 '15 at 17:07