1

I have a large set of variables measured on different scales/units, and want to standardize them to the same scale, with a mean of 0 and standard deviation of 1, so that I can run a PCA on them. I have heard that the NORM.S.DIST function in Excel is appropriate - can anyone verify this?

amoeba
  • 93,463
  • 28
  • 275
  • 317
meepmeep
  • 21
  • 1
  • 1
  • 3
  • 1
    PCA software worthy of use will standardise for you on request. Doing it yourself is pointless. – Nick Cox Aug 18 '13 at 14:39
  • @ Nick Cox, do you know if SPSS will do it? – meepmeep Aug 18 '13 at 14:46
  • If that's your question, please reverse accordingly. I imagine that SPSS is more than capable of doing with a PCA with a correlation matrix, but I have checked since the 1970s. – Nick Cox Aug 18 '13 at 14:48
  • @Nick Cox: Another point of confusion for me is the difference between doing PCA on a correlation vs covariance matrix - what I am reading suggests that doing it on a correlation matrix means you don't need to scale the variables prior to running the PCA (i.e. they're already scaled), whereas you do need to scale with the covariance matrix. Do you know if this is this correct? – meepmeep Aug 18 '13 at 15:15
  • First is right; second is wrong. Standardising before producing a covariance matrix would just produce the correlation matrix. But please don't new questions in comments. – Nick Cox Aug 18 '13 at 15:49

1 Answers1

5

Like this:

$$\frac{X - \bar{X}}{SD(X)}$$

This is often referred to as a Z-Score.

I believe NORM.S.DIST gives you access to the CDF and PDF for $Z \sim N(0,1)$. This is not the same as Z-Standardizing your variable (i.e. transforming your variable such that it is distributed according to $Z$ given that the variable was originally normally distributed).

Here's a demo in google docs.

David Marx
  • 6,647
  • 1
  • 25
  • 43
  • Yeah, so in Excel just use the formula: =(A1-AVERAGE(A\$1:A\$14))/STDEV(A\$1:A\$14) or use the standardize function =STANDARDIZE(A1,AVERAGE(A\$1:A\$14),STDEV(A\$1:A\$14)). There should be a dollar sign between each A and the row number. This lets you apply the formula over each row without Excel auto increasing it each time. – Twitch_City Aug 18 '13 at 14:48
  • Thanks @David Marx. Trying to determine which way to scale the data, and seems there are tons of different ways. Do you know if z-scores are any more or less valid than other ways of scaling? – meepmeep Aug 18 '13 at 14:50
  • @Twitch_City I think those excel formulas will be inefficient for a large number of observations since they have excel recalculate the mean and variance for each standardization operation. My updated post provides a demo where the mean and standardization need only be calculated once, which should result in a faster spreadsheet. Obviously it doesn't make a huge difference if the number of observations is small. – David Marx Aug 18 '13 at 14:53
  • @meepmeep z-standardization is pretty common with PCA and has some nice properties, but there are certainly other options. You may find this article interesting: http://link.springer.com/article/10.1007%2FBF01034742 – David Marx Aug 18 '13 at 14:56
  • @Nick Cox: Another point of confusion for me is the difference between doing PCA on a correlation vs covariance matrix - what I am reading suggests that doing it on a correlation matrix means you don't need to scale the variables prior to running the PCA (i.e. they're already scaled), whereas you do need to scale with the covariance matrix. Do you know if this is this correct? – meepmeep Aug 18 '13 at 15:01
  • meep, Please see the discussions on correlation vs. covariance at http://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance and – whuber Aug 21 '13 at 15:42