2

Say I have 300 samples from a population containing two groups, A and B, and data for several variables. I have 150 from Group A and 150 from Group B. However, I know that Group A makes up roughly 20% of the population and group B makes up 80% and the two groups differ on the variables in question.

Is there a way to weight the PCA by cases to make it more representative of the population?

Would it be enough to just to do a weighted standardization?

amoeba
  • 93,463
  • 28
  • 275
  • 317
J. Sweet
  • 23
  • 3

1 Answers1

2

(Converting my comment into an answer so that this doesn't stay officially unanswered.)

I don't know of such a thing, but it may exist. However, it seems to me that this isn't really much of a problem. PCA is more of a descriptive technique than an inferential technique. We can contrast it with running a simple product moment correlation. If you have two variables, $X$ & $Y$, and you duplicated your data (such that you had two copies of every observation), the computed $r_{XY\ (2N)}$ wouldn't change relative to computing $r_{XY}$ on only the original $N$ rows. What would happen is that the computed confidence interval around $r_{XY\ (2N)}$ would be too narrow, and the $p$-value would be too low. These effects occur because Pearson's $r$ can be seen as both a descriptive statistic and an inferential statistic. PCA doesn't really have that latter inferential attribute. As a result, there is no harm in duplicating your data and running PCA—you should get the same eigenvectors and eigenvalues. The implication, therefore, is that you can get a weighted PCA manually by duplicating the $n_A$ rows and copying the $n_B$ rows seven times over such that your final dataset is $2\times n_A +8\times n_B$. Then run PCA on the enlarged dataset.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • +1 but note that this is equivalent to multiplying all A rows by $\sqrt{2}$ and all B rows by $\sqrt{8}$ and then proceeding as usual. In general, each sample point can have its own weight $w_i$ and to get weighted PCA one can multiply each row of $X$ by the corresponding $\sqrt{w_i}$. Then the covariance matrix $X^\top X/n$ will become $\sum_i w_i x_i x_i^\top/n$, i.e. will be weighted. One should also note that this will screw up the centering of $X$, so it probably makes sense to center after applying the weights (i.e. to subtract the weighted mean). – amoeba Apr 28 '16 at 21:00
  • Actually, I realized that it is covered in my answer to http://stats.stackexchange.com/questions/113485/, so I leave this link here to connect the two threads. – amoeba Apr 29 '16 at 13:02
  • 1
    Hmmm, we could make this a duplicate, @amoeba. – gung - Reinstate Monica Apr 29 '16 at 13:12
  • @amoeba, Thank you both for your helpful comments and connecting to previous thread. There is some useful information there as well! – J. Sweet May 02 '16 at 16:30