Is there such thing as a case-weighted PCA?

Question

Say I have 300 samples from a population containing two groups, A and B, and data for several variables. I have 150 from Group A and 150 from Group B. However, I know that Group A makes up roughly 20% of the population and group B makes up 80% and the two groups differ on the variables in question.

Is there a way to weight the PCA by cases to make it more representative of the population?

Would it be enough to just to do a weighted standardization?

Why not just duplicate your A data & duplicate your B data 7 times, & run PCA on the enlarged dataset? — gung - Reinstate Monica, Apr 14 '16 at 18:31

score 2 · Accepted Answer · answered Apr 28 '16 at 17:28

(Converting my comment into an answer so that this doesn't stay officially unanswered.)

I don't know of such a thing, but it may exist. However, it seems to me that this isn't really much of a problem. PCA is more of a descriptive technique than an inferential technique. We can contrast it with running a simple product moment correlation. If you have two variables, $X$ & $Y$, and you duplicated your data (such that you had two copies of every observation), the computed $r_{XY\ (2N)}$ wouldn't change relative to computing $r_{XY}$ on only the original $N$ rows. What would happen is that the computed confidence interval around $r_{XY\ (2N)}$ would be too narrow, and the $p$-value would be too low. These effects occur because Pearson's $r$ can be seen as both a descriptive statistic and an inferential statistic. PCA doesn't really have that latter inferential attribute. As a result, there is no harm in duplicating your data and running PCA—you should get the same eigenvectors and eigenvalues. The implication, therefore, is that you can get a weighted PCA manually by duplicating the $n_A$ rows and copying the $n_B$ rows seven times over such that your final dataset is $2\times n_A +8\times n_B$. Then run PCA on the enlarged dataset.

+1 but note that this is equivalent to multiplying all A rows by $\sqrt{2}$ and all B rows by $\sqrt{8}$ and then proceeding as usual. In general, each sample point can have its own weight $w_i$ and to get weighted PCA one can multiply each row of $X$ by the corresponding $\sqrt{w_i}$. Then the covariance matrix $X^\top X/n$ will become $\sum_i w_i x_i x_i^\top/n$, i.e. will be weighted. One should also note that this will screw up the centering of $X$, so it probably makes sense to center after applying the weights (i.e. to subtract the weighted mean). — amoeba, Apr 28 '16 at 21:00
Actually, I realized that it is covered in my answer to http://stats.stackexchange.com/questions/113485/, so I leave this link here to connect the two threads. — amoeba, Apr 29 '16 at 13:02
@amoeba, Thank you both for your helpful comments and connecting to previous thread. There is some useful information there as well! — J. Sweet, May 02 '16 at 16:30

Is there such thing as a case-weighted PCA?

1 Answers1