Minimum sample size for PCA or FA when the main goal is to estimate only few components?

Question

If I have a dataset with $n$ observations and $p$ variables (dimensions), and generally $n$ is small ($n=12-16$), and $p$ may range from small ($p = 4-10$) to perhaps much larger ($p= 30-50$).

I remember learning that $n$ should be much larger than $p$ in order to run principal component analysis (PCA) or factor analysis (FA), but it seems like this may not be so in my data. Note that for my purposes I am rarely interested in any principal components past PC2.

Questions:

What are the rules of thumb for minimum sample size when PCA is OK to use, and when it is not?
Is it ever OK to use the first few PCs even if $n=p$ or $n<p$?
Are there any references on this?
Does it matter if your main goal is to use PC1 and possibly PC2 either:
- simply graphically, or
- as synthetic variable then used in regression?

I remember reading about this sort of guidelines with respect to factor analysis. Are you also interested in that or only in PCA? Also, the answer might depend on the type of data you are dealing with, do you have a specific field of application in mind? — Gala, Dec 13 '12 at 08:16
Thanks Gael for the comments and references below. Now I am left needing to know the differences between FA and PCA. :) — Patrick, Dec 13 '12 at 16:04
This question has been treated extensively on this site, see e.g. http://stats.stackexchange.com/questions/1576/what-are-the-differences-between-factor-analysis-and-principal-component-analysi and http://stats.stackexchange.com/questions/612/is-psychprincipal-function-still-pca-when-using-rotation — Gala, Dec 13 '12 at 16:08

score 25 · Answer 1 · edited Dec 22 '14 at 15:15

For factor analysis (not principal component analysis), there is quite a literature calling into question some of the old rules of thumb on the number of observations. Traditional recommendations – at least within psychometrics – would be to have at least $x$ observations per variable (with $x$ typically anywhere from $5$ to $20$) so in any case $n \gg p$.

A rather thorough overview with many references can be found at http://www.encorewiki.org/display/~nzhao/The+Minimum+Sample+Size+in+Factor+Analysis

However, the main take-away message from recent simulation studies would probably be that the quality of the results vary so much (depending on the communalities, on the number of factors or the factors-to-variables ratio, etc.) that considering the variables-to-observations ratio is not a good way to decide on the required number of observations. If the conditions are auspicious, you might be able to get away with a lot fewer observations than old guidelines would suggest but even the most conservative guidelines are too optimistic in some cases. For example, Preacher & MacCallum (2002) obtained good results with extremely small sample sizes and $p > n$ but Mundfrom, Shaw & Ke (2005) found some cases where a sample size of $n > 100 p$ was necessary. They also found that if the number of underlying factors stays the same, more variables (and not fewer, as implied by guidelines based on the observations-to-variables ratio) could lead to better results with small samples of observations.

Relevant references:

Mundfrom, D.J., Shaw, D.G., & Ke, T.L. (2005). Minimum sample size recommendations for conducting factor analyses. International Journal of Testing, 5 (2), 159-168.
Preacher, K.J., & MacCallum, R.C. (2002). Exploratory factor analysis in behavior genetics research: Factor recovery with small sample sizes. Behavior Genetics, 32 (2), 153-161.
de Winter, J.C.F., Dodou, D., & Wieringa, P.A. (2009). Exploratory factor analysis with small sample sizes. Multivariate Behavioral Research, 44 (2), 147-181.

(+1) Here is another paper, using simulation and real datasets, that suggests that the N/p rule-of-thumb does not perform very well in practice, and that provides sample sizes required to obtain stable and accurate solution in EFA--controlling for various quality criteria--as a function of the number of factors and the number of items (and optionally the half-width of Cronbach's alpha 95% CI, based on Feldt's formula) in a psychiatric scale: [Sample size requirements for the internal validation of psychiatric scales](http://1.usa.gov/14w0ezL) Int J Methods Psychiatr Res. 2011 Dec;20(4):235-49. — chl, May 17 '13 at 10:05

score 24 · Accepted Answer · edited Mar 27 '13 at 20:09

24

You can actually measure whether your sample size is "large enough". One symptom of small sample size being too small is instability.

Bootstrap or cross validate your PCA: these techniques disturb your data set by deleting/exchanging a small fraction of your sample and then build "surrogate models" for each of the disturbed data sets. If the surrogate models are similar enough (= stable), you are fine. You'll probably need to take into account that the solution of the PCA is not unique: PCs can flip (multiply both a score and the respective principal component by $-1$). You may also want to use Procrustes rotation, to obtain PC models that are as similar as possible.

edited Mar 27 '13 at 20:09

Jeremy Miles

13,917
6
30
64

answered Dec 13 '12 at 18:34

cbeleites unhappy with SX

34,156
3
67
133

Thanks cbeleites. Do you think bootstrapping will be overly informative with n as low as, say, 16? To understand, I'd just be looking for relative stability by running many PCAs, leaving one site out each run. – Patrick Dec 14 '12 at 16:44
In that case it is certainly feasible to look at all 16 models that are disturbed by deleting one sample (or even at all 120 model that left out 2 samples). I think with small $n$ I'd probably go for such a systematic cv-like approach. – cbeleites unhappy with SX Dec 14 '12 at 17:17

doctorate · Answer 3 · 2013-03-28T08:59:48.370

I hope this might be helpful:

for both FA and PCA

''The methods described in this chapter require large samples to derive stable solutions. What constitutes an adequate sample size is somewhat complicated. Until recently, analysts used rules of thumb like “factor analysis requires 5–10 times as many subjects as variables.” Recent studies suggest that the required sample size depends on the number of factors, the number of variables associated with each factor, and how well the set of factors explains the variance in the variables (Bandalos and Boehm-Kaufman, 2009). I’ll go out on a limb and say that if you have several hundred observations, you’re probably safe.''

Reference:

Bandalos, D. L., and M. R. Boehm-Kaufman. 2009. “Four Common Misconceptions in Exploratory Factor Analysis.” In Statistical and Methodological Myths and Urban Legends, edited by C. E. Lance and R. J. Vandenberg, 61–87. New York: Routledge.

from "R in Action" by Robert I. Kabacoff, very informative book with good advises covering almost all statistical tests.

It seems you are just plugging a book and rehashing some points made before based on a secondary or tertiary source. This does not seem very useful. Could you at least provide the full reference for Bandalos and Boehm-Kaufman, 2009? — Gala, Mar 27 '13 at 21:30

lcrmorin · Answer 4 · 2013-03-27T22:16:06.593

1

The idea behind the MVA inequalities is simple: PCA is equivalent to estimate the correlation matrix of the variables. You are trying to guess $p\frac{p-1}{2}$ (symetric matrix) coefficients from $np$ data. (That is why you should have n>>p.)

The equivalence can be seen this way: each PCA step is an optimization problem. We are trying to find wich direction express the most variance. ie:

$$ max( a_{i}^{T} * \Sigma * a_{i} ) $$

Where $\sigma$ is the covariance matrix.

under the constraints:

$$ a_{i}^{T} * a_{i} = 1 $$ (normalization)

$$ a_{i}^{T} * a_{j} = 0 $$ (for $j<i$, orthogonality whith previous components)

The solution of these problems are clearly eigenvectors of $\Sigma $ associated to their eigenvalues. I have to admit that I don't remember the exact formulation, but eigenvenctors depends on the coefficients of $\sigma$. Modulo normalisation of the variables, covariance matrix and correlation matrix are the same thing.

Taking n = p is more or less equivalent to guess a value with only two datas... it's not reliable.

There's no rules of thumbs, just keep in mind that PCA is more or less the same thing as guessing a value from $2\frac{n}{p}$ values.

edited Mar 27 '13 at 22:16

answered Mar 27 '13 at 20:58

lcrmorin

1,380
16
31

Could you be more specific about the sense in which PCA is "equivalent" to estimating a correlation matrix? Suppose I stop my PCA after $k$ principal components. That requires estimating $k$ eigenvalues and $(p-1)+(p-2)+\cdots+(p-k)$ independent eigenvector coefficients, all totaling less than $pk$ parameters, which could be quite a bit less than $p(p-1)/2$. – whuber Mar 27 '13 at 21:28
The point is you are calculating (p-k) coefficients of eigenvectors from p(p-1)/2 coefficients of the matrix. For a random matrix, I don't think there is a way to "skip" some coefficients calculating eigenvectors/eigenvalues. – lcrmorin Mar 27 '13 at 22:23
Sure there is: the usual algorithms find the eigenvalues and the eigenvectors one at a time, from the largest eigenvalue on down. Besides, this is not a computational issue, but one of counting the number of estimated values--unless I misread your answer? – whuber Mar 27 '13 at 22:30

Minimum sample size for PCA or FA when the main goal is to estimate only few components?

4 Answers4

for both FA and PCA

Reference:

Linked

Related