What is the minimum sample size for using exploratory factor analysis to reduce a pool of questionnaire items?

Question

Context: In my real experiment I am planning to include a questionnaire. I aim to measure 4 different constructs with multiple questions per construct. The questionnaire now consists of 24 items. I created more questions than I need with the idea of doing a pre-test and then reducing the number of questions to 12 based on the results of factor analysis.

I was thinking of having a sample size of 30 people to participate in my pre-test, but is just what I came up with.

Question: What is the minimum sample size required for exploratory factor analysis in order to reliably refine a multi-factor questionnaire?

This question [has already been answered](http://stats.stackexchange.com/questions/45820/pca-and-small-number-of-observations/45827#45827). In short: 30 is very small, traditional guidelines would certainly be that you need more, at least around 100-200 but it is really difficult to know exactly as the stability of the results will also depend on the number of factors and the communalities. — Gala, Jul 09 '13 at 05:56
Well yes, I understand that for a real questionnaire a high N is desirable. Is there no way to test the questions in a pre-test properly before using them for real? That is, to see/indicate if the questions are actually measuring the constructs you intend? — user1747079, Jul 09 '13 at 06:26
The problem is that with a small sample the sample correlations are highly variable and the results are not stable: Items might appear to relate to another factor, etc. Either you have a stable solution and you can indeed conclude that different items are or aren't related or you just have an uninterpretable mess. In any case, my answer reflects the literature on this. — Gala, Jul 09 '13 at 06:44
Of course, you can always do whatever you want (I have done it myself) but the fact that it's a pretest doesn't mean you're going to be OK or you won't face criticism for it later on if you try to publish your results. Sorry to be harsh but talk of “not a real questionnaire” (what is it then? fictional?) or “I don't pretend that to be scientific” (I have heard that one often elsewhere) do not change anything to the problem. — Gala, Jul 09 '13 at 06:46
In fact, you could even argue that it's the other way around: You need a bigger sample size for an exploratory study because you have more items, probably some nuisance factors, you don't know yet if the communalities are high, etc. Validation studies for personality scales typically involve thousands of people but once you have a good scale, it's supposed to be good for very small groups (e.g. experiments) or even single measurements (e.g. personnel selection). — Gala, Jul 09 '13 at 06:50
@Gael It seems to me like your answer on the other question is more directly relevant to this question than where it was originally posted. I.e., I don't think this is a duplicate. I think the previous question focused more on extracting major PCAs for data simplification, whereas this question focuses more on the traditional question of assessment of scale structure, which you tackle in your answer on the other question. — Jeromy Anglim, Jul 09 '13 at 08:57
True but, as I said, the question has already been answered. Could/should the answer be moved? — Gala, Jul 09 '13 at 09:02
@Gael I'm not sure. It's a bit of strange situation. I just didn't want the question to be closed. Just a thought, if you wanted, you could add a paraphrase here with a link to your other answer. +1 to your answer by the way. — Jeromy Anglim, Jul 09 '13 at 11:48

score 1 · Answer 1 · answered Jul 09 '13 at 11:24

1

I don't think factor analysis is the way to go here at all, regardless of the number in your sample. The goal of factor analysis is to find latent variables that are linear combinations of the scores on observed variables (your questions).

The somewhat similar technique of principal component analysis is data reduction, but not through elimination of questions: Each component (like each factor) will be a linear combination of all the variables.

The idea of a pre-test to reduce the number of questions is a good one, but the way to get rid of questions is to look at correlations, item analysis, reliability, expert review and so on, not factor analysis.

answered Jul 09 '13 at 11:24

Peter Flom

94,055
35
143
276

I disagree. Factor analysis is a pretty standard tool of that in psychology, directly using the loadings/factor matrix and indirectly to define multi-item scales on which you can conduct other type of analyses (reliability, item-whole correlation). In practice, I don't see how your suggestions would make it possible to avoid examining or assuming a factor structure first. – Gala Jul 09 '13 at 11:50
`to find latent variables that are linear combinations of the scores on [of?] observed variables` Don't quite get what you mean @Peter, but allow me to remind you that theoretically factor is _not_ a linear combination of the variables (and component is) because of the uniqueness separated at extraction. Factor scores that are computed regressionally from the variables and are therefore their linear combinations - are distorted "good guess" factor scores. – ttnphns Jul 09 '13 at 14:04
Hi @ttnphns I think the factor scores still have to be linear combinations of the variables; after all, the regressions are. But the linear combination for a factor isn't as easy to get from the output as it is for a component (where it is right in the loadings matrix). After all, what else is there to put into the factor except the variables, multiplied and added? – Peter Flom Jul 09 '13 at 21:20
1

@Peter, I'll expand a bit my point (and the well known fact). One facet of the difference b/w PCA and FA is that a component is the _error-free_ linear combination of all the variables, whereas a true factor is not. The regressionally computed factor (factor scores) is, but it is not true factor, only an approximation. Proof: correlation b/w f. scores - in contrast with c. scores - and a variable isn't exactly the loading, the true correlation. We can never compute _true_ f. scores - because we don't know uniqueness values on the case level. – ttnphns Jul 09 '13 at 21:47
Hi @ttnphns OK, so, what you're saying is that the factor scores that we get out of a program aren't the real factor scores, only approximations, but those approximations are linear combinations? Also, isn't correlation of factors a matter of rotation? Or are you saying that orthogonal factors will have some non-zero correlation since they are approximations? – Peter Flom Jul 09 '13 at 21:53
1) Yes, I said that: f. scores are only approximations and they are exact linear combinations of the variables (just because we estimated them by linear regression approach). 2) Saying of "correlation", I meant the loading, correlation b/w factor and variable (and rotation has nothing to do with the topic we discuss). `orthogonal factors [the scores] will have some non-zero correlation [between them] since they are approximations` Yes, by the way: they will correlate to some extent, unless we compute the scores by Anderson-Rubin modification of the regression method (is A-R default in R?) – ttnphns Jul 09 '13 at 22:07

What is the minimum sample size for using exploratory factor analysis to reduce a pool of questionnaire items?

1 Answers1