PCA/MFA for (graphical) dimension reduction: what to do with very small explained variance?

Question

I ran a Multiple Factor Analysis on a data set with 3,924 rows and 96 columns, of which six are (unordered) categorical, with 12-14 categories in each, and the rest are numeric, mean-centered and scaled by one-standard-deviation. My goal is dimension reduction, in order to visualize the results of PAM clustering by plotting the first two or three dimensions and coloring the points by assigned partition, as well as highlighting each medoid.

I found that no one dimension of PCA space explains more than a small fraction of variance in the data:

       eigenvalue percentage of variance cumulative percentage of variance
comp 1  1.0350075               2.466873                          2.466873
comp 2  0.8243004               1.964666                          4.431539
comp 3  0.8093599               1.929057                          6.360596
comp 4  0.7587070               1.808329                          8.168924
comp 5  0.6495978               1.548274                          9.717198
comp 6  0.6328384               1.508329                         11.225527

What should I make of this situation? Can I still use the first two PCA dimensions as a quick 2D approximation of the data set, or will they just fail to represent the data accurately?

Is there an alternative dimension reduction technique I could/should use? All of the reviews of nonlinear dimension reduction I've read were somewhat equivocal on their usefulness compared to PCA, except on fabricated data like the swiss roll data set, so I've been hesitant to use them.

Edit: here are the PCA results from just the numerical variables:

        eigenvalue percentage of variance cumulative percentage of variance
comp 1   5.1704992              5.7449991                          5.744999
comp 2   4.0469449              4.4966055                         10.241605
comp 3   3.8800122              4.3111247                         14.552729
comp 4   3.0606430              3.4007144                         17.953444
comp 5   2.7176048              3.0195609                         20.973005
comp 6   2.4725503              2.7472781                         23.720283

It appears that the variables (96 columns) are not related to each other. What does the result of correlation matrix suggest? Are there any significant correlations (at least amongst numeric variables)? If so, what are the correlation coefficients (r values) like? It will be a big matrix and would take some time to check. — rnso, Apr 09 '15 at 04:45
@rnso they're very weakly correlated overall. I hadn't even thought to put that together. If the variables are very weakly correlated, then the variance-maximizing basis is not much different from the original basis, right? Maybe then a nonlinear technique would be better after all. — shadowtalker, Apr 09 '15 at 04:57
Principal component analysis is generally for only numeric variables. How are categorical variables being handled here? It may be worth investigating of any of these categorical variables relate with any of the numeric variables (using unpaired t-tests). Or do multivariate techniques like multidimensional scaling. — rnso, Apr 09 '15 at 05:41
@rnso I think `FactoMineR` is converting them to dummy variables, and I'm fine with that. I admit that I don't fully understand how MFA works, but it seems to be designed explicitly for the purpose of grouping variables so that the dummy "batches" are treated as a coherent unit — shadowtalker, Apr 09 '15 at 05:49
It may be useful to convert all categorical variables to numeric (e.g. by using R command: var1 = as.numeric(var1) ) and try simple principal component analysis using R commands: res = prcomp(mydf, scale = TRUE); res; biplot(res) . It may be helpful if you post output of res and this plot here. — rnso, Apr 09 '15 at 06:21
@rnso that imposes the assumption that the categories are ordered. But they're decidedly _not_ ordered, so changing the ordering here would arbitrarily change the results. What I can do instead is break each categorical variable into a batch of dummy variables. Again, I think is what `MFA` does internally. — shadowtalker, Apr 09 '15 at 13:42
What are the results of simple PCA on numeric data only (excluding categorical variables)? I want to know if that also shows first and second component with very low variance explanation. — rnso, Apr 09 '15 at 13:52
@rnso I added that to my question. Dim 1 explains about 5.7% — shadowtalker, Apr 09 '15 at 14:57
It is remarkable that in your data all 96 parameters are more or less independent. — rnso, Apr 09 '15 at 15:27
@rnso I agree; it's going to make for a very bizarre lit review at the end of this paper. — shadowtalker, Apr 09 '15 at 17:49
@rnso You could also pretty easily try using Spearman correlation or mutual information to generate your correlation (or more generically, similarity matrix) and get a sense for whether using nonlinear approaches would be more successful. If you decide to try some nonlinear methods for dimensions reduction, [NMF](http://www.nature.com/nature/journal/v401/n6755/abs/401788a0.html) may also be worth considering. — Keith Hughitt, Aug 31 '16 at 11:26
@KeithHughitt that's actually a very nice idea. I just wish you'd posted that a year and a half ago :) — shadowtalker, Aug 31 '16 at 11:34
Ha! I was just scrolling down the list of posts on CrossValidated and this was one of the ones on the first page -- I didn't even think to check the date! Hope that your analysis went well in the end :) — Keith Hughitt, Aug 31 '16 at 11:44

score 1 · Answer 1 · answered Apr 09 '15 at 03:04

1

Despite the term multiple factor analysis (MFA), used to describe the factor analysis (FA) that you've performed, it seems to me like a standard PCA approach (or, FA via PCA, at best), which focuses on principal components. Instead, I suggest you to use exploratory factor analysis (EFA) and then confirmatory factor analysis (CFA), both of which focus on latent variables approach. EFA serves as an alternative dimensionality reduction technique with an added benefits of discovering latent factor structure, which has more explanatory power. Let me know, if you need further help.

answered Apr 09 '15 at 03:04

Aleksandr Blekh

7,867
2
27
93

Yes, it's just PCA that is generalized to "groups" of variables. I'm not sure why the name "multiple factor analysis" was chosen; it seems ill-suited. As for EFA instead of PCA, I can try it but [it doesn't seem like it would help much](http://stats.stackexchange.com/a/126584/36229). – shadowtalker Apr 09 '15 at 04:22
@ssdecontrol: I'm aware of the answer that you referenced. However, based on both that answer and [this answer](http://stats.stackexchange.com/a/123136/31372), PCA $\Rightarrow$ FA, when n $\Rightarrow$ $\infty$ or upon some other conditions. Thus, I would give EFA a try and share the findings. – Aleksandr Blekh Apr 09 '15 at 04:58
I just tried it with the `FAMD` function in R's `FactoMineR` package, and it ran for a long time without finishing. Is there another, faster implementation that can handle mixed numerical and unordered categorical data? – shadowtalker Apr 09 '15 at 05:07
@ssdecontrol: Sure. There are some options. For EFA, which you're currently interested in, you could start with standard function `stats::factanal()`. However, I recommend `psych` package as a better alternative (you might want to load `GPArotation`, if you want to try some rotations, not included in `psych`). Since your data set contains a mixture of continuous and categorical variables, you need to calculate polychoric correlations. For that, use `hetcor()` function from the `polycor` package. – Aleksandr Blekh Apr 09 '15 at 05:24
Polychoric correlation still assumes that the categorical variables are ordered. The issue is still the fact that there's no way to calculate a correlation when unordered categorical variables are involved; `FactoMineR` provides a workaround by allowing the user to specify "groups" of dummy variables that will be analyzed together. – shadowtalker Apr 09 '15 at 13:49
@ssdecontrol: True. You haven't mentioned in your question that you have unordered categorical variables and I have missed that in your comment above. You may want to check, if `Mplus` can handle this scenario, or try to figure out why `FactoMineR` hasn't worked as expected. – Aleksandr Blekh Apr 09 '15 at 14:54
It might be the case that I'm either not using `FactoMineR` correctly, or that it's just an expensive algorithm. I think I need to do some reading up on what exactly it does – shadowtalker Apr 09 '15 at 14:59
@ssdecontrol: Sounds like a good idea. You can even take a look at the [source code](http://github.com/cran/FactoMineR). – Aleksandr Blekh Apr 09 '15 at 15:04

PCA/MFA for (graphical) dimension reduction: what to do with very small explained variance?

1 Answers1