Factor analysis problem -- singular covariance matrix?

Question

I'm having trouble performing factor analysis on my dataset.

When I perform the factor analysis in SPSS (default settings), it works fine. Problem is, I need to do it programmatically (in Python). When I try using Python (MDP library) to do factor analysis on the same dataset, I get this error:

"The covariance matrix of the data is singular. Redundant dimensions need to be removed"

Upon looking into the MDP documentation, it says "...returns the Maximum A Posteriori estimate of the latent variables." Being a factor analysis newbie, I wasn't too clear on what this meant, but I tried changing the default extraction method in SPSS from "principal components" to "maximum likelihood". Then, in SPSS, I get the error:

"This matrix is not positive definite."

Are these two errors the same thing? Regardless, what can I do to fix my dataset so that the covariance matrix is not singular?

Thanks!

edit: OK, so I was trying to keep things simplified, but perhaps its better to just explain everything from the start.

I have a series of documents. Yes, I'm only using 9 documents as a simple test case, but my final objective will be to use it on a much larger corpus.

I've built a term-document matrix, performed tf-idf, and did SVD-- mostly with the help of blog.josephwilk.net/.../latent-semantic-analysis-in-python.html

Now I have a reconstructed matrix, and I want to sort the documents into categories. So, I tried using factor analysis. In fact, it seems to work-- when I put it in SPSS, the factor loadings indicate that the documents are grouped the way I thought they should be, and the loading are higher than if I hadn't performed SVD. (Although I think technically, SPSS is doing PCA even though its under the 'Factor Analysis' heading).

I tried using MDP's PCANode, but that doesn't seem to give me anything close to what I want. Strangely, if I transpose my matrix, the factor analysis does work (it will group the terms, instead of the documents).

Hopefully this all makes a little more sense now...

What do you mean by the "reconstructed matrix"? If you do the SVD, keep a few components, and then do PCA, the results should be identical. — JMS, May 23 '11 at 01:14
The LSA code I was using (from joseph wilk's blog, in my post) "reconstructs" the matrix after SVD. To be honest, I don't really know much about SVD or what it's doing. I'm still having some problems, but I think we've moved beyond my original question so I'll marked this solved. Thanks! — Jeff, May 24 '11 at 21:25
I imagine what it's doing is "reconstructing" the matrix from the first few singular vectors. The link to the blog post is broken btw. But that would make absolutely no difference if you go on and do PCA to the reconstructed matrix! Perhaps it's centering/scaling for you, then you would get different results. I'd look into it further; it seems like you want to be working with the scores from LSA to do clustering/categorization/etc — JMS, May 25 '11 at 03:44
Link works for me, but I think I'm going to explore gensim further and see if that pans out. Thanks again! — Jeff, May 25 '11 at 03:48
Weird - link isn't a link at all for me. Gensim is a pretty mature piece of kit, you should do well with it. If you have to process and corpora the natual language toolkit ( aka NLTK) has some useful bits & is super well documented, although it's more NLP oriented. Good luck :) — JMS, May 25 '11 at 03:55
http://blog.josephwilk.net/projects/latent-semantic-analysis-in-python.html — Jeff, May 25 '11 at 04:01
Thanks. My suspicions were correct; PCA on that reconstructed matrix should give you back what you got from LSA - that's not what you want. There's a little snippet about the relationship between PCA/SVD here: http://www2.imm.dtu.dk/pubdb/views/edoc_download.php/4000/pdf/imm4000.pdf. Googling will undoubtedly turn up more. Using $V$ to categorize documents would be more interesting - then you're using scores on the components from LSA instead of raw word counts. — JMS, May 25 '11 at 04:29

score 6 · Accepted Answer · answered May 22 '11 at 04:08

6

Yes, the two errors amount to the same thing. They're telling you (roughly) that two or more of your manifest variables are linearly dependent (like $y_1 = ay_2 + b$ for scalars $a, b$). These two variables (dimensions) would be "redundant", meaning that the sample covariance matrix is not invertible (ie is singular) and therefore not positive definite either.

As for what you ought to do about it, that depends. First I would try to find out which variables are giving you the trouble; a scatterplot matrix might be enough to tell you that. Then you can decide what to do from there - most likely dropping some redundant variables.

answered May 22 '11 at 04:08

JMS

4,660
1
22
32

Thanks, this is helpful, but I'm not exactly sure dropping variables is the right thing to do? I'm trying to perform LSA on a set of documents, so my matrix is quite big (9 x ~6700). I'm sure there are some columns that are redundant (e.g. several words that appear in only one document). Does it really make sense to delete duplicate columns here? – Jeff May 22 '11 at 04:36
Hmm, I tried deleting all duplicate columns (several thousand of them) and I still get the same error... – Jeff May 22 '11 at 05:40
1

@Jeff Do you mean that you performed ML-based FA with 9 statistical units (rows) and 6k+ variables (columns)? Because, in this case, it won't work. – chl May 22 '11 at 10:29
Performing LSA (I assume you mean latent semantic analysis) uses the SVD; it's basically principal components. You don't want to be doing factor analysis. @chi is correct in that it isn't even possible here. If you want to use Python have a look at Gensim. But why only 9 documents? You're unlikely to get much. – JMS May 22 '11 at 15:12
1

Gensim is here: http://nlp.fi.muni.cz/projekty/gensim/ – JMS May 22 '11 at 15:14
@Jeff Also, the reason why the error persists (I assume) is that you still have many more terms than documents, so the sample covariance matrix is still singular. – JMS May 22 '11 at 15:16
OK, so I was trying not to include extraneous details in my question, but I think it's just added to the confusion, so I've updated my question. I am doing SVD-- I'm trying to perform FA after the fact. I will give Genism a try and let you know. Thanks all! – Jeff May 23 '11 at 00:15
FA on what matrix? – JMS May 23 '11 at 01:05

Factor analysis problem -- singular covariance matrix?

1 Answers1

Linked