Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data?

Question

I'm analyzing people based on their twitter stream. We are using a 'word bag' model of users, which basically amounts to counting how often each word appears in a persons twitter stream (and then using that as a proxy for a more normalized 'probability they will use a given word' in a particular length of text).

Due to constraints further down the pipeline, we cannot retain full data on usage of all words for all users, so we are trying to find the most 'symbolically efficient' words to retain in our analysis. That is, we're trying to retain a subset of dimensions, which, knowing their values would allow a hypothetical seer to most accurately model the probabilities of all words (including any we left out of the analysis).

So a principal components analysis (PCA) type approach seems an appropriate first step. (happily ignoring for now the fact that PCA would also 'rotate' us into dimensions that don't correspond to any particular word).

But I am reading that "Zipf distributions .. characterize the use of words in a natural language (like English) " and as far as I know, PCA analysis makes various assumptions about the data being normally distributed. So, I'm wondering whether the fundamental assumptions of the PCA analysis will be sufficiently far 'off' from reality to be a ral problem. That is, does PCA rely on the data being 'close to' Gaussian Normal for it to work at all well?

If this is a problem as I suspect, are there any other recommendations? That is, some other approach worth investigating that is 'equivalent' to PCA in some way but more appropriate for Zipf or power law distributed data?

Note that I am a programmer, not a statistician, so apologies if I messed up my terminology in the above. (Corrections of course welcomed!)

You simply must read Mosteller & Wallace, *Applied Bayesian and Classical Inference, The Case of the Federalist Papers.* It is a classic, it's readable, and if by page 12 you aren't thoroughly convinced it has plenty to show you, just return it to the library or bookseller. — whuber, Jan 10 '11 at 23:42

score 4 · Answer 1 · answered Jan 10 '11 at 11:31

4

PCA is probably fine to use on your data, as it appears that it does not make any assumptions about the structure of the data. See link for a good introduction.

http://en.wikipedia.org/wiki/Principal_component_analysis#cite_note-4

The note leads to a short PDF tutorial on PCA.

Hope this helps.

answered Jan 10 '11 at 11:31

richiemorrisroe

2,666
17
16

1

There's no assumption as long as you don't make inference based on PCs. – chl Mar 15 '11 at 21:20
@chl, indeed. those inferences can be tricky. – richiemorrisroe Mar 16 '11 at 10:17

score 3 · Accepted Answer · edited Feb 19 '16 at 11:50

The truth is PCA contains an inherent assumption of linearity, i.e. that changing the basis can reframe the problem to provide a more discriminating view on the data. Does it have to be true when working with Zipf/power law following data? It depends on whether all your variables are of the same distribution. If so, you could take a logarithm of the values of all columns and perform PCA with sensible results.

Power law makes your variances explode, PCA will of course yield results, but they will be hard to interpret without making a mistake of arguing that a phenomenon is happening when it actually is only happening in the top 20% outliers. You can also try to use PCA to see the major differences, then divide the data to a point where the long tail is separated from the top outliers and then a PCA on the tail?

A good tutorial on PCA with assumptions can be found here: Jonathon Shlens: A Tutorial on Principal Component Analysis. CoRR abs/1404.1100 (2014)

provide the name of the reference rather than a link. A link can die — Antoine, Feb 18 '16 at 16:29
Thank you for the comment, here's the reference: Jonathon Shlens: A Tutorial on Principal Component Analysis. CoRR abs/1404.1100 (2014) — niedakh, Feb 19 '16 at 10:00

score 1 · Answer 3 · answered Feb 18 '16 at 16:29

There's an excellent typology and discussion of different types of input to PCA as well as Matlab routines for PCA with extreme valued information in this paper by Xie and Xeng, Cauchy Principal Component Analysis.

http://www.cs.cmu.edu/~pengtaox/papers/cpca.pdf

Basically, their approach involves shifting to probabilistic assumptions. They explicitly compare the location-scale family of distributions including Gaussian, Laplace, Logistic and Cauchy distributional fits to the same, simulated information.

Having done it, their algorithm is also pretty easily programmed into languages other than Matlab.

Is principal components analysis valid if the distribution(s) are Zipf like? What would be similar to PCA but suited to non gaussian data?

3 Answers3