I'm analyzing people based on their twitter stream. We are using a 'word bag' model of users, which basically amounts to counting how often each word appears in a persons twitter stream (and then using that as a proxy for a more normalized 'probability they will use a given word' in a particular length of text).
Due to constraints further down the pipeline, we cannot retain full data on usage of all words for all users, so we are trying to find the most 'symbolically efficient' words to retain in our analysis. That is, we're trying to retain a subset of dimensions, which, knowing their values would allow a hypothetical seer to most accurately model the probabilities of all words (including any we left out of the analysis).
So a principal components analysis (PCA) type approach seems an appropriate first step. (happily ignoring for now the fact that PCA would also 'rotate' us into dimensions that don't correspond to any particular word).
But I am reading that "Zipf distributions .. characterize the use of words in a natural language (like English) " and as far as I know, PCA analysis makes various assumptions about the data being normally distributed. So, I'm wondering whether the fundamental assumptions of the PCA analysis will be sufficiently far 'off' from reality to be a ral problem. That is, does PCA rely on the data being 'close to' Gaussian Normal for it to work at all well?
If this is a problem as I suspect, are there any other recommendations? That is, some other approach worth investigating that is 'equivalent' to PCA in some way but more appropriate for Zipf or power law distributed data?
Note that I am a programmer, not a statistician, so apologies if I messed up my terminology in the above. (Corrections of course welcomed!)