Is PCA appropriate for comparing subsets of panel data?

Question

I have a large panel (5000+ subjects, 4 variables over 182 periods), and I've identified particular Granger-causal relationship in a large subset of those subjects (30% or so). I would like to somehow characterize the subjects that exhibit the Granger causality and parsimoniously compare them to the subjects that do not.

Would PCA be an appropriate way to do this? I like PCA in particular because I like the idea of being able to describe the reduced coordinate system in terms of the original variables, but I'm worried that my time dimensions will not be sorted so cleanly. Or is this a good thing in that it will help identify fundamental features of each time series?

I know there are various decomposition techniques available as well, but I would also like to apply whatever method I choose to a much shorter panel (26 time points) and I don't think decomposition would be appropriate there.

I think this question was asked before, but in imprecise terms, and it didn't get a real answer. This paper was linked in the comments and it's interesting, but I'm not sure how I could apply it to make comparisons.

As you want to differentiate between two groups of subjects (exhibiting and non-exhibiting something of interest), you might want to look at linear discriminant analysis (LDA) projection, instead of, or at least in addition to, PCA projection. In either case, I would not worry about non-independence of features; just pool them all together ($4\cdot182=728$) and feed into PCA/LDA. You can certainly start with that, and see if you run into some difficulties with interpretation. — amoeba, Jul 28 '14 at 09:44
I was unsure about using a classifier to perform a task other than classification. But I guess if I can classify cases accurately, then I've certainly characterized them. In any case you might want to post this as an answer because it's good enough for what I'm working on right now. — shadowtalker, Jul 28 '14 at 13:57
It is not so much about classification. My point is that LDA can be seen as dimensionality reduction technique, so you can use it to plot all your data on 2D (similar to how it usually done in PCA) and try *"to describe the reduced coordinate system in terms of the original variables"*, as you wanted to do with PCA. — amoeba, Jul 28 '14 at 14:37
My impression of dimension reduction with LDA is that it captures differences in means between groups but tends to throw away variance structures, whereas PCA captures differences in variance structure between groups but tends to throw away the different baselines. In which case the answer, of course, is "do both." Is that at an accurate comparison? — shadowtalker, Jul 31 '14 at 06:13

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

To recapitulate what we have already discussed in the comments above.

First, I see absolutely no problem in applying PCA to a time series or panel data. @NickCox said the same in the comments to the question you linked to. The paper that OP of that other question brought up and you linked in your question, does not contradict that (and I agree with you, it does not look very relevant).

Second, PCA will reduce your $4\cdot182=728$ features to any number you want (e.g. to two, then you can plot all the data as a scatter plot and look at it), but it does not take class information into account at all (by classes I mean that subjects exhibiting and non-exhibiting something of interest form two classes). So you can end up with a largely overlapping classes in 2D, but it is very well possible that classes are actually well separated, just not in the first two PCs. The only thing PCA cares about is the overall amount of variance of the projection.

Therefore, if class separation is what you are interested in, you might want to apply e.g. LDA (linear discriminant analysis) as another dimensionality reduction technique. It can also reduce the dimensionality (e.g. to 2D), but it looks for projections that achieve maximal class separation. Both PCA and LDA are linear methods, so if you know how to interpret PCA results, you know how to interpret LDA results as well.

One caveat is that LDA can overfit if the number of features is high, but the number of subjects is not high enough (see e.g. my answer here). One sign of overfitting would be that LDA results in projections having tiny variance. If you are familiar with PCA math, you know that PCA is taking leading eigenvectors of covariance matrix $\boldsymbol \Sigma$ as projection axes. LDA takes leading eigenvectors of $\boldsymbol \Sigma_W^{-1} \boldsymbol \Sigma_B$ as projection axes, where $\boldsymbol \Sigma_W$ and $\boldsymbol \Sigma_B$ are within- and between-class covariance matrices. In case of overfitting, you can regularize LDA and force it to look for directions with large variance (somewhat like PCA!) by taking $(1-\lambda)\boldsymbol \Sigma_W + \lambda \mathbf{I}$ instead of $\boldsymbol \Sigma_W$. When $\lambda=0$, you get LDA. When $\lambda =1$, you get PCA on class means. Everything in between is regularized LDA.

+1, I had no idea LDA and PCA could be "mixed" like that. So $\lambda = 0.5$ yields something like a compromise between maximizing variance and maximizing separation? And to be clear, when you're talking about separation do you mean separation of the *means*? — shadowtalker, Aug 01 '14 at 17:50
Good questions. The second one is easier: *no*! LDA is maximizing separation in the sense of the ratio of between-class and within-class variances (one can see it as a signal-to-noise ratio). In case of two classes it reduces to the distance between the means *divided by the within-class standard deviation*. Now to the first question: I don't think one can say that $\lambda=0.5$ is always a reasonable compromise. If one is using rLDA for classification, $\lambda$ is usually optimized with cross-validation. For your purposes it might be an overkill though (or maybe not!). — amoeba, Aug 01 '14 at 19:16
@ssdecontrol: I have just realized that I made a mistake (now fixed): with $\lambda=1$ rLDA reduced to PCA *on class means* (because you will be looking at eigenvectors of $\boldsymbol \Sigma_B$), not to PCA of the whole data! Note that if you have two classes, then PCA on class means simply means projection on the line connecting two class centres. — amoeba, Aug 01 '14 at 19:48

score 0 · Answer 2 · answered Jul 28 '14 at 09:15

0

There is a categorical version of the PCA you might want to consider. It is an extension of the probabilistic PCA. The idea is the following. You have a latent variable $z$ which comes from a normal distribution and then conditional on z, you fit a logistic regression to decide which group your patient is in.

For more details see pg 402 in the machine learning book by kevin murphy ( I have attached the index from his website)

http://www.cs.ubc.ca/~murphyk/MLbook/pml-toc-22may12.pdf

answered Jul 28 '14 at 09:15

Sid

2,489
10
15

What makes you think that OP's data are categorical? – amoeba Jul 28 '14 at 09:49
I think he's saying that there's a PCA-based classification approach. What I'm not clear on is what he's saying the approach actually is, and I don't have the book so I can't go check it myself. @Sid, what's the advantage of this classifier over the zillions of other classifiers out there? – shadowtalker Jul 28 '14 at 13:58
Your question seemed to suggest the output it yes/no, and hence categorical. The method is similar to factor analysis. Only difference is in the observed space instead of saying $P(x|z) = N(Wz +\mu,\Psi)$ we say $P(x|z) = Cat(a^{T}z)$. As far as usefulness goes, there is an expression for the posterior (which is what we look for in the Probabilistic PCA) which aids in the interpretability of the latent factors. I simply suggested this because of your inclination to PCA based approaches. What is "better" becomes a model selection problem: compare Cross Validation errors and make a choice. – Sid Jul 28 '14 at 16:47
In general, I feel sparse LDA/ sparse logistic regression would be strong contenders. In general, the person implementing these algorithms tries out a bunch of things and picks the best one. There is no best solution, if there was we wouldn't have all the other zillion algorithms. Find cross validation errors on the data set using several different contender models and pick the best one – Sid Jul 28 '14 at 16:51
1

Based on your notation I don't think it's what I'm looking for. My features $x$ aren't binary (they're counts in a given time period), but I have a binary split of observations. I suspect that each subset of the data tends toward a different part of the feature space, and I would like to characterize that difference. Since the feature space is high-dimensional, I turned to dimension reduction. I asked about PCA because I'm most familiar with PCA. – shadowtalker Jul 31 '14 at 06:09

Is PCA appropriate for comparing subsets of panel data?

2 Answers2

Linked