25

We usually use PCA as a dimensionality reduction technique for data where cases are assumed to be i.i.d.

Question: What are the typical nuances in applying PCA for dependent, non-i.i.d. data? What nice/useful properties of PCA that hold for i.i.d. data are compromised (or lost entirely)?

For example, the data could be a multivariate time series in which case autocorrelation or autoregressive conditional heteroskedasticity (ARCH) could be expected.

Several related question on applying PCA to time series data have been asked before, e.g. 1, 2, 3, 4, but I am looking for a more general and comprehensive answer (without a need to expand much on each individual point).

Edit: As noted by @ttnphns, PCA itself is not an inferential analysis. However, one could be interested in generalization performance of PCA, i.e. focusing on the population counterpart of the sample PCA. E.g. as written in Nadler (2008):

Assuming the given data is a finite and random sample from a (generally unknown) distribution, an interesting theoretical and practical question is the relation between the sample PCA results computed from finite data and those of the underlying population model.

References:

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • 16
    Just for note. PCA _itself_ is not an inferential analysis. It is a transformation of multivariate dataset of numbers; its core is just svd or eigendecomposition. Therefore it does not make observation independence assumption. Assumptions arise when we use PCA _as_ a statistical tool to analyze samples from populations. But they are not PCA's assumptions. For example, testing for sphericity to decide if PCA is justified to reduce the data does require the indepencence, and the test may look as if a "within-PCA" assumption test, but actually it is an "outside" test. – ttnphns Aug 02 '16 at 11:15
  • @ttnphns, very good points, thank you. If you see a neat way to edit my post, feel free to. I will think about it myself as well. – Richard Hardy Aug 02 '16 at 11:41
  • 1
    Richard, your question is fine and important (+1). Just maybe I'd rather re-word it a bit in a manner like "We usually use PCA as a dimensionality reduction for data where cases are i.i.d. assumed... What are typical nuances in applying PCA for time series data where cases (time points) are lag-interdependent...?" – ttnphns Aug 02 '16 at 12:21
  • @ttnphns, thanks for the input! I edited it partly borrowing from you, but maintaining a lot of the original. – Richard Hardy Aug 02 '16 at 12:31
  • +1 for the good question but, similarly to @ttnphns, I don't think that *any* "nice/useful" properties of PCA are lost, because nothing in PCA depends on the iid assumption. – amoeba Aug 02 '16 at 12:59
  • 1
    @amoeba, right. But we hardly ever stop at just obtaining the loadings of the PCs. In the steps that commonly follow PCA, what should we be aware of under non-i.i.d'ness? I hope an answer could be better than the question (in its current formulation). If you look at it loosely/creatively, perhaps you could come up with some good points. – Richard Hardy Aug 02 '16 at 13:08
  • An example: if we think about the random variables that have generated the current sample and we are interested in their orthogonal linear combinations (the first of which has maximum variance, ..., and the last of which has minimum variance), we can expect the estimated loadings from the sample PCA to generalize better (be more accurate estimates of the population loadings) under i.i.d.'ness than under strong autocorrelation. This is an example of "failure/compromized performance" that I am looking for, but there must be much more of them that are I completely ignorant of. – Richard Hardy Aug 02 '16 at 13:24
  • 2
    Plain PCA respects only "horizontal" associations (i.e. between columns) and ignores "vertical" (between cases): covariance matrix of columns is the same if you shuffle order of cases. Whether this can be called "no assumptions for case serial relations" or "assumption for independent cases is made" is a matter of taste. The i.i.d. assumption is the _default_ in data analysis, and so methods which simply do _not_ pay special attention to case order, like PCA, could be imputed the "silent support" for the i.i.d assumption. – ttnphns Aug 02 '16 at 13:38
  • PCA with original as well as lagged variables (time series) will take the vertical correlations into account. There is also an old, simple and nice Visual Recurrence Analysis which plots lagged relations matrices. It could be used in conjunction with PCA or MDS, for dimensionality reduction. – ttnphns Aug 02 '16 at 13:44
  • great paper, thanks for sharing! I'm afraid I don't know enough on the topic to help you, but may [this thesis](http://www.stat.berkeley.edu/~tran/pub/honours_thesis.pdf) be related to what you're looking for? – DeltaIV Aug 20 '16 at 08:24
  • @DeltaIV, thank you! It looks like it is about functional PCA (FPCA), which is an unnecessarily complication from the perspective of my original question. – Richard Hardy Aug 20 '16 at 08:40
  • @RichardHardy, yes, it's about FPCA. You mentioned time series and estimation in the context of PCA. FPCA aims at estimating the eigenpairs of the kernel operator from a finite sample of correlated, possibly heteroskedastic data, thus it seemed to me related to what you were talking about. See also par. 3 [here](https://statistics.uni-bonn.de/fileadmin/Fachbereich_Wirtschaft/Einrichtungen/Statistik/WS1112/Topics/PropertiesFPCA.pdf) and [here](http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3962763/). Anyway, as I said I probably don't know as much about the topic as you do, thus I won't insist. – DeltaIV Aug 21 '16 at 08:05
  • why not just take better care in coming up with a covariance matrix? Assume a different model. Doesn't/shouldn't model estimation always come before pca? I can think of cases where it doesn't, but I'm being rhetorical – Taylor Jan 28 '17 at 01:43
  • @Taylor, That is a good suggestion, but it is tangential to the question. – Richard Hardy Jan 28 '17 at 09:55

1 Answers1

1

Presumably, you could add the time-component as an additional feature to your sampled points, and now they are i.i.d.? Basically, the original data points are conditional on time:

$$ p(\mathbf{x}_i \mid t_i) \ne p(\mathbf{x}_i) $$

But, if we define $\mathbf{x}_i' = \{\mathbf{x}_i, t_i\}$, then we have:

$$ p(\mathbf{x}'_i \mid t_i) = p(\mathbf{x}'_i) $$

... and the data samples are now mutually independent.

In practice, by including the time as a feature in each data point, PCA could have as result that one component simply points along the time feature axis. But if any features are correlated with the time feature, a component might consist of one or more of these features, as well as the time feature.

Hugh Perkins
  • 4,279
  • 1
  • 23
  • 38
  • 1
    Thanks for the answer. That would be a very special case where time enters linearly. A more widespread phenomenon is, for example, autocorrelation where time itself does not play a role as a feature. – Richard Hardy Mar 16 '17 at 10:06
  • Ok, I see. So, you mean that, eg example $x_t$ is not just a function of some parameters $\theta$, but depends also on $x_{t-1}$? Therefore, $x_t$ is Markov, given $x_{t-1}$ and $\theta$? So, can we then add $x_{t-1}$ as a feature into the PCA? (I'm not saying we can or cant, just thinking through the problem really...) – Hugh Perkins Sep 04 '17 at 15:45
  • Something like that, yes, but without adding $x_{t-1}$ as a feature because I am interested in PCA that is defined on the original variables.. – Richard Hardy Sep 04 '17 at 16:48