4

Briefly: I want to run a PCA on 2-dimensional data, both for the purposes of dimension reduction and capturing greater variance. I know it can be done (e.g., in R, running prcomp on 2-dimensional data works without errors or warnings), but just because something can be done doesn't mean it should be done (e.g., Doing principal component analysis or factor analysis on binary data). Is there anything wrong, theoretically or practically, with running a PCA on 2-dimensional data?

In context: I'm looking at distributions of vowels, which are typically described in terms of two acoustic measures: F1 and F2. These measures are typically modeled separately, but I'm seeking to model them together because they covary. Below is an example of one speaker's distribution of KIT vowels (the vowel in the words hit, sit, etc.) in F1~F2 space. It's clear that there's considerable covariance in F1 and F2 here, so modeling F1 by itself would miss what's going on in F2 (and vice versa).
One speaker's distribution of KIT tokens, showing covariance in F1 & F2

The reason I want to turn to PCA, then, is that the axis that captures the most variance in a given speaker's distribution of a given vowel is unlikely to be the F1 or F2 axis, but a tilted axis. Running a PCA on standardized (F1, F2) values is a way to find the tilted axis that best captures a speaker's vowel distribution, as well as where individual tokens lie on this axis. As there doesn't seem to be much out there on running PCA on 2-dimensional data (aside from demos with toy data, like Making sense of principal component analysis, eigenvectors & eigenvalues), though, it's not clear to me whether there's anything wrong with it.

  • You said "scaled" (F1,F2) values. If you standardize both F1 and F2, then running PCA is pointless: PCA on a 2D correlation matrix always yields the diagonals as PCs. – amoeba Oct 25 '17 at 06:51
  • For these measures, standardizing is necessary because F2 is always on a greater numerical scale than F1. As a result, if values are unstandardized, F2 will have greater variance so it'll dominate PC1, no? Re: the idea that PCA on standarized values is pointless...it is predictable, but does that make it pointless? – Dan Villarreal Oct 25 '17 at 21:57
  • 3
    Well, "pointless" in a sense that after standardization you can simply take F1-F2 (or F1+F2, depending on whether correlation is positive or negative) as your PC1 value. Don't need to actually perform PCA. See https://stats.stackexchange.com/questions/140434/ – amoeba Oct 25 '17 at 22:01
  • Pretty much every elementary introduction to PCA uses a 2 dimensional cloud to illustrate what is going on with PCA. From that i think that one can safely infer that doing PCA on a 2 dimensional data set isn't useless. – meh Oct 25 '17 at 23:33
  • @amoeba, in the answer you linked, you mention "if the variables are not standardized, then you should be doing PCA on their covariance matrix (not on their correlation matrix)". Could that be an applicable solution here? – Dan Villarreal Oct 29 '17 at 22:05
  • Perhaps. See https://stats.stackexchange.com/questions/53. – amoeba Oct 29 '17 at 22:07

1 Answers1

2

No, there is nothing whatsoever wrong with running PCA on just two variables. (Or any other kind of dimension reduction method.) It's completely valid to reduce your model complexity like this.

It might be best to run some kind of multinomial model with cross-validation for your vowels. See whether a model on the first principal component alone predicts vowels better than a model with both F1 and F2 values. If so, this would be good evidence that reducing your model complexity in this way is a Good Thing.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357