2

I have been studying principal component analysis (PCA) and then I have gone up to factor analysis (FA). I understood that PCA seeks orthonormal basis, but I am not so sure if this is the case for factor analysis. It seems that if my code is correct, the basis it finds is not orthonormal.

Is it true that the basis is not orthonormal in factor analysis? What is the theory behind it?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Marcel
  • 138
  • 10
  • What software do you use? Maybe your function call performs some kind of factor rotation. – Andrej Dec 02 '15 at 18:00
  • Hi Andrej, I was trying to implement it from scratch in Matlab. So I just want to make sure that I don't have a bug in the code, and why or why not the basis should be orthogonal. – Marcel Dec 02 '15 at 18:15
  • What exactly are you referring to when you say "basis"? Factor analysis loadings? Factor analysis scores? Something else? Can you briefly describe the procedure you use to find factor analysis solution? I guess that in case of PCA you are referring to the eigenvectors of the covariance matrix, right? – amoeba Dec 02 '15 at 19:50
  • 1
    `PCA seeks orthonormal basis` In a sense, it is so. Eigenvectors are a special case of orthonormal basis. But there are infinite number of orthonormal bases possible in the space spanned by the data cloud. Factor analysis is not a transformation of a data cloud (PCA is), and factors do not lie in the same space as the data cloud. No, factors cannot be seen as a basis of that space. – ttnphns Dec 02 '15 at 20:18
  • 1
    @ttnphns: I think $k$ factor analysis loadings (or perhaps loadings normalized to have unit length) can be thought of as a basis in some $k$-dimensional "FA subspace", similar to how $k$ PCA eigenvectors (which are PCA loadings normalized to have unit length) can be thought of as a basis in a $k$-dimensional "PCA subspace". – amoeba Dec 02 '15 at 20:32
  • 1
    @amoeba, sure unit-normalized loadings is a "basis" in _some_ space (the factor one). But here we enter scholastic discourse without potential help. When people say "find me orthonormal basis" (by, say, Gram-Schmidt or QR or Eigen) they mean, most of the time, the basis of _that_ space where the data lie in. I took Marcel's question to be in that same stream (I might be mistaken at that). – ttnphns Dec 02 '15 at 20:56
  • @ttnphns, but this "factor space" is a *subspace* of the data space. At least in my field people like talking about FA subspaces quite similar to how they would talk about PCA subspaces. E.g. the whole data space is $100$-dimensional, we do PCA of FA to extract $10$ components and this yields a $10$-dimensional subspace. If so, one can meaningfully ask questions about a possible basis of this subspace. – amoeba Dec 02 '15 at 21:34
  • So basically, you are saying that PCA is not a particular case of FA, right? Because of this orthogonality thing. However, it seems interesting to me that any base you can express it in orthogonal components. I am not sure, but I had an intuition that if I run PCA and seek the principal components, and then FA, it should give me similar results. If we think about probabilistic PCA. That is exactly the same as factor analysis, with the exception that now the covariance matrix is not multiple of the identity, but it is still diagonal. This means, that PPCA is a particular case of FA. Right? – Marcel Dec 02 '15 at 21:55
  • Yes, ttnphs, by the orthonormal basis I had in mind a set of vectors in the space of the data. Thank you for the discussion, very interesting! – Marcel Dec 02 '15 at 21:58
  • @amoeba, I believe it depends on how "data space" is being defined. For `n cases by p variables` multivariate data (let `n>p`) one might say that it is the p-dim variable space. Then clearly few `m` PCs are the subspace of it but `m` factors are not. If one chooses to call "data space" the full n-dim subject space (with only `p` dimensions substantive) - then factors, of course belong to that n-space. As you might [remember](http://stats.stackexchange.com/q/127483/3277) factor values are somewhere there and not in _those_ `p` dimensions-variables; but factor scores are in those `p` dimensions – ttnphns Dec 02 '15 at 22:06
  • @amoeba, typically, when people say "data (cloud) spans" a space they mean that minimal, nonredundant, substantive dimensions. To find all these dimensions starting from some objective milestone (such as, for example, the main principal axis) is what they call "find the (orthogonal) basis". It is of not major importance which - columns or rows of the data are appointed to serve "points" and which - "axes", before doing that task. – ttnphns Dec 02 '15 at 22:20
  • @ttnphns: Here I am talking about the $p$-dimensional variable space. Even if $m$ "real" factors are not part of this $p$-dimensional space, this is a purely theoretical construct; any estimated factor loadings will necessarily define a subspace of this $p$-dimensional space (similarly to how you say that factor scores lie within $p$ dimension-variables). So I guess I am talking here about "estimated factors" and not about "real factors" (that are inaccessible to us anyway). – amoeba Dec 02 '15 at 22:20
  • 1
    @amoeba, you are not correct at that. Factor(s) [do not](http://stats.stackexchange.com/a/95106/3277) lie in the p-dim space of the variables, they are not their subspace. Factors are "more" new variables which are coming. Factor loadings are the coordinates of the old variables (as vector endpoints) onto the factors which constitute their own space. Factor loading plot therefore is the projection of the variables space onto the space alien to it. – ttnphns Dec 02 '15 at 22:33
  • @ttnphns: I know that we discussed this a lot already, but I must confess that I still don't *fully* understand this... I do understand it on some level, but not completely. However, I am not disagreeing with you here; even if this is correct and factors lie outside, we can meaningfully talk about the subspace spanned by the loadings -- this will then be a projection of the real factor subspace onto the variable space, right? This projection is the subspace I am talking about then. Loadings are basis vectors of this subspace. I understand OP as asking about orthogonality of loadings. – amoeba Dec 02 '15 at 22:38
  • @amoeba, let us not be husty. Loadings are coordinates, notches, they themselves cannot span anything. But they are the coordinates of variables onto factors (and not vice versa as your last comment goes). So, no old space. Just imagine that factors are additional new variables: they don't belong to the space of the old ones. Whenever you want to project a factor into the variables' space you get the idea of factor _scores_ (reasonable variable, but only an approximation to the true factor, but loadings pertain to the true, transcendent factor). – ttnphns Dec 02 '15 at 22:48
  • @ttnphns: Hmm. I am afraid I got a bit lost. I think what I am saying is a lot simpler than what you are saying. Let me take $p$ variables and do FA with 2 factors; then I take the matrix of loadings, it consists of 2 vectors of length $p$. Surely I can say that these "loading vectors" span some 2-dimensional subspace, i.e. a plane. This is a triviality! Now what I think you are saying is that this plane has no real meaning and we should not think about it... Well, I don't know. Perhaps it does have some meaning even in your view of FA? I am not sure. But I do think that OP asked about *that*. – amoeba Dec 02 '15 at 22:54
  • @amoeba, in FA each variable X (of X, Y, Z,...) is decomposed into two: its unique factor and its communality. The latter, as "variable", belongs to the space of the common factors and it's relation to the factors is exactly _like_ the relation of the original, undecomposed X is to the PCs. Factors are the "PCs" of the collection of communality "variables". Clear analogy. But the communality "variables" are unobserved, latent (they can be seen as real but never measured by us; constructs), hence the factors are latent. (to cont.) – ttnphns Dec 02 '15 at 23:22
  • (cont.) It is not "my" view, it is standard view of FA. Each variable X belongs to the _plane_ defined by two orthogonal "variables": the communality (lying in the factor space) and the unique. Neither of these two parts belong to the variables' X, Y, Z... space. – ttnphns Dec 02 '15 at 23:25
  • @ttnphns: Yes, I think I understand all that. What I still don't understand is what is your opinion on the plane in the X,Y,Z space that is spanned by the two loading vectors in the situation when I extracted 2 factors. Are you saying that there is no such plane? I don't think you are saying *that* because it is obvious that a matrix of $p\times 2$ size defines a plane in $p$-dimensional space. Are you then maybe saying that this plane has no meaning? But it is such a natural object (at least natural to me), that it must have *some* meaning. – amoeba Dec 02 '15 at 23:41
  • @amoeba, I've added a picture and text to [here](http://stats.stackexchange.com/a/95106/3277). Can it help? – ttnphns Dec 03 '15 at 01:45
  • Regarding you last comment: the loading matrix defines the 2-dim factor space where the p communality "variables" lie. You might "invert" the space and see it as 2 points in p-dim axes, if you wish. But those p axes will still be communalities, not the original variables. Actually, _loading matrix_ in FA would be more precise or puristic to call the matrix between communalities and factors, not between variables and factors. – ttnphns Dec 03 '15 at 01:51
  • @ttnphns: Thanks for the discussion and for updating your old post. I need some time to think about that. However, I can tell you straight away my very simplistic view: for me factor analysis is a generative model: take $\mathbf z$ distributed with mean zero and identity covariance, transform it with the loadings matrix into $\mathbf x$ and add some noise with different variance for each coordinate (see the equation in my answer). If we for a moment imagine that the noise is "turned off", then $\mathbf z$ will be mapped to a subspace of X. It is precisely the subspace spanned by the loadings. – amoeba Dec 03 '15 at 23:52
  • @amoeba, the noises (I suppose it is unique factors) cannot be "turned off". Factor analysis without uniquenesses degenerates in PCA (then, of course, we stay with just PCs, not factors, and PCs are the subspace of the variables). When the "noise" is there, however, the loading matrix shows the coordinates in the space which isn't the subspace of the variables anymore. The noise is added not to the $\bf x$ at the last action, it is added earlier in a form of random unique factor variables which you should generate as well (as orthogonal to each other _and_ to the generated common factors). – ttnphns Dec 04 '15 at 12:27

1 Answers1

4

You seem to be familiar with the probabilistic PCA, so I will use it in my explanation.

In probabilistic PCA, the model of the data is $$\mathbf x|\mathbf z \sim \mathcal N(\mathbf W \mathbf z + \boldsymbol \mu, \sigma^2 \mathbf I),\hspace{5em}\mathbf z\sim \mathcal N(\mathbf 0, \mathbf I),$$ where $\mathbf z$ is lower-dimensional than $\mathbf x$.

Maximum likelihood solution is not unique, but one of these solutions is "special" and has an analytical expression in terms of standard PCA: columns of this $\mathbf W_\mathrm{PPCA}$ are proportional to the principal directions of the data matrix $\mathbf X$. In fact, they are principal directions scaled by the corresponding eigenvalues (these are PCA loadings), and then scaled a bit down. We can call them PPCA loadings.

If you compute them like that, then they are of course orthogonal to each other (but do not have unit length, so not orthonormal). But note that $\mathbf W$ can be multiplied by any rotation matrix and will be an equally good (equally likely) solution, and if do that then its columns will stop being orthogonal. So if, instead of taking PCA loadings and converting them into PPCA loadings by the analytical formula, you were to use expectation-minimization (EM) algorithm to find optimal $\mathbf W$ then the algorithm would converge to some random solution and it will not necessarily have orthogonal columns.

Factor analysis has a very much related model $$\mathbf x|\mathbf z \sim \mathcal N(\mathbf W \mathbf z + \boldsymbol \mu, \boldsymbol \Psi),\hspace{5em}\mathbf z\sim \mathcal N(\mathbf 0, \mathbf I),$$ where $\boldsymbol \Psi$ is a diagonal matrix. In the comments you ask if PPCA is a special case of FA; no, I would rather say that it's a "restricted" FA. In any case, there is no analytical solution for FA. Using EM algorithm, you can find maximum likelihood $\mathbf W_\mathrm{FA}$ (which is again not unique). Will it have orthogonal columns? No it will not, see above. It will also not have columns ordered by variance.

However, we can rotate this solution by doing SVD on $\mathbf W_\mathrm{FA} = \mathbf {USV}^\top$ and then taking $\mathbf {US}$ as our new $\mathbf W_\mathrm{FA}$. This new matrix will have columns ordered by factor variance and can be called the matrix of FA loadings. It is easy to see that its columns will also be orthogonal.

So there is no difference between FA and PPCA in this respect.

amoeba
  • 93,463
  • 28
  • 275
  • 317
  • I'm sorry, I could not get your factor analysis model formula. Can you explain it and compare with the model presented by me [here](http://stats.stackexchange.com/a/94104/3277)? I suspect that your formula could be wrong. – ttnphns Dec 04 '15 at 10:12
  • @ttnphns, this is the same formula as shown in the scan of Bishop's *Machine Learning* book [reproduced in this question](http://stats.stackexchange.com/questions/95038); I really don't think it's wrong. When maximum likelihood is used to perform FA, then it's maximum likelihood with respect to this generative model. Perhaps this notation is confusing you? It's simply saying that each variable $x_i$ is a linear combination of factors $z_j$ (that are all assumed to have zero mean and unit variance) plus some noise; [cont.] – amoeba Dec 04 '15 at 10:33
  • ... the noise is uncorrelated for different variables but can have unequal variance. We can write explicitly: $x_i = \mu_i + \sum W_{ij} z_j + \epsilon_i$, where $\epsilon_i$ is Gaussian noise with zero mean and variance $\sigma^2_i$. Here $W$ are loadings and $\sigma^2_i$ are "uniquenesses". I think this directly corresponds to your equations in the linked thread. Matrix $\Psi$ is diagonal matrix with $\sigma^2_i$ on the diagonal. – amoeba Dec 04 '15 at 10:33
  • But the uniquenesses are the variances of the unique factors (a separate set of latent variables). While in your formula $\mathbf x \sim \mathcal N(\mathbf W \mathbf z + \boldsymbol \mu, \boldsymbol \Psi)$ it looks as if the variables $X$ themselves are generated having such variances. I haven't understand it. Maybe it is correct (?), I'm just saying that I didn't understand. – ttnphns Dec 04 '15 at 11:28
  • (cont.) the formula of yours is not quite like the cited Bishop's formula: that formula says of the _conditional_ distribution of $X$. – ttnphns Dec 04 '15 at 11:34
  • @ttnphns: The conditional formula is the correct one, I omitted it only for brevity (it should be clear that the formula describes conditional distribution of $x$ given $z$, because it has $z$ on the right side). I now edited to write it explicitly in the conditional form, like Bishop does. I am not sure, does it make it clear for you or does the confusion lie elsewhere? It's true: if common factors (Z's) are fixed then X are generated randomly with uniquenesses variances around WZ. It's equivalent to saying that "unique factors" with these variances are additionally added. – amoeba Dec 04 '15 at 11:46
  • Yes, true. It should be added that also the conditional covariance is zero: $\bf cov(x_i,x_t | z)=0$. – ttnphns Dec 04 '15 at 11:55
  • @ttnphns: This is true but does not need to be added because it follows from the fact that $\Psi$ is diagonal: noise ("unique factors") is uncorrelated between variables, hence conditional covariance is zero. – amoeba Dec 04 '15 at 11:58