How does Factor Analysis break ambiguity

Question

I'm trying to understand Factor Analysis. After reading the sklearn manual, I have learned FA boils down to modeling the data as a noisy multivariate Gaussian distribution

$$P[x|\mu, W, \Psi] = \mathcal{N}(\mu, WW^T + \Psi)$$

where $WW^T$ is the covariance matrix of the data, and $\Psi$ is the covariance matrix of the noise, which is further assumed to be diagonal.

It looks to me that the estimation procedure is fundamentally ambiguous - it does not seem possible to uniquely estimate $WW^T$ and $\Psi$. This estimation problem looks to me like solving an equation like $A + B = 5$ for $A$ and $B$.

How do standard FA algorithms break this ambiguity? What is the convention?

EDIT: Ok, let me present numerical evidence in support of the problem. Let's start by producing a random covariance matrix using random positive eigenvalues and random orthogonal matrices

import numpy as np
from scipy.stats import ortho_group

def make_random_covariance_matrix(nDim):
    R = ortho_group.rvs(nDim)
    e = np.diag(np.random.uniform(1,5, nDim))
    return R.dot(e.dot(R.T))

p = 10
m = 7
Cov = make_random_covariance_matrix(p)

Now, let's try to write a numerical procedure for finding such $W$ and $\Psi$ that minimize the loss function $L^2$

$$L^2 = ||C - WW^T - \Psi||^2$$

As noted above, $W$ has dimensions $(p, m)$ and $\Psi$ is diagonal. The off-diagonal entries of $\Psi$ are always zero and are not part of the minimized parameters.

from scipy.optimize import minimize

def pack_param(W, psi):
    return np.hstack([W.flatten(), psi])

def unpack_param(x, p, m):
    pm = p*m
    return x[:pm].reshape((p, m)), x[pm:]

def loss_function_FA(C, W, psi):
    return np.linalg.norm(C - W.dot(W.T) - np.diag(psi))

loss_wrapper = lambda x: loss_function_FA(Cov, *unpack_param(x, p, m))

W0 = np.random.uniform(1, 2, (p, m))
psi0 = np.random.uniform(0, 1, p)
x0 = pack_param(W0, psi0)

boundsX = [(None, None)] * (p * m) + [(0, None)] * p

rez = minimize(loss_wrapper, x0, method = 'SLSQP', bounds=boundsX, options={'maxiter' : 1000})
print(rez.success, ';', rez.fun, ';', rez.message)
W, psi = unpack_param(rez.x, p, m)
print(np.sort(psi)[::-1])

We run this numerical minimization procedure several times for the same covariance matrix $C$, but for different initial conditions for $W$ and $\Psi$. For all runs the solver claims to have converged.

For every solver result, I print the sorted diagonal of $\Psi$. I was expecting that the values of $\Psi$ and $WW^T$ are unique, but specific values of $W$ may vary. However, this is not the case. I get different values of $\Psi$ for different initial conditions. They are always more or less in the same range (between approx. 1 and 2 in my case), but they do not match exactly.

EDIT2: I have tested more extensively. When $p=10$ and $m=1$, the result for $\Psi$ seems to be consistent for different initialization points. However, when $p=10$ and $m=7$ the result varies significantly for all elements of $\Psi$. This begs the conclusion that FA is robust in case $p \gg m$, and not so good if $p \approx m$. Is this correct?

W is `p x m` size (p= number of variables, m= number of factors) and `p>m` often `>>`, so W will not be diagonal. — ttnphns, Jun 14 '20 at 18:46
Factor solution is yet not unique: different [rotations](https://stats.stackexchange.com/q/151653/3277) of W, orthogonal or oblique, yield the same reconstruction of the covariance matrix. — ttnphns, Jun 14 '20 at 18:49
@ttnphns There was a typo. I meant to say that $WW^T$ was diagonal. But I don't think you have understood the question. The question is not about how to get covariance matrix from W. It is about how to discriminate between data covariance and noise covariance — Aleksejs Fomins, Jun 15 '20 at 07:55
FA is not meant and is not applied to data with covariance matrix diagonal. — ttnphns, Jun 15 '20 at 11:26
@ttnphns That was an example. I can delete it. My point stands regardless. There is 1 covariance matrix and 1 noise matrix. They are added together. I claim that it is not possible to discriminate them when performing inference. It is the same as trying to guess A and B from the equation A + B = 5. There are infinitely many solutions. According to this logic FA should not work at all. I do not understand what information can FA use to discriminate between the covariance of the data and covariance of the noise. Yet the actual implementations do it somehow. That is my question - how does it work — Aleksejs Fomins, Jun 15 '20 at 21:07
The observed cov. matrix of the data is not WW', but is S= WW'+noise(+small error). Only the noise is diagonal. S is pxp and is p rank. Now, WW' is (i) almost equal to S by its off-diagonal values (but for that small error), _and_ (ii) WW' is constrained to be m rank (m — ttnphns, Jun 15 '20 at 21:48
@ttnphns, Ok, I think I'm starting to understand. If possible, could you comment on why the low-rank constraint is sufficient to uniquely determine the noise matrix. Let's assume that $m=4$, $p=5$. From naive dimensional analysis I would guess that it may be sufficient to only fix one of the 5 noise parameters to make the optimal solution for WW' having rank 4. Then the other 4 noise parameters would remain unconstrained. — Aleksejs Fomins, Jun 16 '20 at 08:33
You forgot that the off-diagonal entries if WW' must be (almost) equal those of S. And diagonal of WW' + the noise = diagonal of S. — ttnphns, Jun 16 '20 at 19:53
@ttnphns Sorry, I don't think it's enough for me to figure it out. Can you perhaps provide a link to a good chapter in some lecture notes that derives the complete procedure of FA from scratch - all the way from making assumptions about probability distributions to choosing a numerical procedure to solve for resulting parameters? — Aleksejs Fomins, Jun 17 '20 at 09:20
There are many nice books and chapters on FA. But please [glace](https://stats.stackexchange.com/a/95106/3277) it geometrically at the pic with the sun. The correlating (p=2) vars X1 and X2 are decomposed into factor F (m=1) and p=2 unique ("noise") variates U1 and U2. All the three latter must be _mutually orthogonal_ (= to put algebraically, off-diagonals of WW'= those of S, and noise is a diagonal matrix), therefore the lengths of the three vectors: sqrt(a1^2+a2^2), u1, u2, are tied together, being the decomposition of the X1, X2 cov.; you cannot make u1 and u2 lengths arbitrary. — ttnphns, Jun 17 '20 at 11:26
@ttnphns This is also very helpful. I now see that FA wants to use the extra free dimensions to use decompose factors and noise into orthogonal components. But it is not obvious how many free dimensions one needs for how many factors in higher dimensional space. More precisely, given $p$, what is the highest value of $m$ that can be robustly estimated? — Aleksejs Fomins, Jun 17 '20 at 12:15
You are correct about the "extra dimensions". As for the maximim or optimal number m - that is specially estimated by various methods withon FA area. Very formally, most of the times m=p-1 is possible mathematically as the highest m. But optimal m is usually considerably less than p; m is selected by the user, it is not defined by FA automatically. — ttnphns, Jun 17 '20 at 12:27
@ttnphns ok, that more or less resolves my problem. Thank you for your time — Aleksejs Fomins, Jun 17 '20 at 14:14

How does Factor Analysis break ambiguity

0 Answers0