I'm trying to understand Factor Analysis. After reading the sklearn manual, I have learned FA boils down to modeling the data as a noisy multivariate Gaussian distribution
$$P[x|\mu, W, \Psi] = \mathcal{N}(\mu, WW^T + \Psi)$$
where $WW^T$ is the covariance matrix of the data, and $\Psi$ is the covariance matrix of the noise, which is further assumed to be diagonal.
It looks to me that the estimation procedure is fundamentally ambiguous - it does not seem possible to uniquely estimate $WW^T$ and $\Psi$. This estimation problem looks to me like solving an equation like $A + B = 5$ for $A$ and $B$.
How do standard FA algorithms break this ambiguity? What is the convention?
EDIT: Ok, let me present numerical evidence in support of the problem. Let's start by producing a random covariance matrix using random positive eigenvalues and random orthogonal matrices
import numpy as np
from scipy.stats import ortho_group
def make_random_covariance_matrix(nDim):
R = ortho_group.rvs(nDim)
e = np.diag(np.random.uniform(1,5, nDim))
return R.dot(e.dot(R.T))
p = 10
m = 7
Cov = make_random_covariance_matrix(p)
Now, let's try to write a numerical procedure for finding such $W$ and $\Psi$ that minimize the loss function $L^2$
$$L^2 = ||C - WW^T - \Psi||^2$$
As noted above, $W$ has dimensions $(p, m)$ and $\Psi$ is diagonal. The off-diagonal entries of $\Psi$ are always zero and are not part of the minimized parameters.
from scipy.optimize import minimize
def pack_param(W, psi):
return np.hstack([W.flatten(), psi])
def unpack_param(x, p, m):
pm = p*m
return x[:pm].reshape((p, m)), x[pm:]
def loss_function_FA(C, W, psi):
return np.linalg.norm(C - W.dot(W.T) - np.diag(psi))
loss_wrapper = lambda x: loss_function_FA(Cov, *unpack_param(x, p, m))
W0 = np.random.uniform(1, 2, (p, m))
psi0 = np.random.uniform(0, 1, p)
x0 = pack_param(W0, psi0)
boundsX = [(None, None)] * (p * m) + [(0, None)] * p
rez = minimize(loss_wrapper, x0, method = 'SLSQP', bounds=boundsX, options={'maxiter' : 1000})
print(rez.success, ';', rez.fun, ';', rez.message)
W, psi = unpack_param(rez.x, p, m)
print(np.sort(psi)[::-1])
We run this numerical minimization procedure several times for the same covariance matrix $C$, but for different initial conditions for $W$ and $\Psi$. For all runs the solver claims to have converged.
For every solver result, I print the sorted diagonal of $\Psi$. I was expecting that the values of $\Psi$ and $WW^T$ are unique, but specific values of $W$ may vary. However, this is not the case. I get different values of $\Psi$ for different initial conditions. They are always more or less in the same range (between approx. 1 and 2 in my case), but they do not match exactly.
EDIT2: I have tested more extensively. When $p=10$ and $m=1$, the result for $\Psi$ seems to be consistent for different initialization points. However, when $p=10$ and $m=7$ the result varies significantly for all elements of $\Psi$. This begs the conclusion that FA is robust in case $p \gg m$, and not so good if $p \approx m$. Is this correct?