0

I always assumed that PCA is exactly the eigendecomposition of the covariance matrix. That is, the explained variances are eigenvalues of the covariance matrix, and components are its eigenvectors. To test my understanding, I have implemented PCA both using the above logic and using an existing library. The results look different. Can somebody explain why? Is there a gap in my understanding, or is it a bug of the code?

nDim = 20
nData = 1000

def naive_pca(d):
    return np.sort(np.linalg.eig(np.cov(d))[0])[::-1]


data = np.random.uniform(0,1,(nDim, nData))

pca = PCA(n_components=nDim)
pca.fit(data)

fig, ax = plt.subplots(ncols=3, figsize=(12,4))
ax[0].plot(pca.explained_variance_)
ax[1].plot(pca.singular_values_)
ax[2].plot(naive_pca(data))
ax[0].set_title("PCA: expained variance")
ax[1].set_title("PCA: singular values")
ax[2].set_title("eigenvalues of covariance")
plt.show()

enter image description here

Aleksejs Fomins
  • 1,499
  • 3
  • 18
  • Maybe not exactly what you are looking for, but [here](https://stats.stackexchange.com/questions/2691/making-sense-of-principal-component-analysis-eigenvectors-eigenvalues) is a great explanation of PCA. – Stochastic Jun 01 '20 at 09:34

1 Answers1

1

You messed up with the columns. Scikit-learn assumes that features are in columns, numpy uses rowvar=True as default:

np.random.seed(42)
nDim = 20
nData = 1000

def naive_pca(d):
    return np.sort(np.linalg.eig(np.cov(d))[0])[::-1]

data = np.random.uniform(0,1,(nDim, nData))

pca = PCA(n_components=nDim)
pca.fit(data.T)  # <----- HERE

fig, ax = plt.subplots(ncols=3, figsize=(12,4))
ax[0].plot(pca.explained_variance_)
ax[1].plot(pca.singular_values_)
ax[2].plot(naive_pca(data))
ax[0].set_title("PCA: expained variance")
ax[1].set_title("PCA: singular values")
ax[2].set_title("eigenvalues of covariance")
plt.show()

enter image description here

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Wait, you are calculating mean over axis=0? That would be the mean over dimensions. That is quite surprising. I expected that for each dimension the mean over datapoints has to be subtracted – Aleksejs Fomins Jun 01 '20 at 09:34
  • Also, if possible, can you provide formulas for the scaling factors between the three plots – Aleksejs Fomins Jun 01 '20 at 09:35
  • @AleksejsFomins assuming that you are using scikit learn, axis 0 is samples, axis 1 is dimensions. – Tim Jun 01 '20 at 09:36
  • Ok, I get it, the problem was that I was plugging the transpose into scikit-learn... I most definitely do not need to subtract the mean however, because np.cov already does that – Aleksejs Fomins Jun 01 '20 at 09:41
  • @AleksejsFomins yeap. Sorry for the answer that was initially misleading. Also, completely random data is not the best example to try PCA, better try it on meaningful data, that can be reduced to lower dimensionality. – Tim Jun 01 '20 at 09:43
  • Thanks for your help :). I actually want to try to compare eigenvalue spectrum from random data with that of my dynamical system, and see if the system has a different behaviour, e.g. more like a logistic curve where some components are very important and others just noise. I wonder if a test like this exists already somewhere. – Aleksejs Fomins Jun 01 '20 at 09:47