4

I am performing PCA on a dataset of (28 features + 1 class label) and 11M rows (samples) using the following simple code:

from sklearn.decomposition import PCA
import pandas as pd

df = pd.read_csv('HIGGS.csv', sep=',', header=None)

df_labels = df[df.columns[0]]
df_features = df.drop(df.columns[0], axis=1)
pca = PCA()
pca.fit(df_features.values)
print(pca.explained_variance_ratio_)
print(pca.explained_variance_ratio_.shape)
transformed_data = pca.transform(df_features.values)

The pca.explained_variance_ratio_ (or eigenvalues) are the following:

[0.11581302 0.09659324 0.08451179 0.07000956 0.0641502  0.05651781
 0.055588   0.05446682 0.05291956 0.04468113 0.04248516 0.04108151
 0.03885671 0.03775394 0.0255504  0.02181292 0.01979832 0.0185323
 0.0164828  0.01047363 0.00779365 0.00702242 0.00586635 0.00531234
 0.00300572 0.00135565 0.00109707 0.00046801]

Based on the explained_variance_ratio_, I don't know if there is something wrong here. The highest component is 11%, as opposed to the fact that we should be getting values starting at 99% and so. Does it imply that the dataset needs some preprocessing such as ensuring the data are in a normal distribution?

steve
  • 113
  • 1
  • 7

2 Answers2

6

There is no rule that you need to have high amount of variance explained in the first principal component.

PCA finds orthogonal linear combinations of your original variables, such that the first principal component has the highest variance, the second the second highest, etc. However, the 'highest' does not mean that it should be a large absolute amount, just that there is no linear combination that has greater variance.

But PCA just maximizes variance, it is not some magical tool to find the most interesting combination of variables. If your data come from a strongly skewed probability distribution, then this variance isn't a very informative measure of variability.


As for how to proceed:

  1. Consider the purpose of your analysis and the data generating process. Is PCA truly the best way to do whatever you are trying to accomplish? (Dimension reduction, orthogonalization, ...)
  2. If the answer to (1) is yes, then you may consider transforming the variables such that their variances are more representative of their variability. You may also conclude that your variables already are approximately normal, in which case 11% is apparently the largest amount of the total variance that can be explained in a single linear combination of your original variables.
  3. Are all variables measured on the same scale? Are they vastly different things on different scales? PCA is not invariant to scaling (variance depends on scale!), and whether you scale or not has important implications for how you can interpret the results.

Have a look at the other PCA related questions on here (e.g. this one). If you better understand what PCA does, you will also better understand the results you get, and more importantly, whether it is the right tool for the job to begin with.

Frans Rodenburg
  • 10,376
  • 2
  • 25
  • 58
2

If the first few eigenvalues don't explain the bulk of the variance in your data, it means one of two things:

  • The data is just random noise. Try running PCA on a matrix of independent points sampled from a standard Gaussian. You'd see that the eigenvalues are much more evenly dispersed.

  • Basic PCA isn't sufficient for dimensionality reduction of your data. If this is the case, you may want to try more advanced methods, like kernel PCA.

user3294195
  • 723
  • 1
  • 4
  • 16