-1

Forgive me if this a too much of a beginner question but I am PCA algorithm on a dataset with having a hard time understanding the implemenation of PCA on a dataset with 18 features. These are the questions I have regarding this:

  1. PCA normally reduce the dimensionality from n to k. In my case, the number of features i.e. n = 18. How can I determine to what level I reduce my dimensions i.e. k value ?

  2. I used the below code to do PCA (got this from one of the tutorials)

    from sklearn.decomposition import PCA 
    pca = PCA(n_components=3)    
    print norm_features.shape
    pca.fit(norm_features)
    print pca.components_
    
  3. It returns a total of three vectors. Should I consider that as my reduced dimension ?

Every example that I see on the internet uses their own dataset and one example is completely different from another and I am unable to generalize on this concept. Any pointers to some resources ?

2 Answers2

1

PCA itself is not a procedure for choosing the number of principal components. It is a procedure for calculating the values of the principal components. Choosing how many of these principal components to keep is up to you. Your example code produces 3 of them because you asked for n_components=3 in the call to PCA.

There are a variety of philosophies on how to choose the number of principal components to keep. This topic should be covered in any introductory text that also discusses PCA, such as:

Hastie, T., Tibshirani, R., & Friedman, J. H. (2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). New York, NY: Springer. Retrieved from http://www-stat.stanford.edu/~tibs/ElemStatLearn

See also: Choosing number of principal components to retain

Kodiologist
  • 19,063
  • 2
  • 36
  • 68
  • Ok I understand that part and also the fact the output is a vector that minimizes the projection error on to that vector. How do I go on using this vector in building my model ? – karun_r Jun 18 '17 at 00:40
  • @karun_r You don't get just one vector; each component is a vector. As for what to do with the extracted components: I don't know; why did you do PCA in the first place? – Kodiologist Jun 18 '17 at 00:45
  • Yes I got a total of three vectors since I gave provided n_components = 3 when I initialized the pca variable. I am using PCA to reduce the dimensions of my feature data as there are a total of 18 features. As per Andrew Ng's course and other online resources, I was under the assumption that I should be performing PCA after mean normalization to get my data in order before building a model. Is it not a correct approach ? I am actually using this in my attempt at my first Kaggle problem (King County Housing) – karun_r Jun 18 '17 at 00:48
  • @karun_r It's not incorrect to use PCA, but it's not mandatory either. There are lots of different ways in which PCA is incorporated into larger analyses, and I couldn't tell you whether PCA makes sense for your situation or how you should use it without knowing a lot more about the problem, the analytic strategy you're trying to pursue, what you hope to accomplish with PCA in particular, etc. You may have been lead to believe that data analysis is a lot simpler and more mechanical than it really is. – Kodiologist Jun 18 '17 at 00:53
  • Thanks for the explanation. Actually I have just started learning about data analysis and I think this kaggle dataset was the first dataset on which I got my hands dirty. I hope I will learn as I move forward using the resources online. Thanks again for providing another viewpoint. – karun_r Jun 18 '17 at 01:41
  • Do you recommend any books for proper learning of statistical analysis ? – karun_r Jun 18 '17 at 05:44
0

Setting the n_components value to a fixed number might not be the best way to reduce the dimensionality of your data if you don't already have a good guess what this number should be. If you set n_components to a value between 0.0 and 1.0 (See documentation) it will choose the number of principle components that represents 'n_components' of the variance in your data. For example, setting 0.8 would choose enough components to represent 80% of the data and the remaining 20% would get dropped, hopefully mostly the noise. You can all look at the explained_variance_ attribute to see how much of the total variance is represented by each component.