How to do Principal Components Analysis from start to finish in Python or R?

Question

I'm a Software Engineer trying to learn how to do a Principal Components Analysis in Python or R.

I've found a few links which do a good job of explaining the concept from a high-level. However, I haven't seen any examples which walk you through all of the steps from start to finish.

For example, let’s say you have a 50-dimensioned dataset, which has 50 columns of varying data types (boolean, float, integer, varchar etc.)… Do those values need to be scaled or normalized to something like 0.0..1.0? Or can the PCA algorithm handle those disparate data types?

Ideally, I want to see something which does a walkthrough of each step and explains it on the way. Especially starting with disparate data which needs to be scaled or normalized. All examples I've seen online, including ones which use well-known example data sets (such as the Iris dataset), start with pristine data where all of the columns are the same data type. I'm starting with a large dataset with many columns of varying data types. What do I do?

Incidentally, after applying PCA to my dataset, I plan on running it through clustering (k-means probably).

Update 9/10/2015

Since this question has been marked as off-topic, I'm not able to submit or select an answer. In any case, I found two links from Sebastian Raschka to be very helpful:

It sounds like you want an R / Python tutorial. That is off-topic here. Can you make your question more specific (& software neutral)? — gung - Reinstate Monica, Jun 08 '15 at 16:00
Okay, I can see your point. I'm a programmer with no background in statistics. So, I'm really looking for someone to explain and demonstrate the step before PCA is applied. Where the data is scaled or normalized. How is this done if I have 50+ columns of varying data types? Is that a statistics question, or a programming question? — Edward J. Stembler, Jun 08 '15 at 17:12
@EdwardJ.Stembler A good PCA routine will center and scale the data for you. — Zach, Jun 08 '15 at 17:38
@EdwardJ.Stembler Ahah! You've got a really good question buried in this post. PCA is pretty easy when all your data are already numeric... it's a lot harder when you have a mix of categorical and numeric data (and EVEN harder if some of your data have missing values). If you can identify the core of your issue and re-post an example dataset you're having trouble with, someone on this site can probably help you. — Zach, Jun 08 '15 at 17:49

score 3 · Answer 1 · edited Apr 13 '17 at 12:44

Here is an end-to end example in R, using the famous iris dataset:

my_data <- read.csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data', header=FALSE)
my_pca <- prcomp(my_data[,1:4], center=TRUE, scale=TRUE)

Note that prcomp centers and scales the data for you, before doing to PCA. You can read more about the method used for centering and scaling in ?scale.

Once you have the PCA object, you can look at the eigen values and rotation matrix:

> my_pca
Standard deviations:
[1] 1.7061120 0.9598025 0.3838662 0.1435538

Rotation:
          PC1         PC2        PC3        PC4
V1  0.5223716 -0.37231836  0.7210168  0.2619956
V2 -0.2633549 -0.92555649 -0.2420329 -0.1241348
V3  0.5812540 -0.02109478 -0.1408923 -0.8011543
V4  0.5656110 -0.06541577 -0.6338014  0.5235463

You can also inspect the structure of the R object itself:

> str(my_pca)
List of 5
 $ sdev    : num [1:4] 1.706 0.96 0.384 0.144
 $ rotation: num [1:4, 1:4] 0.522 -0.263 0.581 0.566 -0.372 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:4] "V1" "V2" "V3" "V4"
  .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
 $ center  : Named num [1:4] 5.84 3.05 3.76 1.2
  ..- attr(*, "names")= chr [1:4] "V1" "V2" "V3" "V4"
 $ scale   : Named num [1:4] 0.828 0.434 1.764 0.763
  ..- attr(*, "names")= chr [1:4] "V1" "V2" "V3" "V4"
 $ x       : num [1:150, 1:4] -2.26 -2.08 -2.36 -2.3 -2.38 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : NULL
  .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
 - attr(*, "class")= chr "prcomp"

You can also plot it!

plot(my_pca$x, col=my_data[,5])

Iris PCA

Finally, you can also use the my_pca object to transform new data:

new_data <- as.data.frame(
  lapply(my_data[,1:4], max)
  )
> predict(my_pca, new_data)
         PC1      PC2       PC3        PC4
[1,] 2.47921 -3.94702 -0.292143 -0.2682428

The predict function is really cool: it will correctly center and scale the new data, then apply the same PCA transformation to it. So you can learn a data transformation on one dataset, and then apply it to a completely new dataset!

Note that R has 2 functions for PCA: prcomp and princomp. Personally, I prefer prcomp, but you can read more about the difference between the 2 functions here.

ilanman · Answer 2 · 2015-06-09T15:48:52.747

While the above answer works as an example of how to use R's built-in functions to do PCA, if you want a mathematically based answer (i.e. deriving principle components as ordered eigenvectors), please see my presentation (also in R). Slide 50 begins the part on PCA. The most important code in calculating PCA is on slide 97 - 102, including the check that the manual calculations equal what R's built in prcomp() function would give. This, along with a high level understanding should be helpful.

How to do Principal Components Analysis from start to finish in Python or R?

2 Answers2