I'm a Software Engineer trying to learn how to do a Principal Components Analysis in Python or R.
I've found a few links which do a good job of explaining the concept from a high-level. However, I haven't seen any examples which walk you through all of the steps from start to finish.
For example, let’s say you have a 50-dimensioned dataset, which has 50 columns of varying data types (boolean, float, integer, varchar etc.)… Do those values need to be scaled or normalized to something like 0.0..1.0? Or can the PCA algorithm handle those disparate data types?
Ideally, I want to see something which does a walkthrough of each step and explains it on the way. Especially starting with disparate data which needs to be scaled or normalized. All examples I've seen online, including ones which use well-known example data sets (such as the Iris dataset), start with pristine data where all of the columns are the same data type. I'm starting with a large dataset with many columns of varying data types. What do I do?
Incidentally, after applying PCA to my dataset, I plan on running it through clustering (k-means probably).
Update 9/10/2015
Since this question has been marked as off-topic, I'm not able to submit or select an answer. In any case, I found two links from Sebastian Raschka to be very helpful: