I'm currently doing work on a cars.data set, which was downloaded from Craigslist in the United States. First of all, there are a ton of outliers and generally, the data set is pretty subjective, as people self-report the car that they want to sell. Some just put in a price of $123,456,789 and some cars have an odometer of 1,000,000,000 etc. These are obvious outliers of course which has been removed.
However, the question I'm in doubt about, is how would you guys perform any data analysis (mainly PCA and FA) on the following variables?
'data.frame': 137643 obs. of 17 variables:
$ price : num 35990 7500 2000 19500 39990 ...
$ year : num 2010 2014 1974 2005 2012 ...
$ manufacturer: Factor w/ 41 levels "acura","alfa-romeo",..: 8 18 8 14 14 8 21 5 14 8 ...
$ model : chr "corvette grand sport" "sonata" "c-10" "f350 lariat" ...
$ condition : Ord.factor w/ 6 levels "salvage"<"fair"<..: 3 4 3 4 3 3 3 3 3 3 ...
$ cylinders : num 8 4 4 8 8 8 6 8 8 8 ...
$ fuel : Factor w/ 5 levels "gas","diesel",..: 1 1 1 2 1 1 1 1 1 1 ...
$ odometer : num 32742 93600 190000 116000 9692 ...
$ title_status: Factor w/ 6 levels "clean","lien",..: 1 1 1 2 1 1 1 1 1 1 ...
$ transmission: chr "other" "automatic" "automatic" "automatic" ...
$ drive : Factor w/ 3 levels "4wd","fwd","rwd": 3 2 3 1 3 3 1 3 3 3 ...
$ type : Factor w/ 13 levels "bus","convertible",..: 7 9 8 8 3 3 7 3 10 7 ...
$ state : num 2 2 2 2 2 2 2 2 2 2 ...
$ lat : num 32.6 32.5 32.9 32.5 32.6 ...
$ long : num -85.5 -85.5 -85.2 -85.5 -85.5 ...
$ posting_date: chr "2020-12-02" "2020-12-02" "2020-12-01" "2020-12-01" ...
$ age : num 10 6 46 15 8 8 3 7 17 8 ...
I've not done nothing and have tried several methods, but are interested in hearing how you guys would approach this data set. Since PCA should be done on continuous variables, I've used price, cylinders, odometer and age (remade from year), to conduct my PCA on. This does give me some of a result, at least one which I can show, but I don't feel like this actually tells me anything. As you can see in the data set above, there are a lot of categorical variables which are rewritten as a factor. I've tried using some of those in a Multiple Factor Analysis from the FactomineR package, but I don't feel as though these provide some good results either, as it would seem that this sort of factor analysis would work better with the majority of the variables being continuous and not categorical.
Next, I'm supposed to conduct a discriminant analysis on either the PCs or factors, if they show to be any good, however I doubt that they will be. Do some of you, which might have worked with this before, have any tips and tricks or comments on things which I might have missed?