0

I'm currently doing work on a cars.data set, which was downloaded from Craigslist in the United States. First of all, there are a ton of outliers and generally, the data set is pretty subjective, as people self-report the car that they want to sell. Some just put in a price of $123,456,789 and some cars have an odometer of 1,000,000,000 etc. These are obvious outliers of course which has been removed.
However, the question I'm in doubt about, is how would you guys perform any data analysis (mainly PCA and FA) on the following variables?

'data.frame':   137643 obs. of  17 variables:
$ price       : num  35990 7500 2000 19500 39990 ...
$ year        : num  2010 2014 1974 2005 2012 ...
$ manufacturer: Factor w/ 41 levels "acura","alfa-romeo",..: 8 18 8 14 14 8 21 5 14 8 ...
$ model       : chr  "corvette grand sport" "sonata" "c-10" "f350 lariat" ...
$ condition   : Ord.factor w/ 6 levels "salvage"<"fair"<..: 3 4 3 4 3 3 3 3 3 3 ...
$ cylinders   : num  8 4 4 8 8 8 6 8 8 8 ...
$ fuel        : Factor w/ 5 levels "gas","diesel",..: 1 1 1 2 1 1 1 1 1 1 ...
$ odometer    : num  32742 93600 190000 116000 9692 ...
$ title_status: Factor w/ 6 levels "clean","lien",..: 1 1 1 2 1 1 1 1 1 1 ...
$ transmission: chr  "other" "automatic" "automatic" "automatic" ...
$ drive       : Factor w/ 3 levels "4wd","fwd","rwd": 3 2 3 1 3 3 1 3 3 3 ...
$ type        : Factor w/ 13 levels "bus","convertible",..: 7 9 8 8 3 3 7 3 10 7 ...
$ state       : num  2 2 2 2 2 2 2 2 2 2 ...
$ lat         : num  32.6 32.5 32.9 32.5 32.6 ...
$ long        : num  -85.5 -85.5 -85.2 -85.5 -85.5 ...
$ posting_date: chr  "2020-12-02" "2020-12-02" "2020-12-01" "2020-12-01" ...
$ age         : num  10 6 46 15 8 8 3 7 17 8 ...

I've not done nothing and have tried several methods, but are interested in hearing how you guys would approach this data set. Since PCA should be done on continuous variables, I've used price, cylinders, odometer and age (remade from year), to conduct my PCA on. This does give me some of a result, at least one which I can show, but I don't feel like this actually tells me anything. As you can see in the data set above, there are a lot of categorical variables which are rewritten as a factor. I've tried using some of those in a Multiple Factor Analysis from the FactomineR package, but I don't feel as though these provide some good results either, as it would seem that this sort of factor analysis would work better with the majority of the variables being continuous and not categorical.

Next, I'm supposed to conduct a discriminant analysis on either the PCs or factors, if they show to be any good, however I doubt that they will be. Do some of you, which might have worked with this before, have any tips and tricks or comments on things which I might have missed?

  • This answer contains a lot of furtheer local links, so check it. https://stats.stackexchange.com/a/215483/3277l – ttnphns Jan 06 '21 at 18:01
  • Why would you feel obliged to include (as active variables, i.e. defining the latent dimensions) categorical variables along with quantitative ones, in PCA? Why not do PCA on a subset of conceptually meaningful correlated quantitative variables and then investigate how the latent dimensions associate with categorical factors? – ttnphns Jan 06 '21 at 18:08
  • Thanks for those answers. I've done a PCA on the conceptually meaningful correlated variables and are interpreting on this. The reason I'm so fixated on a PCA it that I'm supposed to argue in particular for why this method would or would not be suitable. Therefore, before I argue my part, I just wanted to make sure that I didn't miss anything important. – vinbaronen Jan 08 '21 at 19:36

0 Answers0