6

I am having some problems understanding the variable importance and feature selection graphs from caret. Here some data:

require(mlbench)
require(caret)
require(pglm)
require(e1071)
require(pROC)
require(randomForest)

data(Unions) # from the pglm package
Unions <- Unions[c("id", "year", "union", "age", "exper", "married", "ethn",
                   "disability", "rural", "region", "wage", "sector", "occ")]

Unions$union <- as.factor(Unions$union) # for classification as factor
trainIndex   <- createDataPartition(Unions$union, p=.4, list=FALSE, times=1)

UnionsTrain <- Unions[ trainIndex,]
UnionsTest  <- Unions[-trainIndex,]

#### Plotting  Variable Importance ######
control  <- trainControl(method="repeatedcv", number=10, repeats=3)
LVQmodel <- train(union~., data=UnionsTrain, method="lvq", preProcess="scale", 
                  trControl=control)
LVQimportance <- varImp(LVQmodel, scale=FALSE)
plot(LVQimportance)

enter image description here

How can I interpret the following graph? Using that set up, may I say, that feature Wage was in ca. 6 out of 10 cases important to classify a union member? Perhaps one could provide a more substantial interpretation.

Additionally I have used the graphical feature selection:

#### Plotting  Feature Selection ######
rfecontrol <- rfeControl(functions=rfFuncs, method="cv", repeats=5, verbose=FALSE, 
                         number=5)
results <- rfe(UnionsTrain[,4:13], as.factor(UnionsTrain[,3]), sizes=c(4:13), 
               rfeControl=rfecontrol, metric="Accuracy")

ggplot(results, type=c("g","o"), metric="Accuracy") + 
  scale_x_continuous(breaks=4:13, labels=names(UnionsTrain)[4:13])

enter image description here

Do I have to interpret this graph in subsets? Why does the graph depend on the order of my variables rather on the accuracy of the model?

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Mamba
  • 295
  • 3
  • 12
  • 1
    Not about programming, suggesting a move to Cross Validated. – Gregor Thomas Jul 08 '15 at 16:14
  • 2
    Hmm, its actually about the functionality of the caret library – Mamba Jul 08 '15 at 16:15
  • Yes, it's about interpretation of model results, as presented by the caret package. – Gregor Thomas Jul 08 '15 at 16:24
  • Yes, thats true! – Mamba Jul 08 '15 at 16:25
  • The title is misleading since this information is not specific to the `caret` package. It is a standard output of any random forest model like, e.g., the package `randomForest` or `party`. The details are described in the literature. – RHertel Jul 08 '15 at 16:29
  • A good description of the standard ways to measure the importance of variables in random forest models can be found e.g. in section 12.4 of [this book](https://books.google.com/books?id=PdbikQEACAAJ). Again, this is a question that concerns random forest models in general and not the `caret` package, which is but one out of many possibilities to obtain such results. – RHertel Jul 08 '15 at 16:41
  • First question is´t based on `random forrests` – Mamba Jul 08 '15 at 16:44
  • True, but LVQ (Learning Vector Quantization) is equally a standard method, which is considered by many as outdated since kNN is usually superior. The measure of the importance of the variables is related to the size of the Voronoi cell ascribed to each variable. This is not a question that can be answered in a few lines. I think that there is no other way than spending some time reading books. I have some doubts whether SO is a suitable platform to discuss these questions. – RHertel Jul 08 '15 at 17:02
  • 1
    This question isn't really about how to code / use R. It is about how to understand the output. IMO, it is a statistical / ML question; it is on topic here, not on SO. – gung - Reinstate Monica Jul 09 '15 at 21:48

0 Answers0