I am new to machine learning and R.
I know that there is an R package called caretEnsemble, which could conveniently stack the models in R. However, this package looks has some problems when deals with multi-classes classification tasks.
Temporarily, I wrote some codes to try to stack the models manually and here is the example I worked on:
library(caret)
set.seed(123)
library(AppliedPredictiveModeling)
data(AlzheimerDisease)
adData = data.frame(diagnosis, predictors)
inTrain = createDataPartition(adData$diagnosis, p = 3 / 4)[[1]]
training = adData[inTrain,]
testing = adData[-inTrain,]
set.seed(62433)
modelFitRF <- train(diagnosis ~ ., data = training, method = "rf")
modelFitGBM <- train(diagnosis ~ ., data = training, method = "gbm",verbose=F)
modelFitLDA <- train(diagnosis ~ ., data = training, method = "lda")
predRF <- predict(modelFitRF,newdata=testing)
predGBM <- predict(modelFitGBM, newdata = testing)
prefLDA <- predict(modelFitLDA, newdata = testing)
confusionMatrix(predRF, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
confusionMatrix(predGBM, testing$diagnosis)$overall[1]
#Accuracy
#0.7926829
confusionMatrix(prefLDA, testing$diagnosis)$overall[1]
#Accuracy
#0.7682927
Now I've got three models: modelFitRF
, modelFitGBM
and modelFitLDA
, and three predicted vectors corresponding to such three models based on the test set
.
Then I will create a data frame to contain these predicted vectors and the original dependent variable in the test set
:
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
And then, I just used such data frame as a new train set
to create a stacked model:
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]
#Accuracy
#0.804878
Considering that stacking models usually should improve the accuracy of the predictions, I'de like to believe this might be a right to stack the models. However, I also doubt that here I used the predDF
which is created by the predictions from three models with the test set
.
I am not sure whether I should use the results from the test set
and then apply them back to the test set
to get final predictions?
(I am referring to this block below:)
predDF <- data.frame(predRF, predGBM, prefLDA, diagnosis = testing$diagnosis, stringsAsFactors = F)
modelStack <- train(diagnosis ~ ., data = predDF, method = "rf")
combPred <- predict(modelStack, predDF)
confusionMatrix(combPred, testing$diagnosis)$overall[1]