I'm doing a test run of the Gradient Boosting Machine algorithm on the iris
data with the caret
package.
library(caret)
library(gbm)
data(iris)
set.seed(123)
inTraining <- createDataPartition(iris$Species, p = .75, list = FALSE)
training <- iris[ inTraining,]
testing <- iris[-inTraining,]
gbmGrid <- expand.grid(interaction.depth = c(1, 2, 3),
n.trees = (1:10)*1000,
shrinkage = c(0.001, 0.005, 0.01, 0.05, 0.1),
n.minobsinnode = c(1, 2, 5, 10, 15, 20))
fitControl <- trainControl(
classProbs = TRUE,
method = "repeatedcv",
number = 10,
repeats = 10,
allowParallel = T)
set.seed(234)
gbmFit2 <- train(Species ~ .,
data = training,
method = "gbm",
trControl = fitControl,
verbose = FALSE,
tuneGrid = gbmGrid)
I'm achieving excellent Accuracy metrics, however the predicted probabilities for the Species values in the test data are fairly evenly split. I expected GBM would return predicted probabilities of 90%+ for the correctly predicted Species value rather than in the 35%-40% range.
predict(gbmFit2, newdata=testing, type="prob")
setosa versicolor virginica
1 0.3826163 0.3086751 0.3087086
2 0.3826643 0.3086374 0.3086983
3 0.3826681 0.3086355 0.3086964
4 0.3811067 0.3114695 0.3074237
5 0.3811067 0.3114695 0.3074237
...
32 0.3077245 0.3568080 0.3354674
33 0.3153934 0.3275473 0.3570593
34 0.3097463 0.3525782 0.3376756
35 0.3065883 0.3151160 0.3782957
36 0.3078244 0.3122151 0.3799605
Did I misspecify my model?