9

Documentation states that R gbm with distribution = "adaboost" can be used for 0-1 classification problem. Consider the following code fragment:

gbm_algorithm <- gbm(y ~ ., data = train_dataset, distribution = "adaboost", n.trees = 5000)
gbm_predicted <- predict(gbm_algorithm, test_dataset, n.trees = 5000)

It can be found in the documentation that predict.gbm

Returns a vector of predictions. By default the predictions are on the scale of f(x).

However the particular scale is not clear for the case of distribution = "adaboost".

Could anyone help with the interpretation of predict.gbm return values and provide an idea of conversion to the 0-1 output?

Alexey Lakhno
  • 93
  • 1
  • 1
  • 4
  • This question appears to be *only* about how to interpret R output, & not about the related statistical issues (although that doesn't make it a bad Q). As such it is better asked, & probably answered, on [Stack Overflow](http://stackoverflow.com/), rather than here. *Please don't cross-post* (SE strongly discourages this), if you want your Q migrated faster, please flag it for moderator attention. – gung - Reinstate Monica Sep 18 '12 at 15:57
  • 4
    @gung seems like a legitimate statistical question to me. The GBM package supplies the Deviance used for adaboost but it is not clear to me either what f(x) is and how to back transform to a probability scale (perhaps one has to use Platt scaling). http://cran.r-project.org/web/packages/gbm/vignettes/gbm.pdf – B_Miner Sep 18 '12 at 16:26

3 Answers3

12

You can also directly obtain the probabilities from the predict.gbm function;

predict(gbm_algorithm, test_dataset, n.trees = 5000, type = 'response')
Edwin
  • 221
  • 2
  • 4
11

The adaboost method gives the predictions on logit scale. You can convert it to the 0-1 output:

gbm_predicted<-plogis(2*gbm_predicted)

note the 2* inside the logis

razgon
  • 126
  • 1
  • 3
3

The adaboost link function is described here. This example provides a detailed description of the computation:

library(gbm);
set.seed(123);
n          <- 1000;
sim.df     <- data.frame(x.1 = sample(0:1, n, replace=TRUE), 
                         x.2 = sample(0:1, n,    replace=TRUE));
prob.array <- c(0.9, 0.7, 0.2, 0.8);
df$y       <- rbinom(n, size = 1, prob=prob.array[1+sim.df$x.1+2*sim.df$x.2])
n.trees    <- 10;
shrinkage  <- 0.01;

gbmFit <- gbm(
  formula           = y~.,
  distribution      = "bernoulli",
  data              = sim.df,
  n.trees           = n.trees,
  interaction.depth = 2,
  n.minobsinnode    = 2,
  shrinkage         = shrinkage,
  bag.fraction      = 0.5,
  cv.folds          = 0,
  # verbose         = FALSE
  n.cores           = 1
);

sim.df$logods  <- predict(gbmFit, sim.df, n.trees = n.trees);  #$
sim.df$prob    <- predict(gbmFit, sim.df, n.trees = n.trees, type = 'response');  #$
sim.df$prob.2  <- plogis(predict(gbmFit, sim.df, n.trees = n.trees));  #$
sim.df$logloss <- sim.df$y*log(sim.df$prob) + (1-sim.df$y)*log(1-sim.df$prob);  #$


gbmFit <- gbm(
  formula           = y~.,
  distribution      = "adaboost",
  data              = sim.df,
  n.trees           = n.trees,
  interaction.depth = 2,
  n.minobsinnode    = 2,
  shrinkage         = shrinkage,
  bag.fraction      = 0.5,
  cv.folds          = 0,
  # verbose         = FALSE
  n.cores           = 1
);

sim.df$exp.scale  <- predict(gbmFit, sim.df, n.trees = n.trees);  #$
sim.df$ada.resp   <- predict(gbmFit, sim.df, n.trees = n.trees, type = 'response');  #$
sim.df$ada.resp.2 <- plogis(2*predict(gbmFit, sim.df, n.trees = n.trees));  #$
sim.df$ada.error  <- -exp(-sim.df$y * sim.df$exp.scale);  #$

sim.df[1:20,]
gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650