3

I have a set of 380 observations with 30 predictor variables and one response variable. I have determined using other methods (bagging and random forest) that there is a set of 7 variables that are most important for prediction. I'm using the boosting function from the adabag package, and have set a constant random number seed. However, when I run the same command again, I get a different prediction accuracy.

Here is a sample code that reproduces the behaviour:

library(MASS)
library(adabag)
set.seed(42)

data("birthwt") # Example data
birthwt$race<-as.factor(birthwt$race)
birthwt$smoke<-as.factor(birthwt$smoke)
birthwt$low<-as.factor(birthwt$low)

N<-nrow(birthwt)
train <- sample(1:N,size=0.5*N,replace=FALSE)
train <- sort(train)
valid <- sample(setdiff(1:N,train),0.25*N,replace=FALSE)
valid <- sort(valid)
test <- setdiff(1:N,union(train,valid))

fit.b1 <- boosting(low~.-bwt,data=birthwt[train,])
pred.b1 <- predict(fit.b1,newdata=birthwt[valid,])
tab.b1 <- pred.b1$confusion
acc.b1 <- sum(diag(tab.b1))/sum(tab.b1)
acc.b1
importanceplot(fit.b1)

fit.b2 <- boosting(low~.-bwt,data=birthwt[train,])
pred.b2 <- predict(fit.b2,newdata=birthwt[valid,])
tab.b2 <- pred.b2$confusion
acc.b2 <- sum(diag(tab.b2))/sum(tab.b2)
acc.b2
importanceplot(fit.b2)

Notably the accuracy varies by several percentage points on each iteration of the commands, and does so seemingly randomly. I'm uncertain whether this is a "problem" as such or intended behaviour. Why does this happen?

Daire
  • 33
  • 2

1 Answers1

1

There is nothing wrong with your code or with the code in adabag but I think there is a slight misunderstanding of how the random seed works.

(Disclaimer: I will offer a very simplified description of how an PRNG work to accommodate the answer; random number generation is a very serious business.)

The random seed effectively sets the pseudo-random number generator (PRNG) to a particular state and from that point onwards the PRNG "progresses" in a deterministic way. For example, we set the random seed to 12 and we then sample four random numbers in $U(0,1)$ using runif, we get 0.06936092 0.81777520 0.94262173 0.26938188. All in all, the state of our PRNG progressed four times.

set.seed(12)
runif(4)
# [1] 0.06936092 0.81777520 0.94262173 0.26938188

Now, suppose that we set the random seed to 12, sample two random numbers, do some operations not involving the PRNG and then we sample another two random numbers. In this case, first we progressed the state twice, did some stuff, and we progressed it from the state we left it another two times.

set.seed(12)
runif(2)
# [1] 0.06936092 0.81777520
3 + 4 *12 / 55
# [1] 3.872727
runif(2)
# [1] 0.94262173 0.26938188

Notice that here the second call to runif generated the two same numbers we got in last two entries of our original call to runif in the snippet above. That is because we progressed our state from where the first two progressions left it.

Coming now to your actual question: adabag::boosting indeed randomly selects a bootstrap sample from the training set. The point is that the second time the call to adabag::boosting is made, the state of the PRNG has progressed and has not be reset to the state used by the first call of the function. Indeed, if we set the seed right before the call to our algorithm we will get the same exact results.

set.seed(34)
fit.b3 <- boosting(low~.-bwt,data=birthwt[train,])
pred.b3 <- predict(fit.b3,newdata=birthwt[valid,]) 

set.seed(34)
fit.b4 <- boosting(low~.-bwt,data=birthwt[train,])
pred.b4 <- predict(fit.b4,newdata=birthwt[valid,])

all.equal(fit.b3, fit.b4)
# [1] TRUE
all.equal(pred.b3, pred.b4)
# [1] TRUE

So to recap, the observed phenomenon happens because we are using different states of the PRNG. In that sense, what is observed is completely normal. How can we stop it? We do not stop but rather we live with it, i.e. we quantify it. Variation due to sampling is a reality of life. ☺ I would suggest looking into bootstrapping (or repeated cross-validation). These resampling techniques allows us to we get a sense of the variability in a classifier's performance. CV.SE has some excellent threads on the matter that I would urge you to read carefully: "Cross-validation or bootstrapping to evaluate classification performance?" and "Variance estimates in k-fold cross-validation".

A final note: I see that Accuracy is used to measure the classifier's performance. It is often not a great choice. See the thread: "Why is accuracy not the best measure for assessing classification models?" for more details.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • Interesting! Do you have any thoughts on why me changing the `boos` parameter value in my answer but keeping only one call to `set.seed()` in the code still produced identical results? – AlexK Apr 27 '19 at 20:32
  • 1
    Because `boos = FALSE` does not do resampling any more, i.e. there no additional calls to the PRNG. It should be noted that `boos=FASLE` changed the default behaviour of `adabag::boost`. In general, resampling is *good* when learning, all newer boosting algorithms strongly benefit from it. – usεr11852 Apr 27 '19 at 20:44
  • Thanks, I'll just delete my answer. It was not well-informed. – AlexK Apr 27 '19 at 20:47
  • I see! Thank you very much, this was a useful answer. – Daire Apr 28 '19 at 10:40
  • Cool, I am glad I could help! – usεr11852 Apr 28 '19 at 10:45