7

I can't figure out what does learning_rate stand for in sklearn implementation of Adaboost. When i see the original algorithm i don't see any "learning_rate"...

Meanwhile i can see from https://fr.wikipedia.org/wiki/AdaBoost that the training errors are weighted thanks to $D_t(i)$ (where $i$ is attached to the $i$th training instance in the training matrix $X$). Is there any relation between the sklearn "learning_rate" and this $D_t$ ?

Alexis
  • 26,219
  • 5
  • 78
  • 131
curious
  • 343
  • 1
  • 3
  • 10
  • 1
    Possible duplicate of [Shrinkage parameter in Adaboost?](https://stats.stackexchange.com/questions/82323/shrinkage-parameter-in-adaboost) – Xavier Bourret Sicotte Jul 11 '18 at 15:29
  • Have a look at my answer here: https://stats.stackexchange.com/questions/82323/shrinkage-parameter-in-adaboost/355632#355632 – Xavier Bourret Sicotte Jul 11 '18 at 15:29
  • @XavierBourretSicotte It is worth noting, that although the learning parameter is a natural extension, it is not mentioned in the papers referenced by [scikitlearn](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.AdaBoostClassifier.html), which do represent classic version and SAMME. In [Experiments with a New Boosting Algorithm by Freund and Shapire](http://www.cis.upenn.edu/~mkearns/teaching/COLT/boostingexperiments.pdf) M1 and M2 are presented (again), both without learning rate. I wonder what the official reference for the learning rate / shrinkage parameter is. – mlwida Jul 18 '18 at 15:10
  • Fair point - I don't know where it comes from - It doesn't seem to be here either https://web.stanford.edu/~hastie/Papers/samme.pdf – Xavier Bourret Sicotte Jul 18 '18 at 15:20

1 Answers1

4

The official documentation states that "The learning rate shrinks the contribution of each regressor by learning_rate.". Thus, basically we need to understand three concepts:

1. Weak Classifier

A model whose error rate is only slightly better than random guessing, that is to say, 50% accuracy.

2. Boosting

This technique has the objective to apply $K$ different times (sequentially) a model to modified versions of the data. So, suppose at each iteration $i \in \{1,2,..., K\}$ you build a new tree model $T_{i}$

\begin{align} T_{i+1}(x) = T_{i}(x) + \alpha M(x), \end{align}

where $$M(x) = \sum_{j=1}^{J} t(x, \theta_{j})$$ is the sum of trees with different paramaters $\theta_{j}$ and $\alpha$ is the learning rate between 0 and 1.

3. Learning Rate

This parameter controls how much I'm going to contribute with the new model to the existing one. Normally there is trade off between the number of iterations $K$ and the value of $\alpha$. In other words, when taking smaller values of alpha ($\alpha \approx 0$) you should consider more $K$ iterations, so that your base model (the weak classifier) continues to improve. According to Jerome Friedman, it is suggested to set $\alpha$ to smaller values ($\alpha < .1$).

Miguel Trejo
  • 261
  • 1
  • 3