Why AdaBoost works exactly the way it does

Question

I understand the basic idea of AdaBoost -- when training weak classifiers, use more of the difficult examples. However, it puzzles me why I sould modify the weights the way AdaBoost does. There are, surely, plenty of other ways to do it that would have the same basic idea (the one I described).

Can someone give me an explanation of why AdaBoost makes the specific choices it does? How did the authors arrive at the algorithm? Surely it's not just random formulas that satisfy the described intuition.

I did some searching online but found nothing.

Are you asking why exponential loss is used? Instead of say hinge loss? — usεr11852, Jan 26 '19 at 12:04
I am asking why are the formulas the way they are. Why not use some totally different formulas? — user2316602, Jan 26 '19 at 16:07
I am sorry but it is unclear to me how to answer the question: "*why are the formulas the way they are*". Maybe you can edit your question to include which formulas you refer at specifically. — usεr11852, Jan 26 '19 at 17:10
I refer to all formulas. Why are they better than any of the thousands of different formulas which can, too, be intuitively explained by "they make the learning focus more on difficult examples" — user2316602, Jan 26 '19 at 17:42

score 3 · Accepted Answer · answered Sep 21 '20 at 16:54

I will justify the whole algorithm from the statistical point of view, i.e. risk minimization. I will get to the weights later.

Now, just for the record, I will outline steps of the AdaBoost algorithm.

Let $f_1, f_2,..., f_C$ be classifiers and $Y\in \{-1,1\}$.

Initialize observations with weights $w_i = \frac{1}{N}, \ i=1,2,...,N$.
For $c=1$ to $C$:
i) Fit a classifier $f_c$ to the training data using weights $w_i$
ii) Compute $$err_c = \sum_{i=1}^{N} w_i\mathbb{1}(Y_i \neq f_c(X_i))$$ and $$\gamma_c = \log\left(\frac{1-err_c}{err_c}\right). $$
iii) Set $w_i := w_i \exp(\gamma_c \mathbb{1}(Y_i \neq f_c(X_i)))$ and normalize weights $w_i := \frac{w_i}{\sum_{j=1}^{N} w_j}.$
Output $f(x) = sign\left(\sum_{i=1}^{C} \gamma_i f_i(x)\right).$

Why does AdaBoost work?

In general, we would like to find a risk-minimizing model $f$, i.e., a model that minimizes the expected loss over the joint distribution $\mathbb{ P}(X,Y):$ $$R(f) = \mathbb{E}_{X,Y}L(Y,f(X)),$$ where $L$ is a loss function.

AdaBoost minimizes empirical risk for the exponential loss function in the following class of functions $$f^{(m)}(x) = \sum_{c=1}^m \gamma_cf_c(x),$$

where $f_1,f_2,...,f_m$ are classifiers.

One important observation is that $f^{(m)}(x) = f^{(m-1)}(x) + \gamma_m f_m(x)$ and we can optimize parameters $\gamma_c$ sequentially, not globally, for which we would require computationally intensive numerical optimization techniques. The theoretical risk equals

$$\mathbb{E}_{X,Y}L(Y,f(X)) = \mathbb{E}_X\mathbb{E}_{Y|X} L(Y,f(X)),$$

so it is sufficient to minimize the conditional expected value with respect to $f$. One can easily show that $$\underset{f(X)}{\mathrm{argmin}} \mathbb{E}_{Y|X}e^{-Yf(X)} = \frac{1}{2}\log \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=-1|X)},$$

which is a monotonic function of Bayes scoring function and thus minimizing the risk for the exponential loss leads to the Bayes classifier. And that is the reason why AdaBoost makes sense. We can now establish relations between parameters from the algorithm and parameters obtained by optimization. Note that

$$\underset{\beta}{\mathrm{argmin}} \sum_{i=1}^{n} L (y_i, f^{(m-1)}(x_i) + \beta f_m(x_i)) = \underset{\beta}{\mathrm{argmin}} \sum_i w_i^{(m)}\exp(-\beta y_i f_m(x_i)) = \underset{\beta}{\mathrm{argmin}} \left(e^{-\beta} \sum_{y_i = f_m(x_i)}w_i^{(m)} + e^\beta \sum_{y_i \neq f_m(x_i)}w_i^{(m)}\right),$$

where $w_i^{(m)} := \exp(-y_i f^{(m-1)}(x_i)).$ Differentiating with respect to $\beta$ we get

$$\beta_m = \frac{1}{2} \log\left(\frac{\sum_{y_i = f_m(x_i)} w_i^{(m)}}{\sum_{y_i \neq f_m(x_i)}w_i^{(m)}}\right) = \frac{1}{2}\log \left(\frac{1- err_m}{err_m}\right).$$

As you can see, $\beta_m$ equals $\frac{1}{2}\gamma_c$ from the algorithm. Now let's have a look how weights are adapted. As I defined above

$$w_i^{(m+1)} := \exp\left(-y_if^{(m)}(x_i)\right) = \exp\left(-y_i\left(f^{(m-1)}(x_i) + \beta_m f_m(x_i)\right)\right) \\= w_i^{(m)}\exp\left(-\beta_m y_i f_m(x_i)\right).$$

Note that $-y_i f_m(x_i) = 2\cdot \mathbb{1}(y_i \neq f_m(x_i)) - 1,$ so we get $$w_i^{(m+1)} = w_i^{(m)}e^{2\beta_m \mathbb{1}(y_i \neq f_m(x_i))}e^{-\beta_m}.$$

The factor $e^{-\beta_m}$ multiplies all weights by the same value, so after weight normalization it will cancel out. Finally, we obtain the result

$$w_i^{(m+1)} = w_i^{(m)} \exp \left(\gamma_m \mathbb{1}(y_i \neq f(x_i))\right),$$

where $\gamma_m = \log \left(\frac{1-err_m}{err_m}\right).$

score 0 · Answer 2 · answered May 12 '19 at 20:43

0

I will recommend the following paper titled "Explaining Adaboost". It cleared a lot of doubt I had about the algorithm, what it does and loss minimization.

answered May 12 '19 at 20:43

Anjani Anjani

119
7

Why AdaBoost works exactly the way it does

2 Answers2

Why does AdaBoost work?