I will justify the whole algorithm from the statistical point of view, i.e. risk minimization. I will get to the weights later.
Now, just for the record, I will outline steps of the AdaBoost algorithm.
Let $f_1, f_2,..., f_C$ be classifiers and $Y\in \{-1,1\}$.
- Initialize observations with weights $w_i = \frac{1}{N}, \ i=1,2,...,N$.
- For $c=1$ to $C$:
i) Fit a classifier $f_c$ to the training data using weights $w_i$
ii) Compute $$err_c = \sum_{i=1}^{N} w_i\mathbb{1}(Y_i \neq f_c(X_i))$$ and
$$\gamma_c = \log\left(\frac{1-err_c}{err_c}\right). $$
iii) Set $w_i := w_i \exp(\gamma_c \mathbb{1}(Y_i \neq f_c(X_i)))$ and normalize weights
$w_i := \frac{w_i}{\sum_{j=1}^{N} w_j}.$
- Output $f(x) = sign\left(\sum_{i=1}^{C} \gamma_i f_i(x)\right).$
Why does AdaBoost work?
In general, we would like to find a risk-minimizing model $f$, i.e., a model that minimizes the expected loss over the joint distribution $\mathbb{
P}(X,Y):$
$$R(f) = \mathbb{E}_{X,Y}L(Y,f(X)),$$
where $L$ is a loss function.
AdaBoost minimizes empirical risk for the exponential loss function in the following class of functions
$$f^{(m)}(x) = \sum_{c=1}^m \gamma_cf_c(x),$$
where $f_1,f_2,...,f_m$ are classifiers.
One important observation is that $f^{(m)}(x) = f^{(m-1)}(x) + \gamma_m f_m(x)$ and we can optimize parameters $\gamma_c$ sequentially, not globally, for which we would require computationally intensive numerical optimization techniques.
The theoretical risk equals
$$\mathbb{E}_{X,Y}L(Y,f(X)) = \mathbb{E}_X\mathbb{E}_{Y|X} L(Y,f(X)),$$
so it is sufficient to minimize the conditional expected value with respect to $f$. One can easily show that
$$\underset{f(X)}{\mathrm{argmin}} \mathbb{E}_{Y|X}e^{-Yf(X)} = \frac{1}{2}\log \frac{\mathbb{P}(Y=1|X)}{\mathbb{P}(Y=-1|X)},$$
which is a monotonic function of Bayes scoring function and thus minimizing the risk for the exponential loss leads to the Bayes classifier. And that is the reason why AdaBoost makes sense. We can now establish relations between parameters from the algorithm and parameters obtained by optimization. Note that
$$\underset{\beta}{\mathrm{argmin}} \sum_{i=1}^{n} L (y_i, f^{(m-1)}(x_i) + \beta f_m(x_i)) = \underset{\beta}{\mathrm{argmin}} \sum_i w_i^{(m)}\exp(-\beta y_i f_m(x_i)) =
\underset{\beta}{\mathrm{argmin}} \left(e^{-\beta} \sum_{y_i = f_m(x_i)}w_i^{(m)} + e^\beta \sum_{y_i \neq f_m(x_i)}w_i^{(m)}\right),$$
where $w_i^{(m)} := \exp(-y_i f^{(m-1)}(x_i)).$ Differentiating with respect to $\beta$ we get
$$\beta_m = \frac{1}{2} \log\left(\frac{\sum_{y_i = f_m(x_i)} w_i^{(m)}}{\sum_{y_i \neq f_m(x_i)}w_i^{(m)}}\right) = \frac{1}{2}\log \left(\frac{1- err_m}{err_m}\right).$$
As you can see, $\beta_m$ equals $\frac{1}{2}\gamma_c$ from the algorithm.
Now let's have a look how weights are adapted. As I defined above
$$w_i^{(m+1)} := \exp\left(-y_if^{(m)}(x_i)\right) = \exp\left(-y_i\left(f^{(m-1)}(x_i) + \beta_m f_m(x_i)\right)\right) \\= w_i^{(m)}\exp\left(-\beta_m y_i f_m(x_i)\right).$$
Note that $-y_i f_m(x_i) = 2\cdot \mathbb{1}(y_i \neq f_m(x_i)) - 1,$
so we get
$$w_i^{(m+1)} = w_i^{(m)}e^{2\beta_m \mathbb{1}(y_i \neq f_m(x_i))}e^{-\beta_m}.$$
The factor $e^{-\beta_m}$ multiplies all weights by the same value, so after weight normalization it will cancel out. Finally, we obtain the result
$$w_i^{(m+1)} = w_i^{(m)} \exp \left(\gamma_m \mathbb{1}(y_i \neq f(x_i))\right),$$
where $\gamma_m = \log \left(\frac{1-err_m}{err_m}\right).$