1

Let us consider the Adam optimizer with the equation given below:

$w_{t + 1} = w_{t} - \frac{\eta \times \bar{m_t}} {\sqrt{\bar{v_t} + \epsilon}}$

Here $w$ denotes weight (in time $t$ and $t + 1$) and $\eta$ denotes the learning parameter. Then,

$\bar{m_t} = \frac{m_t}{1 - \beta^t_1}$

$\bar{v_t} = \frac{v_t}{1 - \beta^t_2}$

And ${m_t}$ and ${v_t}$ are given by:

${m_t} = \beta_1 \times {m_{t -1}} + (1 - \beta_1) \times g_t$

${v_t} = \beta_2 \times {v_{t -1}} + (1 - \beta_2) \times g_t^2$

Here $g_t$ is the gradient at timestep $t$.

I don't clearly understand why $\bar{m_t}$ and $\bar{v_t}$ are calculated. What I have heard is that this is done to remove bias towards zero which exists initially during training. This means that $m_t$ and $v_t$ are close to zero or are very small values initially. Since $\beta1$ and $\beta2$ are close to one, the division results in a larger value. But what is the need to do this? Shouldn't these values get larger naturally over time? Can someone please explain what this means?

Oren Milman
  • 1,132
  • 11
  • 25
skr_robo
  • 111
  • 4
  • 1
    Didn't you mean $\bar{m_t} = \frac{m_t}{1 - \beta^t_1}$ and $\bar{v_t} = \frac{v_t}{1 - \beta^t_2}$? (That's what the original Adam paper says - [Adam: A Method for Stochastic Optimization](https://arxiv.org/abs/1412.6980).) – Oren Milman Sep 24 '18 at 08:05

0 Answers0