Let us consider the Adam optimizer with the equation given below:
$w_{t + 1} = w_{t} - \frac{\eta \times \bar{m_t}} {\sqrt{\bar{v_t} + \epsilon}}$
Here $w$ denotes weight (in time $t$ and $t + 1$) and $\eta$ denotes the learning parameter. Then,
$\bar{m_t} = \frac{m_t}{1 - \beta^t_1}$
$\bar{v_t} = \frac{v_t}{1 - \beta^t_2}$
And ${m_t}$ and ${v_t}$ are given by:
${m_t} = \beta_1 \times {m_{t -1}} + (1 - \beta_1) \times g_t$
${v_t} = \beta_2 \times {v_{t -1}} + (1 - \beta_2) \times g_t^2$
Here $g_t$ is the gradient at timestep $t$.
I don't clearly understand why $\bar{m_t}$ and $\bar{v_t}$ are calculated. What I have heard is that this is done to remove bias towards zero which exists initially during training. This means that $m_t$ and $v_t$ are close to zero or are very small values initially. Since $\beta1$ and $\beta2$ are close to one, the division results in a larger value. But what is the need to do this? Shouldn't these values get larger naturally over time? Can someone please explain what this means?