0

enter image description here

why is possible to approximate E[gi^2] with E[gt^2]? by the time we go to the t timestamp, we've already made weight updates, which mean gradients should be different as they are taken from different points in parameter space?

I guess we get error term(zeta) by making same approximation but are we really getting rid of that by doing the so-called bias correction, I don't think so

enter image description here

0 Answers0