I'm not claiming to understand the paper in its entirety, but if I understand it correctly, the authors use the term "robust" when describing how they use a Bayesian hierarchical model with an inverse gamma prior on the sigma parameter of the Gaussian distribution. The resulting predictive distribution is a Student-t which is the common substitute for the normal when the variance is unknown.
The authors use the HT algorithm to "unbias" the mini-batch estimator which they claim has better convergence properties than a full estimate being sparser due to having less data to crunch in each of the individual samples.