Why should we discuss convergence behaviors of different estimators in different topologies?

Question

In the first chapter of the book Algebraic Geometry and Statistical Learning Theory which talks about the convergence of estimations in different functional space, it mentions that the Bayesian estimation corresponds to the Schwartz distribution topology, whereas the maximum likelihood estimation corresponds to the sup-norm topology (in page 7):

For example, sup-norm, $L^p$-norm, weak topology of Hilbert space $L^2$, Schwartz distribution topology, and so on. It strongly depends on the topology of the function space whether the convergence $K_n(w)\to K(w)$ holds or not. The Bayes estimation corresponds to the Schwartz distribution topology, whereas the maximum likelihood or a posteriori method corresponds to the sup-norm. This difference strongly affects the learning results in singular models.

where $K_n(w)$ and $K(w)$ are respectively the empirical KL-divergence (summation over observations) and the true KL-divergence (integral w.r.t. the data distribution) between the true model and a parametric model (with parameter $w$).

Can anyone give an explanation, or hint me which place in the book has the justification? Thank you.

Update: copyright contents are removed.

I will answer your question later, I know watanabe's book relatively well. Yet I strongly dislike the way you cite a book. It may cause potential copyright problem if you put sections directly here. Using page numbers and typing citations with appropriate bib will be a better choice. — Henry.L, Dec 11 '16 at 18:35
@Henry: While I believe there is value in being cautious and conscientious in reproducing portions of copyrighted works, I think, in this case, ziyuang has absolutely nothing to worry about. The OP's use of small excerpts for scholarly critique falls very squarely within (U.S.) "fair use" doctrine. Indeed, having the exact reproduction can sometimes be especially valuable since it removes any ambiguities that could be introduced by restatements of the content. (All that said, IANAL.) — cardinal, Dec 11 '16 at 20:07
@cardinal I think a page number along with a citation is enough to clarify all possible ambiguities(with edition perhaps) and that is widely practiced in academia. But I agree that this is a "fair use". — Henry.L, Dec 11 '16 at 20:53
@ziyuang Thanks for the edit! And hope the answer helps, feel free to ask more. — Henry.L, Dec 12 '16 at 01:22

score 2 · Accepted Answer · edited Dec 12 '16 at 01:21

To understand Watanabe's discussion, it is important to realize that what he meant by "singularity". The (strict) singularity coincides with the geometric notion of singular metric in his theory.

p.10 [Watanabe] :"A statistical model $p(x\mid w)$ is said to be regular if it is identifiable and has a positive definite metric. If a statistical model is not regular, then it is called strictly singular."

In practice, singularity usually arise when the Fisher information metric induced by the model in degenerated on the manifold defined by the model, like low rank or sparse cases in "machine learning" works.

What Watanabe said about the convergence of empirical KL divergence to its theoretic value can be understood as follows. One origin of the notion of divergence comes from robust statistics. The M-estimators, which include MLE as a special case with contrast function $\rho(\theta,\delta(X))=-\log p(X\mid \theta)$, are usually discussed using weak topology. It is reasonable to discuss the convergence behavior using weak topology over the space $M(\cal{X})$(the manifold of all possible measures defined on Polish space $\cal{X}$) because we want to study the robustness behavior of MLE. A classical theorem in [Huber] stated that with well separated divergence function $D(\theta_0,\theta)=E_{\theta_{0}}\rho(\theta,\delta)$. $$\inf_{|\theta-\theta_0|\geq\epsilon}(|D(\theta_0,\theta)-D(\theta_0,\theta_0)| )>0$$ and good empirical approximation of contrast function to divergence, $$\sup_{\theta}\left|\frac{1}{n}\sum_{i}\rho(\theta,\delta(X_i))- D(\theta_0,\theta)\right|\rightarrow 0,n\rightarrow\infty$$ along with regularity, we can yield consistency in sense $$\hat{\theta_n}:=\mathrm{arg\,min}_{\theta}\rho(\theta,\delta(X_n))$$ will converge to $\theta_0$ in probability $P_{\theta_0}$. This result requires far more precise conditions if we compared with Doob's result [Doob] in weak consistency of Bayesian estimator.

So here Bayesian estimators and MLE diverges. If we still use weak topology to discuss consistency of Bayesian estimators, it is meaningless because Bayesian estimators will always(with probability one) be consistent by Doob. Therefore a more appropriate topology is Schwarz distribution topology which allows weak derivatives and von Mises' theory came into play. Barron had a very nice technical report on this topic how we could use Schwartz theorem to obtain consistency.

In another perspective, Bayesian estimators are distributions and their topology should be something different. Then what kind of role the divergence $D$ plays in that kind of topology? The answer is that it defines KL support of priors which allows Bayesian estimator to be strongly consistent.

The "singular learning result" is affected because, as we see, Doob's consistency theorem ensures that Bayesian estimators to be weakly consistent(even in singular model) in weak topology while MLE should meet certain requirements in the same topology.

Just one word, [Watanabe] is not for beginners. It has some deep implications on real analytic sets which requires more mathematical maturity than most statisticians have, so it is probably not a good idea to read it without appropriate guidance.

$\blacksquare$ References

[Watanabe] Watanabe, Sumio. Algebraic geometry and statistical learning theory. Vol. 25. Cambridge University Press, 2009.

[Huber] Huber, Peter J. "The behavior of maximum likelihood estimates under nonstandard conditions." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 1. 1967.

[Doob] Doob, Joseph L. "Application of the theory of martingales." Le calcul des probabilites et ses applications (1949): 23-27.

I am trying to give some intuition for parts of the answer so correct me if I am wrong. Bayes estimator is consistent if we see it as a point estimator (MAP, rather than a probabilistic distribution). It requires less conditions for its consistency than MLE intuitively because of the prior acting as regularization. On the other hand, Schwartz distribution topology is more suitable when we see Bayes estimator as a distribution, it also help build a closer relation between the consistency of MLE and Bayes estimator, so that the case where one diverges and the other converges will not happen. — ziyuang, Jan 31 '17 at 14:42
Sorry but I don't think your explanation is correct. The prior acts as a regularization but that does not necessarily controls the convergence rate. Actually flat priors actually slow down convergence. They are simply two different topologies. — Henry.L, Feb 01 '17 at 12:52

Why should we discuss convergence behaviors of different estimators in different topologies?

1 Answers1

Linked