To understand Watanabe's discussion, it is important to realize that what he meant by "singularity". The (strict) singularity coincides with the geometric notion of singular metric in his theory.
p.10 [Watanabe] :"A statistical model $p(x\mid w)$ is said to be regular if it
is identifiable and has a positive definite metric. If a statistical
model is not regular, then it is called strictly singular."
In practice, singularity usually arise when the Fisher information metric induced by the model in degenerated on the manifold defined by the model, like low rank or sparse cases in "machine learning" works.
What Watanabe said about the convergence of empirical KL divergence to its theoretic value can be understood as follows. One origin of the notion of divergence comes from robust statistics. The M-estimators, which include MLE as a special case with contrast function $\rho(\theta,\delta(X))=-\log p(X\mid \theta)$, are usually discussed using weak topology. It is reasonable to discuss the convergence behavior using weak topology over the space $M(\cal{X})$(the manifold of all possible measures defined on Polish space $\cal{X}$) because we want to study the robustness behavior of MLE. A classical theorem in [Huber] stated that with well separated divergence function $D(\theta_0,\theta)=E_{\theta_{0}}\rho(\theta,\delta)$. $$\inf_{|\theta-\theta_0|\geq\epsilon}(|D(\theta_0,\theta)-D(\theta_0,\theta_0)| )>0$$
and good empirical approximation of contrast function to divergence,
$$\sup_{\theta}\left|\frac{1}{n}\sum_{i}\rho(\theta,\delta(X_i))- D(\theta_0,\theta)\right|\rightarrow 0,n\rightarrow\infty$$
along with regularity, we can yield consistency in sense
$$\hat{\theta_n}:=\mathrm{arg\,min}_{\theta}\rho(\theta,\delta(X_n))$$
will converge to $\theta_0$ in probability $P_{\theta_0}$. This result requires far more precise conditions if we compared with Doob's result [Doob] in weak consistency of Bayesian estimator.
So here Bayesian estimators and MLE diverges. If we still use weak topology to discuss consistency of Bayesian estimators, it is meaningless because Bayesian estimators will always(with probability one) be consistent by Doob. Therefore a more appropriate topology is Schwarz distribution topology which allows weak derivatives and von Mises' theory came into play. Barron had a very nice technical report on this topic how we could use Schwartz theorem to obtain consistency.
In another perspective, Bayesian estimators are distributions and their topology should be something different. Then what kind of role the divergence $D$ plays in that kind of topology? The answer is that it defines KL support of priors which allows Bayesian estimator to be strongly consistent.
The "singular learning result" is affected because, as we see, Doob's consistency theorem ensures that Bayesian estimators to be weakly consistent(even in singular model) in weak topology while MLE should meet certain requirements in the same topology.
Just one word, [Watanabe] is not for beginners. It has some deep implications on real analytic sets which requires more mathematical maturity than most statisticians have, so it is probably not a good idea to read it without appropriate guidance.
$\blacksquare$ References
[Watanabe] Watanabe, Sumio. Algebraic geometry and statistical learning theory. Vol. 25. Cambridge University Press, 2009.
[Huber] Huber, Peter J. "The behavior of maximum likelihood estimates under nonstandard conditions." Proceedings of the fifth Berkeley symposium on mathematical statistics and probability. Vol. 1. No. 1. 1967.
[Doob] Doob, Joseph L. "Application of the theory of martingales." Le calcul des probabilites et ses applications (1949): 23-27.