Rates of convergence for estimating population mean squared error

Question

Suppose I have an i.i.d. sample $\{(Y_i, X_i)\}_{i=1}^n$ on which I am trying to estimate a conditional expectation model:

$$Y = g(X) + \varepsilon,\quad \mathbb E[\varepsilon | X] = 0$$

There is a lot of literature on how different estimators of $g$ will imply different rates of convergence to the true conditional expectation function for different regularity assumptions. However, for my application, I am not interested in $g$ directly but instead care about a single summary statistic about $g$, namely

$$\theta = \mathbb E[(g(X) - \mathbb E[Y])^2] \quad (\text{alternatively}, \eta = \mathbb E[\varepsilon ^2])$$

My question is the following: are there settings in which $\theta$ or $\eta$ can be estimated at a faster rate than $g$ itself?

score 1 · Accepted Answer · answered Oct 06 '20 at 15:31

We get some interesting results when we ask this question within a semi-parametric statistics framework. In particular, I will focus on estimating $\eta$, since by the law of total variance, once we can estimate $\eta$, we can estimate $\theta = \mathbb V\mathrm{ar}[Y] - \eta$. Define the influence function for $\eta$ as

$$\psi_i = (Y_i - g(X_i))^2$$

so that if $g$ was known, then we could easily estimate $\eta$ via

$$\tilde\eta = \frac1n\sum_{i=1}^n \psi_i$$

However, we have to estimate $g(X_i)$. Therefore, a reasonable "plug-in" estimate of $\psi_i$ would be $\hat\psi_i = (Y_i - \hat g(X_i))^2$ and our updated estimate of $\eta$ would be

$$\hat\eta = \frac1n\sum_{i=1}^n \hat \psi_i$$

It is a simple exercise to check that $\psi_i$ is first order insensitive to $g$ around the true $g$ in the sense of the Gateaux derivative:

$$\partial_r \mathbb E[\psi(g + r(h - g))] = - \mathbb E[(h(X)-g(X))\cdot (Y - g(X))] = 0$$

This is the "Neyman Orthogonality" condition as described, for example, in Chernozhukov et al (2017). It implies that even if $g$ is not estimated at the parametric $O\left(n^{-1/2}\right)$ rate, it is possible that $\hat\eta$ will not only converge to $\eta_0$ at the parametric rate, but that it will be asymptotically normal with feasibly estimated standard errors. In particular, assuming $g$ is smooth enough that it is estimated at a rate $o\left(n^{-1/4}\right)$, then we should expect that (with some sample-splitting):

$$\frac1{\hat\sigma \sqrt n}\sum_{i=1}^n \hat \psi_i - \eta_0 \Rightarrow \mathcal N(0,1),\quad \hat\sigma = \frac1n\sum_{i=1}^n \left(\hat \psi_i - \frac1n\sum_{i=1}^n \psi_i\right)^2$$

The $O\left(n^{-1/2}\right)$ convergence is not too surprising since we mechanically have that if $\hat g \to g$ in $L^2$ norm, we would have that the norm squared (which is essentially what $\hat\eta$ is estimating) must be converging at twice the rate, although the asymptotic normality is a potentially useful addition.

Rates of convergence for estimating population mean squared error

1 Answers1