Inconsistent Empirical Risk Minimization procedure, but why?

Question

Given a random variable $Y$ and the typical squared loss function:

$$L(Y,\hat{Y}) = (Y-\hat{Y})^2$$

the minimizer for expected loss $E[L(Y,\hat{Y})]$ is know to be the mean, $\hat{Y} = E[Y] = \mu$.

If we take $n$ $IID$ samples from the distribution of $Y$, we can describe an Empirical Risk Minimization(ERM) procedure:

$$\hat{Y} = \arg\min_{Y^*} \sum_{i=1}^n (Y_i - Y^*)^2$$

$$\implies \hat{Y} = \frac{1}{n}\sum_{i=1}^nY_i$$

$$E[\hat{Y}] = \mu$$

hence, it is consistent.

Now let's assume that $Y \sim N(0,\sigma^2)$ and our loss function is as follows:

$$L(Y,\hat{Y}) = e^{2(Y-\hat{Y})} - 2(Y-\hat{Y}) - 1$$

The mimizer for expected loss $E[L(Y,\hat{Y})]$ can be shown to be $\hat{Y} = \sigma^2$ using the fact that $e^{2Y}$ is lognormal with $E[e^{2Y}] = e^{2\sigma^2}$.

If we again apply ERM procedure:

$$\hat{Y} = \arg\min_{Y^*} \sum_{i=1}^n L(Y_i,Y^*)$$

$$\implies \hat{Y} = \frac{1}{2} \, \ln \left( \frac{1}{n} \sum_{i=1}^n e^{2Y_i} \right)$$

$$E[\hat{Y}] \not\rightarrow \sigma^2$$

I would like to understand why the procedure is not consistent in this case. Which assumptions of ERM am I violating?

By the strong law, $(1/n)\sum_{i=1}^n e^{2Y_i}$ converges a.s. to $\mathbb{E}[e^{2Y}]=e^{2\sigma^2}$. Hence, $(-1/2) \ln \left(n/\sum_{i=1}^n e^{2Y_i}\right)$ converges a.s. to $\sigma^2$, by continuity. — Zen, Dec 27 '19 at 12:57

Zen · Answer 1 · 2019-12-30T17:39:28.543

3

The estimator is strongly consistent, since $$ \hat{\theta}_n = \frac{1}{2} \, \ln \left( \frac{1}{n} \sum_{i=1}^n e^{2Y_i} \right) \to \frac{1}{2} \, \ln \left( \mathbb{E}[e^{2Y}]\right) = \sigma^2, $$ almost surely.

Continuing the discussion started in the comments about the distribution of the estimator.

sim <- function(true_var, n, N) {
    y <- matrix(rnorm(N * n, mean = 0, sd = sqrt(true_var)), ncol = n)
    M <- apply(y, 1, max)
    dy <- sweep(y, 1, M, "-")
    est <- M + apply(dy, 1, function(row) 0.5 * log(mean(exp(2 * row))))
    hist(est, freq = TRUE, breaks = "FD", col = "cyan")
    abline(v = true_var, col = "red", lwd = 2)
}

set.seed(1234)

sim(true_var = 2, n = 10^4, N = 10^3)

sim(true_var = 10, n = 10^4, N = 10^3)

Note: in the function, I subtracted the maximum to prevent overflows: $$ \hat{\theta}_n = M + \frac{1}{2} \, \ln \left( \frac{1}{n} \sum_{i=1}^n e^{2(Y_i-M)}\right), $$ in which $M = \max_{1\leq i\leq n} Y_i$.

edited Dec 30 '19 at 17:39

answered Dec 27 '19 at 13:40

Zen

21,786
3
72
114

The strong law says that the estimator is consistent, but the sampling distribution of the estimator gets weird when we increase the value of the true variance. – Zen Dec 27 '19 at 13:45
I don't know much theory but isn't $E[ln f(x)] \neq ln E[f(x)]$, how does that continuity thing work? – Cagdas Ozgenc Dec 27 '19 at 14:32
if $X_n$ converges a.s. to $X$, and $g$ is a continuous function, then $g(X_n)$ converges a.s. to $g(X)$. – Zen Dec 27 '19 at 14:35
1

It's possible for an estimator to be strongly consistent, but still be rubbish. See this for example: https://radfordneal.wordpress.com/2008/08/17/the-harmonic-mean-of-the-likelihood-worst-monte-carlo-method-ever/ – Zen Dec 27 '19 at 15:13
There is a long discussion in the comments here: https://stats.stackexchange.com/a/31038/20980. Is this one of those cases where a.s. convergence doesn't imply we are actually converging in the mean? – Cagdas Ozgenc Dec 28 '19 at 19:31
If you want to investigate this, check the simulation first. It looks like the bias isn't going to zero, and the variance is increasing. It may be an interesting pathological example. Then, you may try to find approximate expressions for the bias and the variance of the estimator using Taylor expansions. It may reveal something. Why did you consider this particular loss function? – Zen Dec 28 '19 at 20:49
Even with `true_var = 2`, the variance of a single $\exp(2*Y)$ is going to be very large relative to the mean: about 8.9 million, if my calculations are correct. This may hinder your ability to learn about the asymptotics through simulation. – jbowman Jan 13 '20 at 22:03
Basically, you are taking the log of the sample mean from a lognormal distribution, and I can't see any reason why eventually the asymptotics won't take over and give you a nicely-behaved close-to-Normal distribution with bias going to zero. The question is when "eventually" kicks in. – jbowman Jan 13 '20 at 22:06
@jbowman I used billions of samples for true_var = 10. Didn't cut it. It seems infinity is not just a big number, and non-linear biases are very dangerous. – Cagdas Ozgenc Jan 14 '20 at 07:20
Billions isn't going to cut it for true_var = 10; the variance is on the order of $\exp\{20\}$, which is roughly $10^{8.7}$. You would need your actual sample size, not just the # of samples drawn, to be on the order of a billion to get the s.e. of the mean down below 1. – jbowman Jan 14 '20 at 16:20

Inconsistent Empirical Risk Minimization procedure, but why?

1 Answers1