Why taking an average makes convergence to zero faster?

Question

Let $f(x,y)$ be some density, and let the leave-one-out Nadaraya-Watson estimator $\widehat{f}_{-i}(x,y)$ be defined as follows: $\widehat{f}_{-i}(x,y)=\frac{1}{(n-1)h^2}\sum_{j=1,j\neq i}^nK(\frac{(X_j,Y_j)-(x,y)}{h})$, where $K(\cdot,\cdot)$ is the kernel function and $h\rightarrow 0$ at some specified speed such that we have $\underset{(x,y)\in J}{\sup} |\widehat{f}_{-i}(x,y)-f(x,y)|=o_{P}(n^{-1/4})$.

In a paper I read about the following statement :

"$ R_{n}=\frac{C}{n}\sum_{i=1}^{n} \underset{(x,y)\in J}{\sup} |\widehat{f}_{-i}(x,y)-f(x,y)| $, where $C$ is some positive constant and $J$ is a compact subset of the support.

As $\underset{(x,y)\in J}{\sup} |\widehat{f}_{-i}(x,y)-f(x,y)|=o_{P}(n^{-1/4})$, it follows that $R_n=o_p(n^{-1/2})$."

Here the $o_{p}(a_n)$ notation means converge in probability to zero at rate $a_n$.

Why we can arrive at the conclusion that $R_n=o_p(n^{-1/2})$? Intuitively, why taking an average makes convergence to zero faster? Thanks in advance!

@SextusEmpiricus Hi, sorry I cannot, as it's still a preliminary paper circulated in a small group and cannot be distributed. But I can provide more details of the math. Do you want that? — T34driver, Sep 13 '20 at 21:43
@SextusEmpiricus Yes, let me provide more detailed math in the question. $J$ is independent of $n$. I guess your confusion is caused by the bad notation in this paper, let me try to restate it as follows: $\sum_{i=1}^{n}\underset{x,y\in J}{\sup}|\widehat{f}_{-i}(x,y)-f(x,y)|$, so summation is over the $i$ in the $-i$. — T34driver, Sep 14 '20 at 00:39
So J is a set of points (x,y). And the supremum in this case actually coincides with the maximum right? — Sextus Empiricus, Sep 14 '20 at 06:48

Sextus Empiricus · Accepted Answer · 2020-09-14T19:06:41.357

Simulation

I have been trying some bit of modeling to see how the leave-one-out estimators converge. In my simulation (one-dimensional, but I don't believe that matters), I get that they are strongly correlated (ie there is not much variance between different $-i$).

When the $n$ get's large then the values of $$\underset{x\in J}{\sup} |\widehat{f}_{-i}(x)-f(x)|$$ are very similar for different values of $i$.

This makes sense, leaving one $i$ out versus another $i$ is not much effect. I wonder whether something is missing?

The simulation below is just a quick plot of some errors computed for different $n$ with different $i$, and I guess that the $\mathcal{o}_P(a_n)$ relates to the variance which is not exactly the same, but I guess that the plot shows that the different $i$ are not so different from each other and the averaging won't be having such a big effect for large $n$.

# sample size
ns <- 1000

# kernel estimator
f_hat <- function(x, i, obsf,obsx) {
  ### some function for the bandwith 
  h <- 1/length(obsf)  
  ### distance from the sample point
  d <- x-obsx
  ### Gaussian as kernel function
  K <- dnorm(d,mean=0,sd=h)*obsf
  ## an average over the kernel functions
  f <- mean(K[-i])
  return(f)
}
f_hat <- Vectorize(f_hat, vectorize.args = 'x')

# some function to be estimated
f <- function(x) {
  sin(x*10)+sin(x*2)
}

# the set of points to estimate
x <- seq(0,1,0.01)
ni <- lenght(x)
z <- f(x)

# the data
xs <- runif(ns)
fs <- f(xs)+rnorm(ns,0,0.1)

### how the estimation looks like
plot(x,z, type = "l", lwd = 2)
points(xs,fs, pch = 21, col = 1, bg = 1, cex = 0.1)
lines(x,f_hat(x,1,fs,xs), col = 2, lty = 2, lwd = 2)



### repeating for many different sample sizes
nrange <- floor(2^c(seq(6.5,16,0.25)))
err <- matrix(rep(0,length(nrange)*90),length(nrange))

j = 0
for (ns in nrange) {
  j=j+1
  xs <- runif(ns)
  fs <- f(xs)+rnorm(ns,0,0.1)
  for (i in 1:90) {
    ### the maximum error for the points x
    ### computed for 90 different i
    err[j,i] <- max(abs(f_hat(x,i,fs,xs)-f(x))) 
  }
}

plot(-1,-1, log = "xy", xlim = range(nrange), ylim = range(err),
     xlab = "n", ylab = "error size")
for (i in 1:10) {
  lines(nrange,err[,i],col = rgb(0,0,0,0.3))
}

[![simultion][1]][1]

Intuition

At first, I thought that maybe the different $i$ have large differences such that the averaging procedure is reducing the variance/error by diluting the probability of selecting a 'bad' $i$.

But with this plot I guess that, either I misunderstand the concept, or the question is missing some details that should make the error values for the leave on out estimators more different for different $i$.

The idea that the variance of an average can converge faster than the variance of the elements is not strange.

Say you have

$$S = \frac{1}{n} \sum_{i=1}^n X_{i,n} $$

Where $X_{i,n}$ are independent random variables (and with the same mean) with $\text{Var}(X_{i,n}) \in \mathcal{o}(f(n))$. Then $\text{Var}(S) \in \mathcal{o}(f(n)/\sqrt{n})$.

I am not sure whether this is exactly behind $\mathcal{o}_p({a_n})$ term. Whether it is about the convergence of the variance of the error term, ie. the difference with respect to it's expected value. Or whether it is about the convergence of the mean square error, ie. the difference with respect to zero.

Thank you so much for the simulation and the intuition, it's very helpful. Just a minor remark, I guess you need to change the definition of your $S$ to the average of $X_{i,n}$ instead of the sum. I agree with your intuition, and I guess their conclusion that $R_n=o_{p}(n^{-1/2})$ could just be wrong. As what they have is averaging outside absolute value, which implies that large positive error and large negative error cannot cancel out like in CLT, in which we have $\frac{1}{n}\sum_{i}(X_i-EX_i)$. They might need to change it to $sup|\frac{1}{n}\sum_i \widehat{f}_{-i}(x)-f(x)|$. — T34driver, Sep 14 '20 at 18:44
I am not sure whether my (preliminary) answer is already an acceptable answer. I just added this simulation to get a more direct/practical example of the estimator, and because I had troubles to get the question at once. The idea about the mean and the additional factor $n^{1/2}$ in the convergence gives an idea about how it could work but it is not complete yet since the paper has a factor $n^{1/4}$ which is not clear (maybe it is due to some correlation or other peculiarities). At least, I have still some open questions about the problem, or I do not get it completely/detailed. — Sextus Empiricus, Sep 14 '20 at 19:14
"*They might need to change it to $sup|\frac{1}{n}\sum_i \widehat{f}_{-i}(x)-f(x)|$*" This term $\frac{1}{n}\sum_i \widehat{f}_{-i}(x)$ would equal the regular estimator (without leaving one out). It is a difference that the summation is outside the supremum. — Sextus Empiricus, Sep 14 '20 at 19:21
I guess you are right. I need to think about it further. Thanks! — T34driver, Sep 14 '20 at 19:38
Can you try this one $\frac{1}{n}\sum_{i}|\widehat{f}_{-i}(X_i)-f(X_i)|$ in simulation? — T34driver, Sep 16 '20 at 02:19
@JTS365 You probably mean $\frac{1}{n}\sum_{i}|\widehat{f}_{-i}(X)-f(X)|$ instead of $\frac{1}{n}\sum_{i}|\widehat{f}_{-i}(X_i)-f(X_i)|$. I am not sure what that's gonna add. Or do you mean to compare the two terms with the summation on the inside versus outside? Maybe you could add more context about the situation to help understand what this averaging of errors in estimators is supposed to do. — Sextus Empiricus, Sep 16 '20 at 19:09
I see, thanks for the comments. I meant $A=\frac{1}{n}\sum_{i}|\widehat{f}_{-i}(X_i)-f(X_i)|$. The supremum they use in the current paper is to magnify $A$. I see your point on the inside versus outside too, it would be nice to also try $B=|\frac{1}{n}(\sum_{i}\widehat{f}_{-i}(X_i)-f(X_i))|$, and I guess $B$ could do better than $A$. More context: the paper tries to show that taking average can make the remainder shrink faster. — T34driver, Sep 17 '20 at 17:58
@JTS365 That sounds like some sort of [bagging](https://en.m.wikipedia.org/wiki/Bootstrap_aggregating). But, from the sidelines, it is very difficult to judge how it will work. I hope that you can add a reference to the original text later. Already the use of subscripts $_i$ is confusing (or maybe even wrong) so it is not strange if more things might be problematic for this preliminary article/result. — Sextus Empiricus, Sep 17 '20 at 18:15
I see, that's very good advice. I guess I will try to learn the concept you mentioned as well as discuss some related problems so I have some sense of what is possible and why. Thanks again! — T34driver, Sep 17 '20 at 18:19

Why taking an average makes convergence to zero faster?

1 Answers1

Simulation

Intuition