If variable kernel widths are often good for kernel regression, why are they generally not good for kernel density estimation?

Question

This question is prompted by discussion elsewhere.

Variable kernels are often used in local regression. For example, loess is widely used and works well as a regression smoother, and is based on a kernel of variable width that adapts to data sparsity.

On the other hand, variable kernels are usually thought to lead to poor estimators in kernel density estimation (see Terrell and Scott, 1992).

Is there an intuitive reason why they would work well for regression but not for density estimation?

You wrote "On the other hand, variable kernels are usually thought to lead to poor estimators in kernel density estimation", what is the part of the paper you mention that makes you believe that ? I have plenty of references that go in the other derection, see for example the references mentioned in this paper: http://arxiv.org/PS_cache/arxiv/pdf/1009/1009.1016v1.pdf — robin girard, Oct 19 '10 at 12:37
The abstract of Terrell and Scott summarises it nicely: "Nearest neighbor estimators in all versions perform poorly in one and two dimensions". They only seem to find much advantage in multivariate density estimation. — Rob Hyndman, Oct 19 '10 at 12:50
"Nearest neighbor" is not the only variable kernel. The papers I mention use other tool such as Lepskii's algorithm. I'll read the AOS paper but as the performences of nearest neighbor should decrease with the dimension, I found it strange that increasing the dimension gives advantages to a "very non-parametric" estimator (If we admit constant bandwidth is less non parametric than varying bandwith). In this type of situation, the evaluation case that is used often determine the results ... — robin girard, Oct 19 '10 at 13:13
@Robin Girard:> * found it strange that increasing the dimension gives advantages to a "very non-parametric" estimator (If we admit constant bandwidth is more non parametric than varying bandwith)* is there a typo in this sentence ? Otherwise you would seem to agree with the authors, at least on an intuitive level. Thanks to confirm/correct. — user603, Oct 19 '10 at 16:18
@kwak thanks to notice that! this is a typo: I wanted to say constant bandwidth is less NP ... I can't modify my comment :( sorry about that. — robin girard, Oct 19 '10 at 19:27

score 2 · Answer 1 · answered Oct 20 '10 at 15:51

There seem to be two different questions here, which I'll try to split:

1) how is KS, kernel smoothing, different from KDE, kernel density estimation ? Well, say I have an estimator / smoother / interpolator

est( xi, fi -> gridj, estj )

and also happen to know the "real" densityf() at the xi. Then running est( x, densityf ) must give an estimate of densityf(): a KDE. It may well be that KSs and KDEs are evaluated differently — different smoothness criteria, different norms — but I don't see a fundamental difference. What am I missing ?

2) How does dimension affect estimation or smoothing, intuitivly ? Here's a toy example, just to help intuition. Consider a box of N=10000 points in a uniform grid, and a window, a line or square or cube, of W=64 points within it:

                1d          2d          3d          4d
---------------------------------------------------------------
data            10000       100x100     22x22x22    10x10x10x10
side            10000       100         22          10
window          64          8x8         4x4x4       2.8^4
side ratio      .64 %       8 %         19 %        28 %
dist to win     5000        47          13          7

Here "side ratio" is window side / box side, and "dist to win" is a rough estimate of the mean distance of a random point in the box to a randomly-placed window.

Does this make any sense at all ? (A picture or applet would really help: anyone ?)

The idea is that a fixed-size window within a fixed-size box has very different nearness to the rest of the box, in 1d 2d 3d 4d. This is for a uniform grid; maybe the strong dependence on dimension carries over to other distributions, maybe not. Anyway, it looks like a strong general effect, an aspect of the curse of dimensionality.

score 0 · Answer 2 · answered Apr 17 '19 at 02:20

Kernel density estimation means integration over a local (fuzzy) window, and kernel smoothing means averaging over a local (fuzzy) window.

Kernel smoothing: $ \tilde y(x) \propto \frac 1 {\rho(x)} \sum K(||x-x_i||)\,y_i $.

Kernel density estimation: $\rho(x) \propto \sum K(||x-x_i||) $.

How are these the same?

Consider samples of a boolean-valued function, i.e. a set containing both "true samples" (each with unit value) and "false samples" (each with zero value). Assuming the overall sample density is constant (like a grid), the local average of this function is identically proportional to the local (partial-) density of the true-valued subset. (The false samples permit us to constantly disregard the denominator of the smoothing equation, whilst adding zero terms to the summation, so that it simplifies into the density estimation equation.)

Similarly if your samples were represented as sparse elements on a boolean raster, you could estimate their density by applying a blur filter to the raster.

How are these different?

Intuitively, you might expect the choice of smoothing algorithm to depend on whether or not the sample measurements contain significant measurement error.

At one extreme (no noise) you simply need to interpolate between the exactly-known values at sample locations. Say, by Delaunay triangulation (with bilinear piecewise interpolation).

Density estimation resembles the opposite extreme, it is entirely noise, as the sample in isolation is not accompanied by a measurement of the density value at that point. (So there is nothing to simply interpolate. You might consider measuring Voronoi diagram cell-areas, but smoothing/denoising will still be important..)

The point is that despite the similarity these are fundamentally different problems, so different approaches may be optimal.

If variable kernel widths are often good for kernel regression, why are they generally not good for kernel density estimation?

2 Answers2

Linked