If the Epanechnikov kernel is theoretically optimal when doing Kernel Density Estimation, why isn't it more commonly used?

Question

I have read (for example, here) that the Epanechnikov kernel is optimal, at least in a theoretical sense, when doing kernel density estimation. If this is true, then why does the Gaussian show up so frequently as the default kernel, or in many cases the only kernel, in density estimation libraries?

Two questions conflated here: why not more commonly used? why is Gaussian often the default/only kernel? It may sound trivial, but the name Epanechnikov may seem hard to spell and pronounce correctly for people not fluent in that language. (I'm not even sure E. was Russian; I've failed to find any biographical details.) Also, if I show (e.g.) a biweight, comment on its bell shape, finite width and behaviour at the edges, that seem easier to sell. Epanechnikov is the default in Stata's `kdensity`. — Nick Cox, Jun 02 '16 at 06:40
I would add that this theoretical optimality has little bearing in practice if any. — Xi'an, Jun 02 '16 at 07:36
@NickCox Fair enough. I'm more interested in the second of your two questions. If I may rephrase my question: What advantages does the Gaussian enjoy that make it so commonly used? You've suggested one, which is that people are familiar with its spelling and pronunciation (and, of course, it's basic properties). Are there other, more technical, reasons to prefer the Gaussian? — John Rauser, Jun 02 '16 at 17:56
It's a familiar name. If it makes sense to use a kernel that doesn't have a finite support, you should prefer it. So far as my experience goes, it doesn't make sense, so the choice appears social, not technical. — Nick Cox, Jun 02 '16 at 18:18
@NickCox, yes, E was a Russian dude, it's not an abbreviation :) He was enigmatic person, [this](http://www.mathnet.ru/php/person.phtml?option_lang=rus&personid=50050) is all you could ever find about him. I also remember a very useful [book](http://urss.ru/cgi-bin/db.pl?lang=Ru&blang=ru&page=Book&id=30525) someone with his name wrote on programmable calculators, yes, it was a big thing at the time — Aksakal, Nov 16 '18 at 21:05
@Aksakal I spent some time now searching for E. in Russian and found nothing at all. Not a single biographical detail. Pretty amazing. — amoeba, Nov 18 '18 at 20:07
@amoeba He worked at [Институт радиотехники и электроники Российской Академии Наук](http://www.cplire.ru/rus/) им. Котельникова, I bet he did classified research, full name is Епанечников Виктор Александрович — Aksakal, Nov 19 '18 at 14:48
@NickCox here's one of his patents using the filter [RU 2319164](https://patents.google.com/patent/RU2319164C1/en?oq=RU+2319164) — Aksakal, Nov 19 '18 at 14:56
@Aksakal Are you sure it's the same person? This patent seems to be from 2006... — amoeba, Nov 19 '18 at 15:18
@amoeba, it's got to be him, the same name citing the original paper from 1969 — Aksakal, Nov 19 '18 at 15:22
I had the same experience as @Amoeba. I don't read Russian but his first name could be Viktor? Correct? — Nick Cox, Nov 19 '18 at 16:43
@NickCox, Victor Aleksandrovich, meaning that his Dad's name was Alexander. The last name has a root ["епанечник"](http://www.gardenia.ru/pages/kopyt_001.htm), which among other things also means a plant called [Asarum europaeum](https://en.wikipedia.org/wiki/Asarum_europaeum) — Aksakal, Nov 19 '18 at 19:06

Chill2Macht · Answer 1 · 2018-11-16T21:26:43.647

The reason why the Epanechnikov kernel isn't universally used for its theoretical optimality may very well be that the Epanechnikov kernel isn't actually theoretically optimal. Tsybakov explicitly criticizes the argument that the Epanechnikov kernel is "theoretically optimal" in pp. 16-19 of Introduction to Nonparametric Estimation (section 1.2.4).

Trying to summarize, under some assumptions on the kernel $K$ and a fixed density $p$ one has that the mean integrated square error is, of the form

$$\frac{1}{nh} \int K^2 (u) du + \frac{h^4}{4}S_K^2 \int (p''(x))^2 dx \,. \tag{1} $$

The main criticism of Tsybakov seems to be minimizing over non-negative kernels, since it's often possible to get better performing estimators, which are even non-negative, without restricting to non-negative kernels.

The first step of the argument for the Epanechnikov kernel begins by minimizing $(1)$ over $h$ and all non-negative kernels (rather than all kernels of a wider class) to get an "optimal" bandwidth for $K$

$$ h^{MISE}(K) = \left( \frac{\int K^2}{nS_K^2 \int (p'')^2} \right)^{1/5}$$

and the "optimal" kernel (Epanechnikov)

$$K^*(u) = \frac{3}{4}(1-u^2)_+ $$

whose mean integrated square error is:

$$h^{MISE}(K^*) = \left( \frac{15}{n \int (p'')^2} \right)^{1/5} \,. $$

These however aren't feasible choices, since they depend on knowledge (via $p''$) of the unknown density $p$ -- therefore they are "oracle" quantities.

A proposition given by Tsybakov implies that the asymptotic MISE for the Epanechnikov oracle is:

$$\lim_{n \to \infty} n^{4/5} \mathbb{E}_p \int (p_n^E (x) - p(x))^2 dx = \frac{3^{4/5}}{5^{1/5}4} \left( \int (p''(x))^2 dx \right)^{1/5} \,. \tag{2} $$

Tsybakov says (2) is often claimed to be the best achievable MISE, but then shows that one can use kernels of order 2 (for which $S_K =0$) to construct kernel estimators, for every $\varepsilon >0$, such that

$$ \limsup_{n \to \infty} n^{4/5} \mathbb{E}_p \int (\hat{p}_n (x) - p(x))^2 dx \le \varepsilon \,. $$

Even though $\hat{p}_n$ isn't necessarily non-negative, one still has the same result for the positive part estimator, $p_n^+ := \max(0, \hat{p}_n)$ (which is guaranteed to be non-negative even if $K$ isn't):

$$ \limsup_{n \to \infty} n^{4/5} \mathbb{E}_p \int (p_n^+ (x) - p(x))^2 dx \le \varepsilon \,. $$

Therefore, for $\varepsilon$ small enough, there exist true estimators which have smaller asymptotic MISE than the Epanechnikov oracle, even using the same assumptions on the unknown density $p$.

In particular, one has as a result that the infimum of the asymptotic MISE for a fixed $p$ over all kernel estimators (or positive parts of kernel estimators) is $0$. So the Epanechnikov oracle is not even close to being optimal, even when compared to true estimators.

The reason why people advanced the argument for the Epanechnikov oracle in the first place is that one often argues that the kernel itself should be non-negative because the density itself is non-negative. But as Tsybakov points out, one doesn't have to assume that the kernel is non-negative in order to get non-negative density estimators, and by allowing other kernels one can non-negative density estimators which (1) aren't oracles and (2) perform arbitrarily better than the Epanechnikov oracle for a fixed $p$. Tsybakov uses this discrepancy to argue that it doesn't make sense to argue for optimality in terms of a fixed $p$, but only for optimality properties which are uniform over a class of densities. He also points out that the argument still works when using the MSE instead of MISE.

EDIT: See also Corollary 1.1. on p.25, where the Epanechnikov kernel is shown to be inadmissible based on another criterion. Tsybakov really seems not to like the Epanechnikov kernel.

+1 for an interesting read, but this does not answer why Gaussian kernel is used more often than Epanechnikov kernel: they are both non-negative. — amoeba, Nov 16 '18 at 21:04
@amoeba That is true. At the very least this answers the question in the title, which is only about the Epanechnikov kernel. (I.e. it addresses the premise for the question and shows that it is false.) — Chill2Macht, Nov 16 '18 at 21:25
(+1) One thing to beware with Tsybakov's scheme of taking the positive part of a possibly-negative kernel estimate – which is at least my memory of his suggestion – is that although the resulting density estimator might give better MSE convergence to the true density, the density estimate will in general _not be a valid density_ (since you're cutting off mass, and it no longer integrates to 1). If you _actually_ only care about MSE, it doesn't matter, but sometimes this will be a significant problem. — Danica, Nov 16 '18 at 21:33

Alex R. · Answer 2 · 2016-06-01T22:14:30.790

2

The Gaussian kernel is used for example in density estimation through derivatives:

$$\frac{d^if}{dx^i}(x)\approx \frac{1}{bandwidth}\sum_{j=1}^N \frac{d^ik}{dx^i}(X_j,x)$$

This is because the Epanechnikov kernel has 3 derivatives before it's identically zero, unlike the Gaussian which has infinitely many (nonzero) derivatives. See section 2.10 in your link for more examples.

edited Jun 01 '16 at 22:14

answered Jun 01 '16 at 21:50

Alex R.

13,097
2
25
49

2

The first derivative of the Epanechnikov (note the second *n*, by the way) kernel is not continuous where the function crosses the kernel's own bounds; that might be more of an issue. – Glen_b Jun 01 '16 at 22:13
1

@Glen_b: You're probably right, although having 0 derivatives after some $i$ would be silly too. – Alex R. Jun 01 '16 at 22:14
1

@AlexR. While what you say is true, I don't understand how it explains why the Gaussian is so common in ordinary density estimation (as opposed to estimating the derivative of the density). And even when estimating derivatives, section 2.10 suggests that the Gaussian is never the preferred kernel. – John Rauser Jun 01 '16 at 23:43
@JohnRauser: Keep in mind that you need to use higher order Epanechnikov kernels for optimality. Usually people use a Gaussian because it's just easier to work with and has nicer properties. – Alex R. Jun 02 '16 at 01:14
1

@AlexR I'd quibble on "[u]sually people use a Gaussian"; do you have any systematic data on frequency of use or this is just an impression based on work you see? I see biweights often, but I wouldn't claim more than that. – Nick Cox Jun 02 '16 at 06:33

If the Epanechnikov kernel is theoretically optimal when doing Kernel Density Estimation, why isn't it more commonly used?

2 Answers2