Kernel Density Estimate for Cauchy

Question

As far as I understand, kernel density estimation does not make any assumptions on the moments of the underlying density, and just requires smoothness. The Cauchy density function is quite smooth. Even still, when I try to do KDE using density() in R for random draws from Cauchy distribution, I get incredibly inaccurate answers:

set.seed(1)
foo <- seq(-50, 50, length = 1e3)
plot(foo, dt(foo, df = 1), type = 'l')
lines(density(rt(1e3, df = 1)), col = "red")

produces this plot:

Repeat the above with different seeds or increasing the sample size can give further erratic estimates. The default kernel is Gaussian in R. Changing the kernel to any of the other options doesn't improve the output.

Question: What assumptions does Cauchy violate for KDEs? If it doesn't, then why do we see KDEs failing so miserably here?

Edit: @cdalitz has identified that the problem is where the kde is evaluating the density. The default is 3*bw*range(x), which for Cauchy can be quite large. Which means, by default density tries to estimate the KDE at 512 points sparsely distributed on the x-axis.

To test this, I change the from and to in the density estimation and see that if I run density twice with two sets of evaluating points, so the densities match:

set.seed(1)
samp <- rt(1e4, df = 1)

bd <- 10
den1 <- density(samp, from=-bd, to=bd, n=512)
den2 <- density(samp, from =-2*bd, to = 2*bd,  n =512)

foo <- seq(-50, 50, length = 1e3)
plot(foo, dt(foo, df = 1), type = 'l')
lines(den2, col = "blue", type = "b")
lines(den1, col = "red", type = "b")

This produces the estimates below:

The quality here is much better than before. However, now if instead of 2*bd, I change this to 50*bd, I get that the density estimate even around 0 is very different!

set.seed(1)
samp <- rt(1e4, df = 1)

bd <- 10
den1 <- density(samp, from=-bd, to=bd, n=512)
den2 <- density(samp, from =-50*bd, to = 50*bd,  n =512)

foo <- seq(-50, 50, length = 1e3)
plot(foo, dt(foo, df = 1), type = 'l', ylim = c(0,.7))
lines(den2, col = "blue", type = "b")
lines(den1, col = "red", type = "b")

How does evaluating the density at sparse points change the density evaluation process around $x = 0$ (the bandwidth chosen is the same for both den1 and den2)? The KD estimate at any point $x$ is $$ \hat{f}(x) = \dfrac{1}{nh} \sum_{t=1}^{n} K\left( \dfrac{x - x_i}{h}\right)\,. $$

The density estimate shouldn't change at a given value of $x = a_1$ if the density is also being evaluated at other points. What am I missing here?

I would expect that to be part of the problem, a gaussian Kernel is not robust against outliers. I could also imagine that not having a defined variance affects the estimation. So it might not even make sense to try it with increasing sample sizes+RBF (Gaussian). — Sergio, Feb 02 '22 at 05:09
@Sergio thanks Sergio. Changing the kernels to any of the other options in R doesn't help much. Additionally, when I chance degrees of freedom to `df = 2` (mean exists but variance doesn't), things are ok. So the variance need not be the problem. — Greenparker, Feb 02 '22 at 06:43
The default bandwidth `bw.nrd` uses `1.06 times the minimum of the standard deviation and the interquartile range divided by 1.34 times the sample size to the negative one-fifth power.` This quantity thus converges even in the case of a distribution with no moments. — Xi'an, Feb 02 '22 at 07:19
Thanks @Xi'an for hunting this down. The use of IQR brings in robustness, so that shouldn't cause a problem for Cauchy. My only guess is that in arriving at this "Silverman's rule of thumb", there is an underlying assumption that the density is "close-to" Gaussian, and thus there is an underlying assumption of existence of moments. — Greenparker, Feb 02 '22 at 07:55
This [1992 Biometrika paper by Chiu](https://www.jstor.org/stable/2337233?seq=1#metadata_info_tab_contents) examines automatic selections of the prior for arbitrary densities, including the Cauchy as a test case, and with bandwidths that end up being close to those for the Normal test case. — Xi'an, Feb 02 '22 at 08:21
There is also a general convergence result in [Devroye and Györfi](https://www.szit.bme.hu/~gyorfi/L1bookBW.pdf) (1986, Theorem 1, p.22) that L¹ convergence of a KDE occurs for all densities provided $h_n\to 0$ and $nh_n\to\infty$. — Xi'an, Feb 02 '22 at 08:30
@Xi'an, regarding the Devroye and Györfi result, if [Silverman's rule of thumb](http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode15.html) is used, the bandwidth $h_n$ will diverge since $\hat{\sigma}$ will diverge. So possible that's the issue here. — Greenparker, Feb 02 '22 at 08:33
@Greenparker: since $\hat\sigma_n$ includes the interquartile range, it does not diverge. — Xi'an, Feb 02 '22 at 08:39
To avoid further search in the wrong direction: The default bandwidth selection works well in this case! Beware however that `density` chooses sample points in the range of the data at which the estimator is evaluated. Your observed problem is just an artifact created by this sample point selection. — cdalitz, Feb 02 '22 at 09:05
@cdalitz I understand your point, but how does choosing a different range, change the density evaluation at a single point? I've tried to expand on this point in the updated question. — Greenparker, Feb 03 '22 at 06:52
Strangely, when increasng the number of sample points for *den2* to `100*512`, the computed density is correct. It seems that you have found a bug in the density implementation in R. After some time trying to debug the code, I could not locate the error, but I must admit that I have not understood some parts of the code. Please consider writing a bug report: https://www.r-project.org/bugs.html — cdalitz, Feb 03 '22 at 09:03
@cdalitz thanks for validating my suspicion! Like I said in the comment below, I am pretty sure this has something to do with the `approx` function, and possibility with numerical precision. I'll report it once I'm convinced it is a bug. — Greenparker, Feb 03 '22 at 09:06
No, the error is already there before the call to `approx`. You can verify this by copying the code of `density.default` and set a breakpoint (with `browser()`) before the call to `approx`. Note that you must replaxe `C_BinDist` with `stats:::C_BinDist`, because otherwise it is not found. — cdalitz, Feb 03 '22 at 09:09

cdalitz · Answer 1 · 2022-02-02T09:01:12.593

7

The problem lies in the sampling of the x-range by density. By default, density samples 512 values in the range of the data. As the Cauchy distribution has a heavy tail, you will only achieve a very sparse sampling of the density function with the default settings.

If we sample 512 values between -20 and +20, the density approximation is decent, even with the default bandwidth selection rule:

set.seed(1)
foo <- seq(-50, 50, length = 1e3)
plot(foo, dt(foo, df = 1), type = 'l')
lines(density(rt(1e3, df = 1), from=-20, to=20, n=512), col = "red")

Note also that from, to and n only affect the sample points at which the kernel density estimator is evaluated and have no effect on the bandwidth:

> set.seed(1)
> x <- rt(1e3, df = 1)
> density(x, from=-20, to=20, n=512)$bw
[1] 0.351773
> density(x)$bw
[1] 0.351773

edited Feb 02 '22 at 09:01

answered Feb 02 '22 at 08:31

cdalitz

2,844
6
13

Thanks! But there seems to be something going on in the `density` function. Consider running the below code where density is evaluated at three points: at 0, -a, +a, where a may change. The evaluated density at 0 should be the same, but it isn't: `set.seed(1) samp – Greenparker Feb 02 '22 at 11:00
1

The code typeset is horrible, so I'll edit the question. – Greenparker Feb 02 '22 at 11:03
1

@Greenparker Please read the documentation of the parameter `n`: It is rounded to a power of 2 (for using FFT), and later it is interpolated. This explains slight differences in the values due to interpolation (3 is not a power of 2). I wonder, however, why this is necessary since there is an alternative algorithm (FFTW) instead of FFT that does not have this restriction. – cdalitz Feb 02 '22 at 11:08
1

The rounding to the power of 2 occurs only when `n > 512` (or atleast that's what the documentation says. I've also edited the question, to highlight my larger question. Thanks a ton for bringing some insight, but I still don't quite see how sparse evaluations will change the density estimate at a particular point (without interpolation -- that's why I now plot the points) – Greenparker Feb 02 '22 at 11:29
1

@Greenparker You can view the source code of *density* by entering `desity.default` (without parentheses). Then you will see that `n` is automatically set to a minimum value of 512, even when a smaller value is provided by the user. This is later approximately reversed by interpolating at the actual values. – cdalitz Feb 03 '22 at 08:57
Yes, I did see that. My guess from looking at the source code is that there `approx` being used along with the `fft` somehow is creating some problem when the values are very spread out. Since repeating my example with even Gaussian samples returns absurd estiamtes. – Greenparker Feb 03 '22 at 09:04

BruceET · Answer 2 · 2022-02-02T07:59:18.487

5

In R, the procedure density gives a kernel density estimator (KDE) of data.

In order to get a suitable histogram of the very heavy tailed Cauchy distribution, it is usually necessary to disregard more than a few values in the tails of the sample.

Then in order show the actual Cauchy density curve (black below), it is necessary to adjust the density by dividing by the proportion of the sample plotted in the histogram.

Also, to get a the best KDE (dotted red) you may need to use parameter adj tp adjust the default bandwidth. (I used the default.) Of course, the histogram would be 'smoother' if it had fewer bins; the KDE is made entirely without reference to the histogram.

By adjusting the proportion of the sample plotted, the bandwidth of the KDE, and the sample size, you may be able to improve on my plot below. But the agreement of the density function and the KDE in the the plot below is roughly typical for samples of size 500 to 1000.

R code for figure:

k = diff(pt(c(-4,4),1))
set.seed(2022)
w = rt(1000, 1)        # whole sample, size 1000
y = w[x > -4 & w < 4]  # 861 plotted points
length(y)
[1] 861

hist(y, prob=T, ylim=c(0,.4), br=50, col="skyblue2")
 curve(dt(x,1)/k, add=T, lwd=2)
 lines(density(y), col="red", lwd=2, lty="dotted")

Note: For moderately large samples, the empirical CDF (ECDF) of the sample matches the density function of the population very well. The test statistic $D$ of the Kolmogorov-Smirnov goodness-of-fit test, is the maximum vertical distance between the two.

plot(ecdf(w), lwd=3, col="red", lty="dotted")
 curve(pt(x,1), add=T)
  abline(v=0, col="green2")
  abline(h=0:1, col="green2")

ks.test(w, pt, 1)

        One-sample Kolmogorov-Smirnov test

data:  w
D = 0.018508, p-value = 0.8832
alternative hypothesis: two-sided

edited Feb 02 '22 at 07:59

answered Feb 02 '22 at 06:59

BruceET

47,896
2
28
76

4

Strictly speaking, there are no "outliers" in a Cauchy sample, all points within the sample are from the same Cauchy distribution. – Xi'an Feb 02 '22 at 07:07
1

Yes, I changed the wording a bit before I saw your comment. – BruceET Feb 02 '22 at 07:12
1

Thanks @BruceET. Certainly, I can understand how the code may be modified to make visually acceptable KDEs. I am more curious as to where the theory of KDEs brings down in the practical implementation for Cauchy. That is, why is it "usually necessary to disregard more than a few values in the tails of the sample" – Greenparker Feb 02 '22 at 07:58
1

The optimal choice of bandwidth depends on the standard deviation. I may be crawling out on a limb here, but I think other parts of the theory of kernel density estimation may also depend on the existence of the population standard deviation. For now, I'll leave that for users who know more about KDE than I to clarify. – BruceET Feb 02 '22 at 08:18
1

Thanks @BruceET. From [here](http://sfb649.wiwi.hu-berlin.de/fedc_homepage/xplore/ebooks/html/spm/spmhtmlnode15.html) it dependents on the norm of the second derivative of the target density, which for gaussian densities, depends on the variance. This naturally assumes that the target density exhibits a finite variance, and I think that is the issue. – Greenparker Feb 02 '22 at 08:30
2

@Greenparker: The only characterisation of the target density $f$ that matters in the [mean integrated square error (MISE) and the choice of an optimal bandwidth](http://faculty.washington.edu/yenchic/17Sp_403/Lec7-density.pdf) is$$\int (f")^2(x)\,\text dx$$which is well-defined for the Cauchy density. – Xi'an Feb 02 '22 at 16:11

Kernel Density Estimate for Cauchy

2 Answers2