10

What sort of kernel density estimator does one use to avoid boundary bias?

Consider the task of estimating the density $f_0(x)$ with bounded support and where the probability mass is not decreasing or going to zero as the boundary is approached. To simplify matters assume that the bound(s) of the density is known.

To focus ideas consider as an example the uniform distribution:

Given a sample size $N$ of iid draws $\mathcal U(0,1)$ one could think of applying the kernel density estimator

$$\hat f(y) = \frac{1}{ns}\sum_i K\left( \frac{x_i-y}{s} \right)$$

with a normal kernel and some smoothing parameter $s$. To illustrate the boundary bias consider (implemented in the software R: A Language and Environment for Statistical Computing):

N <- 10000
x <- runif(N)
s <- .045

M <- 100
y <- seq(0,1,length.out=M)
out <- rep(0,M)
for (i in 1:M)
    {
        weights <- dnorm((x-y[i])/s)
        out[i] <- mean(weights)/s
    }
plot(y,out,type="l",ylim=c(0,1.5))

which generates the following plotenter image description here

clearly the approach has a problem capturing the true value of the density function $f_0(x)$ at $x$ close to the boundary.

The logspline method works better but is certainly not without some boundary bias

library(logspline)
set.seed(1)
N <- 10000
x <- runif(N)
m <- logspline(x,lbound=0,ubound=1,knots=seq(0,1,length.out=21))
plot(m)

enter image description here

Jesper for President
  • 5,049
  • 1
  • 18
  • 41
  • In principle the density could be estimated on logit scale and back transformed as density on logit scale / $[x (1 - x)]$ but in practice even if all values are neither 0 nor 1 this doesn't work well for me. Another answer is to reflect at boundaries any probability mass smoothed outside $[0. 1]$. My favourite answer is to use a quantile plot instead! – Nick Cox Feb 20 '20 at 18:29
  • As R isn't yet universal, no harm is done by calling this R code (assuming that is so). I speak a different language. – Nick Cox Feb 20 '20 at 18:30
  • 1
    My prejudice is that an unbounded kernel is wrong in principle for such a distribution, even though the impact can be made minute by suitable choice of bandwidth. – Nick Cox Feb 20 '20 at 18:32
  • 2
    I believe the answer by G. Simpson to a closely related question at https://stats.stackexchange.com/a/71291/919 generalizes directly to this case. It splines the log density. – whuber Feb 20 '20 at 18:37
  • @whuber - wow, that example is very impressive. – jbowman Feb 20 '20 at 19:01
  • See here: https://stats.stackexchange.com/questions/121744/kernel-density-estimation-on-bounded-support – MotiNK Feb 20 '20 at 19:34
  • Your example of bias is unconvincing, because it doesn't reproduce with other random datasets. I found it illuminating to plot a detailed histogram: the apparent bias is present there, too, strongly suggesting you are looking at sampling variation. (In repeating the code with new random numbers I do from time to time see clear bias in the form of a short spike at one of the other end of the range of data--that's something worth worrying about.) – whuber Feb 20 '20 at 21:34
  • Well i actually tried it with several seeds and used one of the prettier looking ones. But offcourse any non superficial treatment would not depend on a single underlying true distribution nor a single sample. But that is also lacking in the other answer. I deliberately avoided the ones with the short spike. But i will look at the methods in the comment from Motin as well and compare. – Jesper for President Feb 20 '20 at 21:51
  • 1
    A key issue is whether you know the bounds of the domain or need to estimate them from the data. In the example you seem to know them beforehand: should we assume that? – whuber Feb 20 '20 at 22:46
  • It is fine to assume boundaries are known. I will add that to the question. – Jesper for President Feb 21 '20 at 06:54

2 Answers2

4

If you know the boundaries, then one approach mentioned in Silverman's great little book (Density Estimation for Statistics and Data Analysis) is the "reflection technique". One simply reflects the data about the boundary (or boundaries). (This is what @NickCox mentioned in his comment.)

# Generate numbers from a uniform distribution
  set.seed(12345)
  N <- 10000
  x <- runif(N)

# Reflect the data at the two boundaries
  xReflected <- c(-x, x, 2-x)

# Construct density estimate
  d <- density(xReflected, from=0, to=1)
  plot(d$x, 3*d$y, ylab="Probability density", xlab="x", ylim=c(0,1.1), las=1)

Estimate of density

Note that in this case we end up with 3 times the number of data points so we need to multiply by 3 the density that comes out of the density function.

Below is an animated display of 100 simulations (as above) but with the true density and the two estimated densities (one from the original data and one from the reflected data). That there is bias near the boundaries is pretty clear when using density with just the original data.

Animation of density estimates

JimB
  • 2,043
  • 8
  • 14
  • '+1' Seems to be working fine. I read that the reflection methods was particularly efficient when $f'(x)$ the derivative of the density was equal to zero when evaluated at the boundary. This is offcourse the case for the uniform distribution. Could be interesting to see how this methods fares with for example the exponential distribution as data generating (I expect the logspline may do better in that case). – Jesper for President Feb 21 '20 at 16:35
  • The logspline approach does indeed work better for estimating the density at the lower boundary of zero for the exponential distribution. But it will depend on the application of the results as to if there is any "practical" difference. One computational issue (which should not be the most important issue) is that the reflection approach is more likely to be readily implemented in a variety of statistics packages as not all have logspline functions available. – JimB Feb 21 '20 at 17:24
3

I do not know if it is interesting (given the original quesiton and the answers it already received) but, I would like to suggest an alternative method. It could maybe be useful to somebody in the future as well (I hope at least) :-).

If you worry about bounday effects of your density smoothing method I would suggest to use P-splines (see Eilers and Marx, 1991 - the authors specifically talk about boundary bias in density smoothing in par 8). Quoting Eilers and Marx,

the P-spline density smoother is not troubled by boundary effects, as for instance kernel smoothers are.

In general, P-splines combine B-splines and finite difference penalties. The density smoothing problem is a special case of GLM. So we just need to parameterize our smoothing problem accordingly.

To answer the original question I will consider data grouped in a histogram fashion. I will indicate with $y_{i}$ the count (but the reasoning can be adapted to the density case as well) of observations falling in the bin/bar $u_{i}$. To smooth these data I will use the following ingredients:

  • the smoother: Whittaker smoother (special case of P-splines, the bases is the identity matrix)
  • first order difference penalty
  • IWLS algorithm to maximize my penalized likelihood (eq 36 in the reference)
    $$ L = \sum_{i} y_{i} \log \mu_{i} - \sum_{i} \mu_{i} - \lambda \sum_{i} (\Delta^{(1)} \eta_{i})^{2} $$ with $\mu_{i} = \exp(\eta_{i})$.

The results are produced by the code below for a fixed value of $\lambda$ (I left some comments to make it easier to read I hope). As you will notice form the results, the $\lambda$ parameter regulates the smoothness of the final estimates. For a very high $\lambda$ we obtaine a pretty flat line.

library(colorout)

# Simulate data
set.seed(1)
N = 10000
x = runif(N)

# Construct histograms
his = hist(x, breaks = 50, plot = F)
X = his$counts
u = his$mids

# Prepare basis (I-mat) and penalty (1st difference)
B = diag(length(X))
D1 = diff(B, diff = 1)
lambda = 1e6 # fixed but can be selected (e.g. AIC)
P = lambda * t(D1) %*% D1

# Smooth
tol = 1e-8
eta = log(X + 1)
for (it in 1:20) 
{
    mu = exp(eta)
    z = X - mu + mu * eta
    a = solve(t(B) %*% (c(mu) * B) + P, t(B) %*% z)
    etnew = B %*% a
    de = max(abs(etnew - eta))
    cat('Crit', it, de, '\n')
    if(de < tol) break
    eta = etnew
}

# Plot
plot(u, exp(eta), ylim = c(0, max(X)), type = 'l', col = 2)
lines(u, X, type = 'h')

To conclude, I hope my suggestion is clear enough and replies (at least partially) the original question.

enter image description here

Gi_F.
  • 1,011
  • 1
  • 7
  • 11