2

Suppose I have $(\mathbf{X}_1, \cdots, \mathbf{X}_n)$ from a multivariate distribution $f$. The multivariate KDE is \begin{align*} \widehat{f}_\mathbf{H}(\mathbf{x}) = n^{-1}\sum_{i=1}^{n}K_\mathbf{H}(\mathbf{x} - \mathbf{X}_i) \end{align*} with bandwidth matrix H. The bandwidth matrix can be computed via a plethora of methods; I'm leaning towards either plug-in selectors or smoothed cross-validation. In my specific example, each $\mathbf{X}_i \in \mathbb{R}^2$ and my $n$ is tremendous ($n \approx 130,000$). I'm thinking about reducing computing time by performing on a subset of observations $(\mathbf{X}^*_1, \cdots, \mathbf{X}^*_m) \subset (\mathbf{X}_1, \cdots, \mathbf{X}_n)$ with $m \ll n$ to obtain some estimate $\widetilde{H}$. I know that subsetted bandwidth estimate $\widetilde{H}$ is not optimal for the entire dataset $(\mathbf{X}_1, \cdots, \mathbf{X}_n)$ because people have shown that $\widetilde{H}$ decreases with respect to $n$. However, is it possible to simply scale $\widehat{H} = c \widetilde{H}$ such that $\widehat{H}$ is approximately optimal for the entire data? If it makes it any easier, I plan to also restrict $\mathbf{H}$ to be diagonal.

Thanks!

Tom Chen
  • 407
  • 2
  • 12
  • 1
    That's an interesting question. I wonder whether, with so many points in only 2 dimensions, there might be diminishing returns. The benefit of reducing the bandwidth is being able to fit sharper features of the underlying distribution. If the distribution is smooth, you might already be able to hit a good bandwidth using a subset of your large number of points. If that's the case, further decreasing the bandwidth and increasing the number of kernels might give only marginal improvement (yeah, that's a cop out). – user20160 Feb 22 '17 at 22:42
  • 1
    On an unrelated note, my post [here](http://stats.stackexchange.com/questions/219833/density-estimation-for-large-dataset/219913#219913) lists a couple tricks that might be helpful for speeding up computation for KDEs with large datasets. – user20160 Feb 22 '17 at 22:44
  • btw, I meant marginal improvement in fit by reducing bandwidth. Clearly computational performance will go way using a subset of the data. – user20160 Feb 22 '17 at 23:40
  • 1
    I wouldn't call 130,000 (bivariate) points a "tremendously" big data set! Actually I probably woudn't even refer to that as "big data" ;) – Adrien Feb 23 '17 at 10:10
  • 1
    Careful: \tilde{H} doesn't decrease with respect to n: \tilde{H} is a quantity you chose, you can chose it increasing, decreasing or however you want. What you probably meant is the "optimal" bandwidth decreases (in a certain sense). – Adrien Feb 23 '17 at 10:17
  • Yes, Adrien, you are correct; I mean the optimal $\widetilde{H}$ that minimizes the AMISE or some similar loss function – Tom Chen Feb 23 '17 at 15:43

1 Answers1

2

To everyone curious about an answer, I've tested a conjecture and verified through simulation.

Silverman's approximation (1986) states that an approximation to the optimal bandwidth is \begin{align*} h_n = C_f n^{-1/5} \end{align*} for some constant $C_f$ that only depends on the true data-generating distribution. Hence \begin{align*} h_n = h_m\left(\frac{m}{n}\right)^{1/5} \end{align*} If we replace $h_m$ with $\widehat{h}_m$ from smoothed cross-validation on subset of $m$ observations, this gives us an approximation to $\widehat{h}_n$ on the entire dataset.

Tom Chen
  • 407
  • 2
  • 12
  • Bandwidth selection is a very hard problem and there are no definitive answers. Most solutions do work the way you said: define an optimal bandwidth in some sense, which depends on some quantities of the unknown distribution, then estimate these quantities. – Adrien Feb 23 '17 at 10:23
  • Cross validation is indeed computationally intensive, however rather than throwing away data, I would rather perform a simplified cross validation on all data. Which version of cross validation are you using? – Adrien Feb 23 '17 at 10:25
  • Hi Adrien, I am performing smoothed cross-validation are defined here: https://en.wikipedia.org/wiki/Multivariate_kernel_density_estimation#Smoothed_cross_validation I'm doing this because I've read many authors have tested time and again that either plug-in or SCV have the best performance (statistically). Of course, computationally, it's still a challenge. – Tom Chen Feb 23 '17 at 15:44
  • One other thing: I only use a subset of the data to compute the bandwidth. I then plug in this estimated bandwidth into the overall KDE which includes all the data, so I may lose a bit of precision on estimating the optimal bandwidth, but I'm still using the full dataset. – Tom Chen Feb 23 '17 at 15:49
  • 1
    I gave it a little more thought: cross-validation is computationally intensive algorithm to make up for lack of data. In your case it seems you have limited computational power and more data than you can exploit. Wouldn't you be better off with classical training set/test set than cross-validation? – Adrien Feb 24 '17 at 09:37