How does sklearn.neighbors.KernelDensity.fit()
fit the dataset with a probability density distribution? The bandwidth is a parameter that we are already providing; what other parameter values does it evaluate to fit the data with an estimated probability density distribution?
1 Answers
Start with refreshing your knowledge on kernel density estimation. As you may know, KDE estimates the distribution of the data by looking at the distance of a point $x$ to each $x_i$ point in the training set, where for calculating the distance we use a kernel $K$ parametrized by the bandwidth $h$.
$$ \hat{f_h}(x) = \frac{1}{nh} \sum_{i=1}^n K\Big(\frac{x-x_i}{h}\Big) $$
Since kernels $K$ have the property that they integrate to one, we can think of the result as of a mixture distribution with equal $\tfrac{1}{n}$ weights. What follows, the mixture would also integrate to one and have the properties of the probability density function.
Same as with algorithms like $k$-NN, training the KDE means just memorizing the training data (the $x_i$ points), so that we could do the above computation at prediction time.
From scikit-learn's documentation you can learn that, they use the slightly more clever algorithm. As you can imagine, looping over the whole training set is a computationally demanding step. There are several possible solutions to speed it up. R's density
function uses the FFT transformations, to reduce the number of points used in the computation. Scikit-learn used a different approach, they store the training data in a tree structure, using ball trees and KD trees, so that instead of looking at all the data, you are evaluating only the neighborhood of the $x$ point. Those are technical details, but the takeaway message is that they make the computations faster at the cost of giving less precise results.
Answering your question, bandwidth
and kernel
are the core hyperparameters for KDE, they have the greatest influence on the results. The other parameters are the "technical" ones that also could influence the result, so it may be worth empirically checking how did they influence the result as well.

- 108,699
- 20
- 212
- 390