4

I am taking a look at : http://pages.cs.wisc.edu/~jerryzhu/cs731/kde.pdf

Where they define the following loss function for kernel density estimates

$$J(h) = \int \hat{f_n}^2(x)dx -2\int\hat{f_n}(x)f(x)dx$$ which comes from expanding the loss $$\int(\hat{f_n}(x)-f(x))^2dx$$ called the integrated square loss. This loss makes intiuitive sense to me because they are asking, how well did our kernel density match the true density.

However, I am unable to follow the next step. They claim we can re-write $J(h)$ as $$\hat{J(h)} = \int\hat{f_n}^2(x)-\frac{2}{n}\sum\hat{f_{-i}}(x_i)$$ meaning we approximate $J(h)$ with a leave-one-out approach (that is what the notation $f_{-i}$ means).

I really don't understand the intuition behind this. Can anyone help clarify?

Thanks!

user2879934
  • 523
  • 4
  • 13
  • The identity (in expectation, anyways) in question was uncovered simultaneously in a pair of famous papers by Bowman and Rudemo in the early 1980s. Like many great ideas, it's obvious once you know it's true -- but it was not obvious to several generations of statisticians who came before then, including quite a few brilliant ones! So don't feel bad if you do not immediately see it. – nth Aug 21 '19 at 02:49

2 Answers2

2

Comparing your formulas, we have \begin{align} J(h) & \approx \hat{J(h)} & \implies \\ \int\!\!\hat{f_n}(x)f(x)dx & \approx \tfrac{1}{n}\sum_i\hat{f}_{\!\!-i}(x_i) & \implies \\ \mathbb{E}\left[\hat{f_n}(x)\right] &\approx \overline{\hat{f}_{\!\!-i}(x_i)} \end{align} which says that the expected value of the full kernel estimate is approximately equal to the sample average of the "leave the evaluation point out" kernel estimate.

Does this help the intuition?

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • it does, but why leave-one-out? why not full sample average? – user2879934 May 05 '17 at 11:15
  • Well, the point is to do cross-validation (i.e. the formula is not advertised as a training error?). As I understand it, the $\hat{f}_{-i}$ kernel-density excludes the basis function centered @ point $x_i$. Each term in the 2nd sum is the [likelihood](https://en.wikipedia.org/wiki/Likelihood_function) of a point under the kernel-density based on the other points, so when the average leave-one-out likelihood is high, the CV error is low. (For example, if a point is an outlier relative to its nearest-neighbor + bandwidth, it will have a low likelihood, so not contribute much to the sum). – GeoMatt22 May 05 '17 at 12:53
  • Sorry I realized I never framed this discussion properly, I think the point is bandwith choice, so it is training error in that minimizing the error with respect to bandwith yields optimal bandwith – user2879934 May 05 '17 at 13:18
  • Well first of all, the sum **is** a full-sample average, just an average of $\hat{f}_{-i}$ rather than $\hat{f}_n$. In terms of bandwidth, the "missing terms" would just "want" $h\to{0}$, to put all the probability mass of their kernel $K_i(x)$ at *their* sample point $x_i$, no? – GeoMatt22 May 05 '17 at 15:09
0

I think it is how leave-one-out cross validation works. Let's assume we have ten data points {x1,x2,...,x10}. Every time we train the bandwidth without only one point which will be used for validation purpose. I agree with you that we do the training and aim to minimize the error, but we may face an over-fitting issue and that's the reason why we apply cross-validation. Actually, it is full sample average; otherwise, the second term in J(h)^ should be divided by n-1 instead of n. In the ten data points example, we will train ten times and each time we have 9 training points and 1 validation point. Here n=10.