I have been wondering about this passage from Bishop 2.5.1 Kernel density estimators (p. 122). The passage follows an explanation of how a density function $p$ could be approximated, based on the following argument:
- Let random variable $X \sim p$. Consider a small region $R$ such that $x \in R$ for some $x \in X$. Let $P = P(X \in R) = \int_R p(x) \:\mathrm{d}x$.
- Assume we have a data set with $N$ samples drawn from $p$. Then $K$, the number of samples that land inside $R$, is binomially distributed $K \sim Bin(N, P)$.
- If $N$ is large, the variance of r.v. $K/N$ is close to zero, and so we expect $K/N$, the ML estimator for $P$, to give a good estimate for $P$.
- Assume that $R$ is so small that $p$ is constant over $R$. Then $P \approx p(x)V$, where $V$ is the volume of $R$.
- Thus, the approximation $p(x) = \frac{K}{NV}$ (2.246)
Note that the validity of (2.246) depends on two contradictory assumptions, namely that the region R be sufficiently small that the density is approximately constant over the region and yet sufficiently large (in relation to the value of that density) that the number K of points falling inside the region is sufficient for the binomial distribution to be sharply peaked.
Bishop, Christopher M. Pattern recognition and machine learning. Springer, 2006.
I am not sure I understand the second assumption.