Online density estimation and learning

Question

Suppose that I have a system that at each time $t_i$ produces $N$ i.i.d samples of an unknown distribution $f(x;t)$. I want to estimate the distribution in an online manner. If I had only the observation at a single time $t_i$, I think I could use the kernel density estimation method. Hence, $$f(x;t_i)\approx \frac{1}{Nh}\sum_{j=1}^{N} K\left(\frac{x-x_j}{h}\right)$$ But the system generates $N$ data at each time. Hence, if the distribution do not depend on $t$ after $T$ observations, I will have the following approximation $$f(x;t) = f(x)\approx \frac{1}{NTh}\sum_{i=1}^{T}\sum_{j=1}^{N} K\left(\frac{x-x_{j,t_i}}{h}\right)$$ In the above expression the number of samples in the summation increases as the time goes on. Hence after some time I should store lots of information. I should also do lots of calculations as the number of terms in the summation increases. Hence, I'm looking for a method that do not require all of the previous (raw) information (for example by some sort of moving average). This method is also not suitable for time varying distributions.

Are there extensions of kernel density estimation or any other methods that can estimate or learn the data distribution in an online manner without the need to store all of the information at all times? Is it possible to learn time-varying distributions with such a method?

cdalitz · Answer 1 · 2020-06-20T06:04:04.967

2

You can easily reuse the old estimation and only add the new estimation. For example, if you have measured $N_1$ data points at $t_1$ and $N_2$ points at $t_2$, you have: $$f(x;t_i) = \frac{1}{N_i h}\sum_{j=1}^{N_i} K\left(\frac{x-x_j}{h}\right)$$ Combining both data, you have $$f(x)=\frac{1}{(N_1+N_2) h}\sum_{j=1}^{N_1+N_2} K\left(\frac{x-x_j}{h}\right) = \frac{1}{N_1+N_2}\Big(N_1 f(x;t_1) + N_2 f(x;t_2)\Big)$$ Added remark: How does this formula reduce the complexity for the computation of $f(x)$?

The R method density estimates $f(x)$ by sampling n values for x (default: n=512) and computes $f(x)$ for all of these values. Hence, if $N=N_1+\ldots+N_{i-1}$ is the total number of data points up to $t_{i-1}$, and $f$ is the density estimate up to this point, this estimate is updated as follows (beware that the parameter bw in desity is NOT h, but proportional to it, so let's assume for simplicity that bw=h):

# assume that x.ti contains the data measured at time t_i
# and that f is to be estimated between x.min and x.max
n <- 512
N.i <- length(x.ti)
f <- (N * f + N.i * density(x.ti, n=n, from=x.min, to=x.max, bw=h)$y) / (N + N.i)
N <- N + N.i

The total space complexity is thus $O(n+N_i)$ and the time complexity of one update step is $O(n\cdot N_i)$.

edited Jun 20 '20 at 06:04

answered Jun 19 '20 at 11:38

cdalitz

2,844
6
13

The formula is also easily modified to estimate densities that vary in time: just add a weight factor to each contirbution, with lower weights assigned to older measurements. – cdalitz Jun 19 '20 at 11:56
This method suffers from the mentioned problems. It requires to store $N_1+N_2 + N_3 + \ldots$ points and computation complexity of calculating $f(x)$ for a single point $x$ grows with time, since you have more points inside the summation as time increases. – SMA.D Jun 19 '20 at 13:14
@SMA.D I have edited my answer to include an explanation how the given formula can be used to reduce space and time complexity. – cdalitz Jun 20 '20 at 05:21
Thank you for your answer. I conclude that if we need the distribution to be evaluated at finite number of values then the complexity can be reduced. – SMA.D Jun 20 '20 at 07:19
SMA.D: "if we need the distribution to be evaluated at finite number of values". You can (e.g. linearly) interpolate between the values to obtain the density for arbitrary *x* values. The kernel density estimator is an approximation anyway, so the interpolation will typically not add much error. The main drawback is that you must keep the bandwidth *h* and cannot adjust it later to some possibly more appropriate value. – cdalitz Jun 20 '20 at 07:33

Tim · Accepted Answer · 2020-06-19T14:31:51.767

Recall that kernel density estimation is closely related to finite mixture models, so for

$$ f(x) = \frac{1}{N} \sum_{i=1}^N \, K_h(x - x_i) $$

where $K_h(x) = K(x/h)/h$, $\frac{1}{N}$ may be thought as a weight, or mixing proportion in mixture, and kernel $K_h$ as a distribution with mean equal to $x_i$, what makes kernel density a mixture of $N$ components, with equal mixing proportions and where each component has fixed standard deviation $h$.

Now recall that $k$-means clustering is a special case of Gaussian mixture model, moreover there are online algorithms for $k$-means, so if you could decide for using some pre-defined number of components $k$, you could iterate, for each data point $x$ finding closest component using kernel as a proximity metric $K_h(x - x_j)$ and then updating the number of samples already assigned to it $n_j$ and it's mean $x_j$:

$$\begin{align} j &:= \operatorname{arg\,max}_j \; K_h(x - x_j) \\ n_j &:= n_j + 1 \\ x_j &:= x_j + \tfrac{1}{n_j} ( x_i - x_j) \\ \end{align}$$

then your density estimate is

$$ f(x) = \sum_{j=1}^k \, \frac{n_j}{N} \, K_h(x - x_j) $$

where $N = \sum_{j=1}^k n_j$. What it does, is it collapses and shifts the components, so it is similar to estimating kernel density for binned data, but the binning also happens online and shifts the bin centres $x_j$ to fit the data better.

The downside of this is that you need to decide about $h$ and $k$ hyperparameters, somehow initialize the initial bin centers $x_j$ (e.g. uniform grid from min to max), and it’s surely not the most precise algorithm, but the computational time is probably fastest possible $O(Nk)$, and memory usage is $O(k)$.

By the way, while googling it I found out that there are some less ad hoc algorithms for solving this problem, that also collapse the components, but with using more complicated algorithm for that. Additionally, they seem to be able to adapt the bandwidth, but since I didn't hear about them before, I can't comment on this.

Kristan, M., Skočaj, D., & Leonardis, A. (2010). Online kernel density estimation for interactive learning. Image and Vision Computing, 28(7),

Kristan, M., Leonardis, A., & Skočaj, D. (2011). Multivariate online kernel density estimation with Gaussian kernels. Pattern Recognition, 44(10-11), 2630–2642.

Online density estimation and learning

2 Answers2

Linked