How to learn a continous distribution incrementally?

Question

Suppose i have a random process that generates a singe number $x \in [0, 1]$ per time step $t$. Let's call the process $\pi$. At the beginning i assume that the the outcome is uniformly distributed. Now as i receive $x_t$ i update my belief over the process. As $t$ goes to $\infty$ i'll get an accurate representation.

Currently I'm keeping a set of particles $X$, which I initialize at the beginning in the range of $[0, 1]$, from which i draw uniformly. I do this because at the beginning I'm assuming that all the numbers in this range are equally likely.

Let's say i store 10.000 of them. Now that i get a new one I'll replace the oldest one with that value (sort of like a ring buffer or a FIFO queue). After some time the 10.000 particles will represent the underlaying distribution well enough. To generate samples from $\pi$ i randomly draw from $X$ which is now like drawing from $\pi$.

To make it a bit clearer: My intension is not to learn the distribution per so, but rather to be able to sample from it using values I've already seen. So my idea was that the more samples i store the better my approximation will be.

Is there a more efficient way? Is there perhaps a neural network that learns a representation? I've read about Restricted Boltzmann Machines. Would that be something appropriate?

What exactly do you want to learn about the distribution? PDF? Or maybe CDF? — Tim, Aug 31 '17 at 14:37
"randomly initialize" — https://en.wikipedia.org/wiki/Hacker_koan#Uncarved_block — Kodiologist, Aug 31 '17 at 15:33
It looks like you're sampling to simulate the bayesian kernel density estimation with noninformatrive prior — Aksakal, Aug 31 '17 at 17:04
Btw, regarding your edit, sampling from kernel densities is pretty straightforward: https://stats.stackexchange.com/questions/286498/generate-random-draws-from-kernel-density-in-python/286557#286557 — Tim, Sep 01 '17 at 08:32

Tim · Accepted Answer · 2017-08-31T16:38:38.333

4

If the bounds of the distribution are known in advance, you can use binned kernel density estimator. If standard kernel density estimator is

$$ f(x) = \frac{1}{nh} \sum_{i=1}^n K \left( \frac{x-x_i}{h} \right) $$

then you can define binned kernel density estimator as

$$ g(x) = \frac{1}{nh} \sum_{i=1}^k n_i \, K \left( \frac{x-t_i}{h} \right) $$

for data binned into $k$ bins with sizes $n_1,\dots,n_k$ such that $\sum_i n_i = n$, with bin centers $t_1,\dots,t_k$.

You can find more details in the following papers:

Scott, D. W., & Sheather, S. J. (1985). Kernel density estimation with binned data. Communications in Statistics-Theory and Methods, 14(6), 1353-1359.

Hall, P., & Wand, M. P. (1996). On the accuracy of binned kernel density estimators. Journal of Multivariate Analysis, 56(2), 165-184.

This approach needs you only to decide about $k$ and then simply count the observations that fell into each bin. The advantages of this approach are that the kernel density estimator can be re-calculated at any time and that you need to store only $k$ values ($k$ does not have to be large) plus their counts. Moreover, it gives you the histogram estimator for free as you'd already have the counts.

edited Aug 31 '17 at 16:38

answered Aug 31 '17 at 14:57

Tim

108,699
20
212
390

I think you need to modify this approach to take into account OP's uniform prior. Say, if you have only one observation, the result must be the uniform density. In the ordinary kernel method it will be some kind of a bell shaped distribution, unless the window is very wide. – Aksakal Aug 31 '17 at 17:09
@Aksakal I don't know any meaningful Bayesian approach to density estimation (of unknown form). Moreover, I can't see how "assuming uniformity" changes anything in here, especially since OP says that he's going to have large sample size -- so it will diminish any prior and a flat one for sure. – Tim Aug 31 '17 at 17:20
OP says at the beginning it assumes uniform, so I'm simply looking at the boundary case of one observation. – Aksakal Aug 31 '17 at 17:22
@Aksakal I do not disagree with you, but I see it as a practical problem of efficiently dealing with large size data. From practical point of view, the initial prior doesn't matter. Thinking of it in Bayesian terms would be a nice intellectual puzzle, but I don't see practical reasons for it (* this is said by a person who in general prefers the Bayesian approach). – Tim Aug 31 '17 at 17:28
I suggested a simple hybrid method, where you start with a wide bandwidth then shrink it as the sample size increases. – Aksakal Aug 31 '17 at 18:33
Thanks for your answer! But wouldn't this approach be a little inefficient? I'm thinking about an implementation in an Machine Learning context. So I'd have to check for each incoming sample to which bin it belongs. If i have a lot of bins, that can take quite some time. Or am I wrong? – hh32 Sep 01 '17 at 08:05
@hh32 (1) you don't have to check all the bins, since you can stop as soon as you find the correct one, (2) when you know the number of bins and the distribution is bounded, you don't have to start searching at min or max, but can start with a smart guess, (3) you don't have to have a large number of bins as KDE will deal with "smoothing" the histogram, (4) if searching for the right bin is "inefficient" for you, then you won't find anything better as histogram is the most simple algorithm for density estimation and this approach is based on it. – Tim Sep 01 '17 at 08:15
Okay, Thanks! I'll go with this approach as it seems to be well suited. – hh32 Sep 01 '17 at 08:21

Aksakal · Answer 2 · 2017-08-31T18:31:36.580

The way you describe your procedure, you're actually not learning the distribution per se. You created the uniform sample $\pi'$, then gradually replace the members of the set with new observations $\pi$.

First, if you keep doing this as you said to infinity, then at some point all original members of $\pi'$ will be replaced with observations $\pi$. In this case why bother about initializing with $\pi$? All you need is to use the new observations $\pi$.

Second, even after you replaced the "old" observations with "new" ones, you simply use the modified datase to sample from it. You're not learning the probability distribution from it in a sense of building the distribution.

Now, the only reason to replace old with new observations gradually is if you start with a very small set of observations. So that the new data does not overwhelm the prior belief from the get go. Only in this case it would make a sense to try Bayesian kernel density estimation, see Sec 27.5 here for an example.

UPDATE. In your case a very simple solution would be the ordinary kernel density estimator(KDE) with shrinking bandwidth. The bandwidth is the most important parameter of KDE. So, you start with a very wide bandwidth, so wide, that effectively it's going to produce you the uniform distribution. For instance, if you put it equal to 10, any kernel will produce nearly uniform distribution.

Next, you shrink the bandwidth as the sample grows by some principle. It could be by $\ln n$ or something along those lines. Obviously, you only use the new observations, there's no need to initialize now, because the kernel bandwidth represents your uniform prior. Once the sample grows your distribution will start acquiring a shape that is driven by your observations.

Continuing our discussion in the comments: last sentence from the section: "The final Bayeian density estimator is similar to the kernel density estimator. Of curse, this raises the question of whether the Bayesian method is worth all the extra effort". This is basically the same conclusion as Rasmus Baath's about Bayesian bootstrap. — Tim, Aug 31 '17 at 17:56
How exactly would you shrink the bandwidth (I am curious since this is an interesting problem)..? In many cases, for KDE bandwidth is *the* key issue. — Tim, Aug 31 '17 at 18:37
@Tim, I just thought of this, have no exact recipe. Maybe $w_0/\ln \ln n$, where $w_0$ - initial bandwidth. Speed of decrease should decrease over time. Yes, the bandwidth is everything in KDE. — Aksakal, Aug 31 '17 at 18:45
Thanks for your answer! I bother with the initialization because I'm sampling from $\pi$, or rather from the approximation of it, at the beginning of the process. After a while i want to gradually improve my sampling as i see more data. Concerning the KDE: I'm doing that to get an idea of the approximation but not after every data point i get. So overall i think a bayesian approach would be the best. — hh32, Sep 01 '17 at 07:57

How to learn a continous distribution incrementally?

2 Answers2

Linked