6

This is a pretty basic question, I believe. I'm trying to estimate a distribution (~a Gaussian) where I only have very few data points. If I had more data points, I'd just fit a Gaussian to the points.

What I thought of is using the following Bayesian method: I have a bunch of hypothesis distributions $P(a|i)$ each with weights $w_i = P(i)$, initially all equal, and then my estimated distribution is:

$$P(a) = \sum_i P(a|i) P(i)$$

Now, I recieve a new data point $a^*$, and calculate new weights $w_i^*$ to replace $w_i$:

$$w_i^* = P(i|a^*) = \frac{P(a^*|i)}{P(a^*)} P(i)$$

where I just plug in $a^*$ into the above formula for $P(a)$, and $P(i)$ is my known weights distribution. This works pretty well, after a few iterations the weight for the correct hypothesis approaches 1.

I just want to know if I'm doing it correctly. In particluar, i'm doing it point for point. Should I put in all data points at once instead, and how do I do that? Does it make a difference? Is there a proof that this technique works as expected?


Update: I've made a little python script that tries out both cases - point by point and adding all points at once (using the product of probabilities). It seems adding data point-by-point or all at once gives the same results. I use $$ P(\vec a) = \sum_i (P(i) \cdot \prod_j P(a_j|i))$$ where I sum over all hypotheses and $j$ indexes my data points. Previously I had in this question $$ P(\vec a) = P(a_0) \cdot P(a_1) \cdot P(a_2) \cdot \ldots $$ $$ = \prod_j \left[ \sum_i P(a_j|i) P(i) \right] \,,$$ which was wrong (and I'm still a bit confused why).

amoeba
  • 93,463
  • 28
  • 275
  • 317
jdm
  • 281
  • 1
  • 5

1 Answers1

4

You can indeed update point-by-point or via a batch of observations, so long as your observations are at least exchangeable. Exchangeable random variables are conditionally independent given an appropriate latent variable.

That is, you have

$$ p(X_{1}, \ldots, X_{n} \, | \, \theta) = \prod_{i = 1}^{n}p(X_{i} \, | \, \theta) $$

Since $p(\theta \, | \, X_{1}, \ldots, X_{n}) \propto \prod_{i}^{n} p(X_{i} \, | \, \theta)p(\theta)$, it doesn't matter in what order you multiply the $p(X_{i} | \theta)$ terms. You can do it one point at a time, in mini batches of $m < n$ points, all-at-once with $n$ points, etc.

Here's an inductive proof that performing $n$ point-by-point updates corresponds to doing a single $n$-point batch update. It suffices to show that

$$ p(\theta \, | \, X_{1}, \ldots, X_{n}) \propto p(X_{n} \, | \, \theta) \,p(\theta \, | \, X_{1}, \ldots, X_{n-1}). $$

Notice that $p(\theta \, | \, X_{1}) \propto p(X_{1} \, | \, \theta) p(\theta)$ holds by Bayes' theorem. Assume that the result holds for the $n^{th}$ case, and consider the case for $n+1$.

We have:

\begin{align} p(\theta \, | \, X_{1}, \ldots, X_{n + 1}) & \propto p(X_{1}, \ldots, X_{n+1} \, | \, \theta)\, p(\theta) & \text{Bayes' theorem} \\ & = p(X_{1}, \ldots, X_{n-1} \, | \, \theta) \,p(X_{n} \, | \, \theta) p(X_{n+1} \, | \, \theta) p(\theta) & \text{exchangeability} \\ & \propto p(X_{n+1} \, | \, \theta) \,p(X_{n} \, | \, \theta) \,p(\theta \, | \, X_{1}, \ldots, X_{n-1}) & \text{Bayes' theorem} \\ & = p(X_{n+1} \, | \, \theta) \,p(\theta \, | \, X_{1}, \ldots, X_{n}). \square & \text{inductive hypothesis} \end{align}

Tim
  • 108,699
  • 20
  • 212
  • 390
jtobin
  • 1,446
  • 8
  • 9