Bayesian inference on a sum of iid real-valued random variables

Question

Let $X_1$, $X_2$, ..., $X_n$ be iid RV's with range $[0,1]$ but unknown distribution. (I'm OK with assuming that the distribution is continuous, etc., if necessary.)

Define $S_n = X_1 + \cdots + X_n$.

I am given $S_k$, and ask: What can I infer, in a Bayesian manner, about $S_n$?

That is, I am given the sum of a sample of size $k$ of the RV's, and I would like to know what I can infer about the distribution of the sum of all the RV's, using a Bayesian approach (and assuming reasonable priors about the distribution).

If the support were $\{0,1\}$ instead of $[0,1]$, then this problem is well-studied, and (with uniform priors) you get beta-binomial compound distributions for the inferred distribution on $S_n$. But I'm not sure how to approach it with $[0,1]$ as the range...

Full disclosure: I already posted this on MathOverflow, but was told it would be better posted here, so this is a re-post.

I was about to write a comment to you on MO, but I'll write it here instead. If you feel the question is better suited to this forum, you might flag it on MO and ask to have it closed. — cardinal, Mar 08 '12 at 20:55
I would like some clarification of your last statement. If the range is $\{0,1\}$ then any distribution that puts any mass on values not in $\{0,1,\ldots,n\}$ for the distribution of $S_k$ seems silly, so I'm wondering if I've understood your aim correctly. (Maybe a reference would be helpful.) — cardinal, Mar 08 '12 at 20:58
@cardinal: I agree with your "misunderstanding"... The beta distribution would be on the $\theta$ driving the Bernoulli distribution, not on the $S_n$... — Xi'an, Mar 08 '12 at 21:13
Are you interested in Bayesian non-parametrics? If you do not want to make assumptions on the distribution of the $X_k$'s, you need a non-parametric framework. But then, given only $S_k$ you cannot say much... — Xi'an, Mar 08 '12 at 21:14
These are good remarks; sorry that the problem was a little muddled. I was thinking that n is very large in comparison to $k$, and that the posterior on $S_n$ would directly reflect the posterior on the parameters. Perhaps instead of $S_n$ I should have used $S'_n = S_n/n$, and asked for the posterior on $\lim S'_n$ as $n$ goes to infinity. Does this make sense now? — Ronald L Rivest, Mar 08 '12 at 21:41
If $k$ is fixed, I believe essentially the same argument as I gave above shows that $S'_n$ converges to the *prior* uniform distribution in this case. By a different argument, one can construct a product space $[0,1] \times \Omega$ on which the convergence of $S'_n$ is even stronger than this. — cardinal, Mar 09 '12 at 04:24
I don't understand yet your reasoning. For the simple case k=1, n=2, with $X_i \in \{0,1\}$, if we observe $X_1=1$, then the posterior on $p$ is $f(p;2,1)$ where $f(x;\alpha,\beta)$ is the beta density function. Then $X_2=1$ with probability 2/3 (doing the integration over $p$ and $X_2=0$ with probability 1/3, so the distribution on $S_n$ is \emph{not} uniform. Note that each $X_i$ is from the same distribution, so information from $S_k$ \emph{does} provide information on $S_n$... Does this make sense now?? — Ronald L Rivest, Mar 09 '12 at 19:04
Apologies. I have deleted the offending comment, which resulted from an unfortunate calculation error. I have left the other one to maintain some semblance of context to your comments. The question you are asking is a bit clearer now. You might consider editing the original post to reflect this. — cardinal, Mar 11 '12 at 17:21
I think you actually mean exchangeable rather than iid (or conditionally iid perhaps?). For under iid we have $p(X_{i}|X_{j})=p(X_{i})$. Taking this further, under iid we have that $S_{k}$ is independent of $S_{n}-S_{k}$. Also, we must have $\frac{S_{k}}{n}\leq \frac{k}{n}\to 0$. So, under iid this means that in the limit, knowing the sum of the first $k$ terms doesn't help with the limiting proportion of ones. We are basically left with the central limit theorem and $\frac{S_{n}}{n}\sim N(p,\frac{\phi}{n})$ where $p=E(X_{i})$ and $\phi=V(X_{i})\leq p(1-p)$. — probabilityislogic, Mar 12 '12 at 05:40
I hope you don't mind, but I made a suggested edit to the question as it should be a beta-binomial compound distribution on $S_n$ for the $\{0, 1\}$ case. — Neil G, Mar 12 '12 at 07:25

Zen · Answer 1 · 2012-06-30T23:55:12.187

Consider the following Bayesian nonparametric analysis.

Define $\mathscr{X}=[0,1]$ and let $\mathscr{B}$ be the Borel subsets of $\mathscr{X}$. Let $\alpha$ be a nonzero finite measure over $(\mathscr{X},\mathscr{B})$.

Let $Q$ be a Dirichlet process with parameter $\alpha$, and suppose that $X_1,\dots,X_n$ are conditionally i.i.d., given that $Q=q$, such that $\mu_{X_1}(B)=P\{X_1\in B\} = q(B)$, for every $B\in\mathscr{B}$.

From the properties of the Dirichlet process, we know that, given $X_1,\dots,X_k$, the predictive distribution of a future observation like $X_{k+1}$ is the measure $\beta$ over $(\mathscr{X},\mathscr{B})$ defined by $$ \beta(B) = \frac{1}{\alpha(\mathscr{X})+k} \left( \alpha(B) + \sum_{i=1}^k I_B(X_i)\right) \, . $$

Now, define $\mathscr{F}_k$ as the sigma-field generated by $X_1,\dots,X_k$, and use measurability and the symmetry of the $X_i$'s to get $$ E\left[ S_n \mid \mathscr{F}_k \right] = S_k + E\left[ \sum_{i=k+1}^n X_i \,\Bigg\vert\, \mathscr{F}_k \right] = S_k + (n-k) E\left[ X_{k+1} \mid \mathscr{F}_k \right] \, , $$ almost surely.

To find an explicit answer, suppose that $\alpha(\cdot)/\alpha(\mathscr{X})$ is $U[0,1]$. Defining $c=\alpha(\mathscr{X})>0$, we have $$ E\left[ S_n \mid X_1=x_1,\dots,X_k=x_k \right] = s_k + \frac{n-k}{c+k}\left(\frac{c}{2}+s_k\right) \, , $$ almost surely $[\mu_{X_1,\dots,X_k}]$ (the joint distribution of $X_1,\dots,X_k$), where $s_k=x_1+\dots+x_k$. In the "noninformative" limit of $c\to 0$, the former expectation reduces to $n\cdot (s_k/k)$, which means that, in this case, your posterior guess for $S_n$ is just $n$ times the mean of the first $k$ observations, which looks like as intuitive as possible.

Is it possible to get a nice expression for $\text{Var}[S_n|S_k]$ under this model too? — Cyan, Mar 13 '12 at 21:53

Cyan · Answer 2 · 2012-03-11T22:17:33.917

1

Forgive the lack of measure theory and abuses of notation in the below...

Since this is Bayesian inference, there must be some prior on the unknown in the problem, which in this case is the distribution of $X_1$, an infinite-dimensional parameter taking values in the set of distributions on $[0, 1]$ (call it $\pi$). The data distribution $S_k|\pi$ converges to a normal distribution, so if $k$ is large enough (Berry-Esseen theorem) we can just slap in that normal as an approximation. Furthermore, if the approximation is accurate the only aspect of the prior $p(\pi)$ that matters in practical terms is the induced prior on $(\text{E}_\pi(X_1),\text{Var}_\pi(X_1))=(\mu,\sigma^2)$.

Now we do standard Bayesian prediction and put in the approximate densities. ($S_n$ is subject to the same approximation as $S_k$.)

$p(S_n|S_k) = \int p(\pi|S_k)p(S_n|\pi,S_k)d\pi$

$p(S_n|S_k) = \int \frac{p(\pi)p(S_k|\pi)}{p(S_k)}p(S_n|\pi,S_k)d\pi$

$p(S_n|S_k) \approx \frac{\int p(\mu,\sigma^2)\text{N}(S_k|k\mu,k\sigma^2)\text{N}(S_n|(n-k)\mu + S_k, (n-k)\sigma^2) d(\mu,\sigma^2)}{\int p(\mu,\sigma^2)\text{N}(S_k|k\mu,k\sigma^2) d(\mu,\sigma^2)}$

For the limits of the integral, $\mu \in [0, 1]$, obviously; I think $\sigma^2 \in [0,\frac{1}{4}]$?

Added later: no, $\sigma^2 \in [0,\mu(1-\mu)].$ This is nice -- the allowed values of $\sigma^2$ depend on $\mu$, so info in the data about $\mu$ is relevant to $\sigma^2$ too.

edited Mar 11 '12 at 22:17

answered Mar 10 '12 at 06:19

Cyan

2,748
17
21

1

I don't understand your main paragraph. First of all, the convergence to a normal is only after a shift and rescale of $S_n$ and this is not by the Berry--Esseen theorem (which is a theorem on the *convergence rate* to normal), but the CLT. Furthermore, the shift and rescale will depend on the particular fixed parameter. Have you looked at a case where you have, say, a three point prior uniformly distributed on $\{0,1/2,1\}$? – cardinal Mar 11 '12 at 17:27
Let me clarify that when I write "normal" I don't mean standard normal. So the shift and re-scale change the mean and variance but the convergence is still to some element in the family of normal distributions. I meant for the link to the Berry-Esseen theorem to reference the phrase "if $k$ is large enough"; its current placement is a cut-n-paste error, and I'll change it. I don't understand your question about the fixed parameter -- can you clarify the question? – Cyan Mar 11 '12 at 21:20
Re: cardinal's question. Note that the prior is a distribution _on distributions_ with support in $[0, 1]$. If I take your question literally, you're asking about a prior that has support on three [constant random variables](http://en.wikipedia.org/wiki/Degenerate_distribution), which is trivial to analyze. But since you wrote in another comment "If the range is ${0,1}$ then any distribution that puts any mass on values not in ${0,1,…,n}$ for the distribution of $S_k$ seems silly," I think you're asking discrete data distributions. The short answer is, "no, it's not silly." Continued... – Cyan Mar 11 '12 at 21:49
It's OK to [approximate a discrete distribution with a continuous one](http://en.wikipedia.org/wiki/Binomial_distribution#Normal_approximation). – Cyan Mar 11 '12 at 21:56
I think there are several issues here: (a) The question statement could use some refinement to clarify the end goal, (b) the question, comments and answers have, unfortunately, been muddled through inadvertent typos, calculation errors and multiple threads of conversation, and (c) my comments referenced above appear taken a little out of context. My statement regarding $S_k$ (Typo: should have been $S_n$) concerns the *posterior* distribution of $S_n$ given $S_k$. If I know $S_n \in \{S_k,\ldots,n\}$ then any posterior distribution which does not put all its mass there should be inadmissible. – cardinal Mar 11 '12 at 22:09
Thanks for the clarification, particularly point (c). On that point, all I can do is repeat that _it really is okay_ to approximate a discrete distribution with a continuous one, even a posterior predictive distribution. Posterior predictive mean and standard deviation work fine; the only question is how to ascribe positive probability mass to the allowed discrete values, and $\Pr(S_n = i|S_k) \approx \frac{p^{\*}(i|S_k)}{\sum_{j=0}^n p^{\*}(j|S_k)}$ where $p^{\*}(\cdot|S_k)$ is the approximate posterior predictive density above should be accurate enough for reasonably wide credible intervals. – Cyan Mar 11 '12 at 23:09
Approximation can work fine in practice and is useful in theory, particularly vis a vis asymptotics. But, *again*, I stress my original comments were made within a formal context, not one of post-hoc approximations or asymptotics. – cardinal Mar 11 '12 at 23:18
Fair enough. Is there any further aspect of your question, "Have you looked at a case where you have, say, a three point prior uniformly distributed on $\{0,1/2,1\}$?" that you'd like me to address? (I'd move this to chat if I could, but I haven't got enough reputation.) – Cyan Mar 11 '12 at 23:30
The main point, admittedly poorly stated, was to consider simple priors that should elucidate edge cases. In making that remark, I was still thinking of the $\{0,1\}$ case with $p$ distributed uniform on $\{0,1/2,1\}$. I realize that was not at all clear from the original comment. I think the most potential benefit and value in this answer would come from addressing your approximation statement more formally. While one sees the heuristic notion you are conveying; in my opinion, it needs some more substantiation to be convincing. – cardinal Mar 12 '12 at 00:02
Good point. I usually operate on intuition and simulation -- working things out with a high level of rigor doesn't come naturally to me. – Cyan Mar 12 '12 at 01:23

score 0 · Answer 3 · edited Mar 12 '12 at 23:57

Let each $X_i$ belong to distribution family $F$ and have parameters $\theta$.

Given, $S_k$, we have a distribution on $\theta$:

\begin{align} \Pr(\theta \mid S_k) &= \frac1Z \Pr(\theta)\Pr(S_k \mid \theta) \end{align}

And, our distribution on $S_n$, $n \ge k$ is \begin{align} \Pr(S_n = i \mid S_k) &= \Pr(S_{n-k} = i - S_k | S_k) \\ &= \int \Pr(S_{n-k} = i - S_k | \theta)\Pr(\theta \mid S_k)d\theta \\ \end{align}

(and similarly for $n < k$)

Both of these equations have nice forms when $F$ is a distribution in the exponential family that is closed under summation of iid elements like the normal distribution, the gamma distribution, and the binomial distribution. It also works for their special cases like the exponential distribution and the Bernoulli distribution.

It might be interesting to consider $F$ is the family of scaled (by $\frac1n$) binomial distributions with known "trials" $n$, and taking the limit as $n$ goes to infinity.

Bayesian inference on a sum of iid real-valued random variables

3 Answers3

Linked