Bayes estimator of Bernoulli random variables

Question

In a particular task I am given, I have to compute the Bayes estimator of $Bernoulli(\theta$) random variables $X_1, ..., X_n$. As a prior distribution $p(\theta)$, I have to assume a $Beta(\alpha, \beta)$ distribution. When I multiply the likelihood function with the prior distribution and simplify the term by dropping the proportional constants, I get:

$$ p(\theta \vert x_1, ... , x_n )= \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1} $$ where $Y = \sum_i X_i $. Now in the "standard procdedure", the next step is to calculate the expected value of this term, in other words $\mathbf{E}(\theta \vert x_1, ... , x_n$ ), correct?

So if I do that, I get:

$$\mathbf{E}(\theta \vert x_1, ... , x_n ) = \int_0^{\infty} \theta \cdot \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1} \text{d}\theta= \int_0^{\infty} \theta^{Y+\alpha } (1- \theta)^{n-Y + \beta - 1} \text{d}\theta $$

But when I compute this last integral (just using the "standard integration rule" for polynoms), this thing converges to infinity which would imply that the expected value doesn't exist.

Where am I wrong?

Additional question:

In particular, it is given that
$$ p(\theta \vert x_1, ... , x_n )= \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1} \sim Beta(Y + \alpha, n - Y + \beta)$$

Why is that true? I clearly see the point that the $\theta \cdot (1-\theta)$ part shows similarities and could be written like that, but why can we just drop the constant part (why can we just drop $\frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha) \Gamma(\beta)}$)?

Edit: I want to add a side question to this, since I realised that I actually have not fully understood the thing:

From the comments I now know that $$ c \cdot p(\theta \vert x_1, ... , x_n )= \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1} \sim Beta(Y + \alpha, n - Y + \beta)$$ where c is a constant so that $p(\theta \vert x_1, ... , x_n )$ fulfills the requirements of a pdf. Now we know that the expected value of a $Beta(\alpha, \beta)$ distribution is $E(z) = \frac{\alpha}{\alpha + \beta}$. Now in the solution to this exercise, they continue by just estimating \theta as follows:

$\hat{\theta} = \frac{Y+ \alpha}{(Y + \alpha) + (n-Y+\beta)}$

Why is that true? Don't we have to multiply this by the constant? So isn't it $\hat{\theta} = c \cdot (\frac{Y+ \alpha}{(Y + \alpha) + (n-Y+\beta)})$ (by the rule of extracting factors before the expected value)?

Think about the range of the integral -should it go to infinity? — probabilityislogic, Jan 14 '17 at 10:12
My answer follows a similar path so it may be of use to you http://stats.stackexchange.com/a/134385/61496 — Yair Daon, Jan 14 '17 at 14:33

Bahgat Nassour · Accepted Answer · 2017-01-14T09:07:50.040

You drop the marginal density $p(x)$ (the normalizing constant ) because it is a function of the data which are fixed (in the Bayesian context ) but that leads the posterior density $p(\theta|x)$ to lose some properties like integration to 1 (improper density )over the domain of $\theta$ but this is not a big deal since we are usually not interested in integrating the function but in maximising it , so multiplying this function with the constant does not change the $\theta$ that corresponds to the maximum point (MAP).

Now given a binomial likelihood over $r$ success ($Y$ as you denote it ) in $n$ Bernoulli trials each independent and conditional on unknown success parameter $\theta \in [0,1]$ with prior density $\theta \sim Beta(\alpha,\beta)$ . If you drop the constants in the likelihood and the prior you get the kernel of beta density(the posterior ) that means the posterior is proportional (not equal )to the likelihood * prior such that

$$p(\theta|r,n) \propto {\theta}^r (1-\theta)^{n-r} * {\theta}^{\alpha -1}(1-\theta)^{\beta-1} = {\theta}^{r+\alpha-1}(1-\theta)^{n-r+\beta-1 }$$

Now to make this density proper(new beta density ) we multiply it with the constant $c$ which will ensure that this posterior density integrate to 1 such that :

$$p(\theta|r,n)= c{\theta}^{r+\alpha-1}(1-\theta)^{n-r+\beta-1 }$$ Note no proportionality anymore, now let $c$ be as follows $$c=\frac{\Gamma(n+\alpha+\beta)}{\Gamma(r+\alpha)\Gamma(n-r+\beta)}$$ That means $$ \int_{0}^{1}{\theta}^{r+\alpha-1}(1-\theta)^{n-r+\beta-1 }d\theta=c^{-1}$$ $$\theta|r,n \sim Beta(\alpha+r,\beta +n-r)$$ then $E(\theta|r,n)=\frac{\alpha+r}{\alpha+n+\beta}$ .

Great explanation! But so from your post, what is written above after "In particular, it is given that" is wrong, correct? I especially mean the part after $\sim$. This would only hold true if I would set the constant c as explained in your post, right? — Peter Series, Jan 14 '17 at 09:41
@FabianFalck Yes it is proportional (not equal ) to i.e it should be $$ p(\theta \vert x_1, ... , x_n ) \propto \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1}$$ or $$p(\theta \vert x_1, ... , x_n ) =c \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1} \sim Beta(\alpha,\beta)$$ — Bahgat Nassour, Jan 14 '17 at 10:18
I give you the credits, since you helped me to understand how to deal with the constant. — Peter Series, Jan 14 '17 at 10:28

score 3 · Answer 2 · answered Jan 14 '17 at 08:20

If you're taking an expectation, you need to explicitly know the pdf you're working with. For the expectation integral you're doing, you don't have the correct integrand because $p(\theta \vert x_1, ... , x_n )= \theta^{Y+\alpha -1 } (1- \theta)^{n-Y + \beta - 1}$ is not a pdf.

You have $p(\theta \mid x_1,\dots,x_n) = \frac{p(x_1,\dots,x_n \mid \theta) p(\theta)}{p(x_1,\dots,x_n)}$ and it's generally easier to drop the denominator to recognize the pdf. We use $\propto$ to represent this proportionality.

Thus, $p(\theta \mid x_1,\dots,x_n) \propto p(x_1,\dots,x_n \mid \theta) p(\theta) = \theta^{ \alpha -1+\sum_{i=1}^n x_i } (1-\theta)^{n+\beta-1 - \sum_{i=1}^n x_i}$. If we had kept in the constants, things would have gotten messy. Then, like you said, we recognize this as proportional to the Beta$(\alpha + Y, n+\beta+Y)$ distribution, and the missing constants would ensure that this is indeed a probability density function; i.e. that it integrates to one.

So $p(\theta \mid x_1,\dots,x_n) \sim$ Beta$(\alpha + Y, n+\beta-Y)$ and the mean of Beta$(a,b)$ is $a/(a+b)$, so the desired expectation is $\frac{\alpha+Y}{n+\beta+\alpha}$.

You could also get the expectation by integrating directly: \begin{align*} &\int_0^1 \theta \cdot \frac{1}{\beta(\alpha+Y, n + \beta - Y)} \theta^{\alpha+Y-1} (1-\theta)^{n+\beta-Y-1} \; d \theta \\ &=\frac{\beta(\alpha+Y+1, n + \beta - Y)}{\beta(\alpha+Y, n + \beta - Y)} \underbrace{\int_0^1 \frac{1}{\beta(\alpha+Y+1, n + \beta - Y)} \theta^{\alpha+Y} (1-\theta)^{n+\beta-Y-1} \; d \theta }_{=1} \end{align*} which equals $\frac{\beta(\alpha+Y+1, n + \beta - Y)}{\beta(\alpha+Y, n + \beta - Y)} = \frac{\alpha+Y}{ \alpha+n+\beta}$ as we expected.

Great answer! Helps especially to understand that the "tedious" approach is still correct! — Peter Series, Jan 14 '17 at 10:24

Bayes estimator of Bernoulli random variables

2 Answers2