3

Background
Suppose I have observations $y=\{x_1,x_2,...,x_n\}$ of a binary variable $x_i\in\{0,1\}$ (assumed to be IID). Observations are generated by some complicated and unknown system $\mathcal F$. To simulate/predict the output of the system I use a function $f(θ)$, parameterized by model parameters $\theta$, leading to model predictions $\hat y_i(\theta)$. My aim is to learn something about the underlying system by interpreting the model parameters $\theta$, i.e. the overall goal is to infer the posterior distribution over $\theta$.

Likelihood function
I have considered two different likelihood functions to represent the observations:

Representation 1 - Bernoulli distributions. Represent observations by modeling the sequence of observations by
$$\begin{eqnarray}\displaystyle p(y|\gamma,\theta)=\prod_{i=1}^n\gamma^{I(y_i=\hat y_i(\theta))}(1-\gamma)^{I(y_i\neq\hat y_i(\theta))} = \gamma^{s(\theta)}(1-\gamma)^{n-s(\theta)}\end{eqnarray}, \qquad (1)$$
with $\gamma$ being the probability of succes, i.e. correspondence between model prediction and observation, $I()$ being the indicatior function that returns 1 if argument is true and 0 otherwise, and $\displaystyle s=\sum_i I(y_i=\hat y_i(\theta))$.

Representation 2 - Binomial distribution. Represent data by modeling the number of successes
$$\begin{eqnarray}\displaystyle p(y|\gamma,\theta)=\frac{n!}{s(\theta)!(n-s(\theta))!}\gamma^{s(\theta)}(1-\gamma)^{n-s(\theta)}\end{eqnarray}. \qquad (2)$$

Prior
Beta prior
$$\begin{eqnarray}p(\gamma|\alpha,\beta)=\frac{1}{B(\alpha,\beta)}\gamma^{\alpha - 1}(1-\gamma)^{\beta - 1}\end{eqnarray} $$

Marginalize over "success probability"
I am primarily interested in inferences regarding the parameter vector $\theta$ and not $\gamma$. Therefore I marginalize over the success probability $\gamma$
$$\begin{eqnarray}p(y|\theta)=\int_0^1p(y|\gamma,\theta)p(\gamma|\alpha,\beta) d\gamma\end{eqnarray},$$ where I assume $\gamma$ and $\theta$ being independent.

Using "likelihood representation 1" above I get
$$\begin{eqnarray} \displaystyle p(y|\theta)&=& \int_0^1 \gamma^{s(\theta)}(1-\gamma)^{n-s(\theta)}\frac{1}{B(\alpha,\beta)}\gamma^{\alpha - 1}(1-\gamma)^{\beta - 1} d\gamma \\ &=&\frac{1}{B(\alpha,\beta)}\int_0^1 \gamma^{s(\theta)+\alpha-1}(1-\gamma)^{n-s(\theta)+\beta-1} d\gamma \\ &=& \frac{B(s(\theta)+\alpha,n-s(\theta)+\beta)}{B(\alpha,\beta)}.\qquad (3) \end{eqnarray} $$

Using "likelihood representation 2" above I get
$$\begin{eqnarray} \displaystyle p(y|\theta)&=& \int_0^1 \frac{n!}{s(\theta)!(n-s(\theta))!}\gamma^{s(\theta)}(1-\gamma)^{n-s(\theta)}\frac{1}{B(\alpha,\beta)}\gamma^{\alpha - 1}(1-\gamma)^{\beta - 1} d\gamma \\ &=&\frac{n!}{s(\theta)!(n-s(\theta))!B(\alpha,\beta)}\int_0^1 \gamma^{s(\theta)+\alpha-1}(1-\gamma)^{n-s(\theta)+\beta-1} d\gamma \\ &=& \frac{n!B(s(\theta)+\alpha,n-s(\theta)+\beta)}{s(\theta)!(n-s(\theta))!B(\alpha,\beta)}.\qquad (4) \end{eqnarray} $$
Below, I plot eq. (3) in (A) and eq. (4) in (B) and their logarithms in (C) and (D), respectively, for an example with $n=10$ observations enter image description here

The two likelihood representations in eq. (1-2) lead to quite different behavior in $p(y|\theta)$. For example, using a uniform prior ($\alpha =1, \beta = 1$), $p(y|\theta)$ is uniform (plot (B,D)) when using likelihood representation 2 eq. (2) , whereas likelihood representation 1 eq. (1) concentrates most of the probability mass of $p(y|\theta)$ near low and high values of $s(\theta)$ (plot (A,C)).

If I trust my model and believe that it should perform better than chance, I can incorporate this assumption into the probabilistic model by changing the hyperparameters governing the prior over $\gamma$. Fixing $\beta=1$ and letting $\alpha>1$ we see, that likelihood representation 2 eq. (2) results in $p(y|\theta)$ being monotonically increasing as a function of successes $s(\theta)$ plot (B,D). When using likelihood representation 1 eq. (1) this also biases the probability mass towards large values of $s(\theta)$, however, $\alpha\geq n+\beta$ before $p(y|\theta)$ is increasing monotonically as a function of successes $s(\theta)$.

Questions
1) Theoretical: From a theoretical point of view, are there any reasons for preferring either likelihood representation 1 eq. (1) or likelihood representation 2 eq. (2)?

2) Practical: The above probabilistic model is part of a larger probabilistic setup, where I also model the prior governing $\theta$ as well as model other types of observations. For inferences over $\theta$ I use MCMC. Using such an iterative procedure, one concern with using likelihood representation 1 eq. (1) could be, that the sampling procedure get trapped in regions with low $s(\theta)$ because of non-monotonic behavior of $p(y|\theta)$ if $\alpha<n+\beta$ (imagine $p(y|\theta)$ being sharply peaked at low and high success probabilities). Of course I could assign a high value to $\alpha$, but this, in turn, condenses much of the probability mass near maximal values of $s(\theta)$ relative to the case when using likelihood representation 2 (compare A vs. B). Would this concern vote in favor of using likelihood representation 2?

u144119
  • 54
  • 1
  • 5
  • 4
    Are you familiar with [beta-binomial model](http://stats.stackexchange.com/questions/238571/taking-into-account-the-uncertainty-of-p-when-estimating-the-mean-of-a-binomial/238582#238582) it gives closed-form solution to this problem, MCMC is not needed in here. – Tim Jan 05 '17 at 10:06
  • 1
    @Tim, many thanks for your comment. I see that eq. (4) above is equal to the beta-binomial model. However, my goal is not to infer the posterior distribution over the success probability parameter but rather to infer the posterior distribution over the parameter vector $\theta$ governing my prediction function $f$. I do not have a fixed number of successes. Rather the number of successes depends on predictions of the model $f$, which in turn depends (non-linearly) on the parameter vector $\theta$. I also have a prior over the model parameter $\theta$. This is why I use MCMC. – u144119 Jan 05 '17 at 10:45
  • 2
    Then I do not understand your description. Are you dealing with i.i.d. r.v.'s? If not (e.g. $\theta$'s differ) then representation as binomial is incorrect. If number of trials is unknown, then using binomial may be inappropriate (maybe you should use negative-binomial etc.). – Tim Jan 05 '17 at 11:10
  • 1
    @Tim, but $\theta$ are parameters of the prediction function $f$ and $\gamma$ is the parameter of the Bernoulli / Binomial likelihoods. Does this make sense? Please let me know if I should clarify in the Background description. – u144119 Jan 05 '17 at 11:20
  • Then what are $f$ and $s$? If I understand you correctly then what you observe is some data generated by $f$. If you are not interested at all in estimating parameters of binomial distribution, then what is the purpose of introducing such model? How would estimating $\gamma$ help in estimating $\theta$? – Tim Jan 05 '17 at 11:36
  • @Tim, observations are generated by some complicated and unknown system $\mathcal F$, and $f(\theta)$ is a function that I use to predict/simulating the output of the system. s is the number of observations for which we have correspondence between model prediction and observation. I introduce the models eq. (1) or (2) in order to compare the predictions of the simulation model $f(\theta)$ with the observations. Hence, the purpose is to use eq. (1) or (2) as a measure that guide us towards model parameters $\theta$ that give good correspondence between model predictions and observations. – u144119 Jan 05 '17 at 11:58
  • An analogy for continuous y's could be a regression model, parameterized by parameter vector $\theta$, where we are interested in $p(\theta|y)$ and model the likelihood by a Gaussian distribution with mean $\mu(\theta)$ and variance $\sigma^2$, and then impose some prior over $\sigma^2$ (and also a prior over $\theta$). – u144119 Jan 05 '17 at 12:04
  • But then, from what you're saying, it seems that distribution of your data is *fully determined* by $\gamma$ and so $\theta$ must be a deterministic function of $\gamma$, so by learning about $\gamma$ you learn about $\theta$ and it's distribution. – Tim Jan 05 '17 at 14:30

1 Answers1

5

Binomial distribution is a distribution of sum of i.i.d. Bernoulli random variables and their likelihoods are equivalent. If you are getting different results under both representations of your data, then you are doing something wrong.

Moreover, unless I misunderstand your problem, you are making things overcomplicated. Beta distribution is a conjugate prior for $\theta$ parameter of binomial distributions, what leads to beta-binomial model. Under beta-binomial model, posterior predictive distribution of your data follows a beta-binomial distribution parametrized by known number of trials $n$ and parameters

$$ \alpha' = \alpha + \text{total number of successes} \\ \beta' = \beta + \text{total number of failures} $$

where $\alpha,\beta$ are parameters of beta prior. Marginally $\theta$ is distributed as beta with parameters $\alpha',\beta'$. MCMC is not needed in here since the posterior distribution is directly available in closed-form since conjugacy.

Glorfindel
  • 700
  • 1
  • 9
  • 18
Tim
  • 108,699
  • 20
  • 212
  • 390
  • 1
    Thank your for providing this answer. However, I get different results under both representations of the data, because I considered the number of successes to be a function of the parameter vector $\theta$ governing the simulation model $f(\theta)$ rather that a fixed number of observed successes: Hence the representation in eq (4) is equivalent to eq (3) scaled by the factor $n!/(s(\theta)!(n-s(\theta)!))$. (Note that $\theta$ is the parameter governing the simulation model, refer to comments above.) – u144119 Jan 06 '17 at 07:34
  • @PMR but those two models are equivalent so if you're getting dissent results then you're doing something wrong as noted in the comments and in my answer. – Tim Jan 06 '17 at 07:44