Modelling Parameter $r = \max\limits_{i = 1, \dots , 10} p_i - \min\limits_{i = 1, \dots , 10} p_i$ of Binomial Random Variable in Stan/RStan/R

Question

I'm trying to use Stan and R to fit a model that, uhh, models the observed realisations $y_i = 16, 9, 10, 13, 19, 20, 18, 17, 35, 55$, which are from a binomial distributed random variable, say, $Y_i$, with parameters $m_i$ (the number of trials) and $p_i$ (probability of success). So we have $Y_i \sim (p_i, m_i)$ for $1 \le i \le 10$.

For the purposes of this experiment, I'm going to assume that all of the $m_i$ are fixed and given by $m_i = 74, 99, 58, 70, 122, 77, 104, 129, 308, 119$.

I'm going to use Jeffrey's prior: $\alpha=0.5$ and $\beta=0.5$.

I'm trying to

Find the range of the $p_i$ (i.e., the parameters $r = \max\limits_{i = 1, \dots , 10} p_i - \min\limits_{i = 1, \dots , 10} p_i$).
Plot the posterior density of $r$.
Find a Bayesian estimate for $r$.
Find the standard deviation of the posterior distribution of $r$.

I will be using Stan/RStan/R to do this.

My code for this is as follows:

```{r}
library(rstan)
library(bayesplot)
```

```{stan output.var="BinMod_beta"}
  data {
    int <lower = 1> mi[10];
    int <lower = 0> yi[10];
    real <lower = 0> alpha;
    real <lower = 0> beta;
  }

  parameters {
    real <lower = 0, upper = 1> p[10];
  }

  transformed parameters {
    real r;
    real mx = max(p);
    real mn = min(p);
    r = mx - mn;
  }

  model {
    yi ~ binomial(mi, p);
    p ~ beta(alpha, beta);
  }
```

```{r}
data.in <- list(mi = c(74, 99, 58, 70, 122, 77, 104, 129, 308, 119), yi = c(16, 9, 10, 13, 19, 20, 18, 17, 35, 55), alpha = 0.5, beta = 0.5)
model.fit1 <- sampling(BinMod_beta, data=data.in)
```

```{r}
print(model.fit1, pars = c("p", "r"), probs=c(0.1,0.5,0.9), digits = 5)
```

```{r, out.width="0.8\\textwidth", fig.align='center'}
mcmc_areas(posterior, pars="r", point_est="mean")
```

My plot of the posterior density of $r$ is

I thought I had gone about this correctly, until I looked at the values I was getting:

The minimum value for $p_i$ in this table is $0.09535$, and the maximum value for $p_i$ in the table is $0.46167$. This would give us $r = 0.46167 - 0.09535 = 0.36632 \not= 0.37543$. So did I do something wrong here? I only recently started learning MCMC, simulations, and Stan, so it's not at all clear to me that I've done anything incorrectly.

I would greatly appreciate it if people could please take the time to to review this and provide feedback.

EDIT: Results of model with single $p$ instead of individual $p_i$s:

score 2 · Accepted Answer · answered Apr 24 '18 at 18:55

2

The posterior mean of $r$ is a non-linear transformation of the $p_i$, so we shouldn't be surprised by this inequality.

Also, if you look at the quantiles, there is considerable overlap in the posterior distributions of the $p_i$s, but your reasoning seems to be conditional on a fixed ordering.

answered Apr 24 '18 at 18:55

HStamper

1,396
9
12

Thanks for the response. Here's an interesting question: Since there is, according to the quantiles, considerable overlap in the posterior distributions of the $p_i$s, does that mean that, instead of simulating/modelling the individual $p_i$s, it would have been more appropriate, for this data, to model just a single $p$? – The Pointer Apr 24 '18 at 19:23
1

2 and 10 show a pretty clear difference, no? Whether collapsing non-distinguishable parameters is appropriate or not depends on the context and the objective of the analysis. – HStamper Apr 24 '18 at 19:35
Yes, you are indeed correct. I have edited the main post with the results of the model with a single $p$. Wouldn't the relatively low standard deviation of the posterior distribution of $p$ indicate that it is a better model to select for this analysis than the model of the individual $p_i$s? Or no? – The Pointer Apr 24 '18 at 19:40
Generally, the posterior is more highly peaked if you use a pooled model. Your results are conditional on how you choose to model the data and if you choose a simpler model, you can generally fit it more precisely. That does not make it more appropriate; you might just be fitting a bad model really precisely. – HStamper Apr 24 '18 at 19:47
Hmm. So, in this case, which model would you say is more appropriate? – The Pointer Apr 24 '18 at 19:50
You didn't provide any context or specify an objective for the analysis. – HStamper Apr 24 '18 at 19:58
What if the $y_i$ represent the number of red cars, and the $m_i$ represent the total number of cars? – The Pointer Apr 24 '18 at 20:02
Based on significant differences between 2 and 10, I would have said that it is better to model the individual $p_i$s. But then you mentioned that the posterior is more highly peaked if using a pooled model, so wouldn't that suggest that the pooled model would more accurately measure the mean? So I'm confused in this regard, since these two facts seem contradictory for choosing one model as more appropriate over the other. – The Pointer Apr 24 '18 at 20:07
Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/76522/discussion-between-eric-mittman-and-the-pointer). – HStamper Apr 24 '18 at 20:10

Modelling Parameter $r = \max\limits_{i = 1, \dots , 10} p_i - \min\limits_{i = 1, \dots , 10} p_i$ of Binomial Random Variable in Stan/RStan/R

1 Answers1