1

I would need help in order to write a specific Stan model.

The biological question

The idea of the model is modeling the number of Bones (NbBones : discret continuous count) according to Orders (3 different discret orders) and Time (continuous). And then show that The Orders 2 has a number of bones greater than the other Orders.

The Baysian model

I'm trying to write a GLM negative binomial model with zero inflated data since the mean of NbBones is not equal to the variance because of a huge number of zeros, and since within these zeros, I have a proportion p that is coming from a specific category (Bones was destroyed and then is not observable, and bones was not find because it never existed (but observable).

So far I wrote this model such as :

model
{
  for(i in 1 : N) 
  {
    observable[i] ~ dbern(p)
    NbBones[i] ~ dpois(nu[i] * theta[i] * observable[i])  
    theta[i] <-  intercept + Coef_Orders[Orders[i]] + Coef_Time*Time[i] 
    nu[i] ~ dgamma(1/alpha, 1/alpha)
  }
  p ~ dunif(0, 1)
  alpha ~ dunif(0, 10)
  Coef_Orders[1] <- 0
  Coef_Orders[2] ~ dexp(1)
  Coef_Orders[3] ~ dexp(1) 
  Coef_Time~ dexp(1)
  intercept ~ dexp(1)
}

Where in the negative binomial model with excess zeros, it is considered that only a proportion p of individuals are likely to have bones (observables). For observable individuals (p %), the number of bones follows a negative binomial distribution (allowing 0's, as an observable individual may not have any bones). Unobservable individuals (1 -p %) have a number of bones following a Poisson distribution with an expectation of 0 (so their number of bones is necessarily zero).

My question

But i cannot manage to write it on stan, can someone help me ? (I found thid code :

stan_glm1 <- stan_glm(NbBones ~ Orders * Time,
                      data = data, family = neg_binomial_2,
                      prior = normal(0, 2.5),
                      prior_intercept = normal(0, 5),
                      seed = 12345)

But it does not include the zero inflated model...

Thanks a lot for your help and time :-)

Here are the data :

  Orders    Time NbBones
1      1 0.21771       0
2      1 0.08723       0
3      1 0.03911       0
4      1 0.08236       2
5      1 0.08744       6
6      1 0.00417       1

SO far I tried:

//Declare the data 
data{
  int<lower=0> N;//
  real Time[N]; //
  real NbBones[N]; //
}
//Declare the parameters 
parameters{
  real  theta; //continuous unconstrained variable 
  real<lower=0>observable;//constrained to not be negative 
}
//Model block - declare the likelihood and priors 
model{
  //likelihood
  for(i in 1 : N){
    NbBones[i] ~poisson(theta[i] * observable[i]);
    theta[i] <-  intercept + Coef_orders[Orders[i]] + Coef_Time*Time[i];
    observable[i] ~ bernoulli(p);
  }
  //Priors 
  p ~ uniform(0, 1);
  alpha ~ uniform(0, 10);
  Coef_orders[1] <- 0;
  Coef_orders[2] ~ exponential(1);
  Coef_orders[3] ~ exponential(1) ;
  Coef_Time  ~ exponential(1);
  intercept ~ exponential(1);
}

In dput format if you want a toy example :

> dput(data)
structure(list(Orders = c("1", "1", "1", "1", "1", "1", "1", 
"1", "1", "3", "1", "2", "1", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "3", "2", "2", "3", "2", "3", "3", "1", "3", "1", "1", "3", 
"3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", "3", 
"3", "3", "3", "3", "3", "3", "1", "2", "1", "1", "1", "1", "2", 
"2", "3", "1", "3", "3", "1", "1", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "1", "2", "2", "2", "2", "2", "2", "2", "2", "1", 
"1", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", 
"3", "2", "2", "2", "2", "2", "2", "2", "2", "2", "2", "3", "3", 
"2", "2", "3", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", "1", 
"1", "1", "1", "1", "1", "1"), Time = c(0.21771, 0.08723, 0.03911, 
0.08236, 0.08744, 0.00417, 0.0162, 0.11631, 0.01712, 0.18276, 
0.0169, 0.01049, 0.04138, 0.03169, 0.05169, 0.01718, 0.11169, 
0.04654, 0.09138, 0.23242, 0.08843, 0.0185, 0.27725, 0.01363, 
0.01198, 0.06394, 0.0218, 0.05122, 0.09829, 0.07324, 0.07186, 
0.05843, 0.06693, 0.18545, 0.18037, 0.00655, 0.0122, 0.00572, 
0.01076, 0.07204, 0.00957, 0.00579, 0.02072, 0.01819, 0.02319, 
0.00948, 0.00263, 0.00698, 0.00364, 0.00502, 0.00597, 0.07171, 
0.09189, 0.01643, 0.00803, 0.00688, 0.00864, 0.01067, 0.01123, 
0.0078, 0.01015, 0.00944, 0.00755, 0.0227, 0.02841, 0.1087, 0.07557, 
0.04564, 0.03514, 0.00946, 0.011, 0.01385, 0.01837, 0.09937, 
0.10749, 0.01513, 0.01839, 0.01219, 0.01021, 0.01, 0.00844, 0.18295, 
0.22844, 0.12524, 0.14649, 0.03092, 0.02957, 0.05148, 0.08305, 
0.02372, 0.04403, 0.06083, 0.05597, 0.04202, 0.01476, 0.03361, 
0.04169, 0.00828, 0.01348, 0.06318, 0.06703, 0.02774, 0.03293, 
0.27072, 0.11012, 0.05687, 0.01636, 0.03637, 0.04128, 0.01157, 
0.01237, 0.13546, 0.01796, 0.07682, 0.0674, 0.188, 0.13317, 0.01142, 
0.01522, 0, 0, 0.01998, 0.07889, 0.02845, 0.08173, 0.01455, 0.06392, 
0.03808, 0.08862, 0.05304, 0.04987, 0.02237, 0.02612, 0.06138, 
0.05949, 0.00959, 0.01014, 0.00687, 0.00701, 0.21064, 0.01838, 
0.14369, 0.0136, 0.01749, 0.02122, 0.06394, 0.07276, 0.02882, 
0.01453, 0.05646, 0.0174, 0.0714, 0.05069, 0.03303, 0.01158, 
0.04024, 0.03759, 0.06414, 0.00566, 0.06868, 0.01201, 0.03361, 
0.03775, 0.03854, 0.00506, 0.01818, 0.03194, 0.04363, 0.01267, 
0.01589, 0.00328, 0.00965, 0.00229, 0.01062, 0.00908, 0.02851, 
0.02654, 0.00732, 0.01082, 0.02281, 0.02332, 0.04561, 0.01981, 
0.0493, 0.04335, 0.07554, 0.0074, 0.00729, 0.00631, 0.0631, 0.10638, 
0.03161, 0.02191, 0.00658, 0.00831, 0.03528, 0.01206, 0.02015, 
0.01668, 0.00712, 0.00762, 0.01905, 0.0512, 0.05175, 0.00956, 
0.02592, 0.00313, 0.00415, 0.0024, 0.00201, 0.00077, 0.00415, 
6e-04, 0.00275, 0.00109, 0.0023, 0.00156, 0.00601, 0.00223, 0.03884, 
0.00295, 0.04161, 0.02727, 0.00582, 0.00725, 0.01396, 0.00538, 
0.03965, 0.01708, 0.01833, 0.02333, 0.03832, 0.02266, 0.02942, 
0.00625, 0.0192, 0.00577, 0.01567, 0.00193, 0.01579, 0.00174, 
0.01351, 0.00787, 0.00599, 0.00326, 0.00456, 0.00482), NbBones = c(0L, 
0L, 0L, 2L, 6L, 1L, 0L, 3L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 2L, 3L, 3L, 2L, 5L, 
4L, 4L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 2L, 2L, 1L, 0L, 1L, 0L, 
1L, 0L, 0L, 0L, 2L, 2L, 2L, 2L, 4L, 3L, 0L, 0L, 0L, 2L, 2L, 3L, 
0L, 1L, 0L, 0L, 1L, 0L, 2L, 3L, 1L, 0L, 1L, 3L, 2L, 3L, 2L, 0L, 
2L, 0L, 4L, 2L, 2L, 2L, 1L, 3L, 1L, 1L, 2L, 6L, 6L, 1L, 5L, 4L, 
2L, 1L, 0L, 0L, 0L, 1L, 4L, 0L, 2L, 0L, 3L, 1L, 0L, 2L, 2L, 0L, 
5L, 4L, 1L, 1L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 0L, 
3L, 5L, 3L, 5L, 3L, 1L, 0L, 1L, 4L, 3L, 3L, 0L, 0L, 0L, 0L, 2L, 
0L, 0L, 2L, 0L, 1L, 1L, 2L, 1L, 1L, 1L, 2L, 0L, 3L, 0L, 1L, 0L, 
2L, 2L, 1L, 0L, 0L, 0L, 1L, 0L, 3L, 0L, 1L, 0L, 2L, 0L, 0L, 0L, 
2L, 1L, 0L, 1L, 1L, 0L, 2L, 2L, 0L, 0L, 0L, 0L, 3L, 4L, 0L, 0L, 
1L, 0L, 1L, 0L, 1L, 0L, 2L, 0L, 4L, 4L, 0L, 0L, 0L, 0L, 1L, 1L, 
0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 2L, 1L, 
0L, 0L, 0L, 1L, 2L, 1L, 2L, 0L, 0L, 0L, 0L, 0L, 1L, 0L, 1L, 0L, 
0L, 0L, 0L, 0L, 0L, 0L)), row.names = c("1", "2", "3", "4", "5", 
"6", "7", "8", "9", "10", "11", "12", "13", "14", "15", "16", 
"17", "18", "19", "20", "21", "22", "23", "24", "25", "26", "27", 
"28", "29", "30", "31", "32", "33", "34", "35", "36", "37", "38", 
"39", "40", "41", "42", "43", "44", "45", "46", "47", "48", "49", 
"50", "51", "52", "53", "54", "55", "56", "57", "58", "59", "60", 
"61", "62", "63", "64", "65", "66", "67", "68", "69", "70", "71", 
"72", "73", "74", "75", "76", "77", "78", "79", "80", "81", "82", 
"83", "84", "85", "86", "87", "88", "89", "90", "91", "92", "93", 
"94", "95", "96", "97", "98", "99", "100", "101", "102", "103", 
"104", "105", "106", "107", "108", "109", "110", "111", "112", 
"113", "114", "115", "116", "117", "118", "119", "120", "121", 
"122", "123", "124", "125", "126", "127", "128", "129", "130", 
"131", "132", "133", "134", "135", "136", "137", "138", "139", 
"140", "141", "142", "143", "144", "145", "146", "147", "148", 
"149", "150", "151", "152", "153", "154", "155", "156", "157", 
"158", "159", "160", "161", "162", "163", "164", "165", "166", 
"167", "168", "169", "170", "171", "172", "173", "174", "175", 
"176", "177", "178", "179", "180", "181", "182", "183", "184", 
"185", "186", "187", "188", "189", "190", "191", "192", "193", 
"194", "195", "196", "197", "198", "199", "200", "201", "202", 
"203", "204", "205", "206", "207", "208", "209", "210", "211", 
"212", "213", "214", "215", "216", "217", "218", "219", "220", 
"221", "222", "223", "224", "225", "226", "227", "228", "229", 
"230", "231", "232", "233", "234", "235", "236", "237", "238", 
"239", "240", "241", "242", "243", "244", "245", "246", "247"
), class = "data.frame")
chippycentra
  • 193
  • 6

1 Answers1

3

I don't think your Stan model is doing what you want it to do.

The density for such a model looks like

$$ f(y \vert X) = \begin{cases} \theta + (1-\theta)\cdot \operatorname{Poisson}(y \vert \exp(X\beta)) & \text{if}\> y=0\\ (1-\theta)\cdot\operatorname{Poisson}(y \vert \exp(X\beta)) \quad &\text{else} \end{cases}$$

Here, $\theta$ is the probability of getting a 0 from the binomial process. We can break this down a little more when writing a Stan program.

In our data block, we need to pass all the things we need to fit the model (including the outcome, a design matrix, and the number of rows).

That might look like

data{
  int N;
  matrix[N, 4] X;
  int y[N];
}

It doesn't matter that we don't use the names in the data. I'm using y and X for consistency with the mathematical model.

Next, we need to declare our parameters. There are two main ones: $\theta$ which is on the unit interval, and $\beta$ which is a vector. $\beta$ must have the same number of elements as X has columns.

parameters{
  real<lower=0, upper=1> theta;

  vector[4] beta;
}

From Poisson regression, we know that $\log(E(y)) = X\beta$ and so $E(y) = \lambda(X) = \exp(X\beta)$. We can compute this in the transformed parameters block. You could do this in the model block and save yourself the trouble of persisting $\lambda$ through each computation, but this is a bit cleaner in my opinion.

transformed parameters{
  vector[N] lambda = exp(X*beta);
}

Now comes the model block. There is a little trick we need to apply, called the log_sum_exp trick. It adds log probabilities on the linear scale (which we need to do when $y=0$. Remember, Stan works by accumulating log probabilities).

The model block looks like

model{
  beta ~ std_normal();
  theta ~ beta(1,1);
  
  for(n in 1:N){
  
    if(y[n]==0){
    target+= log_sum_exp(bernoulli_lpmf(1 | theta), bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda[n]));
    }
    else{
    target += bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n]| lambda[n]);
    }
    
  }

Here, I've been lazy and placed a standard normal prior on each coefficient of our model, and a uniform prior on the probability of getting a 0 (I'm sure you can do much better).

All together, our Stan model looks like


data{
  int N;
  matrix[N, 4] X;
  int y[N];
}
parameters{
  real<lower=0, upper=1> theta;

  vector[4] beta;
}
transformed parameters{
  vector[N] lambda = exp(X*beta);
}
model{
  beta ~ std_normal();
  theta ~ beta(1,1);
  
  for(n in 1:N){
  
    if(y[n]==0){
    target+= log_sum_exp(bernoulli_lpmf(1 | theta), bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n] | lambda[n]));
    }
    else{
    target += bernoulli_lpmf(0 | theta) + poisson_lpmf(y[n]| lambda[n]);
    }
    
  }

}

Alternatively, you can skip a lot of this by just using the brms library which supports zero inflated poisson models and formula syntax much like rstanarm

# Install
install.packages('brms')
library(brms)
fit = brm(NbBones ~ Time + Orders, data = d, family = zero_inflated_poisson())
Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
  • Waww, thanks a lot for all the explanation and the code it will be realy usefull for me !!! And by the was How did you plot the pp_check code since both ```rstanarm``` and ```brms``` have the same pp_check function name ? Even by using ```rstanarm::pp_check(hit, plotfun = "bars", stat = "mean")```it still runs the brms function.... – chippycentra Sep 06 '21 at 15:21
  • @chippycentra Try `brms::pp_check(fit, type='bars')` – Demetri Pananos Sep 06 '21 at 16:23
  • Thanks a lot for the help. Another last question, I found that the best model for my data is ```zero_inflated_negbinomial```, do you know if the coef values estimated need to be transformed back in order to be biologicaly interpeted (with an exponential right ) ? – chippycentra Sep 07 '21 at 06:48
  • And I also wondered if the zi posterior mean value of 0.40 means that overall, around 1-0.40 (60%) of zero are comming from observable values right ? – chippycentra Sep 07 '21 at 09:25
  • @chippycentra Coefficients are on the scale of the link function, so applying the inverse link should give you appropriate interpretations on the multiplicative scale. As for the zi, I'm sure this can be found in the documentation. – Demetri Pananos Sep 07 '21 at 22:31