How to add Gaussian noise on top of a logistic regression model in Stan?

Question

I am building the following model for logistic regression in Stan (pystan):

Predictor: $\eta_t = \beta_1 x_{1,t} + \beta_2 x_{2,t} + \beta_3 x_{3,t} + \beta_4 x_{4,t} + \sigma \epsilon_t$

Outcome: $y_t \sim \text{Bernoulli}(f(n_t))$

where $\beta_i$ and $\sigma$ are the coefficients of regression, $x_{i,t}$ are the values of regressors for trial $t$ and $\epsilon_t \sim N(0,1)$ is the noise in the predictor $\eta_t$. Finally, $f$ is the sigmoid link function for linking the predictor $\eta_t$ to the Bernoulli outcome $y_t$ on each trial $t$.

Without the noise term in the predictor $\sigma \epsilon_t$, the following code for the model runs fine:

choice_code_bernoulliLogit = """ 
data {
    int<lower=1> N;                 // number of training samples
    int<lower=0> K;                 // number of predictors
    matrix[N, K] x;                 // matrix of predictors
    int<lower=0, upper=1> y[N];     // observed/training choice accuracy
}
parameters {
    vector[K] beta;
}
model {
    // faster implementation
    vector[N] theta;
    theta = x * beta;

    // priors
    beta ~ normal(2,2);

    // model
    y ~ bernoulli_logit(theta);
}
generated quantities {
    vector[K] choiceAccuracy;
    choiceAccuracy = inv_logit(beta);
}
"""

# defining the choice data
# choice_data =  {...}

# fit, extract parameters, and print summary
posterior = stan.build(choice_code_bernoulliLogit, data=choice_data)
fit = posterior.sample(num_chains=4, num_samples=1000)

However, I cannot figure out how/where to add noise to the model. I keep getting Semantic errors while building the code in stan; I'm not sure where I'm going wrong, and cannot seem to find any examples on the stan documentation pages nor anywhere else. The following code is an example of what I tried to do:

choice_code_bernoulliLogit = """ 
data {} // same as before
parameters {
    vector[K] beta;
    real<lower=0> sigma;
}
model {
    // faster implementation
    vector[N] theta;
    theta = x * beta;

    // priors
    beta ~ normal(2,2);

    // model
    y ~ bernoulli_logit(normal(theta,sigma));
}
generated quantities {} // same as before
"""

I have also tried removing the vectorization with explicit for loops, and tried using the normal_rng() in a transformed parameters block to make the predictor noisy, but none of that doesn't seem to work. Any help would be greatly appreciated! Thanks :)

[This](https://stats.stackexchange.com/questions/46523/how-to-simulate-artificial-data-for-logistic-regression/46525) uses `R` code, but the comments should help you understand the idea, even if you don't know `R`. — Dave, Aug 24 '21 at 17:22
Thanks @Dave, it did help clear some doubts. With the comments in the question you linked and psboonstra's answer below, it seems that I may not need the noise term after all and that it was a misunderstanding on my part. — lonelyOrca, Aug 25 '21 at 11:24
I think everyone goes through that misconception when they first learn about GLMs. — Dave, Aug 25 '21 at 12:15

score 3 · Accepted Answer · answered Aug 24 '21 at 17:18

A GLM for a binary outcome does not have an additive measurement error term in the same way that a linear regression model for a continuous outcome does. The measurement error in your model is already built in without needing the $\sigma\epsilon_t$ term. It just doesn't seem that way at first glance because the Bernoulli variance is tied to and fully determined by its mean.

Put differently: you are observing draws of random Bernoulli variables that are presumed to have mean equal to $f(\eta_t)$, and there is no statistical way for a Bernoulli variable to be overdispersed (or underdispersed, for that matter), and $\sigma$ in your proposed model is not identifiable apart from its prior.

Your proposed model would only be appropriate if you could directly measure the values of the predictor $\eta_t$, and you presumed that they were being measured with some error.

As an aside: I'm not actually sure if this is the reason why you are encountering errors in your STAN program, because the issue as I see it is a statistical one (namely, your model is not identified) and not a numerical one. The line of code y ~ bernoulli_logit(normal(theta,sigma)); looks suspicious to me.

Very helpful; thanks. You may be right that I do not need the additive noise, but I want to be sure. In my case, the $x_i$'s are indicator variables for the "state" in an experiment. They are mutually exclusive such that only one of them is hot (1) on a given trial. This way, the $\beta_i$'s measure the choice accuracy of the participants in the corresponding states $i$. So while we cannot measure the predictor $\eta_t$ directly, I thought it would be plausible that there is common noise across the 4 states with which participants generate their response. Do you still think it is unnecessary? — lonelyOrca, Aug 25 '21 at 08:32
Upon some more reflection, I realise that you are totally correct and that the error term is totally redundant. There is nothing in the data to constrain this since we cannot measure the predictor $\eta_t$ directly but rather only the outcomes $y_t$ on every trial, which is already stochastic by virtue of the Bernoulli process. — lonelyOrca, Aug 25 '21 at 12:27

How to add Gaussian noise on top of a logistic regression model in Stan?

1 Answers1