3

I wish to model data from an experiment using a hierarchical Bayesian logistic regression. The experiment involved many subjects, and many trials collected from each subject. The DV is the outcome of each trial, coded as 0s and 1s. I have numerous IVs, which vary on a trial-to-trial basis. I wish to model the effects of these IVs on each individual subject's responses, such that I have a regression coefficient for each IV per subject. However, I also want to estimate these regression coefficients hierarchically, such that a coefficient associated with a particular IV for a given subject is informed by the same coefficients from other subjects. Here is the relevant portion of my code:

def make_model(group, formula, data):
    """Construct the hierarchical logistic linear model according to the
    formula. Assumes a fully balanced repeated-measures design (i.e., each
    subject is a complete sub-model).

    """
    # coefficient names
    names = dmatrix(formula, data).design_info.column_names

    # hyperpriors
    mu = [pm.Normal(name='%s_mu_%s' % (group, n), sd=100.) for n in names]
    sd = [pm.HalfCauchy(name='%s_sd_%s' % (group, n), beta=30.) for n in names]

    for subj, subj_data in data.groupby('subj'):

        # design matrix
        x = np.asarray(dmatrix(formula, subj_data))

        # priors
        beta = [pm.Normal(name='%s_%s' % (subj, n), mu=mu[i], sd=sd[i]) for
                i, n in enumerate(names)]

        # deterministic variables
        theta = dot(x, stack(*beta))
        p = pm.invlogit(theta)
        obs = subj_data.stay.values

        # likelihood
        pm.Bernoulli(name='%s_Y' % subj, p=p, observed=obs)

Basically, I create a design matrix for the complete data set using patsy. I then create hyper-priors for each regression coefficient. Then I loop over the individual subjects, creating subject-specific design matrices, subject-level priors for each coefficient, and a subject-level likelihood vector as I go.

Unfortunately, the model is so large that it never compiles. I suspect that I might get better performance by constructing a single likelihood vector rather than multiple ones, but I'm not sure how I could "match up" each item in that vector to the correct priors. Any suggestions?

sammosummo
  • 743
  • 3
  • 13

1 Answers1

1

There's a few examples of that in the docs: http://pymc-devs.github.io/pymc3/notebooks/GLM-hierarchical.html, http://pymc-devs.github.io/pymc3/notebooks/hierarchical_partial_pooling.html as well as http://pymc-devs.github.io/pymc3/notebooks/GLM-linear.html for using patsy.

Finally, you might also be interested in BAMBI: https://github.com/bambinos/bambi

twiecki
  • 1,036
  • 6
  • 5
  • 1
    These examples don't work in my case, unfortunately. In the radon example, each row in the data contains all the data belonging to one sub-model (county), which makes it easy to index the correct priors with `a[county_idx] + b[county_idx]*data.floor.values`. In my example, there are multiple rows per sub-model (subject). – sammosummo Dec 02 '16 at 15:26