3

I'm trying to model the probability of an event Y based on three independant variables, one (X) is continuous (a log count) and the others (A and B) are categorical (nominal). B is a subcategory of A. A has 4 levels, most of all are well populated, B has 3 to 15 levels, depending of level of A, and about half are well populated.

I could take all my three variables and do a Bayesian logistic regression (one-hot encoding for A and B, ending up with 1+4+15 columns). I could also proceed by steps: four distinct models/logistic regressions of Y based on X, one for each level of A. Then using the coefficients of each as priors on X, do logistic regressions Y ~ X on each level of B (if levelBj of B belongs to levelAi of A then I use the priors of model i above).

Does it make sense to proceed that way? Are there advantages/disadvantages doing that? Are there alternatives? Any links/tutorials on the problem of mixing categorical and continuous variables for bayesian logistic regression are also appreciated (particularly for PyMC3).

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Patrick
  • 393
  • 1
  • 9

1 Answers1

1

Short answer: Your second proposal seems strange to me, go for the simpler first proposal. There is no problem in principle with mixing different kinds of variables as predictors in a regression. If the range of the continuous variable is not small, consider to spline it (or simpler, a quadratic polynomial model). For the categorical variable with some sparse levels, maybe regularization, see Principled way of collapsing categorical variables with many levels?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467