2

I would like to know which is statistically more advisable and what are the advantages and disadvantages of each approach.

My data frame data has Y, the outcome, and A and B, the predictor variables. A and B are categorical with multiple levels each (the levels are A0, A1, A2, and A3 for A; and B0, B1, B2, and B3 for B). I want to explore the interaction A * Band calculate some epidemiological measures whose formulas are more manageable when A and B are binary each.

It is possible to keep a meaningful interpretation in my results if I split the data frame into several chunks and fit a logistic regression with binary predictors for each chunk of data. This has the advantage that I can easily calculate the epidemiological measures that are of interest for my analysis. However, this approach might compromise the sample size and there might be other disadvantages that I am not aware of.

Alternatively, I could use the full data frame and fit a single logistic regression with categorical predictors and do the same pairwise comparisons as above - more difficult but possible. This has the advantage of keeping a good sample size and probably other good properties that I am not aware of. But there might be some disadvantages that I might not be aware of and would like to know.

Thanks in advance for any help.

Krantz
  • 556
  • 2
  • 18
  • Possible duplicate of [Should I run separate regressions for every community, or can community simply be a controlling variable in an aggregated model?](https://stats.stackexchange.com/questions/17110/should-i-run-separate-regressions-for-every-community-or-can-community-simply-b) Also see: https://stats.stackexchange.com/questions/329061/moderated-regression-and-separate-models-give-slightly-different-results – StatsStudent Feb 16 '19 at 21:53
  • I don't agree. My question explores a different aspect. – Krantz Feb 16 '19 at 21:56
  • I'm not sure I understand how your question is any different from fitting a model with an interaction effect between A & B (possibly after collapsing categories) and fitting separate models? This question has been asked many times on CV before. – StatsStudent Feb 16 '19 at 22:08
  • `model with an interaction effect between A & B after collapsing categories` is not part of my question. I am not doing that in my analysis. – Krantz Feb 16 '19 at 22:10
  • But you've written "I want to explore the interaction `A * B` and calculate some epidemiological measures whose formulas are more manageable when A and B are binary each. It is possible to keep a meaningful interpretation in my results if I split the data frame into several chunks and fit a logistic regression with binary predictors for each chunk of data." This sounds like trying to chose between interaction effects and fitting separate models. If that's not what you are asking, I would suggest editing this to make your question clearer. – StatsStudent Feb 16 '19 at 22:30
  • `if I split the data frame into several chunks and fit a logistic regression with binary predictors for each chunk of data."` is not `model with an interaction effect between A & B after collapsing categories`. – Krantz Feb 16 '19 at 22:31
  • Correct. It's the other side of the question: running separate regressions. If this is not what you are intending to ask, I'd suggest an editing of the question. – StatsStudent Feb 16 '19 at 22:33
  • Great! Which passage of the question should I edit in your opinion to minimize misunderstandings? – Krantz Feb 16 '19 at 22:37
  • Well I don't know because I am under the impression that this is the same question as deciding between a model with interactions vs. separate models. Perhaps completely specifying, in mathematical terms, the different types of models you are considering running would be helpful? You might also consider a toy example of the models you are thinking about would be helpful? – StatsStudent Feb 16 '19 at 22:40
  • As you might have noticed, I answered all your concerns using passages from the question without even paraphrasing anything. Also, the answer provided below is in line with the scope of the question. Alternatively, I could simply delete the question. – Krantz Feb 16 '19 at 22:44
  • And that's likely why the confusion hasn't disappeared. I don't think you need to delete the question unless you no longer have a question. I think that fully specifying the models you are having difficulty choosing among will likely clarify any confusion. – StatsStudent Feb 16 '19 at 22:48

1 Answers1

1

logistic regression with categorical data will assume that there is a scale between the categories, it can not handle unordered categories.

Having said that, you could split A and B into one-hot encoded vectors and perform a logistic regression on this representation, which will only include binary variables.

If your analysis still applies to this model, you are golden. Otherwise, there will be differences between the chunked model and the full model.

Ulfgard
  • 508
  • 4
  • 6
  • Thanks, @Ulfgard. Could you provide some advantages and disadvantages of each approach along with some citable references if possible? – Krantz Feb 16 '19 at 16:51
  • I'm not sure I understand this. Logistic regression can handle unordered categories. For example, it public health and medicine, it's very common to see logistic regression used with race as a categorical predictor and clearly there is no ordering of races. – StatsStudent Feb 16 '19 at 22:44
  • If `Logistic regression can handle unordered categories.`, then what would be the statistical disadvantage of `single logistic regression with categorical predictors` as compared to `multiple logistic regressions with binary predictors`? Any citable references? – Krantz Feb 16 '19 at 22:58
  • I'm not sure I understand the need for multiple logistic regressions. When you include a categorical predictor in a model, the program is simply converting the categories to dummy variables in the background. So you as long as you coded the dummy variables the same way, a logistic regression model with dummies is the same as a logistic regression with a categorical variable. – StatsStudent Feb 16 '19 at 23:28
  • Thanks, @StatsStudent. I have the same intuition as you. But could you help with some citable references to support this: `So you as long as you coded the dummy variables the same way`, multiple logistic regressions with binary predictors `is the same as a` single logistic regression with categorical predictors? – Krantz Feb 16 '19 at 23:40
  • This is an elementary concept in statistical models, so any elementary statistical models book that discussing variable coding will do for a citation. You could even just write out the model to show it's true and then cite the documentation of the statistical program you are using to describe how it codes the variables. But note that running a model with multiple categories (>2) and obtaining the logit for Category A is not the same thing as recoding category A into 1 and all other values to 0 and running a model and then obtaining the logit for A. – StatsStudent Feb 16 '19 at 23:44
  • Just to clarify. My questions is `multiple logistic regressions with binary predictors vs single logistic regression with categorical predictors`. There is no change in how the variables are coded. Just split and do multiple logistic regressions [plural] based on binary data or use the full dataset and do a single multiple regression [singular] based on categorical data. – Krantz Feb 16 '19 at 23:47
  • There is no need to complete multiple (i.e. several) logistic regressions. You simply write a single model with the appropriate coding. – StatsStudent Feb 16 '19 at 23:48
  • The coding is the same. The levels are `A0`, `A1`, `A2`, and `A3` for `A`; and `B0`, `B1`, `B2`, and `B3` for `B`. That coding does not change. – Krantz Feb 16 '19 at 23:50
  • Could you kindly help by posting an answer, @StatsStudent? That could help summarize your view. – Krantz Feb 16 '19 at 23:52
  • "Logistic regression can handle unordered categories".No it can't. the predictor is linear. if x is a random variable with several unordered categories, the logistic regression predictor is f(x)=ax+b. You will see that this predictor is not invariant under permutations of the categories. Now, you might see it work in practice and that is because many implementations (e.g. R) will represent categorical variables as one-hot encoded vectors – Ulfgard Feb 18 '19 at 09:32