Does it make sense to restrict the number of observations for logistic regression?

Question

This is a theoretical question.

Lets assume I have performed a ordinal logistic regression on a large dataset.

my model:

y~x1+x2+x3

where y is an ordered factor with four levels,
x1is a binomial variable (0 or 1)
x2is a continous variable
x3is a binomial variable (0 or 1)

I got some OR let's say 1.5 for x1;1and 1.5 for x2 as well as x3;1.

I understand that the odds ratios are for a change in one variable, haveing other held constant.

So my question is: If I restrict my observations to subjects who only had a certain x2 and had '1' in their x3 variable and run the regression with the same model --- would I expect to see a significant change in my x1 odds ratio?

If so would it be increasing or lowering the OR of that variable?

Is it even allowed to restrict the number of observations in such matter or would someone hang me for this?

You are right. Put an journal Peer Reviewer under the executioners hood for illustrational purposes. — WojciechF, Mar 11 '15 at 14:04
How are we to judge what an unspecified reviewer with an unspecified background in statistics for an unspecified journal with unspecified editorial policies in an unspecified subject area might do? For example, a reviewer for the journal [*Basic and Applied Social Psychology*](http://stats.stackexchange.com/questions/139290/a-psychology-journal-banned-p-values-and-confidence-intervals-is-it-indeed-wise) might have a very different reaction to one for *Variance*. — Glen_b, Mar 11 '15 at 14:16
I think you are missing the point. The point is IF the odds are expected to change — WojciechF, Mar 11 '15 at 14:32
If the final question isn't to the point, why not remove it? — Glen_b, Mar 11 '15 at 14:34
Do you require an answer to "Is it even allowed to restrict the number of observations in such matter or would someone hang me for this?" or not? If you do, your question has problems that need to be addressed (well there are some issues with the rest of it, but that has the biggest issue). I'd like this to be an answerable question. — Glen_b, Mar 11 '15 at 15:14

kjetil b halvorsen · Accepted Answer · 2017-07-27T15:37:43.727

Why do you want to do this? If you have a good reason, there should be no problem. A better question (until you tell us your reason) is what can be achieved by doing this? Let us look at this in a simpler model (the issues will be the same for your more complicated model).

Let the response y be binomial, with two predictors, x1 is continuous and x2 is 0/1, that is, binomial. What happens if you estimate two models, one for the subset of data with x2=0, another for the subset with x2=1?

Lets suppose that the model for the complete data is $$ \text{logit}(p) = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \beta_{12} x_1 x_2 $$ where the last term represents an interaction between x1 and x2 (in this model, in more concrete terms, this means that the slope of the continuous predictor x1 is different for the two groups coded by x2). Now, for the subsetted data, it will not be possible to estimate $\beta_2$ and $\beta_{12}$, so we will get two models $$ \text{logit}(p^0) = \beta_0^0 + \beta_1^0 x_1 $$ and $$ \text{logit}(p^1) = \beta_0^1 + \beta_1^1 x_1 $$ where the superindex (0 or 1) indicates the subset used (x2=0, x2=1).

Comparing the models we can see that (first using subset x2=0) that $$ \text{logit}(p^0) = \beta_0 + \beta_1 x_1 \equiv \beta_0^0 + \beta_1^0 x_1 $$ and then subset with x2=1: $$ \text{logit}(p^1) = \beta_0 + \beta_1 x_1 + \beta_2 + \beta_{12} x_1 \equiv \beta_0^1 + \beta_1^1 x_1 $$ From this we see (assuming the full model is the true model) that (from x2=0): $ \beta_0 = \beta_0^0 ; \beta_1 = \beta_1^0$ and from x2=1: $ \beta_0+\beta_2 = \beta_0^1 ; \beta_1 + \beta_{12}= \beta_1^1$. So from the two, separate subsetted models you can approximate $\beta_{12}=\beta_1^1 - \beta_1^0$ and the effect of x2, $\beta_2$ by $ \beta_2 = \beta_0^1 - \beta_0^0$. In more prectical terms, this says that the interaction can be recovered as the difference in slope between the two subsetted models, and the effect of x2 by the difference in intercept between the two subsetted models. You could develop similar relations for your more complex model, by similar arguments.

So you can recover all the information from the two subsetted models, if this can be done in an statistically efficient manner is another question!

Does it make sense to restrict the number of observations for logistic regression?

1 Answers1