When I simply duplicate a subset of observations and build the same logistic regression model with the extended data, the coefficients of covariates change. If I duplicate the whole dataset, they stay the same.
This confuses me because all the covariates are categorical, so when I duplicate a subset, I am not providing any new information. A set of observations with combinations of covariates are repeated; the outcomes are not changing.
For example, I've modified the UCLA logistic regression data set used in this tutorial and created a data set where all the covariates are discrete variables. This gives me the data set that you can see here (csv file)
When I run logistic regression on it, I get:
dff <- read.table('d:/temp/ucla-factored.csv',sep=',',header=TRUE)
dff$rank <- factor(dff$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = dff, family='binomial')
summary(mylogit)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = dff)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.5149 -0.8971 -0.6672 1.1441 2.0587
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.7025 0.4144 -1.695 0.090046 .
gregre2 0.4024 0.3404 1.182 0.237190
gregre3 0.6130 0.3571 1.717 0.086038 .
gpagpa2 0.3115 0.3121 0.998 0.318350
gpagpa3 0.8551 0.3428 2.495 0.012609 *
rankrank2 -0.6866 0.3166 -2.169 0.030101 *
rankrank3 -1.3850 0.3439 -4.027 5.65e-05 ***
rankrank4 -1.6000 0.4170 -3.837 0.000124 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 499.98 on 399 degrees of freedom
Residual deviance: 459.80 on 392 degrees of freedom
AIC: 475.8
Number of Fisher Scoring iterations: 4
Then I choose the rows where gpa is gpa2 and copy and append those to the data set, giving me the csv file here
Logistic regression with this extended data set gives:
dffbig <- read.table('d:/temp/ucla-factoredbig.csv',sep=',',header=TRUE)
dffbig$rank <- factor(dffbig$rank)
mylogitbig <- glm(admit ~ gre + gpa + rank, data = dffbig, family='binomial')
summary(mylogitbig)
Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial",
data = dffbig)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.4775 -0.8931 -0.6533 1.1939 2.1207
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.8668 0.3786 -2.290 0.0220 *
gregre2 0.5278 0.2814 1.875 0.0607 .
gregre3 0.7013 0.2964 2.366 0.0180 *
gpagpa2 0.2992 0.2877 1.040 0.2985
gpagpa3 0.8481 0.3364 2.521 0.0117 *
rankrank2 -0.5478 0.2600 -2.107 0.0351 *
rankrank3 -1.3187 0.2873 -4.589 4.45e-06 ***
rankrank4 -1.5695 0.3460 -4.536 5.74e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 727.61 on 587 degrees of freedom
Residual deviance: 672.97 on 580 degrees of freedom
AIC: 688.97
Number of Fisher Scoring iterations: 4
Why? As long as the outcomes are not different, why would observing the same pattern change the estimated parameters of a model?
Background: This is related to my efforts to create a synthetic data set using only an existing logistic regression model. When all the variables are categorical, I should be able to come up with all possible combinations of inputs and generate input data but things don't go as expected when synthetic data is fed back into logistic regression because logistic regression appears to include more than mere distributions of combinations of discrete variables.