Why including some observations twice changes the coefficients of logistic regression?

Question

When I simply duplicate a subset of observations and build the same logistic regression model with the extended data, the coefficients of covariates change. If I duplicate the whole dataset, they stay the same.

This confuses me because all the covariates are categorical, so when I duplicate a subset, I am not providing any new information. A set of observations with combinations of covariates are repeated; the outcomes are not changing.

For example, I've modified the UCLA logistic regression data set used in this tutorial and created a data set where all the covariates are discrete variables. This gives me the data set that you can see here (csv file)

When I run logistic regression on it, I get:

dff <- read.table('d:/temp/ucla-factored.csv',sep=',',header=TRUE)
dff$rank <- factor(dff$rank)
mylogit <- glm(admit ~ gre + gpa + rank, data = dff, family='binomial')
summary(mylogit)

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = dff)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.5149  -0.8971  -0.6672   1.1441   2.0587  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.7025     0.4144  -1.695 0.090046 .  
gregre2       0.4024     0.3404   1.182 0.237190    
gregre3       0.6130     0.3571   1.717 0.086038 .  
gpagpa2       0.3115     0.3121   0.998 0.318350    
gpagpa3       0.8551     0.3428   2.495 0.012609 *  
rankrank2    -0.6866     0.3166  -2.169 0.030101 *  
rankrank3    -1.3850     0.3439  -4.027 5.65e-05 ***
rankrank4    -1.6000     0.4170  -3.837 0.000124 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 499.98  on 399  degrees of freedom
Residual deviance: 459.80  on 392  degrees of freedom
AIC: 475.8

Number of Fisher Scoring iterations: 4

Then I choose the rows where gpa is gpa2 and copy and append those to the data set, giving me the csv file here

Logistic regression with this extended data set gives:

dffbig <- read.table('d:/temp/ucla-factoredbig.csv',sep=',',header=TRUE)
dffbig$rank <- factor(dffbig$rank)
mylogitbig <- glm(admit ~ gre + gpa + rank, data = dffbig, family='binomial')
summary(mylogitbig)

Call:
glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
    data = dffbig)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4775  -0.8931  -0.6533   1.1939   2.1207  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.8668     0.3786  -2.290   0.0220 *  
gregre2       0.5278     0.2814   1.875   0.0607 .  
gregre3       0.7013     0.2964   2.366   0.0180 *  
gpagpa2       0.2992     0.2877   1.040   0.2985    
gpagpa3       0.8481     0.3364   2.521   0.0117 *  
rankrank2    -0.5478     0.2600  -2.107   0.0351 *  
rankrank3    -1.3187     0.2873  -4.589 4.45e-06 ***
rankrank4    -1.5695     0.3460  -4.536 5.74e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 727.61  on 587  degrees of freedom
Residual deviance: 672.97  on 580  degrees of freedom
AIC: 688.97

Number of Fisher Scoring iterations: 4

Why? As long as the outcomes are not different, why would observing the same pattern change the estimated parameters of a model?

Background: This is related to my efforts to create a synthetic data set using only an existing logistic regression model. When all the variables are categorical, I should be able to come up with all possible combinations of inputs and generate input data but things don't go as expected when synthetic data is fed back into logistic regression because logistic regression appears to include more than mere distributions of combinations of discrete variables.

Do you also think that replicating a selected portion of the data in an ordinary least-squares regression ought to leave it unchanged? If not, you can apply your intuition directly to logistic regression, which really isn't any different. — whuber, Jan 17 '15 at 21:48
@whuber Thanks. I'm trying to think at a simple level, assuming the generalisations would hold: say I have a scatterplot that represents a sample model where Y ~ X + e I'm fitting a line over the scatterplot, and having two instances of some points that are already on the scatterplot should not change anything. Or so goes my reasoning. If I take every combination of inputs to a logistic regression as a point on the scatterplot, same reasoning says coefficients should not change; hence my question. Maybe I did not get your question/hint entirely :) — mahonya, Jan 17 '15 at 21:56
"[...] having two instances of some points that are already on the scatterplot should not change anything" - it wouldn't change the appearance of the points, but think about the fitline and how it is weighted by the number of points in each location. — rolando2, Jan 17 '15 at 22:15
The simplest example of regression is the model $\mathbb{E}(Y)=\beta_0$ (only a constant term is present). The OLS estimate is the mean of the data. What happens to the mean when you replicate some of the data? What happens to it when you replicate *all* the data? (What happens to the standard error of the mean when you replicate all the data, though?) What you learn from these simple thought experiments translates (almost unchanged) to the generalized linear model and multivariate regression. — whuber, Jan 17 '15 at 22:20
See https://stats.stackexchange.com/questions/216003/what-are-the-consequences-of-copying-a-data-set-for-ols/216011#216011 — kjetil b halvorsen, Dec 18 '19 at 01:12

score 14 · Accepted Answer · answered Jan 17 '15 at 22:23

You say that you have the same intuition (that the coefficients shouldn't change) for a linear regression, so I'm going to answer your question in that setting, because linear regression is a bit easier to visualize.

I think the thing that is causing your mistaken intuition here is that you're imagining that duplicating data points doesn't change the scatterplot. But it does! It might not be easily visible, depending on the kind of scatterplot you use, but that's why we have things like jitter. The phenomenon of different scatterplots looking the same because the datapoints are directly on top of each other is a problem (called "overplotting"), because it makes fundamentally different data sets look the same.

The regression algorithm doesn't "see" the scatterplot itself, it sees the underlying data points. And when you replicate some of those data points, it makes them more strongly represented in the underlying data, so the regression algorithm treats them as more important to "get right." (Basically, you can think of each occurrence of a data point as pulling the regression line towards it with the same force--so if you have two data points at a given spot, they will pull the line towards them twice as hard.)

I'll give a visual example, using jitter to make clear what a regression actually "sees." First, here's a dataset consisting of one ascending and one descending sequence:

balanced plot

It's symmetrical, so the line of best fit would just be flat through 5.5. But what happens if I replicate the descending sequence 10 times, and use jitter to make the scatterplot look visually like it looks like to the underlying algorithm?

jittered plot

Now it looks a lot more like the ascending sequence is just outliers, right? And the trend line is clearly decreasing. (It works exactly the same way with logistic regression, it's just harder to make it clear from a plot.)

Thanks Ben, your answer and @whuber's comments clarifies my misunderstanding perfectly. — mahonya, Jan 18 '15 at 10:13

Why including some observations twice changes the coefficients of logistic regression?

1 Answers1

Linked

Related