How to generate data for logistic regression with an independent variable that is not centred?

Question

In this post, there is a script to generate data for a logistic regression.

set.seed(666)
x1 = rnorm(1000,0,1)           # some continuous variables 
x2 = rnorm(1000,0,1)
z = 1 + 2*x1 + 3*x2        # linear combination with a bias

pr = 1/(1+exp(-z))         # pass through an inv-logit function
y = rbinom(length(x1),1,pr)      # bernoulli response variable

df = data.frame(y=y,x1=x1,x2=x2)
glm( y~x1+x2,data=df,family="binomial")

From this script I want to 1. change the $x_2$ for a quadratic term in x1. In addition I want to 2. compare $x_1$ when not centred (change in mean) and more or less variable (change in sd). Below, I modified

set.seed(666)
x1 = rnorm(1000,0,1)         
z = 1 + 2*x1 + 3*x1^2  
pr = 1/(1+exp(-z)) 
y = rbinom(length(x1),1,pr)
df = data.frame(y=y,
                x1=x1,
                x2=x1^2)
glm( y~x1+x2,
     data=df,
     family="binomial")

Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
(Intercept)           x1           x2  
      1.002        2.437        3.490  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      795.3 
Residual Deviance: 615.9    AIC: 621.9
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

The estimates seems a little off. But they are close to the coefficients that were theoretically put in the model. But as soon that I change the mean, the algorithm does not converge:

set.seed(666)
x1 = rnorm(1000,10,1)         
z = 1 + 2*x1 + 3*x1^2  
pr = 1/(1+exp(-z)) 
y = rbinom(length(x1),1,pr)

df = data.frame(y=y,
                x1=x1,
                x2=x1^2)
glm( y~x1+x2,
     data=df,
     family="binomial")
Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
  (Intercept)           x1           x2  
2.657e+01   -2.351e-08    1.234e-09  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      0 
Residual Deviance: 5.802e-09    AIC: 6
Warning message:
glm.fit: algorithm did not converge

The only way I found to get back to the same estimates was to standardize the raw data (which makes the mean = 0, so I'm back to the first example).

set.seed(666)
x1 = rnorm(1000,10,1)         
x1=scale(x1)
z = 1 + 2*x1 + 3*(x1)^2
pr = 1/(1+exp(-z)) 
y = rbinom(length(x1),1,pr)

df = data.frame(y=y,
                x1=x1,
                x2=x1^2)
glm( y~x1+x2,
     data=df,
     family="binomial")
Call:  glm(formula = y ~ x1 + x2, family = "binomial", data = df)

Coefficients:
(Intercept)           x1           x2  
     0.9872       2.4292       3.5237  

Degrees of Freedom: 999 Total (i.e. Null);  997 Residual
Null Deviance:      787.8 
Residual Deviance: 605.7    AIC: 611.7
Warning message:
glm.fit: fitted probabilities numerically 0 or 1 occurred

So how is it possible to get the logistic estimates = what was theoretically put in the linear combination, but with an $x_1$ that is e.g. x1 = rnorm(1000,10,2)?

In addition, is it possible to add an error term here: z = 1 + 2*x1 + 3*x2 + rnorm(length(x1),0,1)? This would be the equivalent of saying $y = α_0 + β_1*x_1+β_2*x_1^2+ε$

I think that adding an error term with enough variance can help you resolve the issue. I feel that the model did not converge or had perfect fitted probabilities because your model-implied variance was likely equivalent to the variance in the data because the data-analytic model is perfectly in line with the data-generative model. Changing the population mean to 10 with a variance as small as 1 almost always guarantees your probability to 1, which means your generated data would mostly be of one class, where no or little variance exists, hence non-convergence. — QmmmmLiu, Aug 14 '18 at 19:50
I think you *absolutely* need to center the x values (`x1 =x1-mean(x1))`). Otherwise, the `pr = 1/(1+exp(-z))` will always be a big number and then it doesn't make sense to generate probabilities that are very high! — M. Beausoleil, Aug 14 '18 at 20:50
With an error term of large enough of variance, centering does not have to be "absolutely" necessary. — QmmmmLiu, Aug 14 '18 at 20:52
The way it's written in the code, it's not possible to *not* center the $x_1$. Unless you have a solution, I'd be happy to see it. The point of this is to vary the variance from small to large to see when it is no longer useful to use the model. This is just a toy example. I use it for another purpose that uses the coefficients from this model. — M. Beausoleil, Aug 14 '18 at 21:58

QmmmmLiu · Answer 1 · 2018-08-15T00:09:02.000

The logistic regression model maintains $P(Y_i=1|X_i=x_i)=\pi_i=\frac{\exp(\beta X +\epsilon)}{1+\exp(\beta X +\epsilon)}$ where $\pi_i=E(Y_i)$, $\beta$ and $X$ are respectively a vector of the regression coefficient and a matrix of independent variables.

To generate data of logistic regression model, we set up $\beta$, $X$, and $\epsilon$.

For example, with your model of interest,

b1=1
b2=1
x1=rnorm(1000000,10,2)
x2=x1^2
e=rnorm(1000000,0,200)
pi=exp(b1*x1+b2*x2+e)/(1+exp(b1*x1+b2*x2+e))
y=rbinom(1000000,1,prob = pi)
summary(glm(y~x1+x2,family="binomial"))

Here, the error term helps mitigate the issue resulted from large $\pi$. Alternatively, adding an intercept term would also help,

b0=-100
b1=1
b2=1
x1=rnorm(1000,10,2)
x2=x1^2
e=rnorm(1000,0,10)
pi=exp(b0+b1*x1+b2*x2+e)/(1+exp(b0+b1*x1+b2*x2+e))
y=rbinom(1000,1,prob = pi)
summary(glm(y~x1+x2,family="binomial"))

It would always be good to check $y$ before model fitting to ensure that there are two classes (preferably of balanced number). If $y$ is of single class, then the model will not converge. If $y$ is of classes with (extremely) imbalanced number, the estimation of coefficients won't be accurate (note that because your model inherently violates the assumption of logistic regression, the parameter estimation are "doomed" from the start...).

Here, the error term/the intercept term helps to ensure that not all $y$ would be of a single class. The variance of the error term or the value of the intercept can be determined based on the range of $x_2$ so that $\pi$ would be of higher variance.

However, it is worth noting that it would be very difficult, if not impossible, to obtain accurate estimate of $\beta$s given the specified model of $y=\beta_1 X_1+\beta_2 X_1^2+\epsilon$. The model estimates do not match the specified generative value because $X_1^2$ and $X_1$ are highly collinear, which results in violation of the assumption that the predictors should be independent. I suppose derivation can be done to show how the $\beta$ estimates would change under this particular model given the generative regression coefficients and the simulation can probably follow the derivation to achieve more accurate recovery of the parameters.

This is interesting: 1. why is the data generated is "inherently violating" the logistic regression assumptions? 2. Why is $y=β_1X_1+β_2X_1^2+ϵ$ a problem? It seems that from a quick check, it kind of works `plot(predict(md),residuals(md),col=c("blue","red")[1+y]); abline(h=0,lty=2,col="grey")` 3. When computing a polynomial of a linear model, it is usually not a problem to fit an independent variable as a linear ($x_1$) and nonlinear ($x_1^2$) term. Why here is it a problem? — M. Beausoleil, Aug 15 '18 at 02:28
In this case, the regression coefficients estimates would grow unstable and the estimates would not be accurate. 2. I am not entirely sure what `md` refers to in your code. But I guess that you plotted the predicted logit odds against the residuals. — QmmmmLiu, Aug 15 '18 at 19:54
I think that diagnostic plots for multicollinearity can be an interesting problem to think about. My guess is that since the problematic part of the model does not concern the $\epsilon$ and thus the residual. Diagnostic plots won't reveal much. However, if you check Variance Inflation Factor, the problem should show. 3. In polynomial regression, it is advised to first center the variables before creating polynomial terms and fitting the model, which helps mitigate the collinearity between terms. — QmmmmLiu, Aug 15 '18 at 19:54
In some cases, depending on the research interests, statistical problem from multicollinearity may be less important in case of interpretational advantages where the researcher may decide not to do the centering. — QmmmmLiu, Aug 15 '18 at 19:54
1. As the model of interest has predictors of $x$ and $x^2$ with the latter term that can be expressed through the former, the two terms are likely to show some linear relation. Collinearity thus would be an issue: subspace spanned by collinear covariates may not be (or close to not being) of full rank. — QmmmmLiu, Aug 15 '18 at 20:25
When the subspace (onto which "$y$" is projected) is (close to) rank deficient, it is (almost) impossible to separate the contribution of the individual covariates. The uncertainty with respect to the covariate responsible for the variation explained in "$Y$" is often reflected in the fit of the linear regression model to data by a large error of the estimates of the regression parameters corresponding to the collinear covariates. — QmmmmLiu, Aug 15 '18 at 20:26

How to generate data for logistic regression with an independent variable that is not centred?

1 Answers1