4

I'm relatively new to regression analysis in Python. I'm running a logistic regression on a dataset in a dataframe using the Statsmodels package.

I've seen several examples, including the one linked below, in which a constant column (e.g. 'intercept') is added to the dataset and populated with 1.0 for every row. And then the intercept variable is included as a parameter in the regression analysis.

My question is: what is the purpose of this, and is it necessary? (How do I know if it's necessary?)

(Reference: Logistic Regression: Scikit Learn vs Statsmodels)

Thank you!

mountainave
  • 87
  • 1
  • 5

2 Answers2

13

It is almost always necessary. I say almost always because it changes the interpretation of the other coefficients. Leaving out the column of 1s may be fine when you are regressing the outcome on categorical predictors, but often we include continuous predictors.

Let's compare a logistic regression with and without the intercept when we have a continuous predictor. Assume the data have been mean centered. Without the column of 1s, the model looks like

$$ \operatorname{logit}\left( \dfrac{p(x)}{1-p(x)} \right) = \beta x $$

When $x=0$ (i.e. when the covariate is equal to the sample mean), then the log odds of the outcome is 0, which corresponds to $p(x) = 0.5$. So what this says is that when $x$ is at the sample mean, then the probability of a success is 50% (which seems a bit restrictive).

If we do have the intercept, the model is then

$$ \operatorname{logit}\left( \dfrac{p(x)}{1-p(x)} \right) = \beta_0 + \beta x $$

Now, when $x=0$ the log odds is equal to $\beta_0$ which we can freely estimate from the data.

In short, unless you have good reason to do so, include the column of 1s.

Demetri Pananos
  • 24,380
  • 1
  • 36
  • 94
0

It appears that you may not have to manually include a constant for there to be an intercept in the model. From looking at the default parameters in the following class, there is a boolean parameter that is defaulted to True for intercept.

class sklearn.linear_model.LogisticRegression(penalty='l2', *, dual=False, tol=0.0001, C=1.0, fit_intercept=True, intercept_scaling=1, class_weight=None, random_state=None, solver='lbfgs', max_iter=100, multi_class='auto', verbose=0, warm_start=False, n_jobs=None, l1_ratio=None)

The explanation given for that parameter is as follows:

fit_interceptbool, default=True: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function.

Source: sklearn.linear_model.LogisticRegression

Alain
  • 1
  • 1