4

If I create a regression model design matrix with 3 uncorrelated variables, I get a small condition number as expected. MWE:

> import numpy as np, pandas as pd
> n = 1000
> X = pd.DataFrame()
> X['x1'] = np.random.normal(size=n) * 500
> X['x2'] = np.random.normal(size=n) * 200
> X['x3'] = np.random.normal(size=n) * 300
> print(np.linalg.cond(X))

2.4566193714711306

But if I add a constant to the design matrix (as expected in python statsmodels), my condition number blows up:

> import numpy as np, pandas as pd
> n = 1000
> X = pd.DataFrame()
> X['x0'] = [1] * n
> X['x1'] = np.random.normal(size=n) * 500
> X['x2'] = np.random.normal(size=n) * 200
> X['x3'] = np.random.normal(size=n) * 300
> print(np.linalg.cond(X))

497.654501825216

Accordingly, I have an extremely high condition number and warning of multicollinearity when I estimate my model even though none of my predictors (or constant) are correlated. Why does the design matrix's condition number change drastically when a constant is added?

eos
  • 323
  • 4
  • 13

1 Answers1

2

Combining several questions/comments, I believe I have the answer. This is just a scaling problem.

The condition number is the ratio of the largest eigenvalue in the design matrix to the smallest. This large condition number results from scaling rather than from multicollinearity. If we have just one variable with units in the thousands (ie, a large eigenvalue) and add a constant with units of 1 (ie, a small eigenvalue), we'll get a large condition number as the ratio (and statsmodels warns of multicollinearity because it has a sensitive threshold and uses the unstandardized design matrix for calculating the condition number).

eos
  • 323
  • 4
  • 13