scikit-learn linear regressor digests perfectly collinear features?

Question

I am currently running this little piece of code:

import numpy as np
import sklearn.linear_model as skl_lm

X = np.random.normal(size = 1000)**2
y = X + np.random.normal(size = 1000)

Y = np.array([ [ X[i], X[i] ] for i in range(1000) ])

model = skl_lm.LinearRegression().fit(Y,y)
print(model.coef_)
print(np.dot(Y.transpose(),Y)/999)

It just creates 1000 squares of a normal random variable $X$, and uses it to create another linearly correlated variable $y$, accounting for some noise. Then creates a matrix of features $Y$, where the two features are actually both $X$. Then it tries to fit $y$ over $Y$. The model actually generates something, while the second print shows that the correlation matrix between columns of $Y$ is singular. How is sklearn.linear_model digesting this singular matrix? I believe it should not be able to determine the coefficients, as it should invert the covariance matrix.

score 1 · Accepted Answer · answered Nov 04 '20 at 21:40

1

Internally sklearn uses scipy.linalg.lstsq function for finding solutions to linear equation, which is the same as numpy.linalg.lstsq, and its innerworkings are described in this post: How does NumPy solve least squares for underdetermined systems?

answered Nov 04 '20 at 21:40

hellpanderrr

543
2
6
15

scikit-learn linear regressor digests perfectly collinear features?

1 Answers1