What is scikit-learn's LinearRegression doing when there are more features than observations?

Question

I'm trying to understand what sklearn's LinearRegression (which should be using ordinary least squares) is doing when there are more features than observations.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.random.normal(size=(10,20))
y = np.random.normal(size=10)

reg = LinearRegression().fit(X, y)
reg.coef_

Result:

array([ 0.08483326,  0.10681214,  0.21719561,  0.09594577, -0.03162432,
       -0.12966986,  0.06547396,  0.23470907,  0.03750261, -0.09405698,
       -0.05079304, -0.06141368,  0.04811855,  0.19887924, -0.02054755,
        0.21558906,  0.06054536,  0.08791492,  0.01750048, -0.03848975])

How were these coefficients generated? My understanding is that there should be no residual degrees of freedom, and using R to perform linear regression results in coefficients with NAs. I'm aware of techniques like penalized regression to handle these cases, but I'm unsure how sklearn's LinearRegression is handling this situation.

At least for a while, that function included regularization unless you disabled it. What does the documentation say for your particular version? — Dave, Aug 30 '21 at 20:55
@Dave The version of sklearn I'm using is 0.22.1. The documentation (https://scikit-learn.org/0.22/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression) doesn't seem to indicate any parameters for regularization. Toggling normalize=TRUE or FALSE doesn't change the result - it still gives back coefficients for all features. — dseok, Aug 30 '21 at 21:16
`sklearn` uses `scipy.linalg.lstsq` (which is distinct from `np.linalg.lstsq`), but I think this answer still applies https://stats.stackexchange.com/questions/240573/how-does-numpy-solve-least-squares-for-underdetermined-systems -- I'll test it later — Sycorax, Aug 30 '21 at 23:39
Also relevant here: https://stackoverflow.com/questions/57066026/how-does-sklearn-linear-model-linearregression-work-with-insufficient-data — dseok, Aug 31 '21 at 00:37
@Sycorax, the `scipy` [documentation page](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html) says it uses LAPACK's `gelsd` by default as well. — Ben Reiniger, Aug 31 '21 at 00:41
@Dave, I don't believe that's true of `LinearRegression`, just `LogisticRegression`? Linear regression has separate classes for regularized versions. — Ben Reiniger, Aug 31 '21 at 00:43

What is scikit-learn's LinearRegression doing when there are more features than observations?

0 Answers0