4

I'm trying to understand what sklearn's LinearRegression (which should be using ordinary least squares) is doing when there are more features than observations.

import numpy as np
from sklearn.linear_model import LinearRegression

X = np.random.normal(size=(10,20))
y = np.random.normal(size=10)

reg = LinearRegression().fit(X, y)
reg.coef_

Result:

array([ 0.08483326,  0.10681214,  0.21719561,  0.09594577, -0.03162432,
       -0.12966986,  0.06547396,  0.23470907,  0.03750261, -0.09405698,
       -0.05079304, -0.06141368,  0.04811855,  0.19887924, -0.02054755,
        0.21558906,  0.06054536,  0.08791492,  0.01750048, -0.03848975])

How were these coefficients generated? My understanding is that there should be no residual degrees of freedom, and using R to perform linear regression results in coefficients with NAs. I'm aware of techniques like penalized regression to handle these cases, but I'm unsure how sklearn's LinearRegression is handling this situation.

dseok
  • 41
  • 1
  • 3
    At least for a while, that function included regularization unless you disabled it. What does the documentation say for your particular version? – Dave Aug 30 '21 at 20:55
  • @Dave The version of sklearn I'm using is 0.22.1. The documentation (https://scikit-learn.org/0.22/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linear%20regression#sklearn.linear_model.LinearRegression) doesn't seem to indicate any parameters for regularization. Toggling normalize=TRUE or FALSE doesn't change the result - it still gives back coefficients for all features. – dseok Aug 30 '21 at 21:16
  • 1
    `sklearn` uses `scipy.linalg.lstsq` (which is distinct from `np.linalg.lstsq`), but I think this answer still applies https://stats.stackexchange.com/questions/240573/how-does-numpy-solve-least-squares-for-underdetermined-systems -- I'll test it later – Sycorax Aug 30 '21 at 23:39
  • Also relevant here: https://stackoverflow.com/questions/57066026/how-does-sklearn-linear-model-linearregression-work-with-insufficient-data – dseok Aug 31 '21 at 00:37
  • 1
    @Sycorax, the `scipy` [documentation page](https://docs.scipy.org/doc/scipy/reference/generated/scipy.linalg.lstsq.html) says it uses LAPACK's `gelsd` by default as well. – Ben Reiniger Aug 31 '21 at 00:41
  • 1
    @Dave, I don't believe that's true of `LinearRegression`, just `LogisticRegression`? Linear regression has separate classes for regularized versions. – Ben Reiniger Aug 31 '21 at 00:43

0 Answers0