Difference in regression coefficients of sklearn's LinearRegression and XGBRegressor

Question

Using the Boston housing dataset as example, I'm comparing the Regression Coefficients between Sklearn's LinearRegression() and xgboost's XGBRegressor().

For XGBRegressior, I'm using booser='gblinear' so that it uses linear booster and not tree based booster. According to this page, gblinear uses "delta with elastic net regularization (L1 + L2 + L2 bias) and parallel coordinate descent optimization.".

Thus, I assume my comparison is apples to apples, since I am not comparing OLS to a tree based learner.

Is my assumption correct? If so, would the interpretation of the coefficients in XGBoost be the same as in Lienar Regression? That is, they represent "the mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant."
The coefficients seen for both are different. Why would this be? Is it because of the regularization and optimization XGBoostRegressor makes it different?

boston = load_boston()
X = pd.DataFrame(boston.data, columns = boston.feature_names)
Y = pd.DataFrame(boston.target)

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size = 0.2, random_state=5)

Linear Model:

from sklearn.linear_model import LinearRegression

linear_model = LinearRegression()
linear_model.fit(X_train, y_train)

print(linear_model.coef_)

Output:

[[-1.30799852e-01  4.94030235e-02  1.09535045e-03  2.70536624e+00
  -1.59570504e+01  3.41397332e+00  1.11887670e-03 -1.49308124e+00
   3.64422378e-01 -1.31718155e-02 -9.52369666e-01  1.17492092e-02
  -5.94076089e-01]]

XGBoost Regression with gblinear:

from xgboost import XGBRegressor

xgb_model = XGBRegressor(n_estimators=100, learning_rate=0.06, gamma=1, subsample=0.8, objective='reg:squarederror', booster='gblinear', n_jobs=-1)
xgb_model.fit(X_train, y_train)

print(xgb_model.coef_)

Output:

[-0.192631    0.0966579  -0.00972393  0.34198     0.159105    1.09779
  0.039317    0.289027   -0.00622574  0.00236915  0.171237    0.0164343
 -0.398639  ]

usεr11852 · Accepted Answer · 2020-03-01T19:53:27.130

This answer will be somewhat anticlimactic but here goes...

While indeed the linear booster is using a linear routine and not a tree-based routine as a base learner, most of the similarities between a GBM with a linear booster and a linear regression model end there.

For starters, as you correctly note, within XGBoost we use (usually parallel) coordinate descent optimization instead of "standard" gradient descent. This will likely give us, different results. In addition to that, the learning rate, $\eta$, will itself regularise the estimates inadvertently so there is not direct analogy to "no regularisation" even if we assume reg_lambda, reg_lambda_bias and reg_alpha (weights $L_2$, baseline $L_2$ and weights $L_1$ regularisation respectively) to equate to zero. Finally, an XGBoost model and a linear regression model will not have the same intercept, $\beta_0$: while in the case of a standard linear regression the intercept is calculated as part of the overall design matrix $X$, the intercept used by XGBoost will depend on the learning rate, $\eta$ as well as the mean of the response variable (i.e. we start boosting using as baseline the mean of our response variable or a regularised version of that mean to get our first estimates).

I do not even touch on the case of multithreading multiple estimators because this concerns the reproducibility of your results and not their interpretation per se. I also do not examine the case of multiple iterations (n_estimators>1) as again, our estimates in this case might be affected by variation due to bagging. (Please note, that in the example code provided the subsample ratio of the training instances is set to 0.8, theoretically it should be set to 1 for the purposes of this comparison.) Finally, we should be aware that if we optimise for a sufficient large number of iterations, we will get the slope estimates that we would get from linear regression. After all, both XGBoost and LR will minimize the same cost function for the same data using the same slope estimates! :)

And to address your final question: yes, the interpretation of the XGBoost slope coefficient $\beta_1$ as the "mean change in the response variable for one unit of change in the predictor variable while holding other predictors in the model constant." is correct. Putted differently: if two instances $s_A$ and $s_B$ only different by a unit change in a particular response variable $x_1$ and are otherwise the same, the difference in our predictions for them will be the one quantified in the XGBoost $\beta_1$ coefficient. Just the regularisation steps within XGBoost make the derivation of that intercept and slope coefficients totally incomparable with the derivation of the same intercept and slope coefficients within a linear regression framework.

Difference in regression coefficients of sklearn's LinearRegression and XGBRegressor

1 Answers1

Linked