Why are my elastic net and lasso r-squared measures negative?

Question

I'm using sklearn.linear_model.Lasso and sklearn.linear_model.ElasticNet on a model that includes a constant.

I don't expect a model with a constant to perform worse than the average of the data, ie an r-squared of 0.

I have p > n, but I have regressed on a few subset models and it's trivial to find an OLS subset with r-squared > .5.

Below are some of the model metrics. I'm baffled that the r-squared is negative at all, and further baffled that r-squared is below -1.

Lasso(alpha=0.1)
r^2 on test data : -1.609104
ElasticNet(alpha=0.1, l1_ratio=0.7)
r^2 on test data : -1.745874

Related code is open-source here.

I tried a variety of other alphas but did not achieve a non-negative r-squared. I did achieve an r-squared close to -0.003 with an alpha of 10.

Could anyone provide insight into how such metrics are observed?

Update, 10/20/21 - comparative sample scatter below

plt.scatter(y_train, y_train_pred_lasso, color='red')
plt.scatter(y_test, y_test_pred_lasso, color='blue')
plt.show()

My predicted variable is continuous, so I see how to compute r-squared from PRESS [here](https://stats.stackexchange.com/a/123580/142294), but I still don't see how a negative number would be achieved unless the constant is dropped or something. — John Vandivier, Oct 20 '21 at 02:13
@Dave great point - I added a comparative scatter of the test/train split. There is a plausible negative intersample correlation, so I am happy to accept this as a partial answer. It seems to me that the best fit of blue is quite a good fit of red too though, and I still don't expect even a negative r-squared to go beyond -1. But great point and I will accept if you post as an answer and nothing more complete comes along within a few days. I did confirm the constant column is retained in X. — John Vandivier, Oct 20 '21 at 14:14
If those are all your data (and your code suggests that they might be) then your train/test split is both losing power in the training and precision in the testing. Although such splits are taught in machine learning courses and are OK when there are tens of thousands of cases, they are [inappropriate for data sets of this size](https://www.fharrell.com/post/split-val/). Also, your code indicates that these are _adjusted_ $R^2$ values. What are the values of $n$ (number of cases) and $p$ (number of predictors) corresponding to the adjusted $R^2$ values that you report? — EdM, Oct 20 '21 at 14:35
@EdM thanks for the link! n = 86, so 43 per split, p is 165. The r-squared are non-adjusted. My code has a `TODO` for adjusted r-squared but sklearn doesn't natively support it. — John Vandivier, Oct 20 '21 at 14:58
This is worth mentioning: If I predict on my training split data and compute the r-squared I get `-0.963746`. So I get the intuitive explanation that the worst model in sample 1 is the mean and that could correlate negatively with sample 2, but that explanation doesn't seem to account for predicting back on to the training split so poorly. — John Vandivier, Oct 20 '21 at 15:43
[A question of mine](https://stats.stackexchange.com/questions/494274/why-does-regularization-wreck-orthogonality-of-predictions-and-residuals-in-line) from last year may interest you. This lack of orthogonality means that $R^2$ lacks the usual "proportion of variance explained" interpretation, which is the big selling point $R^2$ has over $MSE$ and $RMSE$, an absolute measure of model performance. (I dispute that $R^2$ is an absolute measure of performance. $R^2$ values do not correspond to letter grades in school: $R^2 = 0.1$ could be quite good, and $R^2 > 0.9$ could be quite pedestrian.) — Dave, Oct 20 '21 at 17:55
@Dave extremely useful link, very important comment. Thanks. I hadn't noticed that the usual interpretation isn't applicable here. — John Vandivier, Oct 21 '21 at 13:51

score 3 · Accepted Answer · answered Oct 20 '21 at 15:07

PRESS is an out-of-sample measure. Even when you have an intercept, out-of-sample $R^2<0$ is possible. It means that you would have been better off predicting the mean of $y$ every time, regardless of your predictors/features. In other words, you have a poor model that is outperformed by a naïve model.

You say that this is unexpected performance. You're doing machine learning, so prepare yourself to be surprised and disappointed by model performance.

score 3 · Answer 2 · answered Oct 20 '21 at 17:11

Dave has addressed your specific question and I suggest you accept that answer. The point about "unexpected performance" in machine learning is quite on target.

What follows is more to get you headed toward better ways to handle your particular data set.

First, don't do a test/train split with so few cases. Build you model on the entire data set, then validate the model-building process by resampling. One accepted approach is to repeat the modeling on multiple bootstrap resamples of your data, and then test performance of the resulting models on the full original data set.

Second, LASSO or elastic net might not be a good choice here. You seem to have a lot of multi-level categorical predictors and at least 1 interaction. LASSO and elastic net won't necessarily keep all levels of a categorical predictor in the model, and might keep an interaction term while omitting the individual contributors to the interaction. That's not generally a good idea; see this thread. There is a group LASSO, explained in Statistical Learning with Sparsity, that can keep specified predictors together, but my guess is that your data set will be too small for that to work adequately.

Third, your many categorical predictors, some of them multi-level, pose a particular problem for penalized approaches like LASSO or elastic net. Continuous predictors in such models are usually normalized to zero mean and unit standard deviation so that they all start on similar scales. That doesn't always make sense with categorical predictors; with a multi-level predictor, the results might change depending on what you choose as the reference level! See this thread. Just jumping ahead into penalization without thinking hard about your categorical predictors is a too-easy way to get into a lot of trouble.

Before you go further with the mechanics of your modeling, see if you can use your knowledge of the subject matter to reduce the number of predictors or to combine multiple predictors into single predictors without first looking at their associations with outcome. Frank Harrell's course notes provide a lot of useful guidance for such predictor simplification and other aspects of multivariable modeling in Chapter 4.

I suspect that you will get the most reliable results by starting with such predictor combinations and then penalizing in a way that keeps all of those (combined) predictors in the model and avoids the unstable predictor selection you get with LASSO. Ridge regression is a choice, but a more general penalized estimation, adjusting relative penalization among predictors based on prior subject matter knowledge and handling categorical predictors intelligently, will probably work better.

Thanks to you both. Happy to accept Dave's answer as suggested. — John Vandivier, Oct 21 '21 at 13:48

Why are my elastic net and lasso r-squared measures negative?

2 Answers2