Why would my R Squared increase AND my RMSE increase when I add more variables to a model?

Question

I ran a regression with tidymodels following this following along with the random forest example here but using different data.

When I ran it with four variables or so, I got an R Squared of 0.94 but a high (for what I’m trying to predict) RMSE of 20000. I added more variables and got an R Squared of 0.97 and RMSE of 40000. Why would the RMSE increase if the R Squared supposedly indicated the model was better? I believe RMSE means how far off my prediction is from the actual test data. I’m trying to bring my RMSE down which is why I added more variables.

Is your RMSE measured in-sample, or on a holdout sample, or on an out-of-bag sample? — Stephan Kolassa, Dec 30 '21 at 23:56
[You are correct to think that $R^2$ increasing corresponds to $MSE$ decreasing](https://stats.stackexchange.com/a/551916/247274)…if you are measuring those values on the same data. It is absolutely possible, however, that in-sample $R^2$ increases while out-of-sample $MSE$ also increases, (Indeed, we use the out-of-sample metric plot detect such behavior.) Such a situation corresponds to in-sample MSE decreasing but out-of-sample MSE increasing. — Dave, Dec 31 '21 at 01:13
Because this sounds mathematically impossible, please supply a reproducible example. — whuber, Dec 31 '21 at 02:02

score 1 · Accepted Answer · answered Dec 31 '21 at 00:37

1

One visual interpretation of adding more variables to your model, which implies adjusting for them and thus slicing your data, is that you may end up with only the points that are closer to the curve you're fitting. This explains why your $R^2$ is increasing when you add more variables. Indeed, your model fits better the data [that is left], however, this may not always be the case. You can adjust for new variables and get a worse $R^2$, by having data points that make it hard to fit with the curve you're trying to use. Besides, this does not prevent overfitting. If your model is overfitting, you can still get a high $R^2$ and very large RMSE when assessing your model on test data (data that was not used to build your model).

answered Dec 31 '21 at 00:37

mribeirodantas

796
3
17

*Caeteris paribus,* $R^2$ increases as the RMSE decreases: this is what requires an explanation. – whuber Dec 31 '21 at 02:01
1

Thank you all for your replies. I am examining the RMSE on test data after training. It does increase along with the R Squared. I will look into overfitting and providing a reproducible example tomorrow. – Learning_and_xbox Dec 31 '21 at 03:43
1

$R^2=1-\dfrac{RMSE^2}{\sum\big(y_i-\bar y\big)^2}$ so bigger $RMSE \iff$ smaller $R^2$ and smaller $RMSE\iff$ bigger $R^2$, since that denominator term is the same for every model of the same data. – Dave Dec 31 '21 at 03:43
Should be $n\times RMSE^2$ in the numerator of my earlier comment – Dave Dec 31 '21 at 03:56
@Learning_and_xbox, are RMSE and $R^2$ calculated on the exact same data? You said RMSE comes from test data. Where does $R^2$ come from? – Richard Hardy Dec 31 '21 at 08:27
@RichardHardy Hi - It is the R Squared from the training data model. – Learning_and_xbox Jan 07 '22 at 03:05

Why would my R Squared increase AND my RMSE increase when I add more variables to a model?

1 Answers1