1

I ran a regression with tidymodels following this following along with the random forest example here but using different data.

When I ran it with four variables or so, I got an R Squared of 0.94 but a high (for what I’m trying to predict) RMSE of 20000. I added more variables and got an R Squared of 0.97 and RMSE of 40000. Why would the RMSE increase if the R Squared supposedly indicated the model was better? I believe RMSE means how far off my prediction is from the actual test data. I’m trying to bring my RMSE down which is why I added more variables.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
  • 5
    Is your RMSE measured in-sample, or on a holdout sample, or on an out-of-bag sample? – Stephan Kolassa Dec 30 '21 at 23:56
  • 1
    [You are correct to think that $R^2$ increasing corresponds to $MSE$ decreasing](https://stats.stackexchange.com/a/551916/247274)…if you are measuring those values on the same data. It is absolutely possible, however, that in-sample $R^2$ increases while out-of-sample $MSE$ also increases, (Indeed, we use the out-of-sample metric plot detect such behavior.) Such a situation corresponds to in-sample MSE decreasing but out-of-sample MSE increasing. – Dave Dec 31 '21 at 01:13
  • 4
    Because this sounds mathematically impossible, please supply a reproducible example. – whuber Dec 31 '21 at 02:02
  • I think it’s in-sample @Stephan Kolassa – Learning_and_xbox Dec 31 '21 at 03:49

1 Answers1

1

One visual interpretation of adding more variables to your model, which implies adjusting for them and thus slicing your data, is that you may end up with only the points that are closer to the curve you're fitting. This explains why your $R^2$ is increasing when you add more variables. Indeed, your model fits better the data [that is left], however, this may not always be the case. You can adjust for new variables and get a worse $R^2$, by having data points that make it hard to fit with the curve you're trying to use. Besides, this does not prevent overfitting. If your model is overfitting, you can still get a high $R^2$ and very large RMSE when assessing your model on test data (data that was not used to build your model).

mribeirodantas
  • 796
  • 3
  • 17
  • *Caeteris paribus,* $R^2$ increases as the RMSE decreases: this is what requires an explanation. – whuber Dec 31 '21 at 02:01
  • 1
    Thank you all for your replies. I am examining the RMSE on test data after training. It does increase along with the R Squared. I will look into overfitting and providing a reproducible example tomorrow. – Learning_and_xbox Dec 31 '21 at 03:43
  • 1
    $R^2=1-\dfrac{RMSE^2}{\sum\big(y_i-\bar y\big)^2}$ so bigger $RMSE \iff$ smaller $R^2$ and smaller $RMSE\iff$ bigger $R^2$, since that denominator term is the same for every model of the same data. – Dave Dec 31 '21 at 03:43
  • Should be $n\times RMSE^2$ in the numerator of my earlier comment – Dave Dec 31 '21 at 03:56
  • @Learning_and_xbox, are RMSE and $R^2$ calculated on the exact same data? You said RMSE comes from test data. Where does $R^2$ come from? – Richard Hardy Dec 31 '21 at 08:27
  • @RichardHardy Hi - It is the R Squared from the training data model. – Learning_and_xbox Jan 07 '22 at 03:05