2

I have designed a neural network with two different choices of features.

Neither feature set stands out as one trains a neural network that scores higher $R^2$ coefficient on test data whereas the other feature set produces a neural network that scores the lower RMSE.

Which of $R^2$ and RMSE should be the deciding factor for which feature set is better?

Gabi23
  • 143
  • 4
  • What happens if you use both features sets? – JTH Sep 10 '21 at 00:59
  • 1
    They're basically the same thing (they both square differences to the mean such that outliers become somewhat important): $R^2_{ajd} = 1 - \frac{{RMSE}^2}{\sigma^2_y}$ See [here](https://stats.stackexchange.com/questions/32596/what-is-the-difference-between-coefficient-of-determination-and-mean-squared) and [here](https://stats.stackexchange.com/questions/142248/difference-between-r-square-and-rmse-in-linear-regression) – Patrick Coulombe Sep 10 '21 at 01:01
  • @PatrickCoulombe watch out for doing $R^2$ out of sample. It makes no sense to me to compare to the variance of the training $y$, and the denominator is supposed to be the performance of the naïve model that always predicts the (training) $\bar y$, so it does not make sense to me to use the out-of-sample variance of $y$, either. See my comment to my answer. – Dave Sep 10 '21 at 01:37

1 Answers1

1

They are equivalent up to computational issues that arise from doing math on a computer.

$$R^2=1-\dfrac{SSResiduals}{SSTotal}$$

$$ RMSE = \sqrt{\dfrac{SSResiduals}{n}} $$

Both are just functions of the sum of the squares residuals (“errors” in a mostly-acceptable-even-if-technically-incorrect machine learning slang).

Setting aside some issues that can arise from doing math on a computer, maximizing $R^2$ is the same as minimizing $RMSE$.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • I would consider the out-of-sample $SSTotal$ to be equal to $\sum \big( \hat y_i - \bar y _{\text{in sample}} \big)^2$ for out-of-sample $\hat y_i$. – Dave Sep 10 '21 at 01:33