Why does the rank order of models differ for R squared and RMSE?

Question

I am comparing $R^2$ and RMSE of different models. Interestingly, the rank ordering of the models with respect to $-R^2$ and RMSE is different and I do not understand why.

Here is an example in R:

library(caret) 

set.seed(0)
d<-SLC14_1(n=1000)

folds<-createMultiFolds(d$y,k=10,times=1)
tc<-trainControl(index=folds,returnResamp="all")
t1<-train(y~.,data=d,method="glmnet",trControl=tc) 
order(t1$results$RMSE)==order(-t1$results$Rsquared)

Output:

[1]  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE

Thus, the order if different for $-R^2$ suqared and $RMSE$.

The question is, why.

Let $SS_{res}$ be the sum of squared residuals $\sum(y_i-f_i)^2$.

$RMSE$ is defined as $\sqrt{SS_{res}/n}$.

$R^2$ is defined as $1-SS_{res}/SS_{tot}$ where $SS_{tot}$ is $\sum(y_i-\overline{y})^2$.

Since $SS_{res}=n*(RMSE)^2$, we can write $R^2$ as $1-n*(RMSE)^2/SS_{tot}$. Since $n$ and $SS_{tot}$ are constant and the same for all models, $-R^2$ and $RMSE$ should strictly positively related. However, they are not since the ranking order is in practice not identical (see example code).

What is wrong with my argument?

grand_chat · Accepted Answer · 2018-04-29T17:21:58.867

In caret, the calculation for results$RMSE and results$Rsquared is not as simple as what you've indicated. They are in fact the average of RMSE and $R^2$ over the ten holdout sets.

To confirm this, run the summary:

> t1
glmnet 

1000 samples
  20 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 900, 900, 900, 900, 900, 900, ... 
Resampling results across tuning parameters:

  alpha  lambda      RMSE      Rsquared 
  0.10   0.01065054  17.93931  0.1655746
  0.10   0.10650539  17.93720  0.1656599
  0.10   1.06505391  17.89291  0.1678166
  0.55   0.01065054  17.93838  0.1657046
  0.55   0.10650539  17.91755  0.1668356
  0.55   1.06505391  17.84962  0.1731936
  1.00   0.01065054  17.93824  0.1657245
  1.00   0.10650539  17.90045  0.1678998
  1.00   1.06505391  17.92535  0.1710923

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were alpha = 0.55 and lambda = 1.065054.

For the optimal parameter combination alpha = 0.55 and lambda = 1.065054 the performance on each held-out set is seen in the object t1$resample:

> t1$resample
       RMSE   Rsquared Resample
1  18.42848 0.04479504   Fold05
2  21.17820 0.10500276   Fold08
3  18.27933 0.20858027   Fold04
4  17.31308 0.19080079   Fold07
5  16.60865 0.21812706   Fold10
6  20.07291 0.18737052   Fold02
7  16.48082 0.24041654   Fold03
8  17.18363 0.18379930   Fold06
9  17.29819 0.13669866   Fold09
10 15.65289 0.21634546   Fold01

(Needless to say, the RMSE and Rsquared seen above are evaluated on different CV folds, so they don't rank order the same.) If you average these columns, you'll get:

> mean(t1$resample$RMSE)
[1] 17.84962
> mean(t1$resample$Rsquared)
[1] 0.1731936

...which are the same as the RMSE and Rsquared numbers seen in row 6 of the summary.

EDIT: Why does averaging over folds disrupt the rank ordering? Suppose we have split the data into $F$ folds, and we are considering $C$ tuning combinations. For each combo $c$ and held-out fold $f$, the relationship between the $R^2$ and MSE calculated on fold $f$ is: $$\operatorname{Rsquared}(c,f)=1-\frac{\operatorname{MSE}(c,f)}{\operatorname{Var}(f)},\tag1 $$ where $\operatorname{Var}(f)$ is shorthand for the variance of the observed responses in fold $f$. It is certainly true that for a given $f$, if we average over all $c$ then the monotonic relationship between $R^2$ and MSE is preserved, since by linearity: $$\frac1C\sum_c\operatorname{Rsquared}(c,f)=1-\frac{\frac1C\sum_c\operatorname{MSE}(c,f)}{\operatorname{Var}(f)}.\tag2 $$ However, if we average (1) over all $f$ we cannot assert a similar statement, since the denominator $\operatorname{Var}(f)$, which varies with the fold being held out, gets in the way: $$\frac1F\sum_f\operatorname{Rsquared}(c,f)=1-\frac1F\sum_f\left(\frac{\operatorname{MSE}(c,f)}{\operatorname{Var}(f)}\right).\tag3 $$ The RHS of (3) cannot be simplified further to reveal a monotonic relationship between the average $R^2$ over all folds and the average MSE over all folds.

Since MSE is the square of RMSE, the relationship between fold-averaged $R^2$ and fold-averaged RMSE is even less direct. Indeed, for any given fold, there is not even an analog for (2) between combo-averaged $R^2$ and combo-averaged RMSE.

Thanks for your answer but I tend to disagree. Each model is tested on the same fold. You can see that by setting returnResamp="all" in trainControl and inspect t1$resample. The same 10 folds are used for all models. Also I think it does not make sense to rebuild models using different folds for different measures. This would be computationally cumbersome. Instead a model is build once for each fold and then all measures are obtained. — Funkwecker, Apr 28 '18 at 07:01
@Julian It is true that the data are split into 10 folds only once, but you are misunderstanding how these ten folds are used. In 10-fold cross-validation, there are ten models built: the first model is trained on folds 2-10 (which is 90% of the data) and evaluated on fold 1 (the remaining 10%); the second model is trained on folds 1,3-10 and evaluated on fold 2; and so on up to the tenth model which is trained on folds 1-9 and evaluated on fold 10. So no, it is not true that each of these 10 models is tested on the same fold. — grand_chat, Apr 28 '18 at 19:22

score 2 · Answer 2 · answered May 01 '18 at 00:10

It's because caret calculates R-squared differently than you are. See the answer to this question: How caret calculates R Squared.

To see it in your code,

library(caret) 

set.seed(0)
d<-SLC14_1(n=1000)

folds<-createMultiFolds(d$y,k=10,times=1)
tc<-trainControl(index=folds,returnResamp="all",
             savePredictions = TRUE) # New option 
t1<-train(y~.,data=d,method="glmnet",trControl=tc) 
order(t1$results$RMSE)==order(-t1$results$Rsquared)

library(data.table)
preds <- data.table(t1$pred)
preds[, overall_mean := mean(obs), by = .(lambda, alpha, Resample)]

sum_sq <- preds[, .(SS_res = sum((obs - pred)^2),
                SS_tot = sum((obs - overall_mean)^2),
                n = .N,
                var = var(obs),
                Rsquared_corr = cor(obs, pred)^2),
            by = .(lambda, alpha, Resample)]
sum_sq <- sum_sq[, ':=' (RMSE_Julian = sqrt(SS_res / n),
                         Rsquared_Julian = 1 - (SS_res/SS_tot),
                         Rsquared_traditional = 1 - (SS_res/ ((n-1)*var) ))]
sum_sq <- merge(sum_sq, t1$resample, by = c("lambda", "alpha", "Resample"))
head(sum_sq)

Note the savePredictions = TRUE in the call to traincontrol(). In the final dataset, sum_sq, you can see your result, Rsquared_Julian matches Rsquared_traditional, but these don't match Rsquared_corr which does match the R-squared from caret, Rsquared.

Also in your question, you assume n and SS_tot are constant, but that only holds true for a fold, not across all the cross-validations.

score 1 · Answer 3 · answered May 02 '18 at 07:34

@grand_chat has the correct maths, I'm just growing in a comparative example to help illustrate what the issue is in different terms that will hopefully help understanding.

We're working with fractional terms here, similar to say miles per gallon. If we average mpg over set units of time we get very different results compared to over set units of fuel or distance.

If we travel 10 minutes at 50 mph achieving 50mpg then 10 minutes at 60 achieving 30 mpg and we then want to calculate the average fuel efficiency for the journey.

Time based average (with one minute representing a unit of time) is $(50*10+30*10) /20 = 40 mpg$

But the distance we travel is $50/6 + 60/6 = 18.33 miles $ given that ten minutes is 1/6 th of an hour

The fuel we use is $(50/6)/50 +(60/6)/30=1/6+2/6 = 1/2 gallon$

This means our average mpg is in fact $18.33/(1/2)=36.66$

Because the total variance is different in every fold you would need to account for this in the averaging to maintain the monotonic relationship. Since it is present in the R2 calculation but not the RMSE then you can get rank switching by not accounting for the total variance in each fold

Why does the rank order of models differ for R squared and RMSE?

3 Answers3