on the use of R squared as a measure of predictive accuracy for non-parametric models such as random forest

Question

I am confused. I know there are a couple of similar questions about $R^2$ but I hope I get some opinions on this particular matter.

I have trained a random forest and other nonparametric regression models and I want to test their performance on unseen data. I want to measure their predictive accuracy.

I am an engineering student which is not particularly good at statistics. I know we must differentiate between measuring goodness of fit (GoF) and predictive accuracy. The difference being the former is measured on the training data and the latter on test data. But it does not mean that we must have different metrics for each. Correct me if I'm wrong, please.

I have read some references on the fact that $R^2$ should not be used for measuring GoF if our model is not linear or cant be transformed somehow to a linear model (Kvålseth, 1985) (Spiess and Neumeyer, 2010).

Now you may ask, which definition of $R^2$. Thats part of the confusion too. Lets take the most common ones:

$$ R_1^2 = 1 - \dfrac{\Sigma (y_{true} - y_{pred})^2}{\Sigma (y_{true} - \bar y_{true})^2}$$

The above version is the one that is used in the popular scikit-learn package in Python.

And $R_2^2$ being the squared correlation coefficient (Pearson's $R$). This one is used in the caret package in R.

The interpretation for both of them: The proportion of total variance of $y_{true}$ as explained by the fitted model.

Two things I gather from this:

It is apprently only a measure for GoF
Since it is a proportion, it is meaningless to be negative and MUST be between zero and one.

I want your opinion on this: in my field (hydrology) researchers use Nash–Sutcliffe efficiency (NSE) score, which is exactly calculated as $R_1^2$, as a way to measure predictive accuracy or power of a hydrological models which clearly are not linear. Their rational is that the model should do better than the benchmark, their benchmark being the $\bar y_{true}$. Therefore negative values of NSE means that our model is doing worse than the mean target.I have a feeling that this is fundamentally wrong. This benchmark estimator is vague and how we can have it on unseen data to being with? and also since NSE is basically $R_1^2$ we can not use it as a measure of predictive accuracy.

Now my questions:

should/can I use $R_1^2$ to measure accuracy of my predictions and random forest?
Can I use $R_2^2$ for the above-mentioned purpose?
Besides metrics like MAE and RMSE, what are other options to qualify the performance of non parametric models on test data? in terms of accuracy or association,

Here is a subset of my test data prediction and observations:

\begin{array}{|c|c|} \hline {} & y\_true & y\_preds \\ \hline 0 & 3.745821 & 4.894624 \\ \hline 1 & 3.940449 & 5.743571 \\ \hline 2 & 2.849447 & 4.726890 \\ \hline 3 & 1.653091 & 2.659571 \\ \hline 4 & 2.934447 & 4.244686 \\ \hline 5 & 3.346146 & 5.269689 \\ \hline 6 & 2.450010 & 4.651610 \\ \hline 7 & 3.393356 & 5.122578 \\ \hline 8 & 0.791639 & 1.656736 \\ \hline 9 & 0.893791 & 1.935156 \\ \hline 10 & 0.129959 & 3.976739 \\ \hline 11 & 2.043000 & 4.072408 \\ \hline 12 & 4.298383 & 4.357470 \\ \hline 13 & 3.115428 & 4.432231 \\ \hline 14 & 4.325494 & 4.599493 \\ \hline \end{array}

(The values are volume of daily evapotranspiration in mm)

for this subset and my random forest:

$R_1^2 = -0.87$ and $R_2^2 = 0.55$.

Does this answer your question? [Using $R^2$ for RF](https://stats.stackexchange.com/questions/500452/using-r2-for-rf) — Dave, Jan 04 '21 at 20:46
@Dave thank you for pointing out the post. It is not as clear as I need in terms of GoF and predictive accuracy. Also I had some other concerns about NSE too, which it would be great to have a discussion about it here. As you see I would get a negative R squared for my test set and I fail to interpret in terms of "explaining variance/deviance". — Alireza Amani, Jan 04 '21 at 20:51
The trouble is that there is no magic number that gives you an $\text{A}$ grade. In some tasks, a value of $0.9$ might be awful, while, in some tasks, a value of $0.6$ might make you a trillionaire. So maybe your scaled $R^2$-type of measurement means that you have performance $0.8$. Out of context, that lacks meaning. — Dave, Jan 04 '21 at 20:53
@Dave, I understand. I am ok with the value I get from measuring MAE on this subset and I basically base my choices and comparisons on using MAE divided by the average target value, $\bar y_{true}$. That being said, its always good to have multiple measures to quantify different aspects. My problem is, if $R_1^2$ and $NSE$ should not be used for measuring predictive accuracy why is it used so frequently. And as I pointed out, I don't feel good how in hydrology they interpret negative NSE and NSE in general. — Alireza Amani, Jan 04 '21 at 20:57
1) Mixing an MAE metric with the pooled mean (instead of median) seems like a mistake. 2) $R^2_1$ is not an invalid metric for nonlinear regressions. However, it is equivalent to MSE, and (in the nonlinear case) it lacks the interpretation as being some proportion of variance explained, due to the lack of orthogonality. — Dave, Jan 04 '21 at 21:10
On the use of MAE divided by mean: [Kolassa & Schütz 2007, Foresight](https://ideas.repec.org/a/for/ijafaa/y2007i6p40-43.html) — Alireza Amani, Jan 04 '21 at 23:57
Interesting! As it happens, Kolassa is a major contributor here who has answered a number of questions of mine! — Dave, Jan 05 '21 at 00:00

score 1 · Answer 1 · answered Jan 05 '21 at 09:05

As you write, the only difference between Goodness of Fit and predictive accuracy is that one is measured in-sample and the other out-pf-sample. The terms themselves say nothing about the KPI used. So you can use $R_1^2$ or $R_2^2$ in either case, or any other accuracy measure. (It would make little sense to use a pinball loss that is optimized by a quantile for the GoF, though.)

Whether the models are linear or nonlinear is immaterial for the accuracy KPI, which after all only assesses how accurate predictions are. We can evaluate predictions from judgmental forecasts with the exact same KPIs, after all.

To your questions:

As Dave writes, $R_1^2$ is equivalent to MSE: it calculates the MSE, divides it by a scaling factor which is just the sum of squared differences between the actuals and their mean, and then subtracts this from 1. In-sample, this is usually a positive quantity, because your model should fit at least as well as the overall mean. (If it does worse in-sample, you have probably made an error.) Out-of-sample, it is quite possible that the mean of the holdout actuals would have been a better prediction than the prediction you calculated, see here. In such a case, your "out-of-sample $R_1^2$" would be negative.

So yes, you can use $R_1^2$ to evaluate predictions.
I personally would not use the squared correlation $R_2^2$, because it does not account for bias. If $\hat{y}_i=y_i-3$ for all $i$, then your predictions are biased, but they correlate perfectly with the actuals, and $R_2^2=1$. So $R_2^2$ will not detect that this prediction is likely not one you would consider "good".
There are many, many accuracy measures for predictions. Besides the MSE and $R^2$, there is also the MAE, the Mean Absolute Percentage Error, various pinball losses for quantile predictions and so forth. The key thing is that you need to think about which functional of the (unknown) future distribution you want to elicit, and then use an appropriate error measure. Do you want unbiased predictions? Use the MSE. Do you want the median? Use the MAE. Do you want a quantile prediction? Use a pinball loss. You may be interested in Kolassa (2020, "Why the "best" point forecast depends on the error or accuracy measure", International Journal of Forecasting), where I explain this in a little more detail.

The other point about $R^2$ vs $MSE$ is that $R^2$ is dimensionless, which looks like an advantage (because you can interpret it without knowing anything about the units) but can also be a disadvantage (because you can interpret it without knowing anything about the units) — Thomas Lumley, Jan 05 '21 at 09:11
@ThomasLumley: yes indeed. In that it is related to a "scaled MSE", which scales a given forecast's MSE by the MSE achieved by a simple benchmark forecasting method, like the historical mean. The idea is (a) that these *scaled* MSEs are comparable even if the underlying series are on different orders of magnitude, and (b) you immediately see whether your method has improved on the very simple benchmark. — Stephan Kolassa, Jan 05 '21 at 11:47

on the use of R squared as a measure of predictive accuracy for non-parametric models such as random forest

1 Answers1