I am confused. I know there are a couple of similar questions about $R^2$ but I hope I get some opinions on this particular matter.
I have trained a random forest and other nonparametric regression models and I want to test their performance on unseen data. I want to measure their predictive accuracy.
I am an engineering student which is not particularly good at statistics. I know we must differentiate between measuring goodness of fit (GoF) and predictive accuracy. The difference being the former is measured on the training data and the latter on test data. But it does not mean that we must have different metrics for each. Correct me if I'm wrong, please.
I have read some references on the fact that $R^2$ should not be used for measuring GoF if our model is not linear or cant be transformed somehow to a linear model (Kvålseth, 1985) (Spiess and Neumeyer, 2010).
Now you may ask, which definition of $R^2$. Thats part of the confusion too. Lets take the most common ones:
$$ R_1^2 = 1 - \dfrac{\Sigma (y_{true} - y_{pred})^2}{\Sigma (y_{true} - \bar y_{true})^2}$$
The above version is the one that is used in the popular scikit-learn package in Python.
And $R_2^2$ being the squared correlation coefficient (Pearson's $R$). This one is used in the caret package in R.
The interpretation for both of them: The proportion of total variance of $y_{true}$ as explained by the fitted model.
Two things I gather from this:
- It is apprently only a measure for GoF
- Since it is a proportion, it is meaningless to be negative and MUST be between zero and one.
I want your opinion on this: in my field (hydrology) researchers use Nash–Sutcliffe efficiency (NSE) score, which is exactly calculated as $R_1^2$, as a way to measure predictive accuracy or power of a hydrological models which clearly are not linear. Their rational is that the model should do better than the benchmark, their benchmark being the $\bar y_{true}$. Therefore negative values of NSE means that our model is doing worse than the mean target.I have a feeling that this is fundamentally wrong. This benchmark estimator is vague and how we can have it on unseen data to being with? and also since NSE is basically $R_1^2$ we can not use it as a measure of predictive accuracy.
Now my questions:
- should/can I use $R_1^2$ to measure accuracy of my predictions and random forest?
- Can I use $R_2^2$ for the above-mentioned purpose?
- Besides metrics like MAE and RMSE, what are other options to qualify the performance of non parametric models on test data? in terms of accuracy or association,
Here is a subset of my test data prediction and observations:
\begin{array}{|c|c|} \hline {} & y\_true & y\_preds \\ \hline 0 & 3.745821 & 4.894624 \\ \hline 1 & 3.940449 & 5.743571 \\ \hline 2 & 2.849447 & 4.726890 \\ \hline 3 & 1.653091 & 2.659571 \\ \hline 4 & 2.934447 & 4.244686 \\ \hline 5 & 3.346146 & 5.269689 \\ \hline 6 & 2.450010 & 4.651610 \\ \hline 7 & 3.393356 & 5.122578 \\ \hline 8 & 0.791639 & 1.656736 \\ \hline 9 & 0.893791 & 1.935156 \\ \hline 10 & 0.129959 & 3.976739 \\ \hline 11 & 2.043000 & 4.072408 \\ \hline 12 & 4.298383 & 4.357470 \\ \hline 13 & 3.115428 & 4.432231 \\ \hline 14 & 4.325494 & 4.599493 \\ \hline \end{array}
(The values are volume of daily evapotranspiration in mm)
for this subset and my random forest:
$R_1^2 = -0.87$ and $R_2^2 = 0.55$.