2

Can the $R^2$ measure be used to measure the performance of Random Forest model? My explanatory and dependent variables are linearly dependent.

Agi
  • 45
  • 4
  • This has been written about before on this site. Check out, for instance, https://stats.stackexchange.com/q/13869/40036 – josliber Dec 11 '20 at 19:39
  • Why are you using a random forest of the relationship between your variables is known to be linear? – Dave Dec 11 '20 at 19:58
  • @Dave At some points it might not yield a linear model. – Agi Dec 11 '20 at 20:29
  • @josliberI did actually but it doesn't give a straight answer of if it is theoretically wrong to do it. – Agi Dec 11 '20 at 20:29

2 Answers2

3

There is no advantage over using MSE or RMSE. Any model that has better $R^2$ than another model also will have better MSE and RMSE (assuming the same data). In that sense, all three are equivalent loss functions and measures of performance.

A common reason for wanting to use $R^2$ over MSE or RMSE is the desire to say that $R^2=0.94$ means that $94\%$ of the variation in the data is explained by the model, and $94\%$ is an $\text{A}$ grade in school. This interpretation of $R^2$ fails for nonlinear models like random forest, as the residuals and predictions are not orthogonal. (That there is a linear relationship between your variables is not relevant to this point.)

So you can use $R^2$ when MSE or RMSE would be viable loss functions, but I don’t see a reason to do so.

Dave
  • 28,473
  • 4
  • 52
  • 104
  • Are you trying to compare the RF performance to the linear regression $R^2& value? – Dave Dec 11 '20 at 19:59
  • Yes. I want to compare the performance of different models with one another. The concern that I have is that RF might build a non-linear model for some training sets (I'm using rolling time window), that's why I wonder if it makes sense to use rf for non-linear mdoels. – Agi Dec 11 '20 at 20:25
  • I do not follow your concerns. Perhaps post them as a new question. It looks like it will be interesting to discuss. – Dave Dec 11 '20 at 20:28
  • Could you provide the formula for MSE to be used for non-linear models? Is it $MSE=\frac{1}{n}\Sigma_{i=1}^n(y_i-\hat{y_i})^2$ for all sort of models? – Agi Dec 13 '20 at 18:15
  • 1
    @Agi The formula for MSE does not depend on the model, so what you wrote applies in general. – Dave Dec 13 '20 at 18:29
0

If you take the definition of $R^2$ coming as a log-likelihood difference ratio (see this answer) then it is a perfectly valid choice. Under this framework, you can interpret $R^2$ as the fraction of explained deviance.

Firebug
  • 15,262
  • 5
  • 60
  • 127
  • 1
    Ben’s answer seems to be specific to GLMs, though I should keep it in mind if I want some kind of scaled performance metric for a logistic regression. – Dave Dec 11 '20 at 20:10
  • @Dave it's actually valid to any likelihood of choice, since it's simply a comparison of likelihoods. GLMs do, however, maximize that quantity directly, if the respective distribution conforms to the GLM framework – Firebug Dec 11 '20 at 20:52
  • @Firebug Do non-linear models fit into GLM category? Cause the way it computes the r-squared leads down to the same traditional r-squared definition of the linear models. – Agi Dec 11 '20 at 21:06
  • @Agi Your random forest is not a GLM, no. – Dave Dec 11 '20 at 23:11
  • So why did state Firebug that it is a valid choice?@Dave – Agi Dec 11 '20 at 23:22
  • @Agi because it is. The log-likelihood can be evaluated with any model (it's model agnostic). – Firebug Dec 12 '20 at 00:04
  • @Firebug So if I define the r-squared as in the link you gave (which at last yields the same traditional r-squared definition) then I can use it for say RF, ARIMAX, VARX and even MRF models? – Agi Dec 12 '20 at 00:06
  • @Agi sure. But you'd need to discuss it's value as the proportion of explained deviance based on a Gaussian likelihood. That's an assumption that you alone must assure makes sense. – Firebug Dec 12 '20 at 00:35
  • @Firebug But the link that you have provided is only for GLM models and Dave said that models like rf don't go into this category. I'm really confused! – Agi Dec 12 '20 at 01:43
  • 1
    @Agi you’re mixing up two questions. One is if random forest is a generalized linear model (GLM), and the answer is that it is not. The other is about how proportion of deviance explained works in general, and I will let Firebug address that. – Dave Dec 12 '20 at 03:55
  • That's correct @Dave, and as I said before to Agi, it makes sense to compare likelihoods, even if they were not used in optimization. It's the exact same thing when you compare the MSE (Gaussian negative log likelihood) between Random Forests, they were not trained to optimize for that. – Firebug Dec 13 '20 at 14:49