1

I understand that in order to have a good, stable model the $R^2$ has to be high and the RMSE must be low (depending on the type of variables examined). There are many questions about the "best RMSE", but my question is about something different.

Are there any possible causes for high $R^2$ values AND high RMSE values? How can I find more about this and thus explain the results?

Ferdi
  • 4,882
  • 7
  • 42
  • 62
g.f.l
  • 43
  • 2
  • 7

1 Answers1

3

The two are directly related:

$R^2 = 1-\frac{\sum(y_i-\hat{y})^2}{\sum(y_i-\bar{y})^2}$

$RMSE = \sqrt{\frac{\sum(y_i-\hat{y})}{n-k}^2}$

so

$R^2 = 1-\frac{RMSE^2\times(n-k)}{\sum(y_i-\bar{y})^2}$

Now the unit of the RMSE is the unit of the dependent variable, while the $R^2$ is a proportion. So, numerically you can arbitrarily change RMSE while keeping the $R^2$ constant by changing the unit of the dependent variable. Such a change has no substantive meaning; you can say that something weights a 1000 grams or 1 kg, the numbers are different but the meaning is exactly the same.

Maarten Buis
  • 19,189
  • 29
  • 59
  • I think you forgot to take mean instead of sum. – Richard Hardy Feb 23 '17 at 09:58
  • @Maarten Buis thank you for your response. Υou're right about this but since there is a clear relation , I can not understand what causes a result like R2=0.80 and RMSE=200 (both high). Would it be reasonable something like R2= 0.80 and RMSE= 0.5? I just can not see any possible reasons. – g.f.l Feb 23 '17 at 10:02
  • @g.f.l You cannot say if 200 for a RMSE is high or low, that crucially depends on the unit of the dependent variable. – Maarten Buis Feb 23 '17 at 10:06
  • @MaartenBuis exactly, I said that in my question (depending on the type of variables examined). But if it is high, is there something I can read to explain this and find possible causes? I only find something like http://stats.stackexchange.com/questions/56302/what-are-good-rmse-values – g.f.l Feb 23 '17 at 10:11
  • Maybe an example helps? Lets say your dependent variable is weight, and you measure it once in grams and once in kg. The RMSE in your first model will be 1000 times the RMSE in your second model. So an RMSE of 200 in the first model will correspond to an RMSE of .2 in the second. Now I don't think there is a problem here that needs explaining other than that 1 kg=1000 gr. – Maarten Buis Feb 23 '17 at 10:18
  • @MaartenBuis, (my english are not very good, sorry), I see what you say but for example, if according to papers (eg. https://bcal.boisestate.edu/docs/CJRS_Cheng_Glenn_2008.pdf) the RMSE **for this variable** is between 0.7 - 1.5 and I get RMSE=100 then something is wrong. – g.f.l Feb 23 '17 at 10:35
  • Now I also wonder if the denominator should be $n-k$ or $n$. I understand $k$ serves as a degrees-of-freedom correction, but is this how RMSE is typically defined? This is of little conceptual importance, but still... – Richard Hardy Feb 23 '17 at 12:10
  • @RichardHardy In statistics "typical" surprisingly often differs from discipline to discipline, but it is how it is defined in mine (sociology). – Maarten Buis Feb 23 '17 at 12:12
  • @g.f.l Either you are not using the same model or you are using different units for the dependent variable. For such a large difference, it is almost certain that the latter is the case. – Maarten Buis Feb 23 '17 at 12:14
  • In all likelihood the height was in cm in your raw data, while in the analysis reported in the article it was rescaled to meters. – Maarten Buis Feb 23 '17 at 12:48