5

I am currently writing my master's thesis about random forests and just started to work with the R software. When I am running my model the output looks like this:

Mean of squared residuals: 0.0002441535
% Var explained: -8.82

Can anyone explain me why I get a negative $R^2$? I always thought that a negative $R^2$ is not possible.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
Alex_cgn
  • 75
  • 2
  • 5
  • 2
    related http://stats.stackexchange.com/questions/7357/manually-calculated-r2-doesnt-match-up-with-randomforest-r2-for-testing – WetlabStudent Apr 03 '15 at 16:24
  • Thank you @MHH. Then I hope I find a way to improve my model to get a positive "%Var explained"... – Alex_cgn Apr 03 '15 at 17:16

1 Answers1

8

Explained variance is here defined as R² = 1- SSmodel / SStotal = sum((ŷ-y)²) / sum((mean(y)-y)²). = 1 - mse / var(y).

It is correct that the squared pearson product-moment correlation cannot be negative.

in the documentation to randomForest function is written in values section: rsq (regression only) “pseudo R-squared”: 1 - mse / Var(y).

A simple interpretation of this negative R², is that you were better of simply predicting any sample as equal to grand mean. Thus the model don't do very good.

The predictions of the training set RF$predicted are out-of-bag cross validated, likewise should any R^2 or other performance measure be.

library(randomForest)
obs = 500
vars = 100
X = replicate(vars,factor(sample(1:5,obs,replace=T)))
y = rnorm(obs)

RF = randomForest(X,y)

#var explained printed
print(RF)
cat("% Var explained: \n", 100 * (1-sum((RF$y-RF$pred   )^2) /
                                    sum((RF$y-mean(RF$y))^2)
                                  )
)

##pearson correlation R²(pearson)
cat("% Pearson cor: \n ", 100*cor(RF$y,RF$predicted)^2)
##spearman correlation R²(spearman)
cat("% spearman cor: \n ", 100*cor(RF$y,RF$predicted,method="s")^2)