2

I'm specifically referring to Random Forest regression.

The first statistics that are usually printed after running a random forest regression (in R - randomForest package - randomForest:::print.randomForest) are: Mean of squared residuals and % Var explained obtained.

Tuning the model, if you made a "good" change, usually you get a lower Mean of squared residuals and a higher % Var explained: is this always the case? (I'm aware of the fact that randomForest reports the variation and not the variance explained as specified here -> Manually calculated $R^2$ doesn't match up with randomForest() $R^2$ for testing new data).

In case it is not, should I prefer a lower Mean of squared residuals or a higher % Var explained?

Nemesi
  • 235
  • 3
  • 13

1 Answers1

3

Suppose the observed outputs are $Y = \{y_1, \dots, y_n\}$ with mean $\bar{y}$, and the predicted outputs are $\{\hat{y}_1, \dots, \hat{y}_n\}$. The mean squared error (MSE) is the mean of the squared residuals:

$$MSE = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2$$

The fraction of variance explained is defined as:

$$R^2 = 1 - \frac{\sum_{i=1}^n (y_i - \hat{y}_i)^2}{\sum_{i=1}^n (y_i - \bar{y}_i)^2}$$

Notice that the fraction above is equal to the mean squared error divided by the variance of $Y$:

$$R^2 = 1 - \frac{MSE}{Var(Y)}$$

Within the context of a particular set of $Y$ values, the variance of $Y$ is a constant, so $MSE$ and $R^2$ have a fixed relationship. A decrease in MSE implies an increase in $R^2$ and vice versa. Similarly, minimizing $MSE$ and maximizing $R^2$ are equivalent.

user20160
  • 29,014
  • 3
  • 60
  • 99