15

I've read some literature that random forests can't overfit. While this sounds great, it seems too good to be true. Is it possible for rf's to overfit?

screechOwl
  • 1,677
  • 3
  • 21
  • 32
  • 6
    If it can fit, it can overfit. In terms of RF, think about what happens if your forest doesn't contain enough trees (say your forest is a single tree to make the effect obvious). There are more issues than this one, but this is the most obvious. – Marc Claesen Aug 25 '14 at 16:54
  • I've just responded to another thread on RF that it could easily overfit if the number of predictors is large. – horaceT Jun 11 '16 at 04:07
  • "Can't" is a very dangerous word. It takes a lot of abuse or a bit of unluck to make it happen but it absolutely can happen. The RF is somewhat more abuse-resistant than other methods, but no method is perfect. Too short, too tall, to fat, too skinny, ... it feels like a zefrankism. – EngrStudent Jul 16 '20 at 15:09
  • This question would benefit from some context. Where did you find the claim that random forest cannot overfit? Can you [edit] to include a quotation that makes the claim & its citation? – Sycorax Feb 22 '22 at 03:48

3 Answers3

10

Random forest can overfit. I am sure of this. What is usually meant is that the model would not overfit if you use more trees.

Try for example to estimate the model $y = log(x) + \epsilon$ with a random forest. You will get an almost zero training error but a bad prediction error

Donbeo
  • 3,001
  • 5
  • 31
  • 48
  • Random Forest principally reduces variance, how can it overfit? @Donbeo could it be perhaps because, decision tree models do not perform well on extrapolation. Let's say, for anomalous predictor variable, DT could give bad prediction. – Itachi Jun 22 '17 at 18:34
  • One clear indication of overfitting is that the residual variance is reduced *too much.* What, then, are you trying to imply with your first remark? – whuber Jun 22 '17 at 18:47
  • 1
    In bias-variance trade off, when we try to reduce bias, we compensate for variance. Such that, if x = 80 gives y = 100, but x = 81 gives y = -100. This would be **overfitting**. Isn't Ovefitting similar to for having high variance. @whuber i assumed ovefitting is only because of high variance. I do not understand how reducing residual variance results in overfitting. Can you please share some paper for me to read on. – Itachi Jun 23 '17 at 11:23
  • 2
    This doesn't require any paper! You can try it yourself. Take a small simple bivariate dataset, such as $x_i=1,2,\ldots,10$ and *any* collection of corresponding $y_i$ you care to produce. Using least squares (because this aims to reduce the variance of the residuals), fit the series of models $y=\beta_0+\beta_1 x+\beta_2 x^2 + \cdots + \beta_k x^k$ for $k=0, 1, \ldots, 9$. Each step will reduce the variance until at the last step the variance is zero. At some point, almost anyone will agree, the models have begun to overfit the data. – whuber Jun 23 '17 at 13:11
  • @whuber I think you're missing the point on what "variance reduction" is. Random Forest (and bagging in general) do not reduce the variance of the residuals, but the variance of your predictions. So in your example, each step you talk about INCREASES variance :) – Davide ND Jan 30 '20 at 10:34
  • The only reason a Random Forest overfits is if the majority of its trees are overfit "in the same way". This can easily happen if the dataset is quite easy and your trees deep. – Davide ND Jan 30 '20 at 10:35
  • 2
    @Davide Your remark shows I should have explicitly stated I was offering my example not as a statement about random forests, but about the underlying concepts of variance reduction and overfitting. But your first comment is opaque because it is irrelevant (and, as I read it, is incorrect): the residual variance matters in this sequence of OLS models, not the prediction variance. Indeed--returning to the general question of fitting models--if reducing variance of the predictions were the objective, then any model that always predicts zero would be optimal! – whuber Jan 30 '20 at 13:26
  • Sorry, my answer was related to the sentence: "one clear indication of overfitting is that the residual variance is reduced too much" as an answer to "Random Forest principally reduces variance". The point was that these are two different variances. But maybe I misinterpreted to who your comment was referring :) – Davide ND Jan 30 '20 at 13:33
  • @whuber however I still find the point you're trying to make hard to interpret: Bagging and RF in general want to reduce the variance of the model (which is linked with the one of the predictions) - What does the residual variance of OLS have to do with this? – Davide ND Jan 30 '20 at 16:13
  • @Davide You seem to have a strange concept of "variance of the model." Virtually all modeling methods attempt to reduce the variability of the *residuals,* which in OLS is assessed by their variance. – whuber Jan 30 '20 at 16:55
  • @whuber the Variance of the Model (as opposed to the Bias) is technically the variance of the parameters of the Model. Nothing to do with the residuals. The more parameters your model has, the more it's complex and its variance INCREASES. – Davide ND Jan 30 '20 at 16:58
  • @whuber the goal is still to reduce residuals variance ofc, but when we talk about variance reduction in a Bagging setting it's not what we are referring to – Davide ND Jan 30 '20 at 16:59
  • @Davide Thank you for clarifying your meaning of "variance of the model," because it's an unusual one and doesn't apply to the example I gave. – whuber Jan 30 '20 at 17:04
  • @whuber it's actually a very common in this setting, and it's what OP and the first comment were referring to. And that's why I could not understand the goal of your example.. https://en.m.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff – Davide ND Jan 30 '20 at 17:10
  • @Davide I seem to be unable to explain this misconception to you, because the OP clearly was *not* referring to your interpretation. Overfitting is not measured in terms of parameter variance. – whuber Jan 30 '20 at 18:26
  • @whuber I'm not going to discuss this any further cause it seems pointless. Please, read more about variance reduction methods, bagging, RFs. Random Forests are simply a variance reduction method for Decision Trees - that's not up for discussion - and this means that they reduce the variance of Decision Trees. It does not even make sense to talk about variance of residuals, cause they were originally developed for classification. – Davide ND Jan 31 '20 at 08:47
  • @Davide You will persuade neither me, nor anyone else, merely by claiming you're right and it's not up for discussion. But if you could offer an accessible, authoritative reference in support of your interpretation, we all could learn from you. In fact, I don't think this is an issue of right or wrong, but rather one of interpretation, definitions, and perhaps even the use of English. – whuber Jan 31 '20 at 16:12
  • @whuber "Bagging or bootstrap aggregation (section 8.7) is a technique for reducing the variance of an estimated prediction function. Bagging seems to work especially well for high-variance, low-bias procedures, such as trees." Elements of statistical learning, section 15.1 – Davide ND Jan 31 '20 at 16:22
  • Let us [continue this discussion in chat](https://chat.stackexchange.com/rooms/103939/discussion-between-davide-nd-and-whuber). – Davide ND Jan 31 '20 at 16:24
6

I will try to give a more thorough answer building on Donbeo's answer and Itachi's comment.

Can Random Forests overfit?
In short, yes, they can.

Why is there a common misconception that Random Forests cannot overfit?
The reason is that, from the outside, the training of Random Forests looks similar to the ones of other iterative methods such as Gradient Boosted Machines, or Neural Networks.
Most of these other iterative methods, however, reduce the model's bias over the iterations, as they make the model more complex (GBM) or more suited to the training data (NN). It is therefore common knowledge that these methods suffer from overtraining, and will overfit the training data if trained for too long since bias reduction involves an increase in variance.
Random Forests, on the other hand, simply average trees over the iterations, reducing the model's variance instead, while leaving the bias unchanged. This means that they do not suffer from overtraining, and indeed adding more trees (therefore training longer) cannot be source of overfitting. This is where they get their non-overfitting reputation from!

Then how can they overfit?
Random Forests are usually built of high-variance, low-bias fully grown decision trees, and their strength comes from the variance reduction that comes from the averaging of these trees. However, if the predictions of the trees are too close to each other then the variance reduction effect is limited, and they might end up overfitting.
This can happen for example if the dataset is relatively simple, and therefore the fully grown trees perfectly learn its patterns and predict very similarly. Also having a high value for mtry, the number of features considered at every split, causes the trees to be more correlated, and therefore limits the variance reduction and might cause some overfitting
(it is important to know that a high value of mtry can still be very useful in many situations, as it makes the model more robust to noisy features)

Can I fix this overfitting?
Like always, more data helps.
Limiting the depth of the trees has also been shown to help in this situation, and reducing the number of selected features to make the trees as uncorrelated as possible.

For reference, I really suggest reading the relative chapter of Elements of Statistical Learning, which I think gives a very detailed analysis, and dives deeper into the math behind it.

Davide ND
  • 2,305
  • 8
  • 24
1

Hastie et al. address this question very briefly in Elements of Statistical Learning (page 596).

Another claim is that random forests “cannot overfit” the data. It is certainly true that increasing $\mathcal{B}$ [the number of trees in the ensemble] does not cause the random forest sequence to overfit... However, this limit can overfit the data; the average of fully grown trees can result in too rich a model, and incur unnecessary variance. Segal (2004) demonstrates small gains in performance by controlling the depths of the individual trees grown in random forests. Our experience is that using full-grown trees seldom costs much, and results in one less tuning parameter.

Sycorax
  • 76,417
  • 20
  • 189
  • 313