Random forest is overfitting

Question

I am trying to use Random Forest Regression in scikits-learn. The problem is I am getting a really high test error:

train MSE, 4.64, test MSE: 252.25.

This is how my data looks: (blue:real data, green:predicted):

Forest regression cleaned

I am using 90% for training and 10% for test. This is the code I am using after trying several parameter combinations:

rf = rf = RandomForestRegressor(n_estimators=10, max_features=2, max_depth=1000, min_samples_leaf=1, min_samples_split=2, n_jobs=-1) 
test_mse = mean_squared_error(y_test, rf.predict(X_test))
train_mse = mean_squared_error(y_train, rf.predict(X_train))

print("train MSE, %.4f, test MSE: %.4f" % (train_mse, test_mse))
plot(rf.predict(X))
plot(y)

What are possible strategies to improve my fitting? Is there something else I can do to extract the underlying model? It seems incredible to me that after so many repetitions of the same pattern the model behaves so badly with new data. Do I have any hope at all trying to fit this data?

Are you training this periodic function with the x axis as the input, and the y axis as the label for x <= 245, and then testing for x > 245? Or am I misinterpreting your plot? — rrenaud, Dec 12 '12 at 16:32
kind of, actually the x axis is the observation index,in total there are 300 observations, so from 245 on, that is test data not used for training the model, the input feature vector consists of integers, has shape (300,2) and closely resemble a linear function of the observation index, so i didnt add info about it in order to not overcomplicate the question. — elyase, Dec 12 '12 at 17:28
You might want to remove the cycle (seasonal part) out of your data first (and the trend). — R. Prost, Mar 04 '18 at 07:35
Have you looked into time series analysis? It's not clear to me what's on your x-axis but it seems periodical to me. Check here and let me know if this helps: https://www.otexts.org/fpp/7/5 — Bram Van Camp, Mar 04 '18 at 09:02

score 21 · Answer 1 · answered May 03 '13 at 11:12

21

I think you are using wrong tool; if your whole X is equivalent to the index, you are basically having some sampled function $f:\mathbb{R}\rightarrow\mathbb{R}$ and trying to extrapolate it. Machine learning is all about interpolating history, so it is not surprising that it scores spectacular fail in this case.

What you need is a time series analysis (i.e. extracting trend, analysing spectrum and autoregressing or HMMing the rest) or physics (i.e. thinking if there is an ODE that may produce such output and trying to fit its parameters via conserved quantities).

answered May 03 '13 at 11:12

isn't machine learning about extracting generalizable models from the data? Once one has a certain set of which interpolate well the data, we can choose the ones with better extrapolation/generalization properties using for example cross validation. Is there something wrong in my understanding? – elyase May 03 '13 at 12:40
Extrapolation is different from generalization -- imagine you are a subject of a following experiment: you see a screen and have a red and green button. First, the screen shows a video of the room you're in where other person pressed green button for cat, lion and tiger shown on a screen and then red for wolf and dog and this way gathered 5 delicious cookies. – May 03 '13 at 15:45
1

Now, the screen shows a bobcat; you perform proper, generalizable interpolation of the history, press the green button and get an electric shock instead of a cookie. Why has this happened? Because the solution is a cycle (g-g-g-r-r-r) and animal pictures are just a deception. You have done the same to your forest -- lured it into a dumb reproduction of your training set while hiding the real information. – May 03 '13 at 15:45
Good example but don't see it the way you do. In your example we have the following data: a target(`g` or `r`) and 2 features(`index`(temporal) and `animal`). From this data I could fit multiple models which can give more or less weight to feature 1 or 2(or equal to both). Cross validation(assuming enough data) should arrive at a model with feature 2(animal) having less importance. I can see that my model is overfitting the data, but I still think that I should be able to extract a model which follows this pattern(because the behavior hasn't changed) with a large enough model space. – elyase May 03 '13 at 16:33
1

Nope; even if you ask for more data the experimentalist can still extend the animal deception and further obfuscate the pattern to hold it not obvious. I.e., extrapolation simply can't be done with learning because by definition it requires information that is not present in the training -- this way you must either apply some assumptions or gather additional data so that the problem will become interpolation. – May 04 '13 at 17:14
BTW there is a trick how to convert time series extrapolation into interpolation assuming some consistency of the series -- autoregression, so learning a value $f_i$ from its history, $f_{i-1},\ldots,f_{i-H}$. If $f$ is not diverging (and if $f$ is, maybe $f'$ or $f''$ is not?), the set of histories is restricted and can be interpolated. – May 04 '13 at 17:30
This has nothing to do with machine learning vs something else. Saying machine learning is interpolation of history is just FUD and giving a completely wrong impression to readers. There are plenty of tools which can be used to solve this problem which people would classify as "machine learning". – bayerj Jul 12 '13 at 15:21
Maybe you have a different definition, but in mine ML is a subset of modelling that only uses input data as a source of information about the process. If so, it can only make interpolation because, as I showed before in this thread, reliable extrapolation requires additional assumptions. I don't know why do you call this FUD, though; excluding a class of futile cases does not make ML any less spectacular. – Jul 13 '13 at 14:48

Daniel Mahler · Answer 2 · 2014-05-08T02:54:51.010

9

The biggest problem is that regression trees (and algorithms based on them like random forests) predict piecewise constant functions, giving a constant value for inputs falling under each leaf. This means that when extrapolating outside their training domain, they just predict the same value as they would for the nearest point at which they had training data. @mbq is correct that there are specialized tools for learning time series that would probably be better than general machine learning techniques. However, random forests are particularly bad for this example, and there other general ML techniques would probably perform much better than what you are seeing. SVMs with nonlinear kernels are one option that comes to mind. Since your function has periodic structure, this also suggests working the frequency domain, using Fourier components or wavelets.

edited May 08 '14 at 02:54

answered May 08 '14 at 02:22

Daniel Mahler

631
4
6

AFAIK SVM have the same problem of random forest. They do not predict well outside of space where they have been trained. Probably neural network would be a better solutions – Donbeo Aug 25 '14 at 22:26
If the data lie on a curve and the kernel is of the right kind to fit that curve, then an SVM will be able to extrapolate along that curve. ed if the data has a linear trend, then a linear SVM will fit the line will extrapolate along that line. More complex kernel can fit and extrapolate more complex behaviours. It depends on having the right kind of kernel. That the not make SVMs the right tool for extrapolation and TS prediction, but it makes them better than random forests. – Daniel Mahler Feb 20 '17 at 02:13

score 3 · Answer 3 · edited Apr 13 '17 at 12:44

3

Some suggestions:

Tune your parameters using a rolling window approach (your model must be optimized to predict the next values in the time series, not to predict values among the ones supplied)
Try other models (even simpler ones, with the right feature selection and feature engineering strategies, might prove better suited to your problem)
Try to learn optimal transformations of the target variable (tune this too, there's a negative linear/exponential tendency, you may be able to estimate it)
Spectral analysis perhaps
The maxima/minima are equally spaced it seems. Learn where they are given your features (no operator input, make an algorithm discover it to remove bias) and add this as a feature. Also engineer a feature nearest maximum. Dunno, it might work, or perhaps not, you can only know if you test it :)

edited Apr 13 '17 at 12:44

Community

1

answered Apr 25 '16 at 18:42

Firebug

15,262
5
60
127

But, as said by Daniel in his answer, random forest won't work for this kind of problems by design since it is not able to predict values outside of the range observed in the train sample. Tuning parameters etc. would lead nowhere. – Tim Mar 04 '18 at 12:30
1

Suggestion #2 @Tim. And Random Forests won't work naively on this data, but clever feature extraction might make it work. – Firebug Mar 04 '18 at 18:40

Vikram · Answer 4 · 2013-05-03T09:38:48.107

This is a textbook example for data over-fitting, the model does very well on trained data but collapses on any new test data. This is one of the strategies to address this: Make a ten fold cross validation of the training data to optimize the parameters.

Step 1. Create a MSE minimizing function using the NM optimization. An example could be seen here: http://glowingpython.blogspot.de/2011/05/curve-fitting-using-fmin.html

Step 2. Within this minimization function, the objective is to reduce the MSE. In order to do this, create a ten-fold split of the data where a new model is learned on 9 folds and tested on the 10th fold. This process is repeated ten times, to obtain the MSE on each fold. The aggregated MSE is returned as the result of the objective.

Step 3. The fmin in python will do the iterations for you. Check which hyper parameters are necessary to be fine tuned (n_estimators, max_features etc.) and pass them to the fmin.

The result will be the best hyper-parameters which will reduce the possibility of over-fitting.

Yes, it appears to be overfitting(which Random Forest Regression normally doesn't, hence the question). Now I have observed that changing the parameters has little effect with RF Regressors. Now cross validation requires an underlying model flexible enough to be optimized. Which kind of ML models/algorithm do you recommend for this kind of data. — elyase, May 03 '13 at 12:43

score 1 · Answer 5 · edited Jul 12 '13 at 15:28

1

This is an interesting problem. Your data suggests some regularity (periodic $x^2$ like functions) but has sharp peaks at transitions. All this suggests a slightly complex model. I would model these data by a succession of $x_2$ functions parametrized by a coefficient and a displacement parameter.

edited Jul 12 '13 at 15:28

gung - Reinstate Monica

132,789
81
357
650

answered Jul 12 '13 at 15:03

Vladislavs Dovgalecs

2,315
15
18

score 0 · Answer 6 · answered Apr 03 '17 at 14:58

After reading above post , I want to give another different answer.

For tree based models, such as random forest, they can't extrapolate the value beyond the training set. So, I don't think it is an over fitting problem, but an wrong modeling strategy.

So, what can we do for time series prediction with tree model?

The possible way is to combine it with linear regression: first, detrend the time series (or modeling trend with linear regression), then modeling the residual with trees (residuals are bounded, so tree models can handle it).

Besides, there is a tree model combined with linear regression can extrapolate, called cubist, it does linear regression on the leaf.

score 0 · Answer 7 · answered Mar 04 '18 at 07:23

0

If you simply want to predict within the bounds of the graph, then simply randomizing the observations before splitting the data set should solve the problem. It then becomes an interpolation problem from the extrapolation one as shown.

answered Mar 04 '18 at 07:23

Deepon GhoseRoy

151
1
4

Random forest is overfitting

7 Answers7

Linked