1

I am building a simple random forest to predict soccer results in sckit. I simply train the model to predict each teams score based on various features. However I am trying to think how I can weight the data so that more recent fixtures will be considered more than historic fixtures.

Any ideas?

Hidden Markov Model
  • 938
  • 1
  • 8
  • 16
Marcus
  • 31
  • 1
  • 7
  • 1
    Related: [Weighting more recent data in Random Forest model](http://stats.stackexchange.com/q/83104/1352) and [Handling case weight in the Random Forest packages in R](http://stats.stackexchange.com/q/166424/1352) – Stephan Kolassa Jan 28 '16 at 20:44

2 Answers2

3

I agree with user Hidden Markov Model, when the underlying phenomena, which is generating the time series, is constant. On the other hand, if the dynamics of wins and losses transform as new football tactics appear, then very old time series are less representative of tomorrow than recent observations.

As S. Kolassa points out, time series cannot be plugged directly into RF or any other supervised regression method.

Typically for RF, a time series is treated with a rolling window generating learning examples how some past events ($X_{t-1}$ to $X_{t-k}$) coincided with some future outcome $X_t$. The model is free to up weigh any $k$ recent periods in the windows. But the regression model does not up weigh those learning examples/windows where $t$ is closest to present day by default. One can help RF up weigh recent examples by stratification. E.g. for each tree is bootstrapped(with replacement) 200 learning examples within last 200 periods of $t$ + 200 learning examples from the last 1000 periods of $t$.

Thus, when the underlying system could be transient it would make good sense to down sample distant-in-past learning examples/windows.

If your system of interest is both transient and noisy you're in trouble.

2

The random forest should pick this up automatically from the data itself. If data closer to the present has a stronger effect on your y predictor than earlier lags, then the coefficients will be larger for more recent lags and smaller for earlier lags. The regression coefficients, i.e., "weights", should be wholly determined by the data. If you do want to try a weighted time series, you can use Holt Winters Triple Exponential smoothing on your score variable and then see if a regression model with additional lagged features beats the time series model.

Hidden Markov Model
  • 938
  • 1
  • 8
  • 16
  • 2
    How would the RF know about lags? – Stephan Kolassa Jan 28 '16 at 20:43
  • How would it not? Random forest is just like any other regression model - it generates an answer using however many lags of variables that you feed into it. – Hidden Markov Model Jan 28 '16 at 20:44
  • I don't see how Holt-Winters - or indeed *any* classical time series algorithm - would help with predictions based on *features*. – Stephan Kolassa Jan 28 '16 at 20:45
  • I guess you never heard of ARIMAX models then .... – Hidden Markov Model Jan 28 '16 at 20:45
  • 1
    Uh, no. You would at least need to transform your data and feed the lags in in some way. RF's don't automatically transform input data, at least not the implementations I'm familiar with. Maybe you could edit your answer to give an example? – Stephan Kolassa Jan 28 '16 at 20:46
  • I do, in fact, think I have some slight idea about causal forecasting. For instance, I happen to be quite certain that ARIMAX and Holt-Winters are two different things, and that ARIMAX will not be able to deal with larger numbers of features. Maybe you would like to be a little more specific? – Stephan Kolassa Jan 28 '16 at 20:49
  • The author mentions "time series" and "features" which therefore implies lags. RF do not require data transformations. I use Random Forests in my current work for revenue prediction. I assure you that they are very good. Please read up on them. To your second comment, this unfortunately is wrong. Holt Winters additive seasonality is an ARIMA model. ARIMAX simply extends ARIMA models. – Hidden Markov Model Jan 28 '16 at 20:52
  • If the correctness of a comment is in dispute, that's not really a matter for mod-flags (mods are not really arbiters of correctness of *content*). Just explain why you think it's wrong. If discussion is extended you can then take it to chat. – Glen_b Jan 29 '16 at 00:23
  • Yes, the author's question sounds as if lags were relevant. I'm just unsure about how a RF would automatically determine that something at time $t$ is more important than at time $t-1$. I think it would be helpful if you could edit your question to include an example, or to point to some references. ARIMAX will likely not be useful, because the external factors need to *vary* in the estimation sample so their parameters can be estimated, and "features" are commonly stable over time. In addition, you will need *a lot* of data for ARIMAX to be able to estimate multiple parameters. – Stephan Kolassa Jan 29 '16 at 07:44
  • RF's variable importance plot tells us which lags for which variables are important and unimportant. – Hidden Markov Model Jan 29 '16 at 20:14