0

I've done many models on customer data across time, with many different types of datasets, and it puzzles me that Logistic Regression consistently has small deviations across months, including train and test periods. Random Forests on the other hand, tend to have a vast performance difference between months used for training and months used for testing. The test performance is similar between approaches.

The training periods are extremely correlated, having the same customer each month, but why such a sharp drop when going to test set and only on RF? I'm hesitant to think about overfit since the same thing happens no matter how much I fiddle with the parameters in varying degrees, and test performance is anyway similar to the Logistic.

Is there any theoretical reason for this?

Blopblop
  • 21
  • 2
  • Are you doing Forecasting or any kind of extrapolation? – Jon Nordby Aug 27 '20 at 19:09
  • You say that the test performance is around the same. So what is the problem? That the train perf is higher with RF? – Jon Nordby Aug 27 '20 at 19:10
  • What kind of hyperparameters did you try to limit overfitting? – Jon Nordby Aug 27 '20 at 19:10
  • 1
    Start by checking all your input variables to see what they profile like in test vs train. I'm guessing one of the variables is either describing the time period itself (like Month(date)) or perhaps one of the input variables is vastly different in the test set. I'm not suggesting the test set to tune the model but this could help give you a business intuition behind whats happening. Remember the RF bins historical observations, so if the test set yields new values it can still only put it in that 1 bin, while regression can extrapolate that affect from the new unseen value – Josh Aug 27 '20 at 19:27
  • @jonnor the models are classifiers. Typically my target would be something like "customer did X in the next 3 months", common for predicting churn, credit scores among others. – Blopblop Aug 27 '20 at 19:40
  • @jonnor Yes the train perf is higher with RF. I wouldn't say it's a problem, just thought it was curious how often it happens and if there's anything here beyond "your RF is overfitted". Played around with number of trees, depth, branching numbers, minimum number of examples per leaf... – Blopblop Aug 27 '20 at 19:43
  • @Josh A variable describing the time period would explain a sharp drop between train/test, but why would it affect only the RF? – Blopblop Aug 27 '20 at 19:51
  • How are you measuring performance? Out-of-bag? – Scortchi - Reinstate Monica Aug 27 '20 at 20:59
  • @Scortchi-ReinstateMonica Gini coefficient / AUC – Blopblop Aug 27 '20 at 22:39
  • 1
    Out-of-bag AUC? (It would help to put more detail about what you're doing in the question, else answers are going to be speculative.) See e.g. https://stats.stackexchange.com/q/66543/17230 for a possible explanation. – Scortchi - Reinstate Monica Aug 27 '20 at 23:14
  • I read some answers about oob and it seems to be what I was looking for! Thank you. It seems I was thinking the wrong questions :) – Blopblop Aug 28 '20 at 02:35
  • You're welcome. I've closed this q. as a duplicate of that one. – Scortchi - Reinstate Monica Aug 28 '20 at 07:30

0 Answers0