Predict Based on Prediction?

Question

I am working on a binary classification task with a pretty straightforward input set of numeric features. One of these features is particularly good, but it cannot be used in real life because it's a measure that is obtained after the fact has occurred. Is it possible to predict this measure based on the other features, and then build a model including this predicted measure?

In more detail, I am building a classifier for this challenge from the UCI repo: https://archive.ics.uci.edu/ml/datasets/bank+marketing

The feature that cannot be used is the call duration because one can't know how long a call will last before it takes place. So I am wondering, could I build a regression model or at least a binned classifier to predict how long a call will last before it takes place, then feed this prediction to the model and replace the provided call duration feature?

That's basically a form of [stacking](http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/), so yes. But you have to be very careful about how you do your train-test splits to make certain you haven't included info from that variable somewhere you shouldn't have. — Dan, Jan 04 '19 at 18:04
Thank you for your reply! In other words, I will have two models; model 1 doesn't include the extra feature, and model 2 will. If I am understanding correctly, you are saying model 2 should use the same train-test split from model 1? — Odisseo, Jan 04 '19 at 18:08
I suggest this, let's say you have your target feature $T$ and your other feature you can't use normally, $T'$. You can make two models, $T = F(x)$ and $T' = F'(x)$ using the same training/validation data. But then you'll want to stack these into a third model, so $T'' = F''(T,T')$. Now for this model $F''$ you have to use a different (future) train/valid set. — Dan, Jan 04 '19 at 18:16
If instead you want to build a model $T'' = F''(T',x)$, then you need to use one dataset for building $T'=F(x)$ and future separate date for $T'' = F''(T',x)$. — Dan, Jan 04 '19 at 18:22
But what if all the features (except call duration) are not good in predicting call duration ? Will that predicted call duration feature will be of any help in predicting actual Target variable? — Harshit Mehta, Jan 04 '19 at 19:45
That is exactly what ended up happening! Unfortunately. So I am a bit stuck, I would love to do this stacking ensemble but if I can't create a good model for Z I might have to give up. Is there anything you would suggest? I tried data augmentation but I really have nowhere to get more info on these call durations. Thanks! — Odisseo, Jan 09 '19 at 07:18

Aksakal · Accepted Answer · 2019-01-04T18:20:12.960

0

Let's formulate a problem. You have two sets of features $X_t$ and $Z_t$, where the former is available in future $X_{t+1}$, while the latter is not $Z_{t+1}=?$

You want to forecast some quantity $Y_t$ conditional on these features: $\hat Y_{t+1}=f(X_{t+1},Z_{t+1})$. The trouble is that $Z_{t+1}$ is not known, so you suggest to first obtain $\hat Z_{t+1}=g(X_{t+1},Z_t)$, then plug it to your first model $\hat Y_{t+1}=f(X_{t+1},g(X_{t+1},Z_t))$ Now you can use information that is available in future, i.e. $X_{t+1}$.

First, this can be done, and is done in practice.

However, conceptually, it is similar to simply building a model on what's available in future: $$Y_{t+1}=h(X_{t+1})$$

So, isn't it better to simply do the second model $h()$, instead of the two step approach with $f()$? It depends. On one hand the second approach is simpler, and thus can be more robust. On the other hand, the first approach may allow you to capture something that is not easy to incorporate in the second model.

I run into this issue all the time, and pick different path case by case.

Here's a trivial example where you want to do the second approach. Suppose, you're limited to linear modeling: $y=X\beta_x+Z\beta_z$ and $Z=X\beta_{zx}$, then you have $y=X\beta_x+X\beta_{zx}\beta_z=X(\beta_x+\beta_{zx}\beta_z)$ This is equivalent to the second approach of modeling on just $X_t$, so you don't bother and do $y=X\beta$ directly.

edited Jan 04 '19 at 18:20

answered Jan 04 '19 at 18:14

Aksakal

55,939
5
90
176

Thank you for the formalization, this makes my question clearer even to me. Unfortunately, I am going to have to go with Option 1 because that variable is very good. Would you recommend a particular approach? What seems to have worked in your experience? Would you recommend stacking? Thanks again! – Odisseo Jan 04 '19 at 18:18
@Odisseo, see my update to answer. Basically, in linear setup you have to be careful not to end up with a situation where it's waste of time to model in two steps – Aksakal Jan 04 '19 at 18:21
Thanks, this makes sense. Aside from using g(X,Z), I think it would also be helpful to throw in more models that use the known variable X, each using a different time window X's into f(X, g). So the final ensemble would be f(X_30days, X_60days, X..., Z). Do you think this would be a good idea? Basically, doing your option 1 in conjunction with an approach similar to what is described here: https://stats.stackexchange.com/questions/47950/ensemble-time-series-model – Odisseo Jan 04 '19 at 18:38
@Odisseo, in my answer $X_t$ really means $X_{s\le t}$, i.e. all information known at time $t$ and earlier. So, you can have any number of lags that make a sense. Also, you can have dynamic models where $\hat Y_{t+1}=f(X_t,Y_t)$ etc. I kept the notation simple to get to the root of the problem. – Aksakal Jan 04 '19 at 18:50
To test this idea and create a model for Z(X), I created a regression model but unfortunately I realized during implementation that have very poor predictors and I never got above a 0.3 R2. Is there anything else you may recommend doing in this situation? Seems like I am running out of options. Thanks! – Odisseo Jan 09 '19 at 07:17

Predict Based on Prediction?

1 Answers1