Train predictive model with predictive variable not present in production

Question

Assume we are receiving a continuous time-series: $$X_1 = \{x_{1,1},\ldots,x_{1,n}\} \in \mathbb{R}^n$$ $$\vdots$$ $$X_i = \{x_{i,1},\ldots,x_{i,n}\} \in \mathbb{R}^n$$

At each step $i$ (knowing all the past history $X_0,\ldots,X_i$) we want to predict some other variable $y_i\in\mathbb{R}$. Time-series $y_1,...,y_i$ is also continuous, and is generally mean-reverting to 0 (i.e. it evolves around 0).

We are given some training data $X_1,\ldots,X_N$ with $y_1,\ldots,y_N$ of length $N$, which we can use to analyse the relationship between $X$ and $y$, find a model that fits this data, etc.

Now, in production (with a time-series of length $M$ where $N\ll M$), we only receive $X_i$ at each time step $i$, but never $y_i$. So while in production, we have no way of knowing whether our current estimate $\hat{y}_i$ has diverged from $y_i$ (since we don't even receive/know past values of $y_1,\ldots,y_{i-1}$ at step $i$).

How would one approach a problem like this? I am asking because I have only come across setups, where one knows $X_1,\ldots,X_i$ and $y_1,\ldots,y_{i-1}$ at step $i$. So this at each step, we can re-calibrate the model. But if we do not receive any $y_i$ then I feel the only option is to calibrate the model once with the training data using e.g. multi-linear regression (and then hope for the best). But I feel doing a single fit is probably not going to be enough to fit all the data well.

So perhaps one could split/identify different regimes in the training data (based on what the current $X_i$ is), and then do a model fit for each regime independently. That is, do localised calibration (almost like a hash table). Then the question is, what sort of technique would one use to identify local clusters?

Any thoughts greatly appreciated.

Edit: It's not that i am missing $y_{N+1},...,y_M$ (= the values which are not in the training data). I theoretically have them, but I have to pretend that whenever I receive the next $X_i$ I do not know the corresponding $y_i$, if $i>N$. That is, the model can only ever know $y_1,...,y_N$ (= the values from the training data).

score 0 · Answer 1 · answered Jul 08 '19 at 15:37

0

Whether conditioning on specific $X$ regimes makes sense is conceptually independent of whether you only observe a small subset of your actual outcome. It may make sense (or not) whether you are predicting "far" or "soon" out of your training sample.

So, by all means, if you believe it makes sense, then try it. However, don't expect magic from this model.

If you have enough training data, it might make sense to simulate your proposed model (or whatever else you envisage): once knowing $y_1, \dots, y_k$ and once with $y_1, \dots, y_i$, with $k\ll i$. Check the difference in predictive power. This will give you an idea whether it would be worthwhile to invest resources to potentially obtain the missing data $y_{k+1}, \dots, y_i$.

Often, when you can put a price tag on data like this, data that used to be "unavailable" suddenly becomes available. I would expect much more magic to happen through a process like this than through a highly nifty model. Or alternatively, you may find that the degradation in predictive performance is not so large after all, in which case you can relax about not having the data.

answered Jul 08 '19 at 15:37

Stephan Kolassa

95,027
13
197
357

Hi Stephan, it's not that i am missing $y_{N+1},...,y_M$ (= the values which are not in the training data). I theoretically have them, but I have to pretend that whenever I receive the next $X_i$ I do not know the corresponding $y_i$, if $i>N$. That is, the model can only ever know $y_1,...,y_N$ (= the values from the training data). – Jamie LL Jul 08 '19 at 15:59
Good. Pretend you don't have the data. Simulate having and not having the data in the way I suggest. Put a price tag on not having the data. Take that price tag to whoever makes you pretend. Start a discussion. – Stephan Kolassa Jul 08 '19 at 16:03
Thanks, Stephan. But there is no "price tag" on the data. It's a theoretical exercise to see how to predict time-series $y$, given $X$ whilst not knowing any new/recent values of $y$. I then judge the model based on the differences between $y_i - \hat{y}_i$ (true vs predicted) to see how it performs. But I don't know a concrete model that does this. Multi-linear regression (calibrating once to the training data) does not yield good results. – Jamie LL Jul 08 '19 at 16:13
1

The price tag is exactly the deterioration in predictive capability caused by the unavailability of data. Regarding "does not yield good results", you may be interested in [Is my model any good, based on the diagnostic metric (R2 / AUC / accuracy / etc) value?](https://stats.stackexchange.com/q/414349/1352) and [How to know that your machine learning problem is hopeless?](https://stats.stackexchange.com/q/222179/1352) – Stephan Kolassa Jul 08 '19 at 16:19

Train predictive model with predictive variable not present in production

1 Answers1