Predict probability of outcome from continuous variables

Question

We have a model which predicts the start time of an event (lets call it predicted_start). We also have a default start time for an event (lets call it default_start), but it's usually not correct and that's why we made the model to predict a more correct start time.

The model is doing great, but sometimes it's wrong and predicted_start differs greatly from the actual start (lets call it actual_start). Also, sometimes it's right and predicted_start can differ greatly from default_start and still be correct.

It would be nice to know the probability of prediction_start being correct and close to actual_start. It's not a random guess and there has to be a probability distribution somewhere ... right? This validation would also probably be dependant on offset from defualt_start and maybe previous event's offset from defualt_start, not sure, maybe this doesn't need to be that complicated?

Can't really wrap my head around this and would greatly appreciate any pointers.

EDIT: I have considered logistic regression of some sort, but was hoping someone knew a better solution.

I think you are asking for prediction intervals. If you use linear regression (eg with nonlinear inputs), you can do this analytically, otherwise bootstrap... — seanv507, Dec 27 '19 at 10:30
Thanks. As I wrote to Tim: To clear up my reasoning for not using logistic regression, a larger offset does not mean a higher probability of an incorrect prediction and visa versa. That's what makes this problem a bit too tricky for me. — NorwegianClassic, Dec 27 '19 at 16:48
offset from what? default_start .. this should be "irrelevant", based on how the model is setup: eg model might be `predicted_start = default_start if time < midnight else 3* default_start`. if you predict `actual start` then a prediction interval tells you what is the "plausible" range a new `actual_start` will be in, taking into account the amount of data you have to estimate the parameters and the typical error size. https://en.wikipedia.org/wiki/Prediction_interval. — seanv507, Dec 27 '19 at 17:41
Thank you! Thought the offset had to be include in some way, but of course it doesn't. I will have a look at your suggestion, it seems to be what I was looking for. — NorwegianClassic, Dec 27 '19 at 18:54

Tim · Accepted Answer · 2019-12-28T11:59:16.703

What about predicting the "corrections" for the default time? If $y$ is your time to predict, and $\tilde y$ the "default" time you know, so instead of predicting $y$, you would predict $y - \tilde{y}$ (i.e. $\tilde{y}$ would be an offset variable in regression).

If this doesn't work, you can use logistic regression, as you mentioned, or use a single model for this. The model could be something like

$$ y = \pi \hat y + (1-\pi) \tilde y + \varepsilon $$

where $\pi \in [0, 1]$. So $\pi$ would tell you about models' confidence about the prediction $\hat y$. By doing this in single model, it could just learn when to use $\tilde y$, and when not care for those values and use $\hat y$ values. This would simplify the task for predicting $\hat y$ as well.

To predict $\hat y$ and $\pi$ you could modify the model to

$$ (\mu_1, \mu_2) = f(\mathbf{X}) \\ \hat y = \mu_1, \qquad \pi = \sigma(\mu_2) $$

where $f$ is some model, and $\sigma$ is sigmoid function.

The simplest case for $f$ could be linear regression model with $k-1$ features and intercept in $\mathbf{X}$

$$ \overbrace{\boldsymbol{\mu}}^{(n \times 2)} = \overbrace{\mathbf{X}}^{(n \times k)} \overbrace{\boldsymbol {\beta}}^{(k \times 2)} $$

But if you want to consider changes over time, you could use something like

$$ (\mu_1, \mu_2) = \mathsf{LSTM}(\mathbf{X}) $$

where $\mathsf{LSTM}$ is LSTM-based recurrent neural network.

If you cannot incorporate the model that makes the $\hat y$ predictions into single model, as described above, you can build higher-level model that would either choose between predictions $\hat y$ and the default values $\tilde y$, or compute weighed average of them. In the first case, you would use classifier like logistic regression, or random forest, to make the choice. Alternatively, you may build a model that would learn to weight the two outcomes by $\pi$ weights, as described above. Such model would predict the weights, and train it by minimizing loss (e.g. squared) between weighed mean of the predicted and default values, and the true values.

Thank you for replying. The "model" is really nothing more than a simple algorithm looking for human defined characteristics. Shouldn't have used the word "model". The algorithm is _usually_ pretty spot on, wouldn't your single model suggestion potentially shift the predicted start time if pi isn't strictly 1? Also, what is epsilon in this case? — NorwegianClassic, Dec 27 '19 at 14:09
@NorwegianClassic yes, it would "shift" the predictions; it'd make them weighed average between "default" and "predicted" values. What would be the error term depends on the details of the model definition you choose, since I defined it in very broad terms. — Tim, Dec 27 '19 at 14:46
@NorwegianClassic If the "model" is just a rule-based algorithm, then using logistic regression (or other classifier) as you described is probably enough. — Tim, Dec 27 '19 at 14:51
Thank you again @Tim. Just to clear up my reasoning for not using logistic regression, a larger offset does not mean a higher probability of an incorrect prediction and visa versa. That's what makes this problem a bit too tricky for me. — NorwegianClassic, Dec 27 '19 at 16:47
@NorwegianClassic I never said that the size of the offset has anything to do with either of the approaches, logistic regression included. — Tim, Dec 27 '19 at 16:50
Good point... was caught up with the idea of using the offset Thanks again! — NorwegianClassic, Dec 27 '19 at 18:50

Predict probability of outcome from continuous variables

1 Answers1