The extrapolation problem: model selection, performance metrics, and improvement

Question

Machine learning models are fit to a response variable within a given range. This leads to weak and sometimes disastrous performance when it comes to instances with an actual response variable outside that range.
When the underlying mechanism (physics-based formula) is known, one gets better performance for the ML model if it incorporates such formula as a descriptor (pointed out by this answer). But there are times when we don't have the luxury of knowing the underlying mechanism. There are also some examples of how certain models work poorly when it comes to extrapolation. (Here is a blog post comparing some models, and here is one sitting on SE archive of unanswered favorites).

So the question is:

1- Model selection: Are there established models that show less vulnerability to the extrapolation problem? (for example can it be that Neural Net models are more potent when it comes to extrapolation compared to regression-based models)

2- Diagnoses: What are (if any) performance metrics that are specifically designed to characterize extrapolation capability of a model? One obvious way would be to just test the model on instances out of range and report the error, which is not systematic neither statistically sound.

3- improvement: Besides the obvious (expanding the range of the training set), are there ways to improve the extrapolation performance of a model? Biased sampling in the training set, tweaking the loss function and increasing the penalty for instances with extreme response could potentially help. Are there systematic methods or published articles that provide guidance on that?

An interesting and challenging question. I think 1) using your knowledge of the system, 2) considering the structural limitations of the modelling approach (e.g. standard random forests will just forecast a flat line out of sample), and 3) thinking about the *plausibility* of of the response shape are likely to help. I agree that all of these are somewhat vague. — mkt, Aug 18 '19 at 07:32
I don't quite follow your second point; extrapolation usually implies that you have no samples out of range on which to test your model. If you are testing your model on instances out of range, it implies that you are intentionally excluding extreme data points from the model fitting process. This would allow you to get some estimate of the error associated with a small amount of extrapolation. But it would come at the potential cost of making the model much weaker by not training it with valuable data points. And it would not tell you much about performance if you had to extrapolate further. — mkt, Aug 18 '19 at 07:40
@mkt can you elaborate on what you mean by plausibility of the response shape? Excluding extreme data is for testing purposes and once confidence is achieved, then they are added to the training data to make the model stronger. If that would not tell you much about performance if you had to extrapolate further, then what will? One would expect that errors at the extremes are closer to an expected error of extrapolation than average errors of the whole set. — Kinformationist, Sep 04 '19 at 01:43
One traditional answer to (3) is variable transformations. For example, if you use a regression to predict house prices, you can avoid ever extrapolating to a nonsensical negative price if you predict the log of the price instead. And a comment on (1): there are entire papers devoted to showing that neural nets don't extrapolate well, e.g. "Deep Neural Networks are Easily Fooled", Nguyen et al. — Flounderer, Sep 04 '19 at 03:46
Re: plausibility. Consider your system and the response shape together. Does it make sense *for your system* to forecast (for e.g.) a response that just increases exponentially without ever saturating or reversing direction? That would be implausible for many systems, which should decrease your confidence in the forecast even if the exponential is a good fit for your dataset. — mkt, Sep 04 '19 at 07:56
Re: excluding extreme data points for testing. *If that would not tell you much about performance if you had to extrapolate further, then what will?*. Nothing - that is my point. There is no general answer to this problem. See the figure in this answer for an example of why that would fail: https://stats.stackexchange.com/a/414350/121522 — mkt, Sep 04 '19 at 07:59

The extrapolation problem: model selection, performance metrics, and improvement

0 Answers0