Extrapolation using machine learning models under specific assumptions

Question

I have a problem that requires inherently extrapolation. I am aware that this a crucial matter with most (if not all) machine learning models.

Yet, given the physical phenomenon underlying the experiment, there is some expert knowledge that could be used to validate the extrapolation for this case. Example: With linear regression, one could combine the model with previous knowledge to say it is safe to use the parameters of the model for extrapolation in that particular case.

My case: After training a Random Forest model on a large data set, I applied the model to the new data set, and 20% of the data points are outside the calibration zone. I decided to compare the behaviour of the Random Forest model in both datasets using partial dependence plots, as RF is non-parametric. It shows clearly the threshold-based behaviour of decision trees, where any value outside of the calibration zone is grouped together with the extremes of the calibration zone.

Yet, the plots do suggest a general trend that could be "safely" extrapolated by a multiple linear regression model, as long as we hold the parameters of the model as plausible/realistic enough.

My initial idea then is to use some sort of stacking to combine RF with some linear or polynomial model and see if this could be achieved. I wonder if there are any fundamental flaws in this rationale or if this could actually be achieved?

A second point is: Are there alternatives to this issue, maybe neural networks or different models that could extrapolate to a certain point, so then I could validate the behaviour and decide if it is realistic or not?

score 3 · Accepted Answer · answered Dec 02 '21 at 11:35

My initial idea then is to use some sort of stacking to combine RF with some linear or polynomial model

Random Forest is not a proper tool for this. I would try Gaussian Process, trying Neural Network and Squared Exponential kernels, modelling the mean as a linear function. See the lower three charts in an example from Golding & Purse 2016:

Surely there's much more you can achieve with Gaussian Processes, you can write any formula and design any kernel functions for your purpose and then test them.

A second point is: Are there alternatives to this issue, maybe neural networks or different models that could extrapolate to a certain point, so then I could validate the behaviour and decide if it is realistic or not?

In extrapolation, you are entering the territory which is totally uknown to you. You need to make some assumptions about it. You need to have a model of the data that will tell you what happens beyond the range of your data. The question then is how much that model is valid. You could do a crossvalidation by cutting of some of your data at the edges and using them as a validation data, and then assume that the data will behave the same outside of your data range. Of course, this assumption might be wrong, but you need to make some, when you are predicting about an unknown territory.

"Random forests are not a proper tool for this" For extrapolation using random forests alone, I'd agree. But, the OP asked about combining random forests with other regression models, and it's not clear to me that this couldn't work (even though I share your preference for GP regression). It would be nice to explain more of the reasoning behind this assertion (and in favor of GPs). — user20160, Dec 02 '21 at 18:11
Might also be worth pointing out that, like random forests, GP regression will also extrapolate a constant (expected) value if using the most popular choices of mean function (zero/constant) and covariance function (decaying functions like RBF, Matern, etc.). Getting reasonable extrapolation requires careful choices of these functions. The role of the mean function actually seems a bit similar to 'extra' polynomial regression model that the OP proposed tacking onto the random forest. — user20160, Dec 02 '21 at 18:38
Thank you for the nice answer @Tomas. The paper is quite interesting. I saw sklearn has come packages with Gaussian Processes, but I wonder if I should use it as kernel for a model (SVM, ANN) or as a model alone. — Henrique, Dec 06 '21 at 10:28

Extrapolation using machine learning models under specific assumptions

1 Answers1