ML Model shows systematic bias in predictions. Can or should this be corrected and how?

Question

I am trying to build a machine learning model using Microsoft ML.NET to predict road surfacing life. I have a set of observed road surfacing lives with associated data such as traffic counts, number of bus lanes, material type etc. In all, I have about 4 categorical variables which I am normalizing using One-Hot encoding, and two numerical variables which I have tried in and out of the model with and without normalization (using MeanVariance and LogMeanVariance normalization).

There is a lot of noise in the data and I know that the value I am trying to predict may be influenced by many factors that are not in my model (unknown information such as construction quality, reason the surfacing was replaced etc.). So I am expecting a rather low R2 and high MAE.

After trying many different combinations of predictor variables in and out of the model, I selected the best performing model (ML.NET's "Fast Forest" model) based on 10-fold cross validation. For this model, I get an average R2 across the 10 folds of about 0.38.

However, when I then use that model to train on a 90% train set and then predict on the remaining 10%, I get the result shown below (black line is equality, red dotted line is a linear fit between the predicted and observed):

As you can see, the model consistently over-predicts when the observed surfacing life is low (say less than 12 years), and under-predicts when the observed surfacing life is higher (say above 12 years).

My question is twofold:

Why does the ML model not automatically detect and correct for this bias?
Is there a way to adjust or calibrate a ML model to address bias such as this. Would that be a form of circular logic or is it something that is sometimes done in building predictive models?

Any advice or suggestions appreciated!

I have just come across another question similar to mine, which is here: https://stats.stackexchange.com/questions/74319/predictions-of-random-forest-on-training-data-dont-lie-around-x-y-line?rq=1 But my question is a bit more related to the comment one person left on the above post: "I wonder, is it reasonable to use a pipeline something like this: run the random forest, then do something like lm(prediction ~ truth), and use the result of the lm as an extra step to correct for such errors? Or is this really a bad idea?" — Fritz45, Aug 09 '21 at 22:01

score 1 · Accepted Answer · answered Aug 09 '21 at 23:17

1

This is a common issue with random forest type algorithms as these algorithms cannot predict outside the training range. You may get better results with "bias corrected random forests" (https://math.unm.edu/~luyan/research/biasrf.pdf) or xgboost. You may also want to look up "bias-variance tradeoff" (https://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). From your comment, I believe you are looking for "stacking". In your case, you are only stacking 1 model but I think it would still be considered stacking. stacking

answered Aug 09 '21 at 23:17

Joe Janssen

55
8

thanks for your answer. I know about the issue of predicting outside the training range. Your answer is not really telling me specifically if it is reasonable to adjust the ML model results to reduce bias but the paper you referenced suggests it is not an totally crazy idea. I am going to just wait and see what other answers come along - if there is nothing more specific I will award to you – Fritz45 Aug 10 '21 at 02:52

ML Model shows systematic bias in predictions. Can or should this be corrected and how?

1 Answers1