I am trying to build a machine learning model using Microsoft ML.NET to predict road surfacing life. I have a set of observed road surfacing lives with associated data such as traffic counts, number of bus lanes, material type etc. In all, I have about 4 categorical variables which I am normalizing using One-Hot encoding, and two numerical variables which I have tried in and out of the model with and without normalization (using MeanVariance and LogMeanVariance normalization).
There is a lot of noise in the data and I know that the value I am trying to predict may be influenced by many factors that are not in my model (unknown information such as construction quality, reason the surfacing was replaced etc.). So I am expecting a rather low R2 and high MAE.
After trying many different combinations of predictor variables in and out of the model, I selected the best performing model (ML.NET's "Fast Forest" model) based on 10-fold cross validation. For this model, I get an average R2 across the 10 folds of about 0.38.
However, when I then use that model to train on a 90% train set and then predict on the remaining 10%, I get the result shown below (black line is equality, red dotted line is a linear fit between the predicted and observed):
As you can see, the model consistently over-predicts when the observed surfacing life is low (say less than 12 years), and under-predicts when the observed surfacing life is higher (say above 12 years).
My question is twofold:
- Why does the ML model not automatically detect and correct for this bias?
- Is there a way to adjust or calibrate a ML model to address bias such as this. Would that be a form of circular logic or is it something that is sometimes done in building predictive models?
Any advice or suggestions appreciated!