I'm working on a regression model where I have to predict time. These times go from a few seconds to up to 30 min and more.
I calculated the sMAPE through 1 minute bins of the target, and noticed that:
- Target 0-1 minutes: up to 200% sMAPE (unstable)
- Target 1-2 minutes: ~50% sMAPE
- Target 2-20 minutes: ~25% sMAPE
- Target >20 minutes: ~35% sMAPE
Most of my data is in the 0-2 min bins and very little data is >20 min.
I figured that the sMAPE might be big for small targets because very little errors correspond to high percentages. For >20 minutes I imagined the errors were due to lack of training data.
My next approach was to collect more data from 0-2 minutes and/or for more than 20 minutes and see the results.
The extra data helped the errors in these bins, but degraded substantially them in the other bins. For example, I was able to get 0-1 minutes sMAPE to down to 50% or increase the range of values with ~25% sMAPE, depending on the type of oversampling. But in any of these cases the results for the other bins were degraded.
I have some intuition, but I'm not very confident. I believe that when I add the new data the model tries to optimize better to that range of values, degrading the other range.
I thought about maybe creating 3/4 different models to work on the 3 different ranges. Maybe first a general model that would find the correct range to search (0-2 minutes / 2-20 minutes / >20 minutes) and then assign the correct model to do the prediction. Or an ensemble of 3 models trained in 3 different datasets. I don't know if this makes sense and I feel like it should not be necessary.
For now I've been working with LightGBM and Mean Absolute Error objective.