Suppose you have a model, evaluated using mean absolute percentage error (MAPE), that has made predictions on 1000 different examples. Each example will have an associated MAPE that reflects how well the model predictions were on that example. Is there a default standard for how to combine these MAPE values into one number to reflect how well the model MAPE performs overall?
The options that I am aware of seem to be:
- Arithmetic mean - seems like the default choice here but I cannot find any literature supporting this argument (note that arithmetic mean is sensitive to outlier MAPEs)
- Geometric mean - unreasonable choice as MAPEs are not multiplicative processes
- Harmonic mean - unreasonable choice as MAPEs are not reciprocal processes
- Median - seems like a reasonable choice if the MAPE values contain outliers that ruin the overall average MAPE (note that median is robust to outlier MAPEs)
- Mode - unreasonable choice as MAPE values are usually all different if the input examples are different
- Mid-range - unreasonable choice as it lacks efficiency as an estimator for most distributions of interest, ignores all intermediate points and lacks robustness as outliers change it significantly
As I am now faced with MAPE values that contain extremely large values, it seems to me that best practice should be to report both the arithmetic mean MAPE and the median MAPE. Is this common practice and if not, why?
Also, having read the central tendency wiki page, are there any other reasonable options?
A somewhat related but different question can be found here.