4

Can a decrease in mean absolute percentage error (MAPE) be correlated with an increase in standard deviation of the error? Is that counter-intuitive? What about MAPE vs. mean absolute error (MAE)? I'm wondering if those metrics should follow a trend, or not necessarily.

Update

The general concern is as follows: when performing cross-validation, by establishing a scoring criterion that combines average (e.g., MAE, MSE) and spread (e.g., SD) of the errors doesn't necessarily lead to minimizing MAPE. In other words, one shouldn't expect to reduce MAPE by minimizing those other metrics. Does this sound reasonable?

Bruno
  • 501
  • 2
  • 5
  • 13

1 Answers1

10

Note that the MAPE goes down as actuals go up - and the standard deviation doesn't. So for a given time series of errors (with potentially increasing SD), we could simply have a time series of actuals with a positive trend, and once the positive trend is strong enough, the MAPE will start going down.

set.seed(10)
nn <- 100
error <- rnorm(nn,0,seq(10,15,length.out=nn))
actuals <- seq(20,50,length.out=nn)

cumulative.mape <- cumsum(abs(error)/actuals)/(1:nn)
cumulative.sd <- sapply(1:nn,function(xx)sd(error[1:xx]))

opar <- par(mfrow=c(1,2))
    plot(cumulative.mape[-(1:10)],type="l",main="Cumulative MAPE",ylab="",xlab="")
    plot(cumulative.sd[-(1:10)],type="l",main="Rolling Error SD",ylab="",xlab="")

MAPE and SD

So the issue is that the MAPE depends on the error and the actuals, whereas the SD of the errors don't depend on the actuals any more (beyond the actuals influencing the errors themselves, of course). Thus, this should typically not happen for the SD and MAE, since the MAE again only depends on the errors, not the actuals.

EDIT: In general, different error measures move somewhat in tandem - but not perfectly so. Minimizing different error types is the same as optimizing different loss functions - and the minimizer for one loss function is typically not the minimizer of a different loss function.

For an extreme example, minimizing the MAE will pull you towards the median of the future distribution, while minimizing the MSE will pull you towards its expectation. If the future distribution is asymmetric, these will be different, so minimizing the MAE will yield biased predictions. I just discussed this yesterday.

So: no, minimizing one error measure will not necessarily minimize a different one.

I regularly read the International Journal of Forecasting, and accepted best practice there is to report multiple error measures, and sometimes, yes, they imply that different methods are "best". Which authors and readers take in stride. I'd say that point forecasts are not overly helpful, anyway, and that you should always aim at full predictive densities.

(Incidentally, I can't recall ever having seen the SD of the errors reported in the IJF, and I don't really see the point of it as an error measure. An error time series can be badly biased and constant over time, with a zero SD - what's good about that?)

EDIT 2: I no longer believe assessing point forecasts using different error measures is useful. To the contrary, I believe it's actively misleading. My argument can be found in Kolassa (2020), "Why the "best" point forecast depends on the error or accuracy measure", International Journal of Forecasting.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • Thanks for that. So it seems that one can't generalize trends of MAPE to those of SD. So it wouldn't make sense to minimize one and expect the other to also decrease necessarily, right? I updated the question with my general concern. – Bruno May 30 '17 at 16:34
  • You don't actually need a trend, though; consider the MAPE of a series with $\mu = \sigma = 1$ vs the MAPE of a series with \$mu = 100, \sigma = 2$, although this doesn't affect your fundamental point. – jbowman May 30 '17 at 16:46
  • I edited my answer in reply to your edit. Hope it helps! – Stephan Kolassa May 30 '17 at 17:03
  • Thanks a lot, Stephan! The motivation for minimizing SD in addition to MSE or MAE (it seems MSE is preferable, right) is to not only attempt to make the error distribution zero-centered, but also "thin" (i.e., small spread). Any comments on that? – Bruno May 30 '17 at 17:54
  • A "thin" error distribution is good. But it seems to me like the MSE already captures this, doesn't it? – Stephan Kolassa May 30 '17 at 19:46
  • Yes, I agree. The MSE is the second moment (centered at the origin) of the errors. So it does account for both bias and variance. Going back to my question: I'm not sure if looking at the MAPE is informative or not when comparing the accuracy of different models... – Bruno May 31 '17 at 00:03
  • The MAPE is informative, it's just that the information it provides can be misleading. Percentages are reassuringly comprehensible, and if you have a MAPE of zero, then your forecasts are perfect. The problem is that the MAPE is asymmetric, which can bias your forecasts if the actuals have a high coefficient of variation ([see here](https://www.researchgate.net/publication/224009542_Percentage_Errors_Can_Ruin_Your_Day_and_Rolling_the_Dice_Shows_How)). So if you use MAPE, always also look at bias. – Stephan Kolassa May 31 '17 at 06:33
  • From your link, I have the impression that MASE or wMAPE seem to be preferable... can they be applied to non time-series data? What if I don't have time-stamped data? What happens to $Y_t - Y_{t - 1}$? – Bruno May 31 '17 at 11:58
  • 1
    Yes, they can. wMAPE is the sum of the absolute errors over the sum of the actuals ([see here](https://ideas.repec.org/a/for/ijafaa/y2007i6p40-43.html)), which works just as well for non-time series. [tag:MASE] divides the MAD by the MAD achieved in-sample by a benchmark method - typically we use the random walk, where this error is $Y_t-Y_{t-1}$, but any other reasonable benchmark works as well. [See here](https://stats.stackexchange.com/a/108963/1352), or other threads tagged [tag:mase]. – Stephan Kolassa May 31 '17 at 12:02