1

I am evaluating two machine learning models. The output is count data which has a range of 0 to 30, which most of the output values being small values. Large output values are rare.

enter image description here

One model has lower MAE and RMSLE and the other model has lower RMSE. I am not sure which model is performing better. It is worth noting that Model 2 is the result of model after taking log transformation of the output variable.

jkjsdf fod
  • 23
  • 5
  • 1
    Which car is better when they aren't equally fast, cheap or stylish? It's not even axiomatic that minimising a measure of global lack of fit is the way to choose a model. Other criteria might include simplicity; ease of interpretation; matching patterns qualitatively; and on and on and on. – Nick Cox Aug 20 '19 at 18:17
  • Thank you for your reply. I have actually transformed the logged predictions back so that the results for these two models should be on the same scale. I think they are comparable? Just want to know if results are better by taking log transformation. – jkjsdf fod Aug 20 '19 at 18:24
  • I wouldn't want to choose a model on this information alone. – Nick Cox Aug 20 '19 at 18:27
  • 2
    Did you [adjust for bias when back-transforming predictions on a log scale](https://otexts.com/fpp2/transformations.html)? – Stephan Kolassa Aug 20 '19 at 22:45
  • @StephanKolassa No, I didn't. I saw many people just use exp(pred) to back transform the predictions. Does that induce bias? – jkjsdf fod Aug 21 '19 at 00:45
  • 1
    Yes, it does. Look at the "bias adjustments" section at the bottom of the page I linked. You need to include a correction term for the predicted variance on the log scale. It's similar to the [lognormal distribution](https://en.wikipedia.org/wiki/Log-normal_distribution): if the mean on the log scale is $\mu$, then the mean on the original scale is *not* $\exp(\mu)$, but $\exp(\mu+\frac{\sigma^2}{2})$. ... – Stephan Kolassa Aug 21 '19 at 06:21
  • ... That said, if your future distribution on the log scale is symmetric and you have an unbiased prediction on the log scale, then this is also an unbiased prediction for the *median* on the log scale. Since $\exp$ is monotone, exponentiating will get you an unbiased prediction for the median on the original scale. Which is what minimizes the MAE. So you essentially need to figure out which functional of the unknown future distribution you want to elicit. ... – Stephan Kolassa Aug 21 '19 at 06:23
  • ... See the proposed duplicate, or relatedly [What are the shortcomings of the Mean Absolute Percentage Error (MAPE)?](https://stats.stackexchange.com/q/299712/1352) – Stephan Kolassa Aug 21 '19 at 06:24

1 Answers1

3

Comparatively, RMSE penalizes large gaps more harshly than MAE, and RMSLE penalizes large gaps among small output-values more harshly than large gaps among large output-values (in fact, penalizes according to the ratio rather than the difference).

So, it seems that on average, Model2 makes more/bigger large-scale errors and fewer/smaller small-scale errors, and tends to make its large-scale errors in the large-output range, all compared to Model1. (And this was probably to be expected, since you fit that model on the log-transformed outputs.)

Which one is better then is up to your use-case. Given only your description (count data, primarily small values), I would personally prefer Model2, but YMMV. Perhaps ask yourself if a prediction of 3 for a true value of 2 better or roughly the same as a prediction of 18 for a true value of 12? I'd also support additional investigations, as suggested by @NickCox's comments.

Ben Reiniger
  • 2,521
  • 1
  • 8
  • 15
  • Thank you for your reply. Would you recommend log transform count data for non-parametric machine learning models if the data is extremely right skewed? – jkjsdf fod Aug 20 '19 at 18:41