Compare value with distribution

Question

I have developed a model that returns flight times between two airports e.g. Paris Rome, 115 minutes. I would like to compare this value with the value distribution from real flights e.g. (123, 110, 130, 120, 111, ...)

My model calculates time for distinct pairs origin destination airports so I want compare each of these values with a distribution for the same pair origin destination.

I wonder if it would help to have the percentile for each of the flights in my model as a measure of fitness

Any suggestions?

Does your model return a *single* flight time for each pair of airports? Or does it include, e.g., flight paths, weather etc. and therefore output different predictions for the same airport pair? — Stephan Kolassa, Jul 03 '19 at 11:09
My model only returns a single value for each pair of origin destination airports; it does not account for wind effect and the path between 2 airports is calculated using the coordinates of the airports in the Haversine formula. — Francisco Lemos, Jul 03 '19 at 17:04

Stephan Kolassa · Answer 1 · 2021-10-13T05:51:25.763

What you have is, on the one hand, a distribution of observed values, and, on the other hand, a one-number summary (ONS) that attempts to condense your knowledge of this distribution into a single number. And what you are looking for is a way to assess whether your ONS is a "good" one.

The first question you need to ask yourself is what a "good" ONS would be. This will depend on what you will do with your ONS, i.e., which decisions you will take based on it. For instance, you may want an unbiased expectation prediction. Or you may want an ONS such that half of the actuals are above it, and half below, i.e., the median of the distribution. If you want to plan some kind of flight schedule (i.e., capacity), then it makes sense to build some slack into it, and a "good" ONS would be something like a 90% quantile.

Once you know what kind of ONS you are aiming at, you can choose an appropriate error measure. For instance, the (root) mean squared error between your single value and the observations will be minimized in expectation by an unbiased expectation prediction, so if that is the kind of ONS you want, you should use the RMSE. If you want the median of the distribution as your ONS, you should use the mean absolute error. If you want a quantile, you should use a kind of asymmetrically weighted linear loss, where the specific parameter depends on what quantile you want (this is done in quantile regression; see the textbook Quantile Regression by Roger Koenker or any of his publications).

I illustrate some of these points in What are the shortcomings of the Mean Absolute Percentage Error (MAPE)? and in a forthcoming commentary on the M4 forecasting competition, to appear soon in the International Journal of Forecasting - feel free to contact me for the manuscript if you believe it would be helpful.

Dear Stephan, many thanks for your insight, they are by all means extremely helpful and I intend to follow your scientific advice. As for my work I would like to mention that I am modeling flights from gate to gate, which means including taxi-out, take-off, climb out until 3000 ft, climb, cruise, descent until 3000 ft, descent + approach and finally taxi-in. — Francisco Lemos, Jul 04 '19 at 07:00

Compare value with distribution

1 Answers1