Error measure for 3D-field-comparison - meaning of mean/median

Question

Short Version

I have to compare two vectors of predictions (from different methods) against one vector of measurements to find out which prediction performs better. Note that this is not a statistical model, but a (somewhat) physical model, so AIC/BIC and so forth do not work. In my field people use error measures similar to Pearson's correlation coefficient, mean square error, etc for this. They are usually more sophisticated, like the normalized mean square error

$\mathrm{NMSE} = \frac{1}{n \overline{s}\overline{o}} \sum_{i=1}^n(o_i - s_i)^2$,

where $o_i$ is the value of one observation, $s_i$ is the corresponding value of the simulation for $n$ pairs, $\overline{s}$ is the arithmetic mean of all simulation results and $\overline{o}$ the same for the observations.

The potential problem I see with this is that it assumes a meaningful average value (i.e., $\overline{o}, \overline{s}$). If you compare methods of predicting body heights of women compared to their actual height, where you could roughly assume a normal distribution around the mean body height, this would make sense to me. In my case however, the values are sparse point samples from a 3D-field where many values are low/zero and so these averages are in my point of view completely meaningless.

Does this invalidate the use of such "comparison to mean" measures?

Longer Background

I have a model that simulates dispersion of atmospheric pollutants. To evaluate these types of model, people have undertaken field measurements, where they release a known quantity of gas at one point and measure the resulting concentration at several (10-80) positions downwind. The model uses meteorological conditions, gas release rate and information about the surface as input and simulates the gas concentration at the measurement positions. Then it is possible to compare the simulated concentrations with the measured ones, but this is exactly where I see a problem. I basically have a few samples of a complex 3D gas concentration field and now need to compare two simulations of this to the measurements. Due to budget and physical constraints, the measurements can only be taken at a few (10-80) positions, which can distributed over hundreds of meters to multiple kilometers downstream and to the sides.

All in all I have 5 measurements campaigns (from different cities, widely different in scale and gas used), totaling at 80 experiments ("snapshots" of concentration fields) and in the end about 3700 data points (about half of which are near 0). Due to different measurement techniques and gases used, the units are not the same between campaigns. I can normalize that using the release rate and thus make the values dimensionless, but they still span about 9 orders of magnitude, because of how different the setups are and how big the different between measuring close to the source and 6 kilometers downstream is.

I noticed the problem because the NMSE explodes when I simply calculate it over all data points. For individual experiments it is in the order of 1, but for all experiments together is about 400. This makes sense, because obviously the deviation from the mean will be larger when data sets in $[10^{-4},10^{-1]}$ are thrown together with data sets in $[10^{1},10^4]$ than if you compute them individually. (That's why nobody does this.)

However, isn't using things like the NMSE on only one such experiment (which is done) doing essentially the same thing, just on a smaller scale?

Scatter plot of normalized data:

PS: The tag is probably not right, but I could not find better ones.

I assume that the $\overline{s}$ and $\overline{o}$ in your formula are the averages of simulations (i.e., predictions) and actual observations, respectively, correct? How does this work mathematically if both are actually two-dimensional? What product do you use? How do you divide by the product? — Stephan Kolassa, May 08 '19 at 13:13
Oh yeah, I forgot to mention that. `overbar` means arithmetic mean. $\overline{s}$ is the average of the simulation results and $\overline{o}$ of the observations. Both of them are scalar. — StefanS, May 08 '19 at 14:53

score 2 · Accepted Answer · answered May 08 '19 at 15:14

I would be less concerned with the interpretation of the means. My initial reaction is that the NMSE you defined will lead to biased predictions.

Take a step back. We don't know the values we want to predict. (Otherwise, we wouldn't be predicting them.) So they follow some probability distribution. (Whether we actually model that distribution is immaterial.) Each value to be predicted has its own distribution.

What is a point prediction? It's an attempt to summarize these unknown (and probably un-modeled) distributions in a single number per prediction.

What is a good point prediction? This is something people rarely ask, and even more rarely answer.

If you use the straight-up MSE, it's a standard result from probability theory, that the one-number summary that minimizes this KPI in expectation is the conditional expectation of the unknown distribution. The MAE is minimized in expectation by the median. So if you optimize your predictive models to minimize the MAE, but your actual distribution is asymmetric, don't be surprised if your predictions are systematically biased.

(I have written on this before, e.g., here and here.)

When it comes to your NMSE, I see two scaling factors. $\overline{o}$ is independent of your predictions. But $\overline{s}$ isn't. Suppose you have predictions $s$ that are actually unbiased predictions of the conditional mean outcome. These will minimize the MSE in expectation. But the $\overline{s}$ will tempt you to increase each $s$ a little, because this will reduce your NMSE. The result is a biased prediction.

Then again, that may be fine. We don't know. Because, to return to a point made above, we don't know what a good one-number summary of your unknown distribution is.

In my previous threads linked above, I referenced a couple of papers by Gneiting, which I recommend very much in terms of loss functions and functionals of future distributions.

The best way of predicting would be to calculate full predictive densities and assess them using proper scoring-rules. See Gneiting's papers linked in my previous threads on this.

I suspect this is not what you wanted to read here. From a practical standpoint, you could predict and assess your predictions on a log scale (which you are already using in your plots). Or you could assess the MAPE and minimize relative absolute errors. (But note its shortcomings.)

In the end, point predictions try to summarize entire distributions in a single number, so it is not surprising that a lot of information gets lost. Just picture if people in medicine asked what "the correct one-number summary" of your health would be - temperature, blood pressure, counts of a single specific cell type?

Thank you for your detailed reply, you have given me a lot to think about. I'm not going to "tune" the model depending on the results of this error analysis, so that part is not going to be a problem. I just want to know whether or not and in which conditions the modified model performs better than the old one. However, I'm unsure whether adding an offset to all $s_i$ (and therefore $\overline{s}$) would actually decrease the NMSE. I tried to work it out and it seems to depend on the size of $\overline{s}$, $\overline{o}$ and the offset. — StefanS, May 13 '19 at 08:53

Error measure for 3D-field-comparison - meaning of mean/median

Short Version

Longer Background

1 Answers1