Short Version
I have to compare two vectors of predictions (from different methods) against one vector of measurements to find out which prediction performs better. Note that this is not a statistical model, but a (somewhat) physical model, so AIC/BIC and so forth do not work. In my field people use error measures similar to Pearson's correlation coefficient, mean square error, etc for this. They are usually more sophisticated, like the normalized mean square error
$\mathrm{NMSE} = \frac{1}{n \overline{s}\overline{o}} \sum_{i=1}^n(o_i - s_i)^2$,
where $o_i$ is the value of one observation, $s_i$ is the corresponding value of the simulation for $n$ pairs, $\overline{s}$ is the arithmetic mean of all simulation results and $\overline{o}$ the same for the observations.
The potential problem I see with this is that it assumes a meaningful average value (i.e., $\overline{o}, \overline{s}$). If you compare methods of predicting body heights of women compared to their actual height, where you could roughly assume a normal distribution around the mean body height, this would make sense to me. In my case however, the values are sparse point samples from a 3D-field where many values are low/zero and so these averages are in my point of view completely meaningless.
Does this invalidate the use of such "comparison to mean" measures?
Longer Background
I have a model that simulates dispersion of atmospheric pollutants. To evaluate these types of model, people have undertaken field measurements, where they release a known quantity of gas at one point and measure the resulting concentration at several (10-80) positions downwind. The model uses meteorological conditions, gas release rate and information about the surface as input and simulates the gas concentration at the measurement positions. Then it is possible to compare the simulated concentrations with the measured ones, but this is exactly where I see a problem. I basically have a few samples of a complex 3D gas concentration field and now need to compare two simulations of this to the measurements. Due to budget and physical constraints, the measurements can only be taken at a few (10-80) positions, which can distributed over hundreds of meters to multiple kilometers downstream and to the sides.
All in all I have 5 measurements campaigns (from different cities, widely different in scale and gas used), totaling at 80 experiments ("snapshots" of concentration fields) and in the end about 3700 data points (about half of which are near 0). Due to different measurement techniques and gases used, the units are not the same between campaigns. I can normalize that using the release rate and thus make the values dimensionless, but they still span about 9 orders of magnitude, because of how different the setups are and how big the different between measuring close to the source and 6 kilometers downstream is.
I noticed the problem because the NMSE explodes when I simply calculate it over all data points. For individual experiments it is in the order of 1, but for all experiments together is about 400. This makes sense, because obviously the deviation from the mean will be larger when data sets in $[10^{-4},10^{-1]}$ are thrown together with data sets in $[10^{1},10^4]$ than if you compute them individually. (That's why nobody does this.)
However, isn't using things like the NMSE on only one such experiment (which is done) doing essentially the same thing, just on a smaller scale?
Scatter plot of normalized data:
PS: The tag is probably not right, but I could not find better ones.