Similarity / dissimilarity of two large bimodal datasets

Question

I am interested in assessing the divergence, or similarity or dissimilarity of 2 datasets that are the results of 2 different lidar instrument measurements. Each dataset has over 90,000 values and they do measure a continuous variable, let's say altitude. Each dataset has a bimodal distribution with a long tail on the second mode, and are highly correlated at 0.8 (Spearman). I was looking into both Kolmogorov-Smirnov test as well as Kullback-Leibler. I am very concerned that my distributions are bimodal and very large. I know that tests that will give a p-value are very sensitive to distribution types and number of measurements. For this reason I think that the KS test is not appropriate. So I remain only with the KL test. But ….. I wonder how this test is influenced by the number of data, the distribution, and how to interpret the result.

For example I gather that if I have 2 KL tests, for each the minimum values in one case is 0.5 and the other case is 0.9, then in the first case the 2 datasets are less divergent than in the second case. But how small the KL should be until I can say that "one dataset is a good approximation of the other"? Granted, even that is debatable, because I can look at divergence itself, or how different one dataset is from the other in a statistical sense (this is one exercise) or look at the similarity / dissimilarity of data interpretation, or what the data "tell" about the variable measured. So …. It is enough that actually I look only at correlations in this case? Maybe I should classify my data in a number of convenient classes and compare one classification to the other through kappa-statistics? Maybe this will tell me at least if the interpretation of one dataset is similar enough with the interpretation of the second dataset? Do you have any other ideas? References? I would appreciate any thoughts on the matter.

Edit

Thanks for your answers. Now i realized i should have included few more details. Both raw lidar data go through a standard processing phase that results in a regular grid xyz of pre - determined resolution (in my case 5 m by 5 m). Both datasets have same xy coordinates and only z varies. I know that one sensor has usually an error of +/- 15 cm vertical, and probably the other one is very similar as well (i can find out actually). Both sensors were collecting data over the same area. We don't have any ground truth for altitude there. We want to compare these processed data and not the actual xyz lidar cloud (in other words the DEM result .....). I hope this is now a little bit more clear.

Thanks.

The final purpose is to decide if the 2 lidar sensors are comparable enough to not care which is used. — Monica Palaseanu-Lovejoy, Sep 09 '11 at 20:28
The new information would not cause me to change my reply, Monica. Although the grids may have the same xy coordinates, that does not mean the coordinates are determined with absolute accuracy. Having gridded DEMs makes comparing the two datasets straightforward; the interpolation has already been done. So feel free to ignore the first paragraph, but everything else still applies. Bear in mind that without any ground truth, there is no way you can use these data to decide which sensor is better; you can only determine whether they are close enough not to matter. — whuber, Sep 12 '11 at 15:46
Thanks. We don't need to establish which sensor is better, only that it does not matter which we use. From technical paper we know that one has a +/- error of about 18.5 cm and the other between +/- 15 to 20 cm. The differences between the 2 measurements are within +/- 40 cm (well, at least 96% of the data is). If we are satisfied that the results are comparable we can proceed further with analysis. — Monica Palaseanu-Lovejoy, Sep 15 '11 at 13:00
That additional information helps, because it bounds the potential bias between the two sensors. If the errors are independent, for instance, then the standard error of the difference is no less than $\sqrt{18.5^2 + 15^2}$ = $28.8$, implying (for errors with approximately Normal distributions) that 96% of them should be less than $28.8 \times 2.05$ = $49$. Your result of $40$ suggests there is a positive correlation among the errors and little or no inherent bias in either one. Both these results are plausible and comforting. — whuber, Sep 15 '11 at 16:14

score 3 · Answer 1 · edited Sep 10 '11 at 01:57

If the spatial positions of points in the two datasets are the same, you should compare them directly by subtracting one elevation from the other and mapping the differences. If the positions are not the same, things get trickier, but you might be OK interpolating the values of one of the datasets to the locations of the other and comparing those. (This requires a good interpolator and some caution to make sure you're not just evaluating the accuracy of the interpolator.)

Some characteristic features of this problem that preclude standard statistical comparisons are

The univariate distribution of the data reflects the actual set of altitudes but says little or nothing about the accuracy of the LIDAR data, either relative to each other or absolutely. Therefore measures of divergence of distributions would appear to be irrelevant or misleading. They certainly won't generalize to the sensors themselves, because the divergence depends so strongly on the particular elevation distribution.
Without any ground truth or surveyed elevations, there is no standard for determining which dataset is better: you can only compare them to each other.
Typically, errors occur in both the altitude measurements and the position determination. A change in position by a vector amount $(dx, dy)$ at a location with gradient $(u,v)$ causes a change in altitude by $u dx + v dy$ (the directional derivative of the surface in the $(dx, dy)$ direction). This change will, on average, be proportional to the tangent of the slope. This causes the positional errors to influence the altitude measurements much more in high-slope areas than in low-slope areas.
Errors tend to have strong spatial correlations, especially the positional errors.
Given the rich spatial structure of these data and their expected spatial correlations, simple correlation coefficients or kappa statistics will likely not reveal anything useful. (For instance, a correlation of 0.80 would be considered extremely bad for LIDAR elevations of terrain with large elevation variations, where correlations ought to be 99+%, but might be decent on very flat terrain.)

One useful way of comparing any two digital elevation models, LIDAR or not, therefore consists of subtracting the values of one from those of the other, point by point, and comparing (with a scatterplot, for instance) that to the average tangent of the slope determined by the two DEMs. Fitting a line to this and mapping out the residuals can reveal locations where the two DEMs differ by unusual amounts, giving you a spatial picture of their relative consistency. From this you can estimate the expectation of the difference (and of the absolute difference) between the two as a function of the slope. This could be digested into a single number, but the expected variation of error with slope indicates you would be better off using the entire curve to characterize the relationship between the two datasets. Alternatively, you could attempt to separate out the two components of variation into (a) a function that depends on slope, reflecting positional error, and (b) the variance of the difference of elevations. Ideally, the function in (a) will be accurately described by a small number of parameters (perhaps only one, which would be related to the variance of the positional error).

If these two components of relative error--spatial and elevation--are sufficiently small, you can conclude the two sensors have essentially the same accuracy. How much accuracy they have cannot be determined without additional information. If the components of error are large for your application, you only know that one (or perhaps both!) of the sensors is inadequate.

I wonder if you have any reference for your statement that 80% correlation is bad for terrain with large evelation variations but acceptable for flat terrain. Thanks, Monica — Monica Palaseanu-Lovejoy, Sep 15 '11 at 13:03
@Monica That's not a reference; that's an easy deduction. I wouldn't even go so far as to say 80% is generally "acceptable,' only that I can imagine some situations where it would not raise any concerns. Note that correlation depends as much on the spatial extent of the DEM as on anything else and therefore is a poor indicator of how well the DEMs match: see http://stats.stackexchange.com/q/13314. *At a minimum* you should be using a an absolute measure of the differences instead, such as root mean square difference or median absolute difference or even maximum absolute difference. — whuber, Sep 15 '11 at 16:08

score 0 · Answer 2 · answered Sep 09 '11 at 20:42

Given the final purpose, I suggest that a statistical test is not what you want. It's not clear to me if these datasets are on the same people (or whatever the subjects are) or not. If the subjects are the same, I suggest taking the difference between the two tests and plotting that. What proportion of the differences are large enough that you would care about them? You could also use a scatterplot of one variable vs. the other, and a qq plot of one variable vs. the other.

If the measurements are on different subjects, then we have to assume that the two samples of 90,000 are drawn in the same way from the same population. In this case, the differences don't exist, and the scatterplot makes no sense, but the qqplot could still be useful

LIDAR is a remote sensing method used for 3D surveying or, when mounted on aircraft or satellites, for digital elevation mapping. It tends to produce high-resolution but irregular networks of 3D locations. — whuber, Sep 09 '11 at 21:39
Thanks for the info @whuber. It would have been nice if that was in the original post — Peter Flom, Sep 10 '11 at 12:17

Similarity / dissimilarity of two large bimodal datasets

Edit

2 Answers2