How to check whether a sample is representative across two dimensions simultaneously?

Question

I'm attempting to develop a standardized method to check whether one set of locations are representative of a larger set. In this particular case, I'm attempting to look specifically at their geographical representativeness.

One method is to look at a two sample t-test for latitude and longitude * independently*, but that clearly ignores the possibility that there could be correlation between the values. Another option is to look at a categorical grouping of location (eg state, market, or any other gridding of the network), and use a chi-squared test. Neither of these strike me as optimal, however.

Is anyone familiar with a test that can check bias of a sample based on two dimensions simultaneously? Any thoughts would be greatly appreciated.

This question suggests you may have different ideas of what "representative" means. The use of t-tests implies you just want the sample to have a comparable mean to the population. The reference to correlation implies you are looking at a second-order comparison between the sample and the population. A chi-squared test examines the full distribution of the sample in comparing it to the population. So, the first thing you could do is reflect on what precisely you mean by "representative," then come back here and share that with us. Could you also say *why* you're checking "representativeness"? — whuber, Jun 08 '12 at 15:23
Fair questions-- let me clarify. The check is aimed at ensuring that a the smaller sample would be a representative proxy of the larger group if used in a test. The desire is to extrapolate findings of a test on the smaller group out for the larger group. — Greg, Jun 08 '12 at 15:43
In terms of the different tests mentioned-- I realize the Chi-squared test looks at the full distribution, while the t-test only looks at the means. Given the goal, a test for checking the distributions is more valuable than simply checking the mean. — Greg, Jun 08 '12 at 15:47
(+1) Thank you for the clarifications. You might get some interesting replies, because this goes to a fundamental issue about how one obtains representative samples in the first place. I have added some tags to reflect that. — whuber, Jun 08 '12 at 15:54
@Macro Good point. The concern with obtaining representative samples is a big deal in many fields. (A Google search turns up a half million hits for "representative sample," for instance.) It might not be a sexy technical subject, but it's a fundamental issue. It seemed to me to be sufficiently broad to warrant a tag, even though nobody had thought of creating this tag during the first 10K questions on this site. Arguably it could be rolled into the concept of "sampling," but *some* way to search for the concept "representative/representativeness/typical/accurate" seems useful. — whuber, Jun 08 '12 at 16:12
@whuber - I'm under the impression that I've read several other questions, perhaps many, which deserve this tag. I can't say I'll hunt them down and tag them, but I think it'll get a fair amount of use going forward. — jbowman, Jun 08 '12 at 16:44

score 1 · Answer 1 · answered Jun 08 '12 at 15:55

1

You can still do the chi-square test. Nothing says that the bins have to be 1 dimensional. Divide the globe into longitude by latitude segments and count the number of cases in each bin for the two samples. The same chi square test applies.

answered Jun 08 '12 at 15:55

Michael R. Chernick

39,640
28
74
143

So, are you suggesting discretizing a continuous distribution and then doing a $\chi^2$ test as though this were a discrete distribution? How exactly would you choose the binning to produce the categorization? (that's a question that interests me in a completely different context - http://stats.stackexchange.com/questions/13054/determining-an-optimal-discretization-of-data-from-a-continuous-distribution - but it is relevant to this question) – Macro Jun 08 '12 at 16:02
@Macro That is exactly what is done with the chi-square test in 1 dimension. I wouldn't characterize it as treating it 'as though it were a discrete distribution". It just tests that the area under the curve in the bins are nearly the same as they should be if the continuous distributions are identical. As far as binning goes the same problem exists in the 1 dimensional case. You want to take enough bins to represent the shape of the distribution but not so many that you get a lot of low frequency or empty cells. – Michael R. Chernick Jun 08 '12 at 16:12
I think I would try to make the cells the same size as much as possible. Using the same values for delta longitude x delta latitude in each cell. – Michael R. Chernick Jun 08 '12 at 16:13
Re: "You want to take enough bins to represent the shape of the distribution but not so many that you get a lot of low frequency or empty cells." - are there methods for doing that or is this generally done in an ad-hoc way? – Macro Jun 08 '12 at 16:13
This is a pretty good idea, but there are some subtleties that deserve explanation. The most important perhaps is that the degrees of freedom will be just one less than the number of bins. @Macro brings up some others, but they are less of an issue when one is planning to compare a sample to a *known* population which is measured within standardized administrative units (such as states, counties, etc.) If such units are not available, then Macro's point gains more force: it's the [MAUP](http://en.wikipedia.org/wiki/Modifiable_areal_unit_problem). – whuber Jun 08 '12 at 16:16
Michael, "size" here is probably not measured with latitude and longitude, due to the tendency for human populations to have highly heterogeneous distributions. But your point is a good one: one would seek a partition of the study region into areas that have approximately equal populations. In principle, an optimal number of such areas could be found as a function of the sample size; it's no different than the one-dimensional problem, except that the solutions are no longer unique. – whuber Jun 08 '12 at 16:19
@Macro I don't know of any formal method to choose bin size. In 1 dimension you look at the range of the data and then split it into K equal pieces where k is chosen ad hoc (maybe by trial and error). – Michael R. Chernick Jun 08 '12 at 16:23
@whuber You make a good point about points being clustered. It doesn't sound like the point represent populations. But they still may have vast empty spaces and clustering shapes. If the two samples have large common empty regions I would suggest leaving those regions out entirely and just forming the bins in the common areas where the points fall. This may allow near equal size binning even though points are clustered. – Michael R. Chernick Jun 08 '12 at 16:28
These are good thoughts-- and the binned Chi-squared test was the option we were considering. Does the arbitrary cutoffs of the bins worry anyone, however? My other concern with comparing the categorical values (instead of continuous values) is the small sample size. Take the case of State-- if I'm only choosing 10 sites, and choose them state by state as closely to the distribution of the actual us population distribution, isn't the chi-squared test going to reject the null every time? – Greg Jun 08 '12 at 16:52
@Greg I don't think there is much to worry about using ad hoc cutoffs. That is typically what is done with the chi square tests. The specific binning doesn't effect the test unless you make it far too narrow or too wide and good judgement prevents that. – Michael R. Chernick Jun 08 '12 at 17:28
@Greg Regarding the sample size, If it is small it will be difficult to reject the null hypothesis not easy. I think the bins should be chosen in a way that best represents how the data is spread out and not necessarily by state boundaries. However if it is too hard to rejcet the null hypothesis then not rejecting will not be a good indicator that the two populations are the same. But this will always be a problem with small samples. On the other hand for very large samples it may be too easy to reject and rejection could be based on small differences that don't matter. – Michael R. Chernick Jun 08 '12 at 17:34

score 1 · Answer 2 · answered Jun 09 '12 at 13:10

Fasano and Franceschini suggested a multi-dimensional version of the Kolmogorv-Smirnov test which they show to be preferable to the $\chi^2$-test for 2- and 3-dimensional data in Monthly Notices of the Royal Astronomical Society 225:155-170. The paper is freely available here.

gui11aume · Answer 3 · 2012-06-09T16:23:49.410

Actually, I had the same questoin recently. By scanning rapidly through the published literature, I came to realize that a general test has been developed by Friedman & Rafsky. Their approach is to use a minimum spanning tree, which is the smallest tree that connects the points of the cloud in $n$ dimension, and compute a statistic from it that is distributed as a Student's t. Unfortunately, I am not aware of any implementation of that test.

All I can suggest is the trick that consists in normalizing your variables in the square $(0,1)\times(0,1)$, applying the inverse erf function to get a bivariate gaussian, square them and sum them, wich should give you a sample distributed as a $\chi^2(2)$, which you can check with your favorite goodness of fit test.

Update: There is C library to test for uniformity in several dimension written by Ben Pfaff. At the section Uniformity testing library you can download the source code and the documentation. If I understood well, this is an implementation of the Smith & Jain test which is a refinement of the Friedman & Rafsky test in case the boundaries of the domain are not defined.

How to check whether a sample is representative across two dimensions simultaneously?

3 Answers3