11

I have a 2D square, and I have a set of points inside it, say, 1000 points. I need a way to see if the distribution of points inside the square are spread out (or more or less uniformly distributed) or are they tending to gather together in some spot inside the square.

I need a mathematical/statistical (not programming) way to determine this. I googled, found something like goodness of fit, Kolmogorov, etc., and just wonder if there are other approaches to achieve this. Need this for class paper.

Inputs: a 2D square, and 1000 points. Output: yes/no (yes = evenly spread out, no = gathering together in some spots).

whuber
  • 281,159
  • 54
  • 637
  • 1,101
Van
  • 191
  • 1
  • 2
  • 4
  • 1
    You haven't articulated enough precisely what is "uniformely distributed" for you. Do you mean evenly tiled 2D uniform cube or something else? For example, evenly-spaced chain of points? or a circle of points? In a sense, these figures are uniform spreads, too. – ttnphns Oct 22 '12 at 11:05
  • 3
    @ttnphns In this context, "uniform" has a well-established conventional meaning. It corresponds to a Poisson process with constant intensity. It is often known as "CSR" [completely spatially random](http://en.wikipedia.org/wiki/Complete_spatial_randomness). – whuber Oct 22 '12 at 14:28
  • 2
    @Van You want to research "spatial point processes." Good keywords include "Ripley K function," "CSR", and "Poisson". An accessible reference for you would be O'Sullivan & Unwin, *Geographical Information Analysis.* A classic is Ripley, *Spatial Statistics*: it focuses on point processes. For applications, take a quick look at [CrimeStat](http://www.icpsr.umich.edu/CrimeStat/download.html). If you're comfortable with `R`, there are [plenty of tools for this task](http://cran.r-project.org/web/views/Spatial.html). – whuber Oct 22 '12 at 14:35

3 Answers3

5

I think @John 's idea of a chi=square test is one way to go.

You would want patches on 2-d, but you would want to test them using a 1 way chi-square test; that is, the expected values for the cells would be $\frac{1000}{N}$ where N is the number of cells.

But it's possible that different number of cells would give different conclusions.

Another possibility is to compute the average distance between points and then compare this to simulated results of that average. That avoids the problem of an arbitrary number of cells.

EDIT (more on average distance)

With 1000 points, there are $\frac{1000*999}{2}$ pairwise distances between points. These can each be computed (using, say, Euclidean distance). These distances can be averaged.

Then you can generate N (a large number) of sets of 1000 points that are uniformly distributed. Each of those N sets also has an average distance among points.

Compare the results for the actual points to the simulated points, either to get a p-value or just to see where they fall.

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • I agree that one-sample chi-square ("agreement chi-square test") is among reasonable ways. But can you elaborate more on your "avarage distance" proposal? I didn't quite understand it. – ttnphns Oct 22 '12 at 11:09
  • @ttnphns, ones used in spatial analysis are the nearest neighbor test (aka Clark and Evans test), or Ripley's K. See the R library [spatstat](http://www.spatstat.org/spatstat/) or the [CrimeStat documentation](http://www.icpsr.umich.edu/CrimeStat/download.html) for examples. Another possibility based on simulation are "scan" tests, but these aren't based on average distances. – Andy W Oct 22 '12 at 11:14
3

Another possibility is a Chi-Squared test. Divide the square into equally sized non-overlapping patches, and and test the counts of the points falling into the patches against their expected counts under a hypothesis of uniformity (the expectation for a patch is total_points / total_patches if they're all equally sized), and apply the chi-squared test. For 1000 points 9 patches should be sufficient, but you may want to use more granularity depending on what your data look like.

Ben Allison
  • 651
  • 5
  • 7
  • 1
    I think you're onto something but a goodness of fit chi-square comparing the actual counts in each cell against an expected count of equal cells would be what you'd want. Using a contingency test would NOT test if there was uniform distribution among your cells, only if row depended on column. – John Oct 22 '12 at 10:22
  • Also, the chi-square test would only tell you if they weren't uniform across the cells you selected. It would not tell you if they were uniform. – John Oct 22 '12 at 10:23
  • Yes I meant the counts against their expected counts under a null hypothesis of uniformity, my apologies if it wasn't clear. You can just visualise it as a table which helps to understand what's going on for the uninitiated! And obviously you're limited to testing against the cells you select rather than uniformity in the abstract sense – Ben Allison Oct 22 '12 at 11:00
  • @John, typically when one does this "dispersion test" one typically does a two sided test. If you really wanted to see if the pattern was more uniform than expected by chance you could simply look to see if the chi-square test fell in the left tail of the distribution (at whatever cut-off you prefer). – Andy W Oct 22 '12 at 11:08
  • Andy, you should provide an answer that details this two-sided goodness of fit test. Typically two sided tests just test two different alternatives to null but still cannot demonstrate the null. Your proposal is intriguing. – John Oct 22 '12 at 12:00
1

Why not use the Kolmogorov-Smirnov test? That's what I would do, especially considering that your sample size is big enough to compensate for the lack of power.

Alternatively, you could do some simulation. It's not rigorous, but it provides some evidence as to whether the data are uniformly distributed.


@whuber The 2-dimensional extension of the KS is well known (see here). In this case, we are investigating whether these 1000 draws (coordinates (x,y)) could be drawn from the 2-dimensional jointly uniform distribution - at least that's how I read "evenly spread out". @John I might've expressed myself clumsily (neither maths nor English are my first languages). What I meant was that the exact p-value can be computed using a test such as the KS, whereas the p-value (or whatever you call the equivalent) only tends asymptotically when doing simulations.

abaumann
  • 1,910
  • 14
  • 12