10

I have a dataset of events that happened during the same period of time. Each event has a type (there are few different types, less then ten) and a location, represented as a 2D point.

I would like to check if there is any correlation between types of events, or between the type and the location. For example, maybe events of type A usually don't occur where events of type B do. Or maybe in some area, there are mostly events of type C.

What kind of tools could I use to perform this ? Being a novice in statistical analysis, my first idea was to use some kind of PCA (Principal Component Analysis) on this dataset to see if each type of event had its own component, or maybe some shared the same (ie were correlated) ?

I have to mention that my dataset is of the order of 500'000 points $(x, y, type)$, thus making things a bit harder to deal with.

EDIT: As noted in the answers below and the comments, the way to go is to model this as a marked point process, and then use R to do all the heavy-lifting, as explained in details in this workshop report : http://www.csiro.edu.au/resources/Spatial-Point-Patterns-in-R.html

Wookai
  • 505
  • 3
  • 10
  • Is this a raster dataset, such as a (processed) remotely sensed image, or is it an irregular dataset? – whuber Mar 14 '11 at 04:58
  • Well, I think you'd call it irregular : it's recordings of crimes that happened in the UK during a given month, avaiable here : http://www.police.uk/data. – Wookai Mar 14 '11 at 20:44
  • @Wookai 500,000,000 crimes in the UK in *one month*?? Has anarchy descended on the British Isles unreported by the press, only at last to be revealed in the police files? :-) I could believe about 1/100th that amount--barely. – whuber Mar 14 '11 at 22:26
  • Wow, I'm really sorry for this "typo" ;) ! It's 1000 times less actually, 500'000 crimes (counting "vehicule crimes", i.e. speed tickets, etc...). – Wookai Mar 15 '11 at 07:23
  • @Wookai That's good for you, because it means your problem is readily addressed with a GIS and may be accessible to R-based solutions. – whuber Mar 15 '11 at 15:06
  • 1
    Yes, R looks the way to go! I found a very complete report of a workshop on the spatstat module of R, that does exactly what I'm looking for : http://www.csiro.edu.au/resources/Spatial-Point-Patterns-in-R.html – Wookai Mar 15 '11 at 16:25

2 Answers2

3

The type of data you describe is ususally called "marked point patterns", R has a task view for spatial statistics that offers many good packages for this type of analysis, most of which are probably not able to deal with the kind of humongous data you have :(

For example, maybe events of type A usually don't occur where events of type B do. Or maybe in some area, there are mostly events of type C.

These are two fairly different type of questions: The second asks about the positioning of one type of mark/event. Buzzwords to look for in this context are f.e. intensity estimation or K-function estimation if you are interested in discovering patterns of clustering (events of a kind tend to group together) or repulsion (events of a kind tend to be separated). The first asks about the correlation between different types of events. This is usually measured with mark correlation functions.

I think subsampling the data to get a more tractable data size is dangerous (see comment to @hamner's reply), but maybe you could aggregate your data: Divide the observation window into a managable number of cells of equal size and tabulate the event counts in each. Each cell is then described by the location of its centre and a 10-vector of counts for your 10 mark types. You should be able to use the standard methods for marked point processes on this aggregated process.

fabians
  • 2,616
  • 15
  • 19
  • I am familiar with marked point processes and some related theoretical tools, I should have thought of this before. Thanks a lot for the keywords, do you have maybe a few pointers for these ? Thanks also for the aggregation idea, I had a similar one, will try to do this. – Wookai Mar 14 '11 at 20:40
  • 2
    Peter Diggle has written a "model-based geostatistics". He also has an analysis of Lancashire crime data on this page: http://www.lancs.ac.uk/staff/diggle/MADE/ that might give you some good ideas. – fabians Mar 14 '11 at 22:56
1

First, the size of the dataset. I recommend taking small, tractable samples of the dataset (either by randomly choosing N datapoints, or by randomly choosing several relatively small rectangles in the X-Y plane and taking all points that fall within that plane) and then honing your analysis techniques on this subset. Once you have an idea of the form of analysis that works, you can apply it to larger portions of the dataset.

PCA is primarily used as a dimensionality reduction technique; your dataset is only three dimensions (one of which is categorical), so I doubt it would apply here.

Try working with Matlab or R to visualize the points you are analyzing in the X-Y plane (or their relative density if working with the entire data set), both for individual types and all types the combined, and seeing what patterns emerge visually. That can help guide a more rigorous analysis.

benhamner
  • 2,723
  • 1
  • 17
  • 15
  • 1
    Whether this is appropriate depends on what you already know or assume about your data generating process. Subsampling the data by region (i.e. take all points in some predefined smaller window) can be dangerous if it's not homogeneous (because using a different window would have changed your conclusions). Sampling the data without regard to positioning for a training set has the effect of "thinning out" the observed process and invalidates conclusions you might want to draw about e.g. the range of correlations between marks or clustering/repulsion processes. – fabians Mar 14 '11 at 17:10
  • Yes, I know that PCA is for dimensionality reduction, this is why I was confused about how I could apply it to my dataset. The idea was to see if each event type had its own "direction", or if some "shared the same direction". But I guess I was simply thinking to correlation. – Wookai Mar 14 '11 at 20:45