9

I often need make $(x,y)$ scatter plots that have many ($>10^5$) points. I've experimented with different ways of representing this many points that capture the distribution while getting around the messiness of actually putting a million points in an image. I've done the obvious things like thinning points, and plotting the local density of points as either contours or a colour map. I've found that thinning tends to be ineffective as the distributions I deal with are often exponentially peaked, so I just get the same problem in a narrower area, and local density representations are sensitive to the details of smoothing/sampling used to get the density field.

Lately I've been thinking about trying a different strategy, but I haven't been able to come up with an algorithm to make it work (which is where this question comes in). This sort of plot, which I've heard cosmologists colloquially call a 'banana plot', is the inspiration, and I think is fairly common in Bayesian analysis:

likelihood contours

Here the contours show the $1\sigma$ and $2\sigma$ confidence intervals for a combination of two parameters. In this application a 2D probability density function is generated, so making the plot is easy. I want to do something similar.

I'd like to draw contours enclosing 99%, 95% and 68% (and more generally any percentage) of the data points in an $(x,y)$ dataset. Obviously such a contour is not unique, so an extra constraint is needed. I think something like minimising the area or maximising the average density of points inside the contour should give the desired effect. Does anyone know an algorithm that would produce such a contour? I eventually want to make a plot using matplotlib, so bonus points if you happen to know such an algorithm that is implemented in python, but I'll do the implementation myself if I have to.

Not terribly familiar with the tags here on CV.SE, so if I'm missing any obvious ones please suggest/edit them in.

Kyle
  • 263
  • 1
  • 3
  • 7
  • You may be interested in [bag plots](http://amstat.tandfonline.com/doi/abs/10.1080/00031305.1999.10474494#.UhZUDdK-qzk). – Andy W Aug 22 '13 at 18:13
  • 2
    Also see [highest density region](http://robjhyndman.com/papers/computing-and-graphing-highest-density-regions/) (if it is ok that the contour is not entirely enclosed in one polygon). – Andy W Aug 22 '13 at 18:44
  • 2
    A closely related question is asked and answered at http://stats.stackexchange.com/questions/63447/integrating-kernel-density-estimator-in-2d. In this case, if you are not firmly wedded to the idea of imposing an extra constraint explicitly, an answer is easy to come by: just compute a kernel density estimate on a fine grid and select the largest 68%, 95%, or 99% from the cumulative sum of its sorted values: they form a region within the grid. A smoothed version of the boundary of that region will do. – whuber Aug 22 '13 at 19:56
  • @whuber I've played around with density methods (similar to the one you linked) and they work alright, but I wanted to try something that is independent of any smoothing kernel choice or similar. – Kyle Aug 22 '13 at 20:49
  • @AndyW looks interesting, do you happen to know if the paper is available anywhere without a paywall? Otherwise I can probably access it next time I'm in my university library I guess. – Kyle Aug 22 '13 at 20:51
  • You might want to view your choice of kernel as tantamount to satisfying an additional constraint. BTW, the constraints you suggest won't work: the infimum of the areas is always zero and the supremum of the densities within those areas is, of course, infinite. Look instead to constraining the tortuosities of the contour regions (which is more or less what is varied when you change kernel widths). – whuber Aug 22 '13 at 20:54
  • @whuber Hm that's a good point. Might be that the best answer is just to smooth, but I'm still curious along these lines. Maybe the constraint should be to find a non-self-intersecting contour which links between points with straight line segments and minimises... not area because that would give some sort of very spiky shape I suspect, but... something? – Kyle Aug 22 '13 at 21:09
  • 1
    I hear you. My concern is that this way of framing the question is likely to lead to huge computational difficulties. Controlling a density bandwidth (or something in that spirit) is easy to do, computationally reasonable, and should be sufficiently flexible to meet most needs. – whuber Aug 22 '13 at 21:18
  • I just found a copy of the [bagplot paper](http://venus.unive.it/romanaz/ada2/bagplot.pdf), and each are implemented in open-source R libraries. See [aplpack](http://rss.acs.unt.edu/Rdoc/library/aplpack/html/bagplot.html) and [hdrcde](http://rss.acs.unt.edu/Rdoc/library/hdrcde/html/hdr.html). The approach whuber is describing is synonymous with the highest density region by Hyndman. The bagplot is a bit different though, and follows from defining the halfspace depth of points. – Andy W Aug 23 '13 at 13:14

0 Answers0