1

I'm dealing with a very large data set, is there a way to create a boxplot with a portion of the data, load the next portion and just update the values of the existing boxplot?

As requested in the comments, I'd like to clarify that the question is about updating the underlying statistics for the features of a boxplot (hinges, whiskers, etc.) when a new data batch is available for the sample.

ravl1084
  • 113
  • 5
  • Please add a [reproducible example](http://stackoverflow.com/q/5963269/1217536) for people to work with. – gung - Reinstate Monica Aug 07 '16 at 00:00
  • 4
    I'm voting to close this question as off-topic because it is about how to use R without a reproducible example. – gung - Reinstate Monica Aug 07 '16 at 00:00
  • 3
    As it stands this is probably going to be off topic. However, there's an important underlying question that seems to be pretty clearly on-topic -- can you update the statistics from which the boxplot is drawn (i.e. the median, hinges, inner fences/whisker ends, list of points outside the inner fences), when the data can be loaded only a piece at a time. If you'd consider editing your question, that would be a great one for our site. – Glen_b Aug 07 '16 at 02:46

1 Answers1

3

If it's already plotted but you want to redraw it rather than plot beside it, you'll need a new plot. I presume that what you really need is to update the calculations by which the plot is drawn without having all the data available at one time.

I'll discuss an approach as a general algorithm (but parenthetically mention a couple of specific hints relating to R implementation; similar considerations will apply in many other languages)

If you can make multiple passes through those portions of data (or at least know beforehand good bounds on the variable over all portions), then certainly something can be done.

Let's ignore issues like anti-aliasing and imagine we plot purely in monochrome; then our device is limited to some resolution -- however in practice we can probably go considerably coarser, since the eye is unlikely to discern much finer than some moderate number of positions.

Either way, let's say we would like to have a resolution of $M$ (e.g. 1000) "pixels". (1000 would probably be four times as many as there would be any practical purpose to having, by the way.)

Since the maximum and minimum are represented on the plot, we need a scale that will include them. So we would need to pass through the data (or otherwise bound) the maximum and minimum from which we would then compute the boundaries of our scale.

We then create an integer vector with the required number of bins (one per notional "pixel" position) and pass through the data constructing a histogram (i.e. at each data value we add +1 to a bin-count if we find a point within its bin boundaries).

(In R you could actually use hist to do this by passing the precomputed bin boundaries (breaks) to it along with the given data portion then accumulate those bin-counts for that portion into the overall bin-counts)

From the histogram we can then identify the bins within which each required statistic (like the median and the hinges) is located and then from those find the fences, whisker ends, points outside the fences and so on, by treating all the observations in a bin as occurring at the bin-center.

(If the variable is actually on a discrete lattice, you may be able to use many fewer bins, centered at the possible values.)

We can then plot the resulting boxplot. This could work with almost any size of data set as long as you can count the number of observations in a bin.

(In R, an integer count can go up to 2147483647L = $2^{31}-1$; but you could consider modifying it to work with double precision floating point if you needed really big numbers.
Note that in R boxplot.default is the default boxplot function; it in turn calls boxplot.stats to compute the things you would calculate above and bxp to actually produce the plot; one could simply adapt that default boxplot code to incorporate working through the data portions in lieu of the call to boxplot.stats)

Glen_b
  • 257,508
  • 32
  • 553
  • 939