1

I want to see the variability in a dataset or timeseries in two different period by the means of Box and Whisker Plot

My time series covers a period of 1901-2010. I want to see the variability post-1970 period w.r.t. to pre-1970 period.

Is this the right approach to compare the Boxplot of 70 year of dataset (1901-1970) to boxplot of 40 years of dataset (1971-2010)? As there will be more number of datapoint in the former timeseries, hence, range may be more?

or

Should I compare the boxplot of 1931-1970 with boxplot of 1971-2010 to have the identical length of dataset?

Edit-1: I am dealing with rainfall data. I want to see the variation in extreme rainfall events, seasonal rainfall after and before 1970 to find how the rainfall pattern has changed in the recent period.

Edit-2 There are no zero value in the data, as I am picking the extreme values at 95 percentile from each year. I am also making a time series of total annual rainfall which is again a non zero number. I am dividing the seasons as per the monsoon (major rainfall months), post-monsoon and pre-monsoonal months.

dSb
  • 171
  • 4
  • Welcome to CV. A boxplot would provide a visual summary of the distributions. This is useful as exploratory work. A better metric that would be comparable across periods would be the coefficient of variation calculated as the std deviation divided by the mean. – Mike Hunter Mar 29 '16 at 10:14
  • The problem you raise I regard as secondary. It's true that sample size has an effect on what range is expected but that in itself need not stop the box plots being helpful. A larger concern is that the box plots won't give much detail. You have scope for much more informative displays, e.g. quantile-quantile plots comparing the distributions. – Nick Cox Mar 29 '16 at 10:26
  • 1
    The recommendation from @DJohnson to look at coefficient of variation is I think pertinent if (and only if) your variable is entirely positive and it tends to vary multiplicatively rather than additively, but you haven't told us that either of those applies. See http://stats.stackexchange.com/questions/118497/how-to-interpret-the-coefficient-of-variation – Nick Cox Mar 29 '16 at 10:27
  • 1
    Why chop the series into blocks any way? You can look at any measure of variability you like within windows that move across the data. Unless something dramatic happened in 1931 or 1970/1971 blocks are arbitrary. – Nick Cox Mar 29 '16 at 10:28
  • @NickCox 's comment makes an excellent point and it would be helpful to know what your data looks like. In the absence of additional information, there are many measures of dispersion that are dimensionless and scale invariant. One metric that seems particularly promising is *entropy*. This CV thread has an extended discussion that compares results for discrete vs continuous versions. http://stats.stackexchange.com/questions/73891/why-does-entropy-increase-with-dispersion-for-continuous-but-not-for-discrete-di – Mike Hunter Mar 29 '16 at 11:05
  • Thanks all for your suggestion. I am basically dealing with the Rainfall data, so no negative values. I have updated the question with additional information regarding the dataset. – dSb Mar 29 '16 at 11:17
  • 1
    Original data are yearly, monthly, daily, ...? Are you looking for extremes within a distribution or is your distribution extremes only? Any zeros in the data? How do you define seasons? You didn't explain your compulsion to divide the data into blocks: broadly speaking, climate distributions don't change by jumps at distinct points. – Nick Cox Mar 29 '16 at 11:27
  • @NickCox Please see the Edit-2 – dSb Mar 29 '16 at 19:21
  • Thanks for the extra detail, but many of my questions remain unanswered, e.g. the nature of the original data. I can't see box plots as being useful for the top 5% of each group; their interpretation is quite awkward and I see absolutely no advantage in using them over plotting all the extremes directly. – Nick Cox Mar 29 '16 at 19:35

0 Answers0