How to best display the distribution of a large dataset with many outliers?

Question

I have a very large data set (~300'000 data points) and a subset of it (6000 data points), which shows the difference of travel time [in seconds] of agents before and after a road closure. I want to show the difference in their distribution.

However, the data set is so large, that the outliers are still so many and nothing can be read from the boxplot. A simple table would be an option of course, but I believe if done right, a graph can be more helpful to see the difference.

Really interesting, for the analysis, is the difference between -1000 and 1000. So I wonder, is it ok to simply truncate the data set or is it more appropriate to transform the data?

One approach would be to create 100 or 1,000 equal-sized buckets based on frequency and then plot the average (median) values for each bucket. A second approach would use the same logic but, instead of using frequency, create the buckets based on the observed values...like a weighted set of buckets. This is commonly done in finance. Yet another approach would be to transform the data by compressing the tails using, e.g., for positive-valued data only, the natural log. The inverse hyperbolic sine function would compress the tails of both positive- and negative-valued data. Do a search for *IHS* — , Jan 17 '20 at 17:29
I believe your question is largely covered by the accepted answer here: [Boxplot equivalent for heavy-tailed distributions?](https://stats.stackexchange.com/a/63542/805). If you have any issues left over after that, please post a question that is clearly distinguished from the issues covered there. — Glen_b, Jan 17 '20 at 23:47

How to best display the distribution of a large dataset with many outliers?

0 Answers0