1

My company wants to track request latencies for the project I'm working on.
Specifically, the report I'm building should have the 95th and 99th percentile values of the latencies over all events.

However, there's an intermediate summarization step that occurs during the processing, and I'm not quite sure what calculations I need to do at the intermediate step to provide accurate percentiles at the final aggregation step.

The primary data is organized in a set of sessions S, where each session has a series of timestamped event pairs (e1, e2).

I need to summarize the latencies (i.e., the timestamp differences from the event pairs) into a small set of numbers (vectors are not available at the intermediate step) for each session S.

In other words, I have sessions $S_{1..n}$, where $S_{k}$ has event pairs $ep_{k1..n}$.

I want the 95th percentile values of the timestamp differences of all event pairs, but I can't directly access all the event pairs in the final calculation; the final calculation can only access a set of summary values calculated per session.

I can perform some arbitrary computation on the event pairs contained in each session, but I cannot have a vector of values in the session summary.

I hope this is clear enough; I'm a statistics novice.

Eric Brown
  • 111
  • 3
  • What do you mean by latency? Just the difference between the timestamps? In terms of your overall summary, what do you want to know about the sessions? Or what does the company want to know? I'm unclear what role the sessions play. – dankernler May 08 '18 at 20:19
  • @dankernler Yes, the latency is the timestamp difference. The company wants reports for the latency across all event pairs, but the reporting requires that each session be aggregated into a (finite, and relatively small) number of columns. I've edited the question to be (hopefully) clearer. – Eric Brown May 08 '18 at 22:37
  • Interesting. Ok, does it have to be the exact 95th percentile of all the data? I'm not sure that's possible. – dankernler May 09 '18 at 01:30
  • @dankernler I think there's some fuzziness available. Of course, now that I've edited the question, an interesting [related question](https://stats.stackexchange.com/questions/87904/is-there-a-way-to-compute-daily-percentiles-median-and-95th-using-24-hourly-pe?rq=1) shows up.... – Eric Brown May 09 '18 at 17:47
  • OK. I was thinking of just using the median of the 95th percentiles, but that should really be weighted somehow, and it could be very inaccurate if the distributions of the latencies vary significantly between sessions. Depending on your purposes, that should give a rough estimate. I did find [this post](https://stats.stackexchange.com/questions/87904/is-there-a-way-to-compute-daily-percentiles-median-and-95th-using-24-hourly-pe?noredirect=1&lq=1) with a very thorough answer to essentially the same question. – dankernler May 09 '18 at 17:56

1 Answers1

0

You can use Q-digest or T-digest — these are finite size summaries, that allow later aggregation multiple into one and estimation of percentiles from them with a bounded error range (which depends on the size of a digest).

Pavel T
  • 101
  • 1
  • Here are some overviews https://dataorigami.net/blogs/napkin-folding/19055451-percentile-and-quantile-estimation-of-big-data-the-t-digest and https://liveramp.com/engineering/computing-distributions-of-large-datasets-with-cascading-and-the-q-digest-algorithm/ – Pavel T Aug 24 '18 at 08:37