3

I have a source that generates computer files varying between 20,000 and 40,000 bytes in length. I think (strongly assume so) that they vary in length according to a Normal distribution. They average 30,000ish but haven't determined this exactly as that is contingent on this question.

I want to say that only 0.001% of the files are less than X bytes long.

What is X (in bytes) or how do I calculate it?

Paul Uszak
  • 494
  • 3
  • 17
  • 3
    This is unclear in several senses. (1) If you want to exclude 99.999% of values, then your interval includes 0.001% of values. That is a very narrow interval. Do you mean **exclude**? (2) The stated interval applies when data are exactly normal (Gaussian) and not in general, and otherwise cannot be trusted at all when you are looking so far out in the tails. (3) If you are looking at sample sizes $\sim 10$ then fractions of 1/100000 and 99999/100000 are either negligible or essentially 1. (It's a numerical slip to say 1 in a million.) – Nick Cox Apr 29 '15 at 13:09
  • Statistical people say "sample size" where you say "number of samples". That is common usage for scientists and others working with (say) soil, water, blood, etc. That is easy to clarify. – Nick Cox Apr 29 '15 at 13:10
  • 1
    I wonder if you are confusing confidence intervals and fraction of the data within certain intervals. – Nick Cox Apr 29 '15 at 13:11
  • It is unclear whether you seek a *tolerance interval* or a *prediction interval.* I suspect it's the former, but there are several types of tolerance interval. Both intervals, being estimated from data, are uncertain. A tolerance interval is intended to estimate an interval of values that comprise a specified proportion of the population. It can be one-sided or two-sided, symmetric or not. It can be constructed to have a minimally high chance of actually including the specified proportion, or to include that proportion on average, or to have some chance of including it. What do you want? – whuber Apr 29 '15 at 13:59
  • While the edits have definitely improved the question, I'm still not sure it's as clear as it could be. However, I'm going to vote for reopening. – Glen_b May 01 '15 at 09:45
  • Glen_b, thanks for re-opening the question. I'm struggling to ask the question in a format that's acceptable. I want to know the value of X in bytes so that only 0.001% of files are shorter... – Paul Uszak May 04 '15 at 21:59
  • If you state your problem differently: what is the probability of certain extreme outcomes given your data, then you can use [Extreme Value Theory](http://en.wikipedia.org/wiki/Extreme_value_theory)-based methods. – Tim May 04 '15 at 22:28
  • @PaulUszak That simply rearranges a sentence that's already in your question; it doesn't clarify what you're computing. Can you read about the definitions of tolerance and prediction intervals in whuber's answer [here](http://stats.stackexchange.com/questions/26702/prediction-and-tolerance-intervals) to confirm whether you want one or the other (or possibly neither)? I suspect you may want a tolerance interval but it's still not 100% clear. If we can't clarify this it may have to be put on hold again. – Glen_b May 04 '15 at 23:17
  • 1
    Your question title and question contents still seem at odds with each other. – Nick Cox May 05 '15 at 00:10

1 Answers1

1

I want to say that only 0.001% of the files are less than X bytes long.

Count the number of files that are less than X bytes long. If you have N files of which k are less than X bytes long then you claim: "$\frac{k}{N}$ percent files are less than X bytes long".

Alternatively, get the 0.001%-percentile. For instance, if you have 1,000,000 files then you find the 10th smallest file, let's say it was 21kB long, then claim "0.001% files are less than 21kB long".

Two things to be aware of. If you have less than 100,000 files, you obviously can't calculate 0.001% percentile. The "work around" would be to fit normal distribution to your data, then estimate the 0.001% percentile parametrically. For instance, you find that the average file size is 30kB and the standard deviation is 3kB. Now you can compute any perctile you wish or 4 sigmas. Of course, anything close to 0.001% will not be reliable if you have less than 100,000 files :)

The second issue is how "normal" is your data. For the first two percentile methods it doesn't matter. For the last, parametric, method it matters a lot. Since you're looking at the tails of the distribution, it's important that your distributional assumption is reasonable.

In finance, there's a technique called value-at-risk (VaR), which is very similar in the setup to your problem, btw. Take a look at the link, and you may find a ton of methods to answer your question. And in finance they fall to a similar trap: calculate parametric VaR at a very high confidence while having not large enough samples, e.g. 0.1% with less than 1000 observations.

Finally, in finance there's something called CVaR or conditional VaR. The idea is to calculate the average size of files that are less than X or at $\alpha$-percentile. So, the claim goes as "0.001%-percentile files have average size Y" or "the files that have average size X or less are in average of size Y"

Aksakal
  • 55,939
  • 5
  • 90
  • 176
  • you relate 100,000 files to 0.001% twice in your answer. I'm going to infer that this is the answer to the question. – Paul Uszak May 15 '15 at 22:16