3

I have a giant array of integer numbers like this:

giantArray = [2048,128,512,512,128,8,....]

The content of this array changes every day and I need to estimate its mean really quick. I have limited processing resource for this job. Moreover, there is no guarantee that distribution and standard deviation of the array remain the same every day. The smallest integer in the array is 8 and the largest one is 2048. All other integers are the power of two numbers.

The size of the array is about 500 million entry and I cannot afford to read it all. Therefore, I think, I need to use simple random samples (SRS) to estimate the Sample mean.

However, Since I don't know my standard deviation I cannot calculate how big should be my minimum sample size for a given confidence level of 95% and error margin of +/- 8.

Do you know how can I calculate the standard deviation and then determined the minimum sample size?

Can I use this online algorithm to calculate standard deviation during sampling time and then use it to find out If I have sampled enough?

ARH
  • 161
  • 5
  • 1
    How many entries can you afford to read? Must they be sequential? (if so, this will not be a random sample) Note also that the maximum standard deviation would be when the data is evenly split between max and min, so it will be less than 1024 (e.g. see [here](https://stats.stackexchange.com/questions/45588/variance-of-a-bounded-random-variable)). – GeoMatt22 Apr 25 '17 at 04:55
  • @GeoMatt22, I can read randomly from this array. That is why I think SRS is doable. Moreover, There is no limit on reading entries. However, I just don't want to waste my time reading entries after I reach to a given confidence level. That is why I am desperate to calculate the minimum sampling size that can give me for example 95% confidence of mean calculation. – ARH Apr 25 '17 at 14:01
  • 1
    You can create a 95% confidence interval of the mean by taking a *single* random value from this array. It will be a very broad interval! The larger the sample size, the narrower the confidence interval will tend to be. *How narrow would you like it to be?* – whuber Apr 25 '17 at 14:52
  • @whuber, I am looking at 95% confidence interval with margin of error of 8. For example, the Sample mean can be some thing like 1128 +/- 8 – ARH Apr 25 '17 at 15:24
  • You can certainly accomplish that with a random sample of $(Z_{0.975}(2048 - 8)/2/8)^2\approx 62000$. Is that small enough? If you wish to improve on this--which you very likely could do--then see https://www.jstor.org/stable/2983720?seq=1#page_scan_tab_contents. – whuber Apr 25 '17 at 17:16
  • 1
    You could compute variance online or use the worst-case upper bound of $\sigma<1024$ that I gave before (like whuber suggested). You will have to judge if the extra computation required for the online variance calculation and convergence test is worth it. – GeoMatt22 Apr 25 '17 at 17:20
  • Thanks @GeoMatt22 and whuber I think you answered my question. 62K is small enough. However, why the sample size does not change if population size grow? for example, should I sample again 62K entries even if my array size increased by 100x ? – ARH Apr 25 '17 at 17:45
  • 1
    [This](https://stats.stackexchange.com/questions/38290/why-does-standard-error-not-involve-population-size) may help clarify? – GeoMatt22 Apr 25 '17 at 18:02
  • @GeoMatt22, I did not get that. Do you have better reference ? – ARH Apr 25 '17 at 18:21
  • 2
    If you are sampling only a small fraction of the population, there is no finite-population correction needed. The finite-population correction comes in only when the sample is *too large* relative to the population, e.g. as the linked question notes, if the sample *is* the population then there is obviously no error at all in the "estimated" mean. By the way, you should probably update *your question* to include your desired confidence interval (95%) and margin of error (+/-8), and perhaps also your maximum # samples constraint (e.g. 100k?). – GeoMatt22 Apr 25 '17 at 18:27
  • @GeoMatt22, I did edit the question. – ARH Apr 25 '17 at 18:46

1 Answers1

4

Since the array entries $x\in[8,2048]$ are bounded, then no matter what their frequency distribution is, the maximum standard deviation is $$\sigma_x\leq\frac{x_\max-x_\min}{2}=\frac{2048-8}{2}=1020$$ So for a random sample of size $n$ the standard deviation of the sample mean is bounded by $$\sigma_\bar{x}=\frac{\sigma_x}{\sqrt{n}}\leq\frac{1020}{\sqrt{n}}$$ Assuming that $n$ is large enough so that $\bar{x}$ is approximately normal, then for a 95% confidence interval of $\pm{8}$, you have $$1.96\times\sigma_\bar{x}\approx{8}\implies{n}\approx{62450}$$ So as noted in the comments, $n=63000$ is enough samples to ensure your required bound.

You could compute variance online to get a better estimate, but you will have to judge if the extra computation required for the online variance calculation and convergence test is worth it.

GeoMatt22
  • 11,997
  • 2
  • 34
  • 64
  • 1
    This is an awesome technical answer, but I think it's important to emphasize that $n = 62,450$ is a _minimum_ value. And it's a pretty small minimum. OP could probably get away with sampling a half-million on a local computer. Probably a lot more. – Tim Atreides Apr 25 '17 at 20:15
  • 1
    @TimAtreides the OP's constraints are not clear to me. (In the comments my first question was if random access was allowed, vs. sequential access e.g. in some embedded system). I would suggest that if it is *at all possible* to save all the numbers for a sample of days, then an (offline) validation study would be good to do, to give confidence in the online (production/deployed) approach. – GeoMatt22 Apr 25 '17 at 20:21
  • 2
    +1. Some comments: (1) @Tim the 62000 value is a *maximum* because it is based on the largest possible variance. The actual variance likely is much smaller. (2) I don't think you have to worry about normality, given the bounds on the possible values of the data. (3) A sequential sampling procedure easily could stop after a few dozen or few hundred observations. (4) A two-phase procedure would be routine: sample, say, 30 values; estimate the variance from those; use that to (overestimate) the sample size needed; and collect the required sample. It's highly likely to be much less than 62000. – whuber Apr 25 '17 at 20:32
  • I think I agree with @whuber that 62000 is maximum due to over estimation of the standard deviation. I cannot afford to do more sampling. I am also interested to know if I can do less sampling with the same error margin and confidence since I don't know what is my standard deviation every day. It might be very small and does not need 62K samples. Taking each sample is not as simple as reading an entry from disk. It requires some computation in addition to disk access. I am interested to know more about Sequential Sampling Procedure. However, I don't know what should be my null hypothesis – ARH Apr 26 '17 at 13:58
  • 1
    There is no null hypothesis, because this procedure is designed to estimate a value. Sequential sampling will give you the smallest *expected* sample size, but on rare occasions it could require larger samples than recommended by traditional fixed-size designs. If this is something you will be repeating over and over, a sequential sampling method would be an excellent choice. Based on the (few) numbers you gave in the question, it looks like your sample sizes would be up to a few thousand, but unlikely any greater than that. – whuber Apr 26 '17 at 14:02
  • @GeoMatt22, why do I need `\sigma_\bar{x}` ? Can I just use \sigma_{x} and plug it in this formula to calculate \{n}: https://www.isixsigma.com/tools-templates/sampling-data/how-determine-sample-size-determining-sample-size/ – ARH Apr 26 '17 at 14:21
  • 1
    ARH, it is simply an alternative notation, your link just uses it without introducing a new symbol: $\sigma/\sqrt{n}$ **is** $\sigma_\bar{x}$. – GeoMatt22 Apr 26 '17 at 14:25
  • Thanks @whuber, It is interesting, would you please elaborate your sequential sampling method a little bit ? Why should I start with 30 samples and how do I know if the mean of 30 samples is good enough ? – ARH Apr 26 '17 at 14:25
  • 1
    @whuber, I think my question now is that what should be the "Stopping Rule" for a sequential sampling method to solve this problem? – ARH Apr 26 '17 at 14:55
  • ARH, the stopping rule is the same as above -- $1.96\times\sigma_\bar{x}\approx{8}$ -- except you estimate the standard error $\sigma_\bar{x}\approx\hat{\sigma}/\sqrt{n}$ using your *online* estimate of the standard deviation, $\hat{\sigma}$, rather than the upper bound $\sigma_\max$. – GeoMatt22 Apr 26 '17 at 22:46