1

I'm trying to estimate how many values fall within a portion of the standard deviation

Lets say I have:

A sample size of 100 and Average of 50. and a start deviation of 10.

enter image description here

Using the Empirical Rule we can say that ~34 of the values should be between 50-60 But what if I wanted to determine how many values are between 50-55?

Since this is a curve I know the data is not linear so 34/2 is not the answer.

How can I calculate how many values fall within a fraction of a deviation?

Note: I do not fully understand mathematical symbols so please explain the algorithm in english instead.

johnny 5
  • 113
  • 3
  • 1
    The empirical rule is just a more specific statement about the CDF of a standard normal distribution. **Every** cumulative distribution function is defined as $F_X(x) = \mathbb{P}(X \le x)$. If you want to know the probability that a sample is in an interval $[a,b]$, then you can use the difference of CDFs: $$\mathbb{P}(X \in [a,b])=F_X(b) - F_X(a).$$ In a sample of 100, all you know for sure is that between 0 and 100 of the values fall in any particular interval; but you can easily determine the *distribution* of this number. – Sycorax Mar 15 '21 at 02:18
  • Thanks I appreciate the answer I'll research the cumulative distribution function, I come mostly from a coding background so some of the formula is difficult for me to interpret – johnny 5 Mar 15 '21 at 02:46

2 Answers2

1

You're correct that the "empirical rule" that you've presented isn't sufficient to answer the question, because the interval $(50, 55]$ isn't illustrated on the diagram.

But the empirical rule is just a more specific statement about a very general fact about CDFs. For every distribution, cumulative distribution function is defined as $F_X(x) = \mathbb{P}(X \le x)$. If you want to know the probability that a sample is in an interval $(a,b]$, then you can use the difference of CDFs: $$\mathbb{P}(X \in (a,b])=F_X(b) - F_X(a).$$ This fact is important because it's true for any probability distribution, for any interval. In the special case that $F_X$ is a standard normal CDF, then we can show that the empirical rule that you've presented just reproduces this identity.

For $\mu=50, \sigma=10$, we have $$\mathbb{P}(X \in (50,55])\approx 0.1914625 = p$$

In a random sample of 100 values, all we know for sure is that the number of values $Y$ in the interval $(50, 55]$ can be 0, 1, 2, ... or 100. But not all of these cases have the same probability. Assuming that the data are drawn independently from this normal distribution, then we know that

$$Y \sim \text{Binom}(100, p)$$

and this distribution looks like this. enter image description here

On average, there will be $100p\approx 19.14625$ samples in $(50,55]$.

Sycorax
  • 76,417
  • 20
  • 189
  • 313
  • Thanks, I appreciate the help. I know this probably isn't the proper place to ask this but do you know of a simple legend I can use with all of the math symbols and their meanings? – johnny 5 Mar 15 '21 at 23:09
  • @johnny5 https://en.wikipedia.org/wiki/Glossary_of_mathematical_symbols – Sycorax Mar 15 '21 at 23:16
  • Thanks, I apologize I really don't understand many math symbols. I've provided a way to find the answer in a more programmatic way. TBH It's a terrible answer because I can't describe it efficiently, and I also don't really know why it works or the math behind it, but I wanted to at-least try to provide an answer for anyone who was looking for the same thing, but only have a full comprehension of math. – johnny 5 Mar 15 '21 at 23:24
0

I've found a way to get the answer I was looking for. I don't know much math or math terms so I'll try to explain it the way I was told/understand it

For a normal distribution, the percentage of values falling within k⋅σ of the mean is given by erf(k/sqrt(2)) where erf is the error function. For example, erf(1/sqrt(2)) = 0.682689.. and erf(3/sqrt(2)) = 0.997300...

To summarize: If you want to get the estimated number of samples between 50 (the mean) and 55.

We can calulate the fraction of the deviation that we want in this case The portion we wanted to calculate was between the mean and 55. Meaning the value deviated by 5 Deviation was 10.

we're trying to calculate (5/10) or 1/2 a deviation. so if we call

erf(.5/sqrt(2))/2 we will get the answer were looking for.

johnny 5
  • 113
  • 3
  • 1
    This answer shows that the error function can be used to compute the CDF of the normal distribution by using the close relationship between the two functions. Explaining how it works just means understanding what a CDF is, and understanding the algebra relating the two functions. https://stats.stackexchange.com/questions/187828/how-are-the-error-function-and-standard-normal-distribution-function-related – Sycorax Mar 15 '21 at 23:26
  • @Sycorax again I apologize for the ill formed answer I wrote here. One day when I understand statistics better I will update and edit the answer – johnny 5 Mar 15 '21 at 23:45
  • No need to apologize, I'm just trying to show how the content in this answer is related to more general facts about probability distributions. – Sycorax Mar 15 '21 at 23:47
  • @Sycorax Thanks, I'm just used to the other stack exchanges where if you don't write the answer perfect you get downvoted and delete – johnny 5 Mar 16 '21 at 00:23