0

Not a stats major here. The Rand Corporation recently released a study on the US Army's junior enlisted ranks. Here's an excerpt from Table A.1:

Variable                             Mean   Min, Max  Standard Deviation
36-month failure-to-adapt attrition  0.133  0, 1      0.340
Fast promotion to E-5                0.318  0, 1      0.466
Demotion from E-5 (among E-5s)       0.061  0, 1      0.0239
48-month attrition                   0.173  0, 1      0.378
36-month continuation                0.772  0, 1      0.420
48-month continuation                0.541  0, 1      0.498

More than a few lines in the table show a stdev that exceeds the bounds when compared to the mean. Maybe I'm incorrectly interpreting this, but how is that possible?

Erich
  • 101
  • 3
  • Exceeds which bounds? They all lie between 0 and 1. – adityar Oct 18 '18 at 15:16
  • as added to or subtracted from the mean – Erich Oct 18 '18 at 15:18
  • I still don't understand. Do you mean why is the mean minus the standard deviation below zero? – adityar Oct 18 '18 at 15:20
  • 1
    For the simplest case calculate the standard deviation of a data-set size 2 containing just 0 and 1. You should find the sd is 0.707.So the calculation is fine but something else must be puzzling you. Can you edit the question to clarify what is your concern? – mdewey Oct 18 '18 at 15:22
  • 1
    Hint: let $p_i$ denote the means (where $i$ indexes the six lines). Have you noticed that the values in the "Standard Deviation" column equal $\sqrt{p_i(1-p_i)}$? For more about this, investigate the [Bernoulli distribution.](https://en.wikipedia.org/wiki/Bernoulli_distribution#Variance) – whuber Oct 18 '18 at 15:59
  • It's not clear why you think that it should not be the case that mean+sd and mean-sd should lay within the range of the data - that is, I don't understand what prompts the question. It's demonstrably possible and your question contains examples where it's the case. You often see it with skewed variates, particularly with two point distributions, but also with continuous ones (consider, for example, a gamma distribution with shape parameter <1; even though no value from that population can be negative, $\mu-\sigma$ is negative.) ... – Glen_b Oct 19 '18 at 01:24
  • 1
    You may find the discussion here relevant: https://stats.stackexchange.com/q/124450/805 – Glen_b Oct 19 '18 at 01:32
  • sorry for the delayed response. it is quite possible i just have a fundamental misunderstanding. those links are indeed helpful to correct this. my follow-on question might be, then, why sd matters at all in a binary data set -- why not just use a simpler percentage? – Erich Oct 25 '18 at 18:10

1 Answers1

1

If I understand the OP's question correctly, this is the gist of the problem stated:

Suppose we have a variable $X$ with mean $\mu$, standard deviation $\sigma$ and range $[a, b]$. Suppose further that we estimate the standard devation and mean of $X$ and denote these estimates as $\bar{x}$ and $s$. Is it possible that $\bar{x}\pm s$ can create an interval with at least one endpoint outside of the range $[a, b]$?

The answer to that question is yes, it is possible.

I present a small reproducible example below. We start by having a variable that takes values on the range $[1,10]$ but that is skewed and has more observations near 1 than near 10.

fictional.data <- c(1,2,3,4,10)
mean(fictional.data)
sd(fictional.data)
mean(fictional.data) - sd(fictional.data)

The mean of this fictional data is 4 and the standard deviation is roughly 3.5. Thus the mean minus the standard devation is less than 1, and thus falls outside the range of the variable.

So why does this happen? It has to do with the fact that the each observation of $X$ contributes to the mean directly proportional to its value but to the variance in a quadratic way. So the while outliers (or skewness) will affect the mean, they will affect the variance more.

Phil
  • 627
  • 4
  • 16