9

A simple question. I know in theory, it is possible to calculate standard deviation for two numbers. I am wondering if it is plausible to do that. My objective is to compare two arbitrary time series data for the same phenomenon and plot mean and standard deviation as error bars for every time point. I know that you could compare the two time series by taking Pearson correlation and such, but I want to compare how much the absolute values were in agreement at every time point. Any insights will be appreciated.

Update: Thank you for the answers. Let us forget about the time series. It is an unnecessary complication. My question is more fundamental. I am doing a biological experiment to measure a biologically relevant quantity, say concentration of a chemical in my cells. Ideally, I would do 3 or 5 or some number of replicates of my experiment to get an estimate of mean and standard deviation. But due to time limitation, complexity of my experiment and costs involved, I can only do two replicates. Now, I end up with two estimates of concentration. No one questioned me when I took the mean of these two quantities. But people were uncomfortable when I calculated the standard deviation. I could understand their concern but I want to get more insights into why it is ok or not ok to take standard deviation in this case? If it is not ok, what are my options?

Mr K
  • 476
  • 1
  • 3
  • 10
  • If your data is Normally distributed, then don't forget to apply the necessary multiplicative factor to the sample standard deviation to get, say, a 95% two-sided confidence interval. That is based on a Student t distribution with 1 degree of freedom, and is a whopping 12.71, in contrast to the value of 1.96 to which fans of the Normal distribution are so accustomed. – Mark L. Stone Aug 16 '16 at 22:37
  • 1
    I think my comment above, from before your update, is getting to the nub of the matter. Try running that by the uncomfortable people. – Mark L. Stone Aug 16 '16 at 23:17
  • So you mean that the estimate of std dev should be multiplied by 12.71 to get the error bars? – Mr K Aug 16 '16 at 23:43
  • 2
    Yes, 12.71, not 1.96. Of course, the error bars will be wide. But that is the penalty you incur for having such a small sample. – Mark L. Stone Aug 16 '16 at 23:47
  • That is disheartening but I see your point. Thanks! Please let me know if you think there is any better way of quantifying the agreement between replicates for my data. It seems error bars will make my data look bad. Not that I want to deceive my audience but I believe my data is good. – Mr K Aug 16 '16 at 23:56
  • 1
    If you can get up to 3 data points, the multiplicative factor on the sample standard deviation to get a two-sided confidence interval goes down from 12.71 to 4.30. That's a 66% savings. And if you act in the next 15 minutes, I'll give you some additional "savings" in the form of n-1 vs. n not dinging you as badly in the denominator of the formula for sample standard deviation when n = 3 vs. 2. – Mark L. Stone Aug 17 '16 at 00:03
  • Hmm, life is tough. No gains without pains. I'll work on the 3rd data point. :-/ – Mr K Aug 17 '16 at 00:08
  • Do you you have two observations, or two samples containing multiple observations? It's not clear. – david25272 Aug 17 '16 at 00:25
  • @david25272 I have both. – Mr K Aug 17 '16 at 12:24
  • Dear @MarkL.Stone , I regret that I have only one "upvote" to give to your "up to 3 data points" comment above. However, as a special bonus offer, I will also upvote your compilation and expansion below! Extra credit for your patient dialog with Jon. – Rob Fagen Jan 22 '18 at 17:12

4 Answers4

15

Compilation and expansion of comments:

Let's presume your data is Normally distributed.

If you want to form two-sided error bars (or confidence intervals), say at the 95% level, you will need to base that on the Student t distribution with n-1 degrees of freedom, where n is the number of data points. You propose to have 2 data points, therefore requiring use of Student t with 1 degree of freedom.

95% 2-sided error bars for n = 2 data points require a multiplicative factor of 12.71 on the sample standard deviation, not the familiar factor of 1.96 based on the Normal (Student t with $\infty$ degrees of freedom). The corresponding multiplicative factor for n = 3 data points is 4.30.

The situation gets even more extreme for two-sided 99% error bars (confidence intervals).

As you can see, at either confidence level, there's a big "savings" in the multiplicative factor if you have 3 data points instead of 2. And you don't get dinged as badly by the use of n-1 vs. n in the denominator of sample standard deviation.

  n  Confidence Level  Multiplicative Factor
  2       0.95              12.71
  3       0.95               4.30
  4       0.95               3.18
  5       0.95               2.78
 infinity 0.95               1.96

  2       0.99              63.66
  3       0.99               9.92
  4       0.99               5.84
  5       0.99               4.60
 infinity 0.99               2.58
Mark L. Stone
  • 12,546
  • 1
  • 31
  • 51
  • 1
    "But people were uncomfortable when I calculated the standard deviation. I could understand their concern but I want to get more insights into why it is ok or not ok to take standard deviation in this case? If it is not ok, what are my options?" -- this question still remains. I am in disagreement of using CIs for only two samples. With n = 2, you basically get a meaningless SD and CIs. – Jon Aug 17 '16 at 15:30
  • 2
    @Johnnyboycurtis The confidence intervals are wide, as seen from the very large multiplicative factors arising from a t distribution with 1 degree of freedom. Whether they are 'satisfying" is another matter. The typical uncomfortable people were probably thinking in terms of Normal Z for multiplicative factor, not t with 1 degree of freedom. – Mark L. Stone Aug 17 '16 at 15:33
  • whether t or Z you're still assuming the data is approx. Normal after some large sample size; this assumption does not hold well with n = 2. – Jon Aug 17 '16 at 15:36
  • I think you need to ask yourself, would you be comfortable presenting this to a room of statisticians? – Jon Aug 17 '16 at 15:37
  • 2
    Normal assumption is just as valid whether 2 data points or a million. There's no averaging going on here as in Central Limit Theorem. But yes, Normal assumption is crucial, that's why at the beginning of my answer I stated "Let's presume your data is Normally distributed." I would discuss the Normality assumption if presenting this to anyone, whether 2 data points or a gazillion. – Mark L. Stone Aug 17 '16 at 15:39
  • Alright, well, you've discussed what would happen if the data were to be approx. Normal. Now, what would happen if it were not? – Jon Aug 17 '16 at 15:42
  • Then all the error bars, confidence intervals, or whatever you want to call them would be bogus, regardless of the number of data points. The bogusness is likely to be moist severe at high confidence levels, i.e., in the tails, where deviations from Normality are likely to be most severe. – Mark L. Stone Aug 17 '16 at 15:46
  • Great, answer. Now how would you test if the data were approx. Normal for n=2? – Jon Aug 17 '16 at 15:47
  • I wouldn't. It would have to be based on a priori knowledge. And with that, I shall now move on to other endeavors. Thanks for the chat. – Mark L. Stone Aug 17 '16 at 15:49
6

Setting aside your initial explanation of the time-series context, it might be useful to look at this as a simple case of observing two data points. For any two observed values $x_1 , x_2$ the sample standard deviation is $s = |x_2 - x_1| / \sqrt2$. This statistic is exactly as informative as giving the sample range of the two values (since it is just a scalar multiple of that statistic). There is nothing inherently wrong with using this statistic as information on the standard deviation of the underlying distribution, but obviously there is a great deal of variability to this statistic.

The sampling distribution of the sample standard deviation depends on the underlying distribution for the observable values. In the special case where $X_1, X_2 \sim \text{IID N}(\mu, \sigma^2)$ are normal values you have $S \sim \sigma \cdot \chi_1$ which is a scaled half-normal distribution. Obviously this means that your sample standard deviation is quite a poor estimator of the standard deviation parameter (biased and with high variance), but that is to be expected with so little data.

Ben
  • 91,027
  • 3
  • 150
  • 376
3

If you only have 2 values, just present those 2 values. It doesn't make sense to convert 2 measurements into 2 other quantities (mean and stdev) if your audience is going to argue about the significance of one or the other.

If you want to estimate uncertainty, these other responses are right on, but don't forget to add other potential sources of error (measurement instrument bias errors, resolution, etc.).

0

" I know that you could compare the two time series by taking Pearson correlation and such" -- this is incorrect. Pearson Correlation assumes observations are independent, but time series data is by nature not independent. You would actually need to use a cross-correlation. Reference: https://onlinecourses.science.psu.edu/stat510/node/74

Also, you shouldn't use the typical variance (if you really must calculate a variance); I suggest using something like Mean Absolute Deviance (MAD). You can then create a histogram to summarise the distribution of similarity/dissimilarity.

Jon
  • 2,180
  • 1
  • 11
  • 28