4

I feel like this question has a very silly (simple) answer, so I apologize. I have a data set that only has 5 points

41.9
32.2
113.3
110.2
102.6

Clearly something significant happens between the 2nd and 3rd data points (but not between the 1st and 2nd, 3rd and 4th, etc.). What statistical test could I employ to show this in a more rigorous sense?

EDIT: Note, I was not clear with my question (sorry). The data is an ordered set. I would like to determine that the difference between 2-3 is statistically larger than 1-2, 3-4, and 4-5.

Thanks!

Shinobii
  • 257
  • 2
  • 7
  • 1
    Some statistical writers have compared your hypothesis to randomly shooting five bullets at a large wall, circling two that happened to be close to each other (and far from the other three), and saying "Wow! What were the chances of *that* happening!" – whuber Jun 16 '15 at 20:44
  • Well said, clearly (if I understand you correctly) not much in a statistical sense can be inferred here. – Shinobii Jun 16 '15 at 20:46
  • That's right. However, if *before collecting these data* you had stated exactly the same hypothesis--is the difference at 2-3 the largest of all (four) differences?--then you could get more traction, because your hypothesis would be independent of the data. Consequently it would be valid to use the data to test that hypothesis. As your intuition might tell you, there's a one in four chance that the 2-3 difference is the largest. That's a tiny bit unusual, but not enough to get anybody's attention. – whuber Jun 16 '15 at 20:51

2 Answers2

7

Of the $5! = 120$ distinct sequences that can be formed of those five numbers,

  • $4\times 2!\times 3! = 48$ of them will have the small values $41.9$ and $32.2$ next to each other. (There are four places for this pair to occur, $2!$ ways of ordering them, and $3!$ ways to order the other three numbers.)

  • Yet another $2! \times 3! = 12$ sequences will alternate between a high value in $\{102.6, 110.2, 113.3\}$ and a low value in $\{32.2, 41.9\}$.

  • Another $2! \times 3! = 12$ will bracket the three high values with a low value on either end.

I have enumerated $72$ so far, which is $60\%$ of all the possible sequences. Thus, depending on what kinds of patterns might catch your notice (which is a matter for your psychologist to explore), the total number of such "clearly something significant" sequences could easily be more frequent than sequences that do not clearly have something significant! From this we may draw two conclusions:

  1. Not a single one of these patterns is rare enough to be considered "statistically significant" at a conventional ($5\%$, or $6/120$) level.

  2. Any conclusion about "significance" derived after recognizing a "clear, significant" pattern when exploring dataset must be considered subjective.

(This is not to say such conclusions are without value. It only maintains that statistics, correctly applied, will not sanctify the conclusions of an open-ended exploratory analysis with any level of "significance," because it cannot.)


Such quantitative reasoning leads generally to the following statistico-psychological metatheorem:

In any collection of random patterns, the majority will be unusual.

Those of you familiar with Garrison Keillor may recognize an echo of the Lake Wobegon population: "... and all the children are above average." However, I privately refer to this as the Shirley MacLaine principle, in honor of her well-known work as a "spiritual missionary," a seer of things and causes that do not exist.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Sorry, I should have been more clear. They must remain ordered, i.e. there are only 4 possibilities (1-2, 2-3, 3-4, 4-5). I would like to show (if possible) that the difference between 2-3 is significantly larger than 1-2, 3-4, and 4-5. – Shinobii Jun 16 '15 at 20:38
  • 1
    They *do* remain ordered. That's what a permutation is. When you produce a sequence of five values and ask about its "significance," you are wondering about how those five values compare to what they *could have been* had they appeared in some other order--because if those five values were actually independent, all orders would be equally likely. I have, in effect, conducted a *post hoc* statistical test that demonstrates that if we pretended a formal test were applicable, your p-value would have to be considered to be $60\%$ or greater. – whuber Jun 16 '15 at 20:41
  • Ah, I see. Clearly need to brush up on my statistics. It is the one subject that is required for ALL fields of study, yet not emphasized enough in University. I appreciate your very detailed answer. – Shinobii Jun 16 '15 at 20:44
  • This is the fun part of statistics: the math is easy, the stakes low, and the fundamental ideas--which can be truly counter-intuitive--come to the fore. Quite a few of the quotations at http://stats.stackexchange.com/questions/726/famous-statistician-quotes allude to this situation where we develop a hypothesis from our data and then are tempted to use the same data to confirm that hypothesis. – whuber Jun 16 '15 at 20:48
  • If we assume that we would only try to find significance we saw 2 low values followed by 3 high values or 3 high values followed by two low values, then it's only 24 sequences. Then if we apply a two-sample t-test to the observed sequence and adjust the p-value via Bonferroni, we would still end up with a pretty small p-value. – James Jun 17 '15 at 13:51
  • @James Although the software would report a low p-value, it would be incorrect to interpret it in the usual way, for many reasons. Chief of those is the fact that the hypothesis was formulated *post hoc*: after seeing a pattern. The reasoning provided in my post here illustrates just how deceptive that low p-value would be. This process, which is well known and understood, is called "data snooping," "data dredging," and worse. A Bonferroni adjustment is not applicable. – whuber Jun 17 '15 at 14:04
  • Didn't you say yourself that it depends on how many patterns will catch one's notice? I'm saying that possibly there are only 24, and in that case the familywise p-value will be pretty small. I don't see why Bonferroni is not applicable. – James Jun 17 '15 at 14:07
  • @James Just to be clear: the value depends on how many patterns *might* catch one's notice. That's a property of the observer, not of the experiment, which is why I termed it "subjective." The purpose of applying statistical procedures to experimental results is to help ensure we are not deceiving ourselves. Applying a *post hoc* test to these data is a sure route to self-deception. – whuber Jun 17 '15 at 14:10
  • If we were to agree on how many patterns might catch the notice, then Bonferroni is applicable, right? – James Jun 17 '15 at 14:12
  • @James I don't see how, for several reasons. One is that in this small dataset there is little support for the t-test assumptions (of near-normal distribution of the sample means, especially). Another is that those tests are not independent. The third is that though you and I might indeed agree on a number of patterns, there is no assurance any third party would, so unless we are going through this exercise only between ourselves, the results would still be of little value to anyone else. – whuber Jun 17 '15 at 14:17
  • My question was narrowed to Bonferroni only. It works for any dependence structure (at the expense of power). – James Jun 17 '15 at 17:20
  • @James Bonferroni simply does not apply here. One way to recognize the problem is to consider a dataset drawn iid from an equal mixture of a Normal$(0,1)$ and Normal$(10^6,1)$ distribution. Conditional on observing two values near $0$ and three near $10^6$, there is a $48/120=40\%$ chance the two low values will be next to each other in the sequence. Your t-test to compare those two to the other three will have an astronomically low p-value. No amount of Bonferroni correction will bring it anywhere near to a reasonable value of $40+\%$. – whuber Jun 17 '15 at 18:21
  • In that case, we essentially try to use the t-test to assign observations to the two components. Given the variance of one and a huge difference in means, it will be obvious which component each observation came from, and the p-value should be way below 40%. However, from your perspective, each observation came from the same mixture distribution with mean about 500,000, so the p-value should be large. – James Jun 17 '15 at 19:47
  • @James If I understand your proposal correctly, that approach (of applying multiple t-tests to a large number of partitions of one dataset) is closely related to k-means clustering. Although the p-values are meaningless, it possibly could lead to good cluster identification--but only by virtue of its relationship to k-means. If you're interested in such problems of classification, look over posts on our site about that, or focus on specific methods like SVM, and check out references to over-fitting. – whuber Jun 17 '15 at 20:47
2

It looks like the simplest way is to use Chow test. If your sequence was assumed to be iid with no covariates, then it's probably equivalent to a two-sample test (e.g. a t-test) for the equality of means with 2 and 3 observations per sample. However, based on the Chow article, I don't see how it adjusts for data snooping, i.e. for the fact that the split into the two groups is suggested by the data.

James
  • 2,600
  • 1
  • 14
  • 26