8

If I have a set of terms each term having a particular frequency associated with it (the number of the times the term has appeared in fixed corpus of papers), then is the following method of significance testing valid?

  1. calculate the median absolute deviation (MAD) of the GO term frequencies in the given corpus,

    for sample $S$ : ${\rm MAD}(S) = 1.4826 \times {\rm median}(|x_{i} - {\rm median}(S) |)$

  2. get ${\rm thresh} = 2.7\times MAD(S) + {\rm median}(S)$

  3. use ${\rm thresh}$ as a threshold above which the GO terms are deemed significantly associated with the given corpus and below which the GO terms are deemed non-siginificant.

Macro
  • 40,561
  • 8
  • 143
  • 148
user1447630
  • 999
  • 3
  • 8
  • 12
  • I have corrected the definition of your thresholding rule to include the adjustment constant of the mad and the Tukey multiple --at Gaussian data-- for the threshold itself. – user603 Jun 11 '12 at 09:56

1 Answers1

1

I doubt it. Most probably, the distribution of frequency terms is highly skewed. In such a case, using a threshold rule based on an assumption that the underlying data is drawn from a symmetrical distribution will give highly misleading thresholds (and as a result potentially results).

You could try to apply the thresholding rule you propose on a transformed versions of your data using transformations such as the arcsin. The threshold rule you proposed is based on order statistics meaning that the result should not depend on which transformation you use so long as it is a valid transformation (i.e. a monotone function on the domain of your inputs).

An alternative solution that i personally favor because it simplifies interpretations is to use adjusted boxplots.

user603
  • 21,225
  • 3
  • 71
  • 135
  • So using the adjusted boxplot, does this mean that the terms significantly-enriched in the given corpus of data will be those above the fence? – user1447630 Jun 11 '12 at 21:07
  • yes, the unusually frequent terms will be above the fence. More importantly, by using the adjusted distance to the median: |S_i-median(S)|/A_i where A_i = UW(S)-median(S) if S_i>median(S) and median(S)-LW(S) if S_i – user603 Jun 12 '12 at 21:21