4

I'm just starting to work with some count data and I'm still trying to understand some of the complexities of it, so any help would be greatly appreciated. First the simple version, and then the potentially more complex version.

I have a dataset that looks something like this:

Group   Count
L         4
R         5
C         9
L         16
L         3
R         8
C         5

Etc.

A central part of my research question is within/between group variation; my hypothesis suggests that observations in L, for instance, will be more like each other than they will observations in C and R. There are some 0 observations, but not a great amount of them. (Edit: The data describes the number of articles over a time period by a news outlet on a given topic. Group is a characteristic of the news outlet.)

Because it's count data, I understand I can't really use a straightforward one way ANOVA, so what should I do?

Now the more complex version: I also have those observations in Count as a percentage of each individual across 20 different test cases. So this data looks more like:

Group   Perc1     Perc2   ...  PercN
L         .3         .04         .2
R         .15        .6          .02
C         .9         .04         .2
L         .21        .08         .34
L         .13        .75         .02

Etc. (Edit: Each row represents the proportion of that outlet's coverage on each topic measured. Perc1 = Count1 / (Sum(Count1..CountN) .)

What would be the best approach? I'm comfortable using R or Stata, whichever is best/easiest. This is somewhat similar to this post, but I'm not sure it fully applies.

Thank you in advance.

John Henckle
  • 41
  • 1
  • 4

1 Answers1

2

I assume that were your data normally distributed with nearly equal variances, you would like to try something like an omnibus test/pairwise multiple comparisons tests. And I assume that you would want to adjust your p-values (or, alternatively, your rejection criterion) for multiple comparisons.

Your first example (with count data) could be approached using the Kruskal-Wallis test as a nonparametric analog of the one-way ANOVA, followed by Dunn's test which is akin to performing rank sum tests based on the same rankings from the Kruskal-Wallis test, and using a pooled variance term based on rank sum distributions. The accuracy of the Kruskal-Wallis test and the Dunn's test statistics will be somewhat compromised by ties (technically these tests are of continuous data), but the adjustments for ties typically implemented in software packages performing these tests will help compensate—the larger a range of count values you have, the better. Multiple comparisons adjustments here for the win!

Your second example looks like a repeated measures/blocked design, so you might consider Cochran's Q test as a nonparametric analog of the one-way repeated measures ANOVA specifically for binary outcomes (which your percentages/proportions are one representation of), followed either by Cochran's Q tests between pairs or by McNemar's test between pairs (these are equivalent). Multiple comparisons adjustments here also for the win!


References
Cochran, W. G. (1950). The comparison of percentages. Biometrika, 37(3/4):256–266.

Dunn, O. J. (1964). Multiple comparisons using rank sums. Technometrics, 6(3):241–252.

Kruskal, W. H. and Wallis, A. (1952). Use of ranks in one-criterion variance analysis. Journal of the American Statistical Association, 47(260):583–621.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • Thanks @Alexis for a very thorough response. After consideration, I can't go with the Count data because there is extreme variation within each case. For clarification, this data is number of articles on a specific topic in a news outlet, and some outlets just publish a lot more (some are weekly, some have more sections, etc.). Each CountN col in the second example, then, is the proportion of total articles that deal with that topic; in example 1, they're just the raw counts. (continued..) – John Henckle Aug 23 '14 at 02:29
  • Oh, and group refers to characteristics of the outlet (similar to a gender variable, but with 3 categories). So I have to go with the second dataset. I read through Cochran's paper and I'm not sure if it applies - I'm less interested in whether or not they cover it a topic but **how much** they cover it. The only way I can see of implementing that is if I have a threshold that defines success or failure, but that's not what my research question is asking - the research question is how similar the coverage is. Sorry if that was misleading. – John Henckle Aug 23 '14 at 02:38
  • @JohnHenckle It's not clear to me why extreme variability would make a difference? The null hypotheses of both the Kruskal-Wallis and the Dunn tests are that the probability that a random observation from one group would exceed the value of a random observation from another group is on half (i.e. no more likely than to be less than). The variance in the observed data don't enter into it for these tests. – Alexis Aug 23 '14 at 04:27
  • @JohnHenckle In response to you second comment. What these tests would give you is and answer to the questions (1) do any of these groups have a higher likelihood of a random draw exceeding any other? and (2) *which* groups (significantly) differ from which in this respect (with the sign of the test statistic indicating the direction). Which may or may not be useful. :) – Alexis Aug 23 '14 at 04:30
  • Kruskal-Wallis assumes continuous data. The permutations of ranks are not all equally likely for counts, and if the counts are small this is a particular problem (ties will be heavy). – Glen_b Aug 24 '14 at 09:04
  • @Glen_b See my second paragraph. :) – Alexis Aug 24 '14 at 15:00
  • Actually I was responding to the second paragraph. (I think you understand the issues quite well already.) The first sentence and a half was more by way of brief explication, for the OP's benefit; the last half sentence expresses the source of my concern that the usual adjustments will not be sufficient. I had intended to finish with a sentence suggesting that a direct permutation test on the actual ranks present could be used for both the Kruskal Wallis and a post hoc test but I seem not to have included it. That was what I was leading into, at least. – Glen_b Aug 24 '14 at 22:22