0

I have a large dataset (n = 170,000), distributed roughly equally among 5 groups (each is 34,000 ± ~1,000). I want to compare whether the means in any of the first 4 groups are different from the control. Before the data came in, I figured the best approach would be to use Dunnett's test; however, now I'm unsure how best to handle this.

After doing some research, it seems there are at least 2 ways to go about it:

  1. Normalize the values (which still yields a skewed distribution)
  2. Use a non-parametric counterpart

I'd like to get an idea of the pros/cons of each, so I can decide on the best approach.

Khashir
  • 133
  • 7
  • 2
    It's quite likely a standard ANOVA is the best procedure, but that depends on which "distribution" the title refers to--is it the distribution of the responses of of their residuals?--and how exactly it departs from Normality. Could you provide this important information by editing your question? – whuber Apr 30 '18 at 20:59
  • Thanks for the comment, whuber. My understanding was that, since Dunnett's compares the means of the control to the treatments, it's the distribution of the dependent variable that needs to be normally distributed. Did I misunderstand that? – Khashir Apr 30 '18 at 21:14
  • What needs to be close to Normally distributed is the *sampling distribution of the difference of means.* With very large groups it would take extreme skewnesses to violate that assumption to the degree it would affect any but tiny p-values. – whuber Apr 30 '18 at 22:37
  • What do you mean by "normalize"? – Glen_b May 01 '18 at 00:36
  • Hey all, thanks for the responses. @whuber: I'll go ahead with an ANOVA and Dunnett, to see if the outcome makes sense. – Khashir May 01 '18 at 01:02
  • @Glen_b: By normalize, I mean apply the scale() function in R, which coerces the data to have a mean = 0 and var = 1. Based on my reading so far, it's a common technique (granted, not without detractors) – Khashir May 01 '18 at 01:04
  • 2
    How would that help? You'd remove any differences in mean (which is what you're interested in), without removing the skewness that you're trying to deal with. – Glen_b May 01 '18 at 06:24

1 Answers1

3

You have several considerations.

  1. Significance level. Unless you have extremely skewed distributions, correctness of significance level is probably not at issue if the group-variances don't differ very much.

    If significance level was still a concern for you, you could avoid that problem by simply performing a permutation test, perhaps with the usual statistic or some suitable simplification of it.

    If variances differ, that will affect significance level, but a Welch-Satterthwaite approach may be adequate to deal with that.

  2. Power. Power may indeed be an issue. You might want to consider a more suitable distributional model (e.g. possibly an exponential-family model -- a GLM -- which will still allow you to compare means). What sort of quantity are you measuring? Are these times? Incomes? Counts? Angles?

  3. A rank-based nonparametric test will not be a test for equality of means without additional assumptions (assumptions that would keep the means equal under the null and unequal under the alternative). If such assumptions don't hold, you can have equal means and yet be highly likely to reject, or unequal means and be highly likely to fail to reject, even in very large samples.

Glen_b
  • 257,508
  • 32
  • 553
  • 939
  • Heya, thanks for submitting an answer. We are measuring average counts (e.g., pizza slices per person, for example). – Khashir May 01 '18 at 01:07
  • 2
    if you know the denominators in your average counts, why not use a model for counts? A Poisson or negative binomial regression model, perhaps (there are a number of other alternatives). These would give you a model for the mean, as you require, and the ability to carry out a test that would correspond to the kind of comparison of means you're after (fit a full and reduced model, use an asymptotic chi-squared test). A suitable choice may deal with the skewness and the heteroskedasticity you definitely expect with count data all at the same time (those aspects are in the distributional model). – Glen_b May 01 '18 at 06:22
  • Hmm, this might be beyond me at the moment. Would you have a resource for me to dive in, or a useful query to throw at google? I want to have a good grasp of what I'd be doing in that case. – Khashir May 01 '18 at 14:21
  • 1
    I'd start with googling *generalized linear model* which extend regression models to a much wider class of practical models; here's Wikipedia on [GLMs](https://en.wikipedia.org/wiki/Generalized_linear_model). Also [here](http://statmath.wu.ac.at/courses/heather_turner/glmCourse_001.pdf)'s lecture slides for an introductory lecture on them (don't worry about the stuff on estimation methods), but it focuses on using R to fit them, which may not suit you but it will give some of the gist. It might be a little more mathematical than you want but you may find parts of it useful. ...ctd – Glen_b May 01 '18 at 22:37
  • 1
    ctd... John Fox's book (applied regression analysis and generalized linear models) isn't bad, or you may prefer a specific book on GLMs. I'd also search for *poisson regression* and - eventually - for *negative binomial regression*. There are many questions here on site about GLMs and on Poisson and negative binomial regression models including a fair number of introductory questions and some have good answers. Strictly speaking negative binomial models are not GLMs unless you specify a parameter but it's often treated as one; learning GLMs/Poisson regression will be valuable as a 1st step – Glen_b May 01 '18 at 22:40
  • Thanks a lot for this—I think these might be worth moving into the answer ("If you want more info..."), since comments sometimes get deleted, and future readers would be missing out. – Khashir May 02 '18 at 00:27
  • Thanks. It doesn't look to me like it fits well with the present answer, nor really with the question. I'll think about whether I can edit some of it to fit. – Glen_b May 02 '18 at 00:30
  • Fair point. One way to hook-up one with the other would be to compare/contrast with Dunnett's test (if you think that would be relevant). – Khashir May 02 '18 at 03:26