Power calculation for non-normal distribution

Question

I'm trying to do a power analysis for a research study aimed at characterizing between group differences in depression for 2 different populations. As I understand it, depression scores are a non-normal distribution in the general population.

For a P value <0.05, specified power of 0.8 and a difference of PHQ-9 scores between populations of 7 (scale is 0-27), what is the best/proper power analysis equation?

Are there any other variables I need given that PHQ-9 scores are a non-normal distribution (right skewed)?

Thank you so much!!

Looking for a seven point difference on a 28 point scale seems rather optimistic. If you are not happy with assuming normality then you can proceed using simulation. — mdewey, Jan 23 '22 at 15:31
For PHQ-9 a 7-point difference is approximately that between the major depression class and the other depression class, and also between the other depression class and the no depression class. Think carefully if that's a reasonable magnitude of difference to look for between your populations. The distribution in the general population probably isn't relevant to your study. — EdM, Jan 23 '22 at 20:31
@mdewey do you think it is fair to assume normality given that this is a very simple project at the medical student level? I need to do a simple power calculation and I need to tell my PI how large my sub-study population should be. — Steven Pinkler, Jan 24 '22 at 00:55

EdM · Answer 1 · 2022-01-24T14:25:54.937

From J Gen Intern Med 2001 Sep; 16(9): 606–613:

The mean PHQ-9 score was 17.1 (SD, 6.1) in the 41 patients diagnosed by the [medical health professional] as having major depression; 10.4 (SD, 5.4) in the 65 patients diagnosed as other depressive disorder; and 3.3 (SD, 3.8) in the 474 patients with no depressive disorder. The vast majority of patients (93%) with no depressive disorder had a PHQ-9 score less than 10, while most patients (88%) with major depression had scores of 10 or greater.

The distribution of PHQ-9 scores within a population thus depends on the prevalence of depressive-disorder classes in the population. As part of the power calculation you must specify values for some combination of depression-class prevalence and within-class distributions of scores. This would seem to be best handled by simulation.

It would be good if you had such estimates of score distributions from the particular populations with which you will be working, for example from a pilot study. If you do, then Frank Harrell's suggestion in another answer to base your power calculations and analysis on a two-sample Wilcoxon test would be a good way to proceed.

If you don't have your own information, the above quote gives you a starting point for those within-class distributions of scores to combine with your estimates of depression-class prevalence.

Unfortunately, within-disorder-class value distributions in the above quote seem to be non-normal. Thus you need to take some care in simulation. In particular, the no-disorder score SD of 3.8 is greater than the mean of 3.3, so a normal distribution would imply many impossible negative values.

You might generate a set of non-negative integer values for that class from a negative binomial distribution. For the R parameterization of the NegBinomial distribution, you could estimate a corresponding prob parameter from the ratio of the mean (3.3) to the standard deviation squared (3.8^2), or 0.2285319. The corresponding size parameter would be mean * prob/(1-prob), or 0.9775586. Then generate a random sample of N scores for that class with rnbinom(N, 0.9775586, 0.2285319). Those parameter estimates agree well with the above value of 93% of such patients having scores below 10.

The problem with a normal-distribution assumption about scores for that class of patients is illustrated here:

The solid line shows a normal distribution with mean 3.3 and SD 3.8, while the points show the probability masses for the corresponding negative binomial distribution from the prior paragraph.

You could take the same approach to simulating integer values for the two depression-disorder classes. A quick check suggests about 5% of scores above 27 with the above approach to negative-binomial simulation for the major-depression class. Thus negative binomial simulation for that class might better be based on counts below 27 (mean = 27-17.1 = 9.9, SD = 6.1), then subtracting the simulations from 27. That agrees well with 88% of those patients having scores of 10 or greater. Among all 3 classes you will get a very small fraction of simulations outside the range of 0 - 27. Those could be eliminated or just set to the corresponding limits.

Finally, as the scores aren't going to be normally distributed overall within each population, you shouldn't count on power calculations from the relatively simple formulas that apply to t-tests, which aren't appropriate here anyway. You should choose an analysis method suited to your type of integer-scale data. I made some suggestions here. Then run multiple analyses on your simulated data to estimate power as a function of sample size. You would also want to repeat that power analysis over ranges of assumptions about depression-class prevalence and score distributions to see how sensitive your power estimates are to those assumptions.

In response to comment:

Whether normal approximations will be adequate and, if so, what to assume about standard deviations depend on what you expect the compositions of your populations to be. Let's call them PopA and PopB.

Based on the cited literature, say that those with major depressive disorder (MDD) have mean scores of 17 with SD of 6, those with other depressive disorders (ODD) have mean scores of 10 with SD of 6, and those not depressed (ND) have mean scores of 3 with SD of 4. Then there is a difference of 7 between ND and ODD and a difference of 7 between ODD and MDD, consistent with the question.

If PopA is all MDD and PopB is all ODD, then a normality assumption for each group with SD of 6 in each group might do fine.

If PopA is all ND and PopB is all ODD, then you also have a mean difference of 7 between the PopA and PopB, but the normality assumption about the PopA/ND group is questionable.

If PopA is 50:50 ND:ODD and PopB is 50:50 ODD:MDD, then you still have a mean difference of 7 between PopA and PopB. But the within-Pop SD will be much larger than the 4-6 for homogeneous populations, and the within-Pop assumption of normality is tenuous.

That's why I said at the beginning of the answer:

The distribution of PHQ-9 scores within a population thus depends on the prevalence of depressive-disorder classes in the population. As part of the power calculation you must specify values for some combination of depression-class prevalence and within-class distributions of scores.

Thanks for this detailed response -- this is obviously the ideal way to go for the most appropriate power calculation given my study. Unfortunately, I'm only a med student and much of the higher level statistics is over my head. It's also worth noting that my project is relatively simple, and I just need to generate a power calculation for my PI by Friday. In short -- is there a way to make some simple assumptions (i.e. assume normality, assume depression prevalence in my populations) in order to produce a power calculation, while noting that it was made with assumptions that may not hold? — Steven Pinkler, Jan 24 '22 at 00:52
@StevenPinkler the main non-normality problem comes from the non-depressed group, whose values are bunched up close to 0. If you are trying to compare a major-depression group against an “other depressive” group (difference about 7 units, SD about 6 within in each group, means of both groups pretty far from the limits of the scale) then you might get by with a normality assumption. But if there are many non-depressed in your populations I would worry. It all depends on just what you expect your 2 populations to contain. — EdM, Jan 24 '22 at 03:45
@StevenPinkler I added to the answer to illustrate some scenarios. Whether a normality assumption is reasonable (and, if so, what you should assume about SDs for power calculations) depends heavily on your assumptions about the populations you are comparing. If your PI hasn't made those assumptions clear yet, get them clarified. Trying to do power calculations is an excellent way to expose and clarify potentially hidden assumptions. — EdM, Jan 24 '22 at 12:54
This is all excellent, but it's worth pointing out that PHQ-9 scores aren't generally bimodally distributed in the general population, as you might inspect, because depression isn't a discrete present/absent thing. Instead, you see a heavy-tailed right-skewed distribution, with people increasingly likely to be diagnosed as "*depressed*" as their scores increase. — Eoin, Jan 24 '22 at 13:17
@Eoin I recognize that, but I think the main point here is to get the OP (and the OP's PI) to think explicitly about their assumptions about the populations they are trying to compare. If they had one population independently assessed as MDD and another as ODD, then assuming within-population normal distributions of PHQ-9 scores might be good enough. Otherwise... — EdM, Jan 24 '22 at 13:20
100% agree, I just wanted to make sure nobody came away from this thinking of a bimodal distribution with two well-separated peaks for ND and MDD cases. Not that I'm accusing you of making that mistake! — Eoin, Jan 24 '22 at 13:45

score 2 · Answer 2 · answered Jan 24 '22 at 13:07

If you have a representative sample of the scores you can compute the power or sample size for a Wilcoxon test. These are functions of the entire frequency distribution of the scores, and the effect size is stated in terms of an odds ratio, which can be converted to a difference in means for interpretation. Detailed examples with R code are in Chapter 7 of BBR. Note that the Wilcoxon test is a special case of the proportional odds model, and this model can handle arbitrarily strong clumping at zero.

Power calculation for non-normal distribution

2 Answers2