From J Gen Intern Med 2001 Sep; 16(9): 606–613:
The mean PHQ-9 score was 17.1 (SD, 6.1) in the 41 patients diagnosed by the [medical health professional] as having major depression; 10.4 (SD, 5.4) in the 65 patients diagnosed as other depressive disorder; and 3.3 (SD, 3.8) in the 474 patients with no depressive disorder. The vast majority of patients (93%) with no depressive disorder had a PHQ-9 score less than 10, while most patients (88%) with major depression had scores of 10 or greater.
The distribution of PHQ-9 scores within a population thus depends on the prevalence of depressive-disorder classes in the population. As part of the power calculation you must specify values for some combination of depression-class prevalence and within-class distributions of scores. This would seem to be best handled by simulation.
It would be good if you had such estimates of score distributions from the particular populations with which you will be working, for example from a pilot study. If you do, then Frank Harrell's suggestion in another answer to base your power calculations and analysis on a two-sample Wilcoxon test would be a good way to proceed.
If you don't have your own information, the above quote gives you a starting point for those within-class distributions of scores to combine with your estimates of depression-class prevalence.
Unfortunately, within-disorder-class value distributions in the above quote seem to be non-normal. Thus you need to take some care in simulation. In particular, the no-disorder score SD of 3.8 is greater than the mean of 3.3, so a normal distribution would imply many impossible negative values.
You might generate a set of non-negative integer values for that class from a negative binomial distribution. For the R parameterization of the NegBinomial distribution, you could estimate a corresponding prob parameter from the ratio of the mean (3.3) to the standard deviation squared (3.8^2), or 0.2285319. The corresponding size parameter would be mean * prob/(1-prob)
, or 0.9775586. Then generate a random sample of N scores for that class with rnbinom(N, 0.9775586, 0.2285319)
. Those parameter estimates agree well with the above value of 93% of such patients having scores below 10.
The problem with a normal-distribution assumption about scores for that class of patients is illustrated here:

The solid line shows a normal distribution with mean 3.3 and SD 3.8, while the points show the probability masses for the corresponding negative binomial distribution from the prior paragraph.
You could take the same approach to simulating integer values for the two depression-disorder classes. A quick check suggests about 5% of scores above 27 with the above approach to negative-binomial simulation for the major-depression class. Thus negative binomial simulation for that class might better be based on counts below 27 (mean = 27-17.1 = 9.9, SD = 6.1
), then subtracting the simulations from 27. That agrees well with 88% of those patients having scores of 10 or greater. Among all 3 classes you will get a very small fraction of simulations outside the range of 0 - 27. Those could be eliminated or just set to the corresponding limits.
Finally, as the scores aren't going to be normally distributed overall within each population, you shouldn't count on power calculations from the relatively simple formulas that apply to t-tests, which aren't appropriate here anyway. You should choose an analysis method suited to your type of integer-scale data. I made some suggestions here. Then run multiple analyses on your simulated data to estimate power as a function of sample size. You would also want to repeat that power analysis over ranges of assumptions about depression-class prevalence and score distributions to see how sensitive your power estimates are to those assumptions.
In response to comment:
Whether normal approximations will be adequate and, if so, what to assume about standard deviations depend on what you expect the compositions of your populations to be. Let's call them PopA and PopB.
Based on the cited literature, say that those with major depressive disorder (MDD) have mean scores of 17 with SD of 6, those with other depressive disorders (ODD) have mean scores of 10 with SD of 6, and those not depressed (ND) have mean scores of 3 with SD of 4. Then there is a difference of 7 between ND and ODD and a difference of 7 between ODD and MDD, consistent with the question.
If PopA is all MDD and PopB is all ODD, then a normality assumption for each group with SD of 6 in each group might do fine.
If PopA is all ND and PopB is all ODD, then you also have a mean difference of 7 between the PopA and PopB, but the normality assumption about the PopA/ND group is questionable.
If PopA is 50:50 ND:ODD and PopB is 50:50 ODD:MDD, then you still have a mean difference of 7 between PopA and PopB. But the within-Pop SD will be much larger than the 4-6 for homogeneous populations, and the within-Pop assumption of normality is tenuous.
That's why I said at the beginning of the answer:
The distribution of PHQ-9 scores within a population thus depends on the prevalence of depressive-disorder classes in the population. As part of the power calculation you must specify values for some combination of depression-class prevalence and within-class distributions of scores.