t-tests on significantly different sample sizes

Question

I have a data set of some 10,000 observations, derived from 64 categories. The mean of most of the categories is similar to the mean of the entire data set, but some are rather different.

If I understand correctly, I can apply a t-test to determine if the differences are significant, but the size of some of the groups is very small compared to the overall size (< 50 observations) which, again iiuc, reduces the power of the t-test to determine an accurate p-value.

One source suggests a solution to this is to "monte carlo the data", which I interpret as multiply sampling the 10k data set excluding the data under test to build a similarly-sized synthetic sample and running the t-test against that. I presume I then take the mean of those p-values to determine a more accurate p-value. Is this the correct approach?

If so, there is also the question of qualifying the variance equivilance, or otherwise, of the data. Should I run Levene's test on the real sample v the synthetic sample and feed that result into the t-test?

(I have read How should one interpret the comparison of means from different sample sizes?)

A collection of t-tests is likely to be inappropriate due to the multiple comparisons problem that ensues. Ordinarily one thinks of ANOVA in this setting. Whether to recommend that depends on what you are trying to evaluate. Could you please state more clearly what "differences are significant" means? (At least one difference? All differences? Specific differences? Something else?) Also, what do you mean by "derived from 64 categories"? Could you explain what that amounts to? How were these observations made and how do they reflect the target population or process? — whuber, Dec 21 '21 at 23:51
@whuber Many thanks for your response. The data represents the change in price of stocks over a given period from an event. The 64 categories are various flavors of triggering event. So, on average for the 10k observations, the outcomes have a mean of 1.90 with 88% being positive, but for some categories the mean is, for example, 2.37 with 96% positive, but n=46. I'd like to be able to determine if this result, and other similar ones, are actually statistically significant or not. — Ian, Dec 22 '21 at 01:32
The difficulty here lies in interpreting what you might mean by "statistically significant." I still can only guess what your question actually is. — whuber, Dec 22 '21 at 14:34

BruceET · Answer 1 · 2021-12-22T02:34:09.340

Comment: Suppose you have a large dataset big consisting of 10,000 observations, and a small sample new of size 45.

Then a Welsh two-sample t test comparing big with new will be essentially the same as a one-sample t test comparing new with the mean of big.

Fictitious data for the demonstration:

set.seed(2021)
big = rnorm(10^4, 100, 15)
new = rnorm(45, 105, 17)
a.big = mean(big);  a.big
[1] 100.2375

Two-sample t test of $H_0: \mu_b = \mu_n$ vs. $H_a: \mu_b \ne \mu_n.$ The test statistic is $T = -3.612,$ the degrees of freedom about $44$ and the P-value about $0.0008$ (significant difference at the 5% level).

t.test(big, new)

       Welch Two Sample t-test

data:  big and new
t = -3.6115, df = 44.302, p-value = 0.0007722
alternative hypothesis: 
 true difference in means is not equal to 0
95 percent confidence interval:
 -14.553954  -4.129689
sample estimates:
mean of x mean of y 
 100.2375  109.5793

One-sample t test of $H_0: \mu_n = 100.2375$ vs. $H_a: \mu_n \ne 100.2375.$ The t statistic is $T = 3.618,$ degrees of freedom are $44$ and P-value is about $0.0008.$

t.test(new, mu = a.big)

       One Sample t-test

data:  new
t = 3.6177, df = 44, p-value = 0.0007623
alternative hypothesis: 
 true mean is not equal to 100.2375
95 percent confidence interval:
 104.3751 114.7836
sample estimates:
 mean of x 
  109.5793

Notes: (1) The powers of the two tests are nearly the same because both degrees of freedom and both standard errors are essentially determined by the new sample.

(2) IMHO: If you test each new sample against the mean of the big dataset (approximately the population mean $\mu)$ only once, then I see no need for an adjusted P-value to avoid false discovery from multiple tests.

The risk of false discovery would arise, if you customarily test the current 'new' sample against several previous 'new' samples.

Many thanks for your most comprehensive reply. Do I interpret it correctly as saying that it is, in fact, /not/ ridiculous to test my small samples against the large one? You say that the powers of the two tests you've demonstrated are similar, but, I assume, they might be less than a test against similar sized sets? But not so much reduced that the test is worthless? — Ian, Dec 22 '21 at 15:23

t-tests on significantly different sample sizes

1 Answers1