The question is - formally speaking, what should the minimum sample size be in order to trust statistical metrics like p-value and power?
Let's say we have population A and B each having values between 0 and 1 and we want to know if there is any significance difference between them.
My typical approach is to use the p-value and power to decide. However, it seems this only works with "sufficient" sized populations but I am failing to formally define what constitutes such populations.
For example, let's say the mean value for A is .5, with 1000 samples, and standard deviation of .02 and the mean value for B is 1.0 with 2 samples. Plug this in your favorite p-value calculator or formula and they will claim the results are significant even though we know that there's a 25% chance of getting B's result using the null hypothesis.
With Za = 2.3263 for 1% tile
power zscore = 2.3263 - (1 - 0.5)/( .02 / (2^(1/2))) = -33
And
p-value zscore = (1 - .05) / ((.5 * .5 / 1000)^(1/2)) = 31
We intuitively and mathematically know that 2 samples shouldn't be enough to conclude anything about the result, so how many samples should be the minimum?
If you propose some fixed minimum "ie. 10" - why? For example, if the probability was 0.001% in population and both samples in population B were 1, I might think intuitively and mathemtically there is some significance there.