The sample size applied to a non-normal distribution

Question

I have a single variable that represents my population values (sample of data):

[1]  94.51  59.81  63.84  94.51  94.51  94.51  94.51  94.51  94.51  94.51
[11]  59.81  94.51  94.51  94.51  47.90  29.16  50.36  23.51  44.41  33.14
[21]  47.90  29.16  47.90  29.16  47.90  29.16  47.90  29.16  47.90  29.16
...
[331]  23.44  24.52  12.37  29.12  24.52  12.37  29.12  24.52  12.37  29.12
[341]  24.52  12.37  29.12  24.52  12.37  29.12  24.52  12.37  29.12  24.52
[351]  12.37  29.12  24.52  12.37  45.25  25.78  49.84  29.12  24.52  12.37
[361]  29.12  24.52  12.37  29.12  24.52  12.37


> summary(group$V1)
Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
6.11   35.94   59.13   62.31   86.10  111.50 
> mean(group$V1)
 [1] 62.30546
> sd(group$V1)
 [1] 29.55491

The corresponding histogram is: Histogram of BitScores of the Population And the Shapiro test of normality:

Shapiro-Wilk normality test


    data:  group$V1
    W = 0.9466, p-value = 3.161e-10

With the last information my conclusion is that the population is not distributed normally. The objetive is extract a sample from these population, but I have problems to apply a method to determine the sample size, because in some methods the assumption is based on the normality of population. (According with these reference) The sample is required to comparate this group with a random group with the same sample size, and the single variable to evaluate is the Bitscore.

Some references, suggestions, approaches? Thanks in advance.

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

I'm going to start by accepting your claim that you have the population, but I'll come back to this issue at the end.

1) If you actually have the target population, then hypothesis tests - which are based on assuming you have samples, not populations - are pointless. You can answer such questions by inspection. If that's the population about which you wish to make inferences, it's plainly not normally distributed. The p-value is irrelevant.

2) Before worrying about whether your population is normal, first worry about whether you do actually need that assumption for something ... and then work out how much of an issue non-normality might be. So which particular things do you need to assume normality to use, and how critical is some degree of non-normality to their results?

3) For this kind of purpose, hypothesis tests of distribution shape don't really answer the right question in any case. e.g.1, e.g.2

--

Now, to try to address your underlying question, which relates to determining sample sizes for hypothesis tests.

a) You say you have the population. Why do you need hypothesis tests at all? Just look at the population. What to see if some mean value differs from some hypothesized value? You have the population mean already, so just look at the number! Is it the same number as the hypothesized value or not?

b) Let's say there is some reason to do a hypothesis test when you have the population. You can just simulate samples from your population (by drawing randomly from the population of values) in order to find the minimum sample size with the required characteristics. But since the simulations would actually be the samples, your question would already be answered by then choosing one of your simulated samples at random and labelling it 'My Sample'. [Quite why one would be interested in such a performance is beyond me, but when you have the population, that is drawing a sample.]

At the end it sounds like you want to compare a particular subgroup with the population as a whole on a particular variable, but you don't say what you want to compare about them - means? some general notion of location? spread? distributional shape?

Why would you need the groups to be the same size?
You say you have the population. The subgroup is therefore the population of that subgroup. Whatever you want to compare, you just compare the numbers and see if they're the same. (Of course, they won't be - you know this before you start. This is a dumb exercise, because you're trying to answer a question you already know the correct answer to.)

[Finally, I'm going to make a little bet. I bet you don't actually have the population about which you wish to make inferences. I bet you wish to extend your inference outside of those 366 values to something broader - your actual target population. This is no doubt part of the reason why you retain some urge to do hypothesis tests.]

My idea is: based in this data (the whole data generated) obtain a sample size statistically significant (95%), next generate a random group and evaluate to obtain their correspond values. Also, select randomly from the 366 values, the correspond sample size and compare this values against random-generate ones. I believed that sample size techniques were dependent to normality distribution assumption from original data. — Cristian Velandia, Apr 15 '13 at 00:18
What does "*obtain a sample size statistically significant (95%)*" mean? What is significant? The sampling you describe I understand what you mean to do ... but the question remains - *why* would one choose to do such a thing? To what purpose? What does it tell you that direct examination of the population does not? — Glen_b, Apr 15 '13 at 00:54
In this case, the sampling strategy isn't necessary because the extraction of a sample from the population do not contribute to generate more information beyond direct information from the population? But, the big data from the populations are very CPU-intensive to obtain the Bitscores (in some cases ~100.000 records like a input), for that reason I supposed that sampling with specific size was the solution to reduce, time and cost. — Cristian Velandia, Apr 15 '13 at 02:12
Hold on, I'm confused. Are the 366 values you presented your population or not? If they are, then cost is irrelevant, you already have them. If not, then what are we even talking about? — Glen_b, Apr 15 '13 at 02:51
Yes, in these population group I have 366, but I have others independent group that I need to process by this way...In some groups (or 'populations) I have records near to 100.000 — Cristian Velandia, Apr 15 '13 at 03:05
It seems like your question would be improved by you being more explicit about the circumstances and your reasoning behind using sampling. — Glen_b, Apr 15 '13 at 03:06
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/8345/discussion-between-cristian-velandia-and-glen-b) — Cristian Velandia, Apr 15 '13 at 03:08

The sample size applied to a non-normal distribution

1 Answers1