I am comparing scores of two student groups using t-test (in scikit-learn). Each group has different number of students. Since there are considerable numbers of 0 scores in my data, the standard deviation gets even higher than the mean for the groups. So, I wonder, in this case, if it is still acceptable to keep the 0 scores? I appreciate any information!

- 63,378
- 26
- 142
- 467

- 755
- 1
- 9
- 25
-
2What do the scores mean? Do students actually get scores of 0? The issue for the t test is what the two distributions look like. It doesn't make sense to drop 0 scores if they are real. The t test can be used with unequal sample sizes. It is usually assumed that the two variances are equal when applying the t test for comparing two means. But even in the cases where the two variances are obviously different the Welch test which approximates a t distribution under the null hypothesis can be applied. If the distributions differ greatly from the normal Wilcoxon's rank sum test can be used. – Michael R. Chernick Jan 11 '17 at 18:50
-
Thanks for the detailed reply. I wish you can post this as answer! So, if scores of 0 are meaningful in my data, then I should keep them, and high variance does not violate anything as far as I understand. – renakre Jan 11 '17 at 18:55
-
@renake I have put the answer part of my comment into a answer as you requested. – Michael R. Chernick Jan 11 '17 at 20:52
-
"Having a lot of 0 scores" could be a serious problem if you want the t-statistic to have a t-distribution (i.e. if you want p-values calculated as if you had normally distributed populations to tell you anything) – Glen_b Jan 12 '17 at 00:30
1 Answers
1. If the data come from normal distributions with common variance the t distribution is exactly the correct distribution under the null hypothesis. The sample sizes do not need to be equal (In practice we don't know whether or not these assumptions are true but it may be reasonable assumptions based on other knowledge. For the test statistic use the pooled estimate of variance.
2. Same assumptions as in 1) except the variances are known to be very different, then you use separate estimates of variance and apply Welch's test (approximately t with possibly a non-integer number for degrees of freedom.
3. The fact that there are many zeros means that the normality assumption is suspect. It is not justifiable to remove the zeros because they are legitimate scores. The Wilcoxon rank sum test only requires that the samples are independent and have the same distribution under the null hypothesis. As Bill Huber points out because the test uses ranks and you have zeros that could lead to several ties it may not be the best alternative. Another alternative would be the bootstrap which does not involve ranks and does not require any normality assumption.

- 39,640
- 28
- 74
- 143
-
1Isn't the issue whether the *sampling distributions* of the means are normal, not whether the underlying distributions of the data are normal? One difficulty with any rank sum test in this situation is the potential for a large number of ties at the rank of zero. – whuber Jan 11 '17 at 22:21
-
@whuber I was trying to give the OP an understanding of some of the assumptions that might lead to a t test being acceptable. I gave most of this in my comment and the OP encouraged me to put it in an answer. The other issue which seemed to be more important to the OP was whether or not the zeros should be kept. With regard to Wilcoxon I agree that ties coming about because of the zeros can be a problem. I could easily also have recommended bootstrap. Mainly I wanted the OP to understand that there are alternatives. – Michael R. Chernick Jan 11 '17 at 22:48
-
Regrading the sampling distribution being normal, that will occur if the data are normal. My point was that if the data is approximately normal the t test will work. I wasn't going to go into the CLT. It depends on the sample size and how far from normal the origimal samples are. – Michael R. Chernick Jan 11 '17 at 22:51
-
1Those issues would seem to be immediately relevant and of interest to the OP. Unless they are addressed, it seems difficult to justify a blanket recommendation of the Wilcoxon RS test. Furthermore, if your interest lies in helping the OP understand the test, wouldn't it be better to focus the discussion on what really matters--the sampling distribution--rather than something that is less directly relevant (the data distribution)? – whuber Jan 11 '17 at 22:56
-
1@whuber I didn't realize that I made such a strong recommendation. I will edit the post! – Michael R. Chernick Jan 11 '17 at 22:58
-
(+1) Thank you for the clarifying remarks. BTW, a nice example of problematic data--perhaps like the data described here--was the subject of a thread at http://stats.stackexchange.com/questions/69898. It was interesting to see just how large a sample might be needed for an accurate t-test in this case with real data (involving "lots of zeros"). – whuber Jan 11 '17 at 23:10
-
@whuber You found an interesting question which you probably remember so well be cause you gave the "best" answer. I encourage the OP and others to check out that thread. – Michael R. Chernick Jan 11 '17 at 23:17
-
Michael, you are correct that I found it easily because I remembered it. However, to locate the thread I needed to search our site. The first keywords I thought of using were "t-test skewness," but the first page of hits looked unpromising. I tried again using ["t-test skewed"](http://stats.stackexchange.com/search?q=t-test+skewed)--and guess what the first hit is? :-) In many cases of searching it can help to remember a user who participated. For instance, twice this week I have looked up Douglas Zare's nice description of the Cauchy distribution by including his user ID in the search. – whuber Jan 11 '17 at 23:26
-
-
I'm just trying to make the point that this site is sufficiently mature that in the majority of cases you will find an excellent answer through a search and therefore might consider doing a search before posting an answer to any question. Plenty of times I have had to migrate or delete one of my answers when I neglected to do so and somebody later found a duplicate. Thus, gaining some experience with search tools and their idiosyncrasies should be a considered a valuable skill for using this site well. – whuber Jan 11 '17 at 23:31
-
@whuber I don't agree. The question is more like asking whether to remove zeros, not about "t-test skewness". – SmallChess Jan 11 '17 at 23:33
-
@Student You are focusing on a superficial aspect of the question rather than the underlying phenomenon. The sole reason for asking about removing zeros is that the data are so highly skewed. – whuber Jan 11 '17 at 23:38