1

I have 12 samples with approx. 50.000 data points each. I got it from a content analysis of Reddit comments. The data is generated by analyzing the comments with LIWC tool and it gives me information for instance about the percentage of pronouns in a comment. The comments represent one year and consists of every publicly availabe comment on Reddit for five particular subreddits.

Central limit theorem says that with a certain amount of samples, the data will be normally distributed. This is also the info I got when asking friends, who are quite good in statistics (Master in econometrics etc), when i asked if i can compare groups with the t-test. But when i plot the data, it looks quite skewed and not normal at all and I wonder when you are not able to accept central limit theorem anymore?

How can I be sure that the attributes of the population is not naturally, non-normal distributed?

  • 2
    (1) See http://stats.stackexchange.com/questions/69898 for a similar question. (2) This is not at all what the CLT says. Before proceeding, you might want to review it. Discussions can be found by [searching our site](http://stats.stackexchange.com/search?tab=votes&q=CLT). – whuber Aug 08 '16 at 16:08
  • Can you show us a plot of your data distribition? – kjetil b halvorsen Aug 08 '16 at 20:20

1 Answers1

1

Confirming Whuber's comment, this is not what the central limit theorem says. The distribution does not get less skewed as the sample size increases. All you get is a more and more accurate picture of the shape of the true distribution in the population (just as you get a more accurate estimate of the mean, the SD, etc).

What the central limit theorem says (amongst other things) is that the sampling distribution of the mean gets closer to normal as the sample size gets bigger. This sampling distribution is the distribution of means of the samples; in other words if you took lots of samples of 50,000 items, and plotted the means of those samples as a new distribution in their own right, that histogram would tend to normality, regardless of the distribution of the original means. It is this that allows you to carry out a t-test regardless of the normality of the original distribution - when the sample size is large enough - and there can surely be no doubt that 50,000 is going to be 'large enough' in this context. [Note: I clarified "of the mean" in the first sentence and added "surely" in the final sentence after reading comments on my answer.]

MikeG
  • 106
  • 1
  • 5
  • 1
    (+1) Nitpick, but I think it's important. The *sampling distribution* is, I believe, the distribution of the samples themselves, i.e. the distribution of datasets of some size, where each datapoint is drawn independently from the base distribution. It is the *sampling distribution of the mean* which, as you describe, the central limit theorem applies to. – Matthew Drury Aug 08 '16 at 17:23
  • Re "can be no doubt": although your conclusion is likely correct, I would urge some moderation of that certainty in light of the data analyzed at http://stats.stackexchange.com/questions/69898. If one were to interpret the present question generously and constructively, one might be inclined to offer suggestions about how to verify whether your conclusion is true. – whuber Aug 08 '16 at 17:34
  • You mean to say "the sampling distribution of the mean", not just "the sampling distribution". Any statistic has a sampling distribution – Glen_b Aug 09 '16 at 08:52