t-test for a non-normal population

Question

Disclaimer: I'm a programmer. I am not a statistician. My last statistics class was (ermumble) years ago.

I read t-distibution for sample mean from non normal population and while I'm sure you all know exactly what you're talking about, I don't, so I'll ask for my specific case.

This is part of a class. The data set is (apparently) from Kaggle. The data is "medical charges". There are 1338 records. Naive statistics on the set give us:

mean: 13270.42 
std. dev.: 12110.01
median: 9382.03

and a histogram as shown:

Decidedly not a normal distribution.

The class exercise asks: "The administrator is concerned that the actual average charge has fallen below 12000. On the assumption that these data represent a random sample of charges, how would you justify that these data allow you to answer that question? What would be the most appropriate frequentist test to apply? What is the appropriate confidence interval in this case? A one-sided or two-sided interval? Calculate the critical value and the relevant 95% confidence interval for the mean and comment on whether the administrator should be concerned?"

I figured, Central Limit Theorem, resample the means and work toward a better distribution.

m = medical.charges.to_numpy()

seed(47)
sample_mean = []

# calculate 100 means sampled from the larger dataset
for n in range(100):
    this_sample = np.random.choice(m, 50)
    sample_mean.append(np.mean(this_sample))

mean_of_means = np.mean(sample_mean)
std_of_means = np.std(sample_mean, ddof=1)

print("mean", mean_of_means, "\nstd. dev.:", std_of_means)

I ended up with a resampled mean of 13326 and a resampled
std. dev.: 1476 and plugged them into a t-test.

However apparently, the exercise expects me to simply plug the initial mean and std. dev into the t-test. (I know this because the following exercise tells me what answer they expected me to get for this one.)

Can I just blithely push the "mean" and "std dev" of this extremely non-normal data set through a t-test? And if so... why? (given that the docs all say "for a normal distribution...")

"Can I just blithely push the 'mean' and 'std dev' of this extremely non-normal data set through a t-test?" Yes. The distribution of the simple random sample means is approximately normal, no matter what the distribution of the individual data are. That's the magic of the central limit theorem, and it's why people write it in all caps Central Limit Theorem. — Him, Sep 26 '19 at 23:40
I used to teach statistics 101 at university, and when I show people the central limit theorem, and they're not like "whhhhhhaaaaaatttttttt", I'm a little heartbroken. Yours is the appropriate response. :) — Him, Sep 26 '19 at 23:41
@LSC The OP [may be interested](https://stats.stackexchange.com/questions/428917/t-test-for-a-non-normal-population/428921#comment800273_428921) to observe that the $p$-value is perhaps best thought of as a "metric of evidence" rather than a probability at all. Trying to interpret it further [is perilous](https://www.americanscientist.org/article/the-statistical-crisis-in-science) — Him, Sep 27 '19 at 00:37
Vicki, it might be instructive to perform a little experiment. Since you've already got some code to resample the mean, try treating your 1338 records as the population, and take a whole bunch of random samples of size 30. Find the mean of each of those samples. Make a histogram of those values. — Him, Sep 27 '19 at 01:55
@Scott, well aware of problems the lay people have interpreting p-values. That article is rife with many of the common errors in interpretation (just in their first sentences they show their lack of understanding). Sure, the p-value can be thought of as a continuous measure of (in)compatibility between the observed data and a particular null hypothesis: this flows naturally from the technical definition that "if the null is true, the p-value is the probability of obtaining a result at least as extreme as the observed one." I think there is a bit of running in circles on this thread. Cheers. — LSC, Sep 28 '19 at 01:30
@Scot " try treating your 1338 records as the population, and take a whole bunch of random samples of size 30." -- I did that. I believe in that one. That's where I got a slightly smaller mean and a much smaller std dev (as expected for a more normal dist) and used those values for the t test. I can believe in using those values. So... because I believe that, I should trust he function and use the original sample values? — Vicki B, Sep 28 '19 at 19:40

score 3 · Accepted Answer · answered Sep 26 '19 at 23:22

The solution shouldn't plug the standard deviation into the t-test, but if they used the standard error then that sounds more appropriate. Because you are not a statistician (and, dare I say, statistically savvy), I'll try to explain without any math.

The sample mean is just a scaled sum: Add up the data and divide by the number of data points. Because the sample mean is a sum, we can use the Central Limit Theorem to do inference about this sum.

The Central Limit Theorem says that when we have enough data, the sampling distribution of the sample mean is Normal with expectation equal to the population mean and standard deviation equal to the standard error.

Go ahead; plot your re sampled means. I bet they look normal. Check the standard deviation of those resamples. I bet it is close to the standard error.

The sample mean is thus a draw from this distribution. We can perform inference on it by subtracting the mean (in this case, the hypothesized population mean 12000) and dividing by the standard deviation (in this case, the standard error).

Because we don't know the standard deviation exactly, we have to estimate it. When we use an estimate of the standard deviation in our test, we switch from using a z-test to a t-test.

So, because the sample mean is assumed to come from a normal distribution (thanks to the Central Limit Theorem), we can use a t-test on this non-normal data.

If you can get your hands on "An Introduction to Medical Statistics" by Martin Bland, he does a very good job outlining this rationale in chapter 8.

You may dare. I am programmatically savvy, problem-solving savvy, and technically savvy. I am statistically befuddled. — Vicki B, Sep 28 '19 at 19:48
@VickiB Is there anything I can clarify? You seem to have two answers to your question. Have you thought about accepting one? — Demetri Pananos, Sep 29 '19 at 00:11

Him · Answer 2 · 2019-09-27T01:50:57.993

3

The average of your sample of 1338 records is a single value, $\bar x$. That $\bar x$ is, itself, a member of a distribution of averages of all possible simple random samples of 1338 records. You don't have access to any other members of this distribution, so you can't plot it. You can't resample it. It's out there, somewhere, in the world, waiting for someone else to gather exactly 1338 records and take an average. You did that thing, and obtained a single observation from that distribution. Your $\bar x$.

Even though we only have one observation from that distribution (of all possible $\bar x$ from samples size 1338), with some minimal assumptions about the distribution of $x$ (individual records), we can figure out a whole lot from the single observation of $\bar x$ that we have.

Specifically, in relation to your question, we can compute the probability of having obtained a sample with that mean, (your actual, measured $\bar x$), if we hypothesize a potential mean for the whole population mean $\mu$.

Fortunately for you, someone else has made just such a hypothesis. Specifically, the administrator is concerned that the actual average charge has fallen below 12000. In other words, the administrator has hypothesized that "I hypothesize that the actual average is perhaps below 12000! We should gather evidence to test that hypothesis. SCIENCE!"

This means that we can compute the probability of obtaining something as large as our measurement of $\bar x$ or larger in the event that the actual average actually were less than 12000. If this probability is low, then our $\bar x$ provides some evidence that the actual average is not, in fact, less than 12000 (because, if it were, then how did we obtain this totally improbable $\bar x$?)

That probability is $p$ from your t-test. Note that your sample distribution doesn't factor into it. The central limit theorem comes into play when one wants to prove that the t-test works for situations like this. If you're not writing a proof, then you don't really need to use it.

If you are curious as to just how close to a normal distribution the distribution of the sample mean is:

This is a beta distribution with $\alpha=\beta=0.1$

It's pretty not normal. It's about the opposite of normal. But let's suppose that our population is distributed this way.

If we plot the distributions of the sum of a simple random sample of $n$ observations from our population distribution (not to scale):

Even for a sample as small as 16, the results are looking pretty normal-ish. Of course, we're usually trying to approximate the tails, so the shape of the middle there might be a little misleading, but this is converging really fast.

edited Sep 27 '19 at 01:50

answered Sep 26 '19 at 23:29

Him

2,027
10
25

"When the population is non-normal in distribution: the t test should be valid if achieving a sufficient sample size." This is a quote from the question that you linked to. Your sample is quite large, and so, interestingly, the distribution of all possible $\bar x$ is pretty close to being normal, *no matter what the distribution of $x$ is*. That's right. *No matter what*. This is a pretty remarkable result, and this is what Central Limit Theorem(s) prove. That distributions of sample statistics converge to very nice distributions as $n \rightarrow \infty$. – Him Sep 26 '19 at 23:35
1

tiny, tiny addendum "_No matter what if_ the variance of the distribution is less than infinity." there are obscure exceptions like the Cauchy distribution--but yeah it'll never come up for any real application. – Huy Pham Sep 26 '19 at 23:54
Your answer is either imprecise or misinterpreting the p-value in 2 ways. First the p-value is not just about the single observation, but all those that are also more extreme (hence, the probability of a statistic at least as extreme as the observed one, assuming the null is true). Next, you've misinterpreted the alternative hypothesis as the null. The former is mu <12000 and the null is mu >=12000 (often shown mu =12000). – LSC Sep 27 '19 at 00:13
@HuyPham. Sure. I'd bet you could figure out a distribution so that the sample mean at any given $n$ is however [far away from normal](https://en.wikipedia.org/wiki/Statistical_distance). So for 1338, specifically, there's probably a population distribution whose sample mean distro is, say, uniform. – Him Sep 27 '19 at 00:14
@LSC it's for a single observation *of a sample mean*. – Him Sep 27 '19 at 00:15
[Cauchy Distribution](https://en.wikipedia.org/wiki/Cauchy_distribution). – Him Sep 27 '19 at 00:15
@Scott, you still can't calculate a probability for a continuous variable taking on a single value (it's zero to be equal to exactly 1 value). The other issue is that you explicitly linked a p-value to an incorrect interpretation. This is definitional regarding a p-value; it is not about "probability of obtaining a single mean" but rather this observation and all those more extreme. https://link.springer.com/article/10.1007/s10654-016-0149-3#Sec6 (point 9 is very specific to your misinterpretation, and the intro provides an adequate p value definition). – LSC Sep 27 '19 at 00:20
@LSC agreed. Will edit. – Him Sep 27 '19 at 00:28
1

@LSC which hypothesis is the null depends on who in the story is on who's side, I suppose. – Him Sep 27 '19 at 00:32
@Scott, in Frequentist statistical theory, it doesn't work that way. The null must be framed congruently with how the p-value is calculated. Equality is always in the null. – LSC Sep 28 '19 at 01:21
"which hypothesis is the null depends on who in the story is on who's side, I suppose." --- Aieeeeee. – Vicki B Sep 28 '19 at 19:45
If I understand correctly (and I don't promise that I do), the null hypothesis can be simplified to "no, you're wrong, nothing changes, nothing is different, nothing to see here, please move along". ??? – Vicki B Sep 28 '19 at 19:47

t-test for a non-normal population

2 Answers2