Disclaimer: I'm a programmer. I am not a statistician. My last statistics class was (ermumble) years ago.
I read t-distibution for sample mean from non normal population and while I'm sure you all know exactly what you're talking about, I don't, so I'll ask for my specific case.
This is part of a class. The data set is (apparently) from Kaggle. The data is "medical charges". There are 1338 records. Naive statistics on the set give us:
mean: 13270.42
std. dev.: 12110.01
median: 9382.03
Decidedly not a normal distribution.
The class exercise asks: "The administrator is concerned that the actual average charge has fallen below 12000. On the assumption that these data represent a random sample of charges, how would you justify that these data allow you to answer that question? What would be the most appropriate frequentist test to apply? What is the appropriate confidence interval in this case? A one-sided or two-sided interval? Calculate the critical value and the relevant 95% confidence interval for the mean and comment on whether the administrator should be concerned?"
I figured, Central Limit Theorem, resample the means and work toward a better distribution.
m = medical.charges.to_numpy()
seed(47)
sample_mean = []
# calculate 100 means sampled from the larger dataset
for n in range(100):
this_sample = np.random.choice(m, 50)
sample_mean.append(np.mean(this_sample))
mean_of_means = np.mean(sample_mean)
std_of_means = np.std(sample_mean, ddof=1)
print("mean", mean_of_means, "\nstd. dev.:", std_of_means)
I ended up with a resampled mean of 13326 and a resampled
std. dev.: 1476 and plugged them into a t-test.
However apparently, the exercise expects me to simply plug the initial mean and std. dev into the t-test. (I know this because the following exercise tells me what answer they expected me to get for this one.)
Can I just blithely push the "mean" and "std dev" of this extremely non-normal data set through a t-test? And if so... why? (given that the docs all say "for a normal distribution...")