1

Problem: There are two groups of customers, group A and group B. Group A have been subject to a campaign in terms of marketing and e-mailing while group B has not been exposed to anything. By looking at these customers spending for the last 12 months (or as long as the experiment has been conducted) I want to know if the average spending between customers in A and B differ as a result of the marketing.

Looking at the distribution of the spending for both groups it looks like this:

enter image description here

This is expected since there are many customers that do not buy within the time period in which we look. So the spending is not normally distributed. According to my co-worker one could still run a two sample t-test here with the motivation:

"in many cases one can do a t-test to to compare two means from a non-normal population since the two means that are compared, given large enough sample size, can always be assumed normally distributed given the CLT. The assumption of normality is done on the parameter being tested and it's distribution rather than the distribution of the population itself"

I feel there are some pitfalls here because of the overrepresentation of number of zeroes. Also, by the CLT, it seems as if the only test needed is z/t-tests since everything apparently becomes normal given sufficiently large sample size.

Is my co-worker right?

Parseval
  • 295
  • 1
  • 7
  • What is your concern about the number of zeroes? // The central limit theorem does not apply to all distributions (it has assumptions), and it does not say at what point the distribution becomes "normal enough". Also, "normal enough" will be situation-dependent. // Make sure you know what the central limit theorem does and does not say. [There is a common wrong interpretation.](https://stats.stackexchange.com/questions/473455/debunking-wrong-clt-statement) // All that said, plenty of people would be comfortable running a t-test on your data. Depending on the situation, I might not be one. – Dave Nov 03 '21 at 15:48
  • Check this: https://stats.stackexchange.com/questions/187824/how-to-model-non-negative-zero-inflated-continuous-data – Amin Shn Nov 03 '21 at 15:51
  • Based on [the proposed duplicate](https://stats.stackexchange.com/q/479566/1352) and your humongous amount of data, I would use a t test without a second thought. Try bootstrapping your means - I would be very surprised if they were not "quite normal enough". As to why we would ever use something else than a t/z test: sometimes we simply don't have sufficiently much data (e.g., data may be expensive to acquire). – Stephan Kolassa Nov 03 '21 at 15:56

0 Answers0