Getting outrageously significant results when comparing two customer groups with t-test

Question

I am trying to run some t-tests to compare census data between two groups.

I have a few niggling questions and am looking for advice.

Let's say I want to find significant differences between two types of shoppers, those who uses product XYZ, and those who don't. There are a total of 50,000 customers, and only 140 of them use product XYZ.

Here's a sample of what my data looks like. (There are about 90 more census data columns not shown here)

       customer_age  Electric_Heat  Asian  EU_National  product_XYZ_user
26393          41.0          15.07  10.67         2.81               0
39621          43.0           1.28   0.00         6.05               0
47382          49.0           1.15   0.00         2.79               0
48356          25.0           0.96   0.00         2.46               0
21870          53.0           0.00   0.00         3.37               0
23977          19.0           0.00   0.00         6.29               0
44377          25.0          13.51   3.49         3.49               0
8800           82.0           0.00   0.00         3.12               0
2937           47.0           4.00   7.91         9.35               1
17972          53.0           5.26   0.35         8.51               0

The goal is to get information that basically says "We have found that people who use XYZ Product tend to be young, urban, educated to college level and born in Europe".

I used python to run the (Welches) t-tests, and it basically goes like this:

choose census feature (e.g. EU National)

    compare feature mean between groups with t-test

    record p-value

repeat for each census feature

After doing this I can then see where the significant differences are (p value < 0.05 or < 0.01) and say that census feature x seems to contribute to whether a customer prefers product X or not. Sorting by lowest possible p-value can show me where the most significant differences lie. Am I in the right direction?

The "problem" is I am getting highly significant differences for a lot of features. That may/may not be good. It just seems too easy and so I thought I'd ask here to double check.

Here is a sample of my results. The feature column holds the census data. I also tried adjusting for inflated familywise error rates (Holm–Bonferroni method), but makes no difference. Still highly significant results.

What am I doing wrong?

By the way - these findings match what I already thought about this group, so it's not as if the results make no sense.

Very large samples with even quite modest effect sizes will give highly significant results (very low p-values). This is discussed in a number of posts already on site. I'll see if I can locate one. — Glen_b, Feb 13 '20 at 11:30
It sounds like your predictor variables should be mostly categorical. Isn't "Peat_Heat" a yes-or-no variable? If so, wouldn't an appropriate test of association be something like a chi-square test of association rather than a t-test? — Sal Mangiafico, Feb 13 '20 at 11:38
On the low p-value issue, see 1. https://stats.stackexchange.com/questions/2516/are-large-data-sets-inappropriate-for-hypothesis-testing 2. https://stats.stackexchange.com/questions/44465/how-to-correct-for-small-p-value-due-to-very-large-sample-size 3. https://stats.stackexchange.com/questions/125750/sample-size-too-large 4. https://stats.stackexchange.com/questions/44465/how-to-correct-for-small-p-value-due-to-very-large-sample-size — Glen_b, Feb 13 '20 at 11:43
@SalMangiafico Peat_Heat is the percentage of house holds in the area using Peat to heat the homes. These are small area statistics. — SCool, Feb 13 '20 at 12:13
Okay. That makes sense.... If all predictor variables are on a scale of 0 - 100 %, you could look at the difference in percent, which may be helpful in interpretation. A more thoughtful effect size statistic is Cohen's *d*, which is essentially the difference in means divided by the standard deviation. This may be helpful in determining which features matter more. — Sal Mangiafico, Feb 13 '20 at 12:29
@SalMangiafico yes they are all percentages. However, many features in the census data do not go to 100%. For example the percentage of houses with no heating has a maximum of 25% for all small areas in the entire country. Could this affect results? Perhaps I should normalize everything first? Also you mention Cohens D - why is this more thoughtful? I'll gladly try it out. You mentioned it's the "difference in means divided by the standard deviation". Standard deviation of what? the feature including both groups? or should it be the standard deviation of the feature for the whole country? — SCool, Feb 13 '20 at 13:59
No, I think it makes sense to look at the percents as reported. I mean, for wood heat, you are comparing a 1.7% wood heat to 0.5%. So, it's only a difference of like 1%. And both values are really low. Are areas with 1.7% wood heat meaningfully different than areas with 0.5% wood heat for you? (The answer is up to you). — Sal Mangiafico, Feb 13 '20 at 14:08
With Cohen's *d*, it's the pooled standard deviation of the two samples you are comparing. It's best to look up Cohen's *d* or pooled sd (assuming unequal sd of two groups). I think this statistic will be helpful to you. If the difference in means is relatively low compared to the sd in the samples, the difference in means is hardly noticeable compared with all the noise. But for a large Cohen's *d*, the difference is really noticeable. Cohen provided some interpretation of this statistic ("small","medium","large"), which are arbitrary, but people sometimes use them as goal posts. — Sal Mangiafico, Feb 13 '20 at 14:14

Getting outrageously significant results when comparing two customer groups with t-test

0 Answers0