Is this p-hacking?

Question

I'm currently looking into the gender pay gap using data from glass door (found via kaggle). The dataset has columns for gender, age, performance evaluations of employees, seniority, pay etc.

For context: I have learned a lot of Data Science/Machine learning/programming over the past few years, and am just doing a few of my own basic portfolio projects for practice, before applying for jobs.

I have done a fairly naive t-test, comparing average pay for men vs average pay for women. I am now looking to add in controls, comparing similar age groups, seniority, education level etc. I want to do more t-tests, as well as looking at a chi-square distribution and/or ANOVA.

As I do multiple tests A/B tests, I want to avoid p-hacking. I have a few hypotheses, for example I expect the pay gap to be greater for older age groups. But this is mostly exploring the data, I don't have a single hypothesis I am looking to prove for the entire study, nor do I have a political agenda.

I'm not sure it would really count as p-hacking as long as I choose which comparisons to make, and report everything. I would think it's only p-hacking if I selected which t-test results to report to help prove a hypothesis. Is this fair?

And another question (forget my data for a moment), with ANOVA, as it compares multiple groups at once to look for significance, is this not p-hacking?

Welcome to Cross Validated! The ANOVA question at the end probably warrants its own post. — Dave, Feb 03 '22 at 13:45
We can solve the problem with ANOVA quickly in a single commonent. It is not p-hacking. Multiple groups are compared but only a single hypothesis is tested. ANOVA computes a ratio of variances and a p-value can be computed for that ratio based on a single hypothesis. Possibly the idea of anova=p-hacking arises due to the confusing aspect that a hypothesis test is often not used to test a null hypothesis, but instead to prove the alternative hypothesis (which can be many at once). ANOVA also doesn't tell which of the many groups are different, but only that they are not the same. — Sextus Empiricus, Feb 03 '22 at 14:44
You may want to look into [Oaxaca-Blinder decomposition](https://en.wikipedia.org/wiki/Blinder%E2%80%93Oaxaca_decomposition) which is frequently sued for studying racial/gender/other biases. — Roger Vadim, Feb 04 '22 at 08:32

Sextus Empiricus · Accepted Answer · 2022-02-03T16:39:48.713

24

If you are doing explorative analysis, then you don't care about p-values. What you do is search for any pattern. P-values are used to verify a hypothesis, but you have none.

However, if after your explorative analysis you are gonna perform some hypothesis tests with the same data then this gives the erroneous p-values if the hypothesis were created by the same data.

If you only have a single data set available then you can split the data into two subsets, one for analysis and another for follow-up research to verify whether the found patterns are much different from statistical variations in the sampling.

You seem to be doing a search for patterns by using hypothesis tests and p-values. That is not p-hacking if you regard the p-values only as an aid in pattern recognition (a search for anomalies) instead of a value to report in relation to an experiment to verify a certain effect.

You have to be careful though that you do not switch the meaning from a statistic used in pattern recognition to a value that expresses the statistical significance of an experiment to measure an effect.

edited Feb 03 '22 at 16:39

answered Feb 03 '22 at 15:00

Sextus Empiricus

43,080
1
72
161

I see, that makes sense. Now that I think about it, maybe I do technically have a hypothesis. Whilst I was mainly exploring the data, I was already aware there was a wage gap, but as I add in controls I wanted to see how much that wage gap shrinks, so was using t-tests to confirm for statistical significance. So I guess my hypothesis would be "the wage gap does/does not exist when taking into account controls such as seniority, age, job role" etc. I would expect to see the pay gap shrink when taking more controls into account. – gazm2k5 Feb 03 '22 at 15:51
@gazm2k5 But, if you are gonna take into account all sorts of different controls, in order to test some hypothesis (the null effect that there is no pay gap), then you end up p-hacking. – Sextus Empiricus Feb 03 '22 at 15:58
1

The point made by @SextusEmpiricus about exploratory or preliminary studies is very important (+1), and yet rarely mentioned in discussions about p-hacking. For your study I would say that the possibility of being misled by exaggerations of the 'significance filter' is equally important. Read about both here: https://link.springer.com/chapter/10.1007/164_2019_286 – Michael Lew Feb 03 '22 at 20:54

score 7 · Answer 2 · answered Feb 03 '22 at 13:56

It looks like p-hacking. Keep in mind that standard tests (e.g., t-test) are designed for testing a single hypothesis. In particular, if you get several p-values, such values are not independent of each other, especially if you look for effect heterogeneity across sub-groups. This usually leads to p-hacking (or data snooping).

Whenever you test multiple hypotheses, you should be careful, and take into account multiple hypotheses testing (e.g., Bonferroni correction) and False Discoveries. In alternative, rely on pre-analysis plans, where you specify ex-ante the (few) hypotheses you want to test.

If you want more details on p-hacking, I suggest an easy reading (about 40 minutes, not technical at all) from the last economics nobel laureate Guido Imbens: Imbens (2021). If you are interested in treament effect heterogeneity, a data-driven alternative to pre-analysis plans and data snooping is proposed in Athey and Imbens (2016).

Regarding the ANOVA, I agree with @Dave's comment in opening a different question.

Do not take the ideas of p-hacking et al. without putting them into the context of the actual studies in question. See here: https://link.springer.com/chapter/10.1007/164_2019_286 — Michael Lew, Feb 03 '22 at 20:56

score 0 · Answer 3 · answered Feb 04 '22 at 21:14

P-hacking implies an agenda to prove a particular type of hypothesis, and in furtherance of that agenda, several related hypotheses are tested. Then if one of them is found to be "statistically significant", only that test is reported with only its p-value. For instance, if I'm claiming to be psychic, I might test the hypothesis that I can guess the color of a card better than chance, and the hypothesis that I can guess the card value, etc. Then if any of those tests is statistically significant, I ignore the others. This ignores the fact that the total probability of a false positive over any of the hypotheses is larger than the probability of a false positive for any particular one. I.e P(A or B) > P(A).

Here, you don't seem to have an agenda, and you're probably not publishing your results, so you shouldn't worry too much about p-hacking, other than the fact that the more hypotheses you check, the higher the probability of getting at least one false positive. If that concerns you, what you can do is choose an overall alpha value (how much of a probability you're willing to accept for any of the tests yielding a false positive), and then choosing smaller adjusted alphas for the individual tests to get the overall false positive probability sufficiently low. One method that is a bit overly conservative (that is, it's not quite correct, but it's erring on the side of avoiding false positives), but simple to implement, is to divide overall alpha by the number of tests (or multiply the p-values by the number). For instance, if you end up testing $100$ combinations, and one of them has a p-value of $0.001$, you can treat that as if its p-value is $0.1$ (if you're wondering "Wait, does that mean I can get a p-value greater than $1$?", well, that's part of what I was talking about of this being an approximation).

While ANOVA tests involve multiple comparisons, they ultimate output only one p-value. The term "p-hacking" is similar to "cherry-picking". You can't cherry-pick if you only have one cherry.

Is this p-hacking?

3 Answers3