How to correct for multiple comparison when performing separate logistic regression analyses?

Question

I did research on the effect of a certain score on a cognitive test (range 0-100) on performance on another test years later (ordinal categorical variable, low-mid-high). This was done over the course of six years for approximately 200 participants, i.e. they were measured yearly on the cognitive test.

In my confirmatory analysis, I wanted to see whether the mean of the cognitive score of a player (over the years) could significantly predict the performance on another test (low-mid-high). Also, I wanted to see for every age cohort (12-13, 13-14, 14-15 y/o and so on) if the cognitive measurement at any age specifically could significantly predict future performance.

So, for every regression analysis, there is one IV and one DV, and due to supervisor demands I really had to perform one overall (mean) + six separate age cohort regression analyses, and could not take age into account as a predictor so I could make one model.

FYI, all described analyses are confirmatory. Now, I was told I have to correct for multiple comparisons due to the fact that I have performed seven separate regression analyses, one with the mean of all cognitive scores achieved over the years per participant, and one for every one of the six age cohorts. My question is: can I just use Bonferroni, so divide the threshold p-value by the number of regression analyses I ran? If so, should I do this only for the six age cohort analyses (so threshold p = 0.05/6 = 0.008), or should I also take into account the overall regression with the mean of all yearly scores per participant, and thus divide by 7 (and thus threshold p = 0.05/7 = 0.007)?

Is there another, better correction for this multiple testing? If so, how does it work? And if Bonferroni is actually okay, why would 6 or 7 be 'better'?

Is there a reason age is discretized? You can reduce the number of parameters and (if age is a variable which significance matters) reduce the number of comparisons considerably if you model age as a continuous variable with a linear effect. If a linear effect is not reasonable, you could still try a sensible transformation to save degrees of freedom. — Frans Rodenburg, Jul 08 '19 at 13:03
Yes: I am doing this research for a company that wants to know from what age on the cognitive score might be predictive of future performance. Because of this, I discretized age. — Pannie, Jul 08 '19 at 13:09
The question really is, what type of correction is best? An FWER or FDR correction? So: Bonferroni / Holm or Hochberg? It should be noted that investments are at play to some degree, so if someone is labelled as 'high performance' based on the model while he is not, that is quite a waste. On the other hand, missing out on too many 'high performance' individuals because the model is too strict is also bad for business, and maybe even worse. What correction is best here, in your opinion? — Pannie, Jul 08 '19 at 13:13
A spline would seem more appropriate than discretizing, which is [rarely a good idea](https://stats.stackexchange.com/a/68839/176202). Reducing the number of tests reduces the multiplicity problem, so it is desirable regardless of the type of correction. FWER and FDR are different things, which one you want to control depends on which assumption makes more sense: All nulls are true (FWER), or if a single test is true after correcting, then surely not all nulls are true (FDR). — Frans Rodenburg, Jul 08 '19 at 13:16
"*due to supervisor demands I really had to perform one overall (mean) + six separate age cohort regression analyses*" Even if your subsets share only the error structure, a single model is still more efficient. Perhaps suggest your supervisor to include an [interaction term](https://en.wikipedia.org/wiki/Interaction_(statistics)) instead of running separate models. — Frans Rodenburg, Jul 08 '19 at 13:28
Do you mean: PerformanceLevel ~ CognitiveScore * Age? And if so, age as a discretized variable or as a continuous one? — Pannie, Jul 08 '19 at 13:35
Yes, that's what I mean. `Age` as a continuous variable is likely a better option for the reasons listed above (see the linked question about binning/discretizing explanatory variables in my earlier comment). — Frans Rodenburg, Jul 08 '19 at 13:37

How to correct for multiple comparison when performing separate logistic regression analyses?

0 Answers0