3

The analysis I am dealing with consists in determining whether some gene is differentially expressed between two groups of people. To this end, 20 people from each group were sampled and the expression of the gene was measured on each person. Let's assume that the expression follows a normal distribution within each group.

  • Analysis 1: t-test with gene expression as response;
  • Analysis 2: Logistic regression with group as binary response and gene expression as explanatory variable.

Are both analyses valid? Is one of them more appropriate?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user7064
  • 1,685
  • 5
  • 23
  • 39
  • 2
    The key question is what is given and what are you trying to explain or predict. Here it seems that the division between groups is given, so Analysis 2 makes limited sense. But it doesn't follow that a t test is best. A t test focuses on difference between means; the ideal condition is that each distribution is normal, and if that is far from true, then some other analysis will be (much) better. – Nick Cox Jun 25 '21 at 08:01
  • Thank you for your comment. For the sake of discussion, let's assume that the gene expression is normally distribution within each group. I have made an edit that reflects that hypothesis. – user7064 Jun 25 '21 at 08:20
  • @NickCox: can you expand a little bit on why the fact that the division between groups is given makes Analysis 2 of limited sense? – user7064 Jun 25 '21 at 08:29
  • The point is negative and best answered yourself. In what sense are to trying predict group from gene expression rather than gene expression from group? – Nick Cox Jun 25 '21 at 08:31
  • Well... that's exactly why I am requesting some help... – user7064 Jun 25 '21 at 08:37
  • @Lewian's answer and comments seem to echo my thinking. – Nick Cox Jun 25 '21 at 10:04
  • [This is a related post](https://stats.stackexchange.com/questions/190156/t-tests-manova-or-logistic-regression-how-to-compare-two-groups) with an interesting answer, different from the answer by @Lewian below. – kjetil b halvorsen Jun 25 '21 at 17:08
  • @kjetilbhalvorsen: The difference in that post is that group sizes there are not equal and don't seem to be fixed in advance (details about the experimental design are not given); so my objection does not directly apply there. – Christian Hennig Jun 26 '21 at 09:17

2 Answers2

3

If continuous $X$ can predict continuous $Y$, continuous $Y$ can predict binary $X$. So it is OK to reverse the problem and use logistic regression. The main disadvantages are (1) the power of this approach relative to the $t$-test needs to be further explored and (2) interpretation of parameter estimates is harder. But on the plus side you can turn a multivariate response problem into a univariate problem using this reversal.

This approach was described by O'Brien here. Another advantage of the approach is that it extends past comparison of mean $Y$. If you expand (now) predictor $Y$ into a quadratic polynomial, you are able to detect differences between $Y$ groups of both the mean and the variance of $X$. Adding a cube term would allow the skewness to vary.

There is a relation to discriminant analysis. When assumptions of linear discriminant analysis hold, a reversed binary logistic regression has to fit. When using quadratic discriminant analysis, a quadratic logistic regression works.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
  • What's your take on my observation that if they fix the group sizes to be equal, the binary responses in the reverted problem cannot be independent? Having observed "group 1" 19 times and "group 2" 20 times, we *know* the next observation is "group 1". – Christian Hennig Jun 27 '21 at 23:48
  • Thank you @FrankHarrell. I'll wait for your answer to the above comment and I'll accept your answer. Best – user7064 Jun 28 '21 at 06:07
  • Yes that would create very mild, probably ignorable, dependence. The analysis is not informed of this information. – Frank Harrell Jun 28 '21 at 18:55
  • @FrankHarrell: How can you be sure it's very mild? If group sizes are fixed equal and one group is in fact much more likely, how can you be sure that the probability for that group is not underestimated substantially? – Christian Hennig Jun 29 '21 at 10:48
  • The dependence you described would be important if the total number of observations is < 8 for example. As $N$ increases the dependence becomes unnoticeable so I'm not clear why you are worrying about that from all the things you could worry about. – Frank Harrell Jun 29 '21 at 12:50
  • @FrankHarrell: But the logistic regression predicts the probability for the occurrence of a certain group, given the covariates. Now if one group is hugely overrepresented in the sample by sampling design, isn't there a possibility that this leads to overestimation of its probabilities? – Christian Hennig Jul 01 '21 at 19:09
  • That affects the intercept only. – Frank Harrell Jul 05 '21 at 11:22
2

The t-test will treat the group memberships as fixed and the gene expression as random. The logistic regression has the group memberships as random variable to be "explained" from the gene expression. But if I understand the design correctly, you have chosen 20 people from each group based on known group membership, so the group should not be treated as random outcome. Therefore the logistic regression seems inappropriate.

Responding to some comments, it seems that despite my objection against logistic regression for such data, in a (maybe not small) number of case-control studies it is applied in this way. I'd insist that (in a situation as given here, where the number of observations in the two groups, i.e., the number of regression outcomes taking a certain value, is fixed in advance) this is problematic, as even if such outcomes are treated as random (which is already questionable but may not cause problems with the results), they can't be independent. I don't know the literature enough to know whether this is discussed somewhere - it could be seen as acceptable if somebody has shown that potential bias introduced in this way is (maybe under some conditions) negligible. Surely I accept that logistic regression does something that is roughly in line with what is required in such case-control studies, and will therefore likely produce results that point in the right direction (if there is a true and clear enough "right direction").

Christian Hennig
  • 10,796
  • 8
  • 35
  • Thank you. Let's assume that the gene is up-regulated in group A compared to group B. Then, shouldn't we expect logistic regression able to detect that the gene expression is an important predictor? – user7064 Jun 25 '21 at 09:09
  • Yes, I'd expect that, because what logistic regression does is roughly related to doing something correct, but for the reason given I still think that it isn't appropriate. – Christian Hennig Jun 25 '21 at 09:13
  • 1
    If you want to predict group from gene expression, you should maybe select people at random without fixing the groups rather than choosing 20 of each group. – Christian Hennig Jun 25 '21 at 09:15
  • So if I understand correctly, logistic regression would provide the correct answer to the question but it is not a valid way to obtain it. Have I put it right? – user7064 Jun 25 '21 at 12:30
  • @Lewian Isn't the odds ratio also estimated correctly in case-control studies, which this kind of is? In that sense, does logistic regression - possibly for matched pairs - not actually answer the question? – Björn Jun 25 '21 at 12:35
  • @Björn: I'm not actually an expert in experimental design, and it will depend on how precisely the study is run. The way I understand the question, I don't think one can justify the group as random in this case. – Christian Hennig Jun 25 '21 at 13:22
  • Regarding "logistic regression would provide the correct answer" - difficult to comment on because it depends on how exactly you interpret the result of the logistic regression. The logistic regression may not address your question precisely, so interpreting it as "providing the correct answer" may be something of a misinterpretation, despite the result, let's say, "pointing in the right direction". Also it may have worse power, i.e., the probability to "point in the right direction" may be lower than for the t-test, although both probabilities will be high if the true effect is strong. – Christian Hennig Jun 25 '21 at 13:23
  • But in case-control studies the outcome is not random, in your sense, but still logistic regression is used? – kjetil b halvorsen Jun 25 '21 at 16:41
  • It seems people do that, fair enough. It seems problematic to me anyway. If you fix the number of cases and controls, your outcomes become technically dependent, don't they? Maybe putting some time into literature search this may be addressed somewhere, with some discussion how much of a problem it is (hopefully not that much, for those who do it...). – Christian Hennig Jun 25 '21 at 21:55
  • @kjetilbhalvorsen, Björn: I have added something to my response addressing this. – Christian Hennig Jun 26 '21 at 09:14