0

Consider this scenario: scientists hypothesize that a particular disease occurs when levels of a particular hormone are high. They gather data: 1000 people with the disease, 1000 without, and measure their hormone levels.

Assuming the data has normal distribution, the scientists can do a t-test or one-way anova to test if the difference in hormone levels between the disease and non-disease group is significantly different from zero. (I guess this would be a one-sided t-test). In R, the Anova would be expressed like

model <- lm( HormoneLevel ~ Disease, ...)

where Disease is 0/1 according to disease|non-disease, and HormoneLevel is a continuous value (amount per litre or something).

HOWEVER, can the test also be done in the other direction, with HormoneLevel as the independent variable?

model <- lm( Disease ~ HormoneLevel, ...)

Perhaps this makes more sense conceptually as a regression, since the scientists believe that the hormone level may be a cause of the disease.

So the question: is it valid and desirable to switch the dependent and independent variables in this way? If so, are there restrictions on when it can be done?

sportscan
  • 51
  • 1
  • 6
  • 1
    If you want to build a valid model for Disease ~ HormoneLevel , you might use logistic regression, such as glm(Disease ~ HormoneLevel, family = binomial(link="logit")). – Sal Mangiafico Jul 09 '18 at 16:12
  • ok, yes. However my question is more about the validity and advantage/disadvantage (if valid) of flipping the independent and dependent variables. – sportscan Jul 12 '18 at 12:40
  • As you say, if you think that the hormone causes the disease or that the hormone can predict the disease, then the model Disease ~ Hormone makes sense. There's no reason to think that Hormone ~ Disease is the default or natural model to use. If there's no prediction or causation intended, correlation could be used instead. – Sal Mangiafico Jul 12 '18 at 12:57
  • Thank you. So if we use a logistic regression, we can get a prediction of Disease from Hormone, divide our data into training (,validation), test, and report some sort of accuracy. But is there a way to report a p-value or "significance", as is possible by formulating it in the other direction? – sportscan Jul 13 '18 at 09:23
  • logistic regression in R gives the usual GLM output - including asymptotic p-values for individual variables and null and residual deviance, from which an asymptotic [overall test](https://en.wikipedia.org/wiki/Likelihood-ratio_test#Distribution:_Wilks%E2%80%99_theorem) could be performed. This is covered in most statistical treatments of generalized linear models, and in several questions on site e.g. this one - https://stats.stackexchange.com/questions/108995/interpreting-residual-and-null-deviance-in-glm-r – Glen_b Jul 13 '18 at 10:45

0 Answers0