4

I'm looking at the risk of seizure in patients with metastasizing brain cancer and so far have several variables that I want to check to my dependent variable of seizure yes/no.

These are variables such as age, sex, tumor size and much more.

Now if I understand it correctly, first I run a univariate regression and the result will tell me whether there is a statistical significant relationship between one variable and my dependent variable.

However, I see in similar studies that they also do multivariate regression, but do not specify how, exactly.

Can someone help me understand why I would do a multivariate regression in such a project?

Thank you.

Paze
  • 1,751
  • 7
  • 21
  • 4
    Just a note on terminology: *multivariate* regressions mean you have multiple dependent variables. These can get complicated. *Multiple* regressions mean you have multiple IVs. It's quite straight forward as the answer below shows. Also, the advantage of using *multiple* IV's in the same model lets you get estimates for each IV *controlling* for the other ones, which is very beneficial. – Huy Pham Dec 02 '18 at 17:16

1 Answers1

7

For the sake of simplicity, let's say your independent variables consist of age, sex and tumour size only.

When you fit a univariate binary logistic regression model relating your dependent variable (seizure, yes or no) to the independent variable age, the model is ultimately enabling you to answer this question:

How does age affect the probability of seizure in the target patient population (i.e., for all patients in the population, regardless of their sex and tumour size)?

When you fit a univariate binary logistic regression model relating your dependent variable (seizure, yes or no) to the independent variable sex, the model is ultimately enabling you to answer this question:

How does sex affect the probability of seizure in the target patient population (i.e., for all patients in the population, regardless of their age and tumour size)?

When you fit a univariate binary logistic regression model relating your dependent variable (seizure, yes or no) to the independent variable tumour size, the model is ultimately enabling you to answer this question:

How does tumour size affect the probability of seizure in the target patient population (i.e., for all patients in the population, regardless of their age and sex)?

When you fit a multiple binary logistic regression model relating your dependent variable (seizure, yes or no) to the independent variables age, sex and tumour size, the model is ultimately enabling you to answer more pointed questions (assuming you only include main effects for these independent variables in your model):

  1. How does age affect the probability of seizure for patients in the target patient population having the same sex and the same tumour size?

  2. How does sex affect the probability of seizure for patients in the target population having the same age and the same tumour size?

  3. How does tumour size affect the probability of seizure for patients in the target population having the same age and the same sex?

Of course, if you include interactions between any of the independent variables in your multiple binary logistic regression model, that expands the list of questions you can ask.

Isabella Ghement
  • 18,164
  • 2
  • 22
  • 46
  • 2
    Thank you. That explains a lot. So when inputting the multivariate analysis in my statistical software (STATA), if I want to answer the question: "How does age affect the probability of seizure for patients in the target patient population having the same sex and the same tumour size)?" Do I select seizure, sex and tumor size as the dependent variables, and age as the independent variable? – Paze Dec 02 '18 at 16:41
  • 1
    You're welcome! No, you only select seizure as the dependent variable and then include age, sex and tumour size together as the independent variable. The 3 questions I listed in my answers, 1., 2. and 3., can all be answered based on this one model! That's the beauty of multiple binary logistic regression! – Isabella Ghement Dec 02 '18 at 16:44
  • Note that, when you fit the model with seizure as dependent variable and age, sex and tumour size as independent variables, the model results will be reported on the so-called log-odds scale. In other words, how does age affect the log-odds of seizure for patients in the target population having the same sex and the same tumour size, etc. You can exponentiate the coefficients of your independent variables to move from the log-odds scale to the odds scale. It's also possible to move from the odds scale to the probability scale used in my answer. – Isabella Ghement Dec 02 '18 at 16:48
  • Right. So to wrap it up: I'll first run a univariate regression to see whether these variables have an effect on their own, and if they do, I can include them in a multivariate regression and then look how they perform if compared together. If I may keep prodding, shouldn't I be very careful with these variables, as many of them can be dependent, such as age and tumor size...One may hypothesize that older patients may have had the tumors for longer or waited longer to get medical help, and therefore have larger tumors. Would multivariate regression eliminate this problem? – Paze Dec 02 '18 at 16:59
  • You would have to look at potential multicollinearity between predictors. – Isabella Ghement Dec 02 '18 at 17:01
  • Independent variables can show no relationship on their own with the dependent variable but a relationship when including them in the model alongside other independent variables. For this reason, some people would use more liberal p-values (e.g., 0.20) when judging the significance of the independent variables in the univariate models. – Isabella Ghement Dec 02 '18 at 17:09
  • Okay so what I've done so far is I've used p=0.2 in my univariate model to look for some associations between my variables and seizures. I've ran all my variables through a univariate model and found 7 variables that were significant with p=0.2. I included all 7 in a multivariate regression, as independent variables with seizures as my dependent variable. 5 of these have a p<0.05. Can we now say that 5 variables may be linked with increased (or decreased if the coefficient is negative) risk of seizure? – Paze Dec 02 '18 at 17:18
  • 1
    Paze, Isabella provided you with some nice guidance, but there are many other factors you'll want to consider to ensure you understand what you are doing. Get a copy of "Applied Logistic Regression" by Hosmer, Lemenshow, and Sturdivant. You'll want to make sure your continuous variables are linear in the log of odds, you'll want to understand why a variable can be statistically significant in a simple regression but insignificant in multiple regression. The book does a great job of taking you from square 1 to proficiency in logistic regression. Statistics is very much garbage in garbage out. – ColorStatistics Dec 02 '18 at 18:04
  • 1
    And thank you very much Isabella for your invaluable help! – Paze Dec 02 '18 at 18:37
  • 1
    Grea advice, @ColorStatistics! The only thing I don't agree with is "Statistics is very much garbage in garbage out". I would qualify that statement as follows: "When applied thoughtlessly, statistics can very much be garbage in, garbage out." – Isabella Ghement Dec 02 '18 at 18:54
  • @Paze: You are not modelling risk directly, you are modelling the log-odds of seizure. So if age, for example, has a significant positive coefficient b = 0.206 in your final model, you would say: Each additional year of age is associated with an (additive) increase of 0.206 units in the log-odds of seizure, controlling for all other independent variables in your model. If you were to exponentiate b, then you would say: For each additional year of age, the odds of seizure are estimated to increase by a multiplicative factor of exp(0.206) = 1.23 (all else being equal). – Isabella Ghement Dec 02 '18 at 19:07
  • The last statement in my previous comment would also be fine if re-expressed like this: Each additional yearof age is associated with a 23% increase in the odds of seizure (all else being equal). This is because (1.23 - 1) x 100% = 23%. – Isabella Ghement Dec 02 '18 at 19:10
  • See https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/1471-2288-12-14 for how you can estimate relative risks rather than odds ratios in a multiple binary logistic regression setting. – Isabella Ghement Dec 02 '18 at 19:26