5

I'm trying to build a linear mixed model for 5 outcome variables ...

  • Cholesterol 1,Cholesterol 2,Cholesterol 3,Cholesterol 4,Cholesterol 5

which will be melted into a single Cholesterol variable, since statsmodel does not support multivariate LMM so far.

The independed variables are 38 specific pathogenetic features build from GenePy scores.

I have to correct for the following confounders: Age, Sex, Group ,Alcohol, Smoking and Levodopa treatment. All of them might contribute to the outcome of the Cholesterol outcome. Sex, Group and Levodopa treatment are binary categorical (0 or 1).

My question would be, how do I properly build up the equation for my model and put it into the statsmodel syntax?

My guess so far is: I treat the 38 specific pathogenetic features as fixed effects and the confounders would be random effects. All catergorical confounders are put into the "groups" option of the statsmodel syntax

Based on the statsmodel syntax:

model = sm.MixedLM.from_formula("Cholesterol ~ pathogenetic feature1 + pathogenetic feature 2 + ... pathogenetic feature 38 , data, re_formula="~Age+Alcohol+Smoking", groups=data["Group,Sex,Levodopa"])

Is that correct or nonsense? I'm a rookie in this topic and apologize for my weak understanding of it. Thanks so much in advance !

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • 1
    How do you know those variables are confounders? Have you drawn a causal diagram? Are there backdoor paths? – Adrian Keister Aug 14 '20 at 15:58
  • What are the outcome variables? Cholesterol 1,Cholesterol 2,Cholesterol 3,Cholesterol 4,Cholesterol 5 ? – Robert Long Aug 14 '20 at 18:26
  • In particular, are Cholesterol 1 through Cholesterol 5 cholesterol levels determined at 5 different visits over time? And how many cases do you have? – EdM Aug 14 '20 at 19:07
  • They are consider as confounders, because they might influence the levels of certain cholesterol types in the human body. For example, massiv alcohol consumption may contribute to higher HDL (cholesterol type) levels in your body. In this model, I want to correct for that and investigate in the role of the genetic features (GenePy Scores). – Thomas Lordick Aug 16 '20 at 18:52
  • Cholesterol 1 to 5 are measured at the same time and considered as the outcome variables. – Thomas Lordick Aug 16 '20 at 18:53

1 Answers1

6

Confounders can be controlled for by treating them as fixed or random. The usual considerations for treating variables as fixed or random apply (There are many questions and answers on our site on that topic).

The variables in your formula, Age, Alcohol and Smoking typically would be modelled as fixed, not random.

To be a confounder a variable is generally a cause, or a proxy for a cause of both the exposure and the outcome. Where you have multiple exposures, as you seem to have, a confounder for one causal path may be a mediator for another. Mediators should be excluded. This means that great care must be taken when choosing the set of variables to include in a model.

A causal diagram or directed cyclic graph (DAG) can be of great benefit in this type of situation. For example see here:
How do DAGs help to reduce bias in causal inference?

It is very important not to just put all your variables into one model.

Robert Long
  • 53,316
  • 10
  • 84
  • 148
  • Hey Robert, Thank your for your answer! This was really helpful! How would you model the **group** and **gender** variable. They are directly related to certain Cholesterol outcomes as causes (among others). Would you put them in the "groups" parameter of the statsmodel syntax? – Thomas Lordick Aug 16 '20 at 19:12
  • 1
    Gender should not be a grouping variable for random intercepts. The software will estimate a variance for it and there are (presumably) only 2 levels. A variance estimate for a sample size of 2 is not reliable and almost meaningless. As for Group, you would need to provide more detail about that variable. Context is very important. – Robert Long Aug 16 '20 at 19:32
  • Btw, I figured out that my variables I defined as confounder, are actually just competing exposures, because they do affect the cholesterol outcomes, but do not affect the exposure variables which are my pathogenetic scores! That's why I can't define them as confounders. So do my other categorical variables **gender** and **group**. So they are basically all just fixed effects in my model, right? – Thomas Lordick Aug 16 '20 at 19:36
  • The **group** variables denotes if the sample suffers under Parkinson (1) or not (0). I'm investigating in the genetic effect of Parkinson and try to link genetic data in form of GenePy scores to the outcome of certain Cholesterol levels, because Cholesterol might play a role in the disease outbreak. This is basically my research project: Linking Genetic features of Parkinson to the levels of certain cholesterols. – Thomas Lordick Aug 16 '20 at 19:41
  • 1
    Ok. So with only 2 levels you shouldn't treat it as random. It sounds like a very interesting study. – Robert Long Aug 16 '20 at 19:48
  • 1
    Competing exposures should be included, along with confounders, but not mediators; however causal inference is not easy, but a principled approach as per the answer I linked to should get you started. – Robert Long Aug 16 '20 at 19:52
  • 1
    You may also want to consider propensity score modeling first and then using your propensity scores to further control for the confounding" https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3144483/ Of course, even more ideal, is to conduct an experiment, if feasible, and address confounding with randomization to treatment groups. – StatsStudent Aug 25 '20 at 21:46
  • 1
    Does this answer your question ? If so, please consider marking it as the accepted answer, and if not please let us know why. – Robert Long Aug 28 '20 at 06:18
  • I actually don't know how to mark it ! :/ – Thomas Lordick Sep 30 '20 at 14:19
  • @StatsStudent Can you recommend me a tool that calculates such scores from my data ? – Thomas Lordick Sep 30 '20 at 14:30
  • @ThomasLordick I'd recommend the `twang` package in R: https://cran.r-project.org/web/packages/twang/vignettes/twang.pdf. It uses generalized boosted regression to estimate propensity scores and this method has been shown to outperform other methods for achieving "covariate balance" – StatsStudent Sep 30 '20 at 14:41