1

I have a longitudinal dataset of 2 time points where Body mass index of patients has been recorded in 2018 and 2020. the research question is to investigate the evolution over time ( from 2018 to 2020) of BMI in different covariate groups.( age gender region education level …)

to answer the research question, I assume I should include in my model the interaction of every covariate with time(categorical with two levels 2018 and 2020).

the issue is that for example when I fit the model with only main effect ( no interactions), I get a significant of some covariates and non significant of others. and when I include the interactions, some of the main effects that were significant they become non-significant and vice versa.

Same issue also with interaction effect themselves, their significance is affected by the presence of other covariates in the model.

How should my final model be? I don't know what to include and what to exclude.

Thank you in advance.

ilyas
  • 75
  • 7
  • 6
    Please stop the focus on statistical significance. This should not inform the decision about whether to include variables and interactions. When you include an interaction it changes the meaning of the main effects so it is very normal, and expected, for the p-values to change. When a variable is interacted with another, the main effect for that variable is conditional on the other variable that it is interacting with being zero (or at it's reference level in the case of a categorical variables) which is completely different to when there is no interaction. – Robert Long May 19 '21 at 18:10
  • Thank you for your reply, that was a good clarification, however the question still remains: what covariates should I include to answer the research question ? if not based on statistical significance, then on what should it be based ? – ilyas May 19 '21 at 18:19
  • 1
    You're welcome :) I went into some detail about that in my answer to [your previous question](https://stats.stackexchange.com/questions/524793/what-is-the-criteria-for-including-and-excluding-variables-in-longitudinal-model) – Robert Long May 19 '21 at 18:21
  • The method you recommended to use has a data science flavor, I don't think I should apply it here, this is is an epidemiological research question, I am particularly interested in inference not in prediction, I am not looking for the best model or the best fit, I simply want to answer " which factors impact the evolution over time of BMI " – ilyas May 19 '21 at 18:38
  • 10
    What ?! The method I recommened is the exact opposite of a data science flavour. My background is in causal inference in the biostatistics department of a well known UK university. A data science approach would be based on an automated procedure, and that is exactly what I am advising against. Did you read the other answer that I linked to: [How do DAGs help to reduce bias in causal inference?](https://stats.stackexchange.com/questions/445578/how-do-dags-help-to-reduce-bias-in-causal-inference/) - it's literally about causal inference !!! – Robert Long May 19 '21 at 19:21
  • For your information, the main tag you are using for this question, "Feature Selection" is terminology from data science. Statisticians talk about "Variable Selection". Anyway, the main point is that when you choose variables based on p-values, you may introduce bias due to confounding, mediation, differential selection to name a few. It is crucial to reduce bias when we are interesting in inference, for obvious reasons and that is what my suggestions are about. In data science, we don't usually care about bias in estimates - only about the accuracy of predictions of the response. – Robert Long May 19 '21 at 19:39
  • 1
    Robert is right. A DAG is perhaps the most appropriate way to go about doing this. – Demetri Pananos May 19 '21 at 20:12
  • @ilyas if you are still not convinced that a DAG is a good way to proceed, simply do a google search for **DAG "causal inference" epidemiology** and you will find references to thousands of academic papers and books by distinguised statistical epidemioloigists that specialise in causal inference such as Hernan, Robins, Greenland, Gilthorpe and Tennant, as well as output from perhaps the world-leading expert in causality, who's primary works is about DAGs, Judea Pearl. – Robert Long May 19 '21 at 21:51

1 Answers1

4

To add to the context here, finding significant predictors is easy. If you want a p < .05, then all you need is 100 or so predictors and you'll get a handful that come out as significant. Significance of predictors is hardly interesting unless there's good theoretical reason to believe that statistical significance corresponds to clinical interest.

You want to know "which factors impact the evolution over time of BMI," so say that education is a significant predictor. Do I gain/lose weight because I went to school longer than someone else? Education does not cause weight or height changes that would impact BMI over time, so finding that education is a significant predictor does not mean that education is causing BMI changes over a two year period. If education is significant, then you might have some beliefs about why that is the case.

What is being recommended is a model-based approach to statistical thinking. The goal of the model is to simplify the world. In this case, you're simplifying BMI changes into a set of linear predictors that are all only indirectly related to BMI (i.e., it doesn't seem to me that you're collecting data on caloric intake, activity level, genetic predispositions, medications and side effects, etc). At the point at which we are simplifying the world for the sake of building a model, it behooves us to think about what factors would help us recreate the data generation process. In this way of thinking, something like education or region start to make sense as informative about BMI. For example, I might think that people with higher educations have more health literacy and may thus eat healthier, or they may have higher average wages that let them afford better foods. Similarly, people in certain regions may all have similar kinds of diets that would make that a useful way of predicting BMI compared to people from other regions with different diets. Significance of individual predictors is irrelevant if the model is capturing meaningful aspects of the underlying data generation (i.e., the true processes affecting BMI change over a two year interval). You'll just want to guard against violating assumptions (e.g., including a bunch of interactions can sometimes inflate collinearity) and overfitting the model. To avoid overfitting, making very careful a priori statements about what is giving rise to your data (e.g., through DAGs) is important. Just like anyone can get significant predictors, it's not hard to get a "good fitting" model. It's much harder to develop a meaningful model

Billy
  • 596
  • 2
  • 7
  • (+1) those are some very good relevant points ! BMI is a fantastic source of all kinds of intersting causal inference statistical issues. All we need now is to bring birthweight, and mother's lifestyle variables into the equation and we can really get going. – Robert Long May 19 '21 at 22:32