0

I am building a predictive model and hope to improve its model fit. I have 2 predictors, BMI (continuous variable) and smoking status (binary variable), and my outcome is disease status (yes/no).

Can someone please share how we can improve the model fit? I can only think of two possible approaches: 1) log transformation and 2) adding interaction or squared terms. Are there any other ways we can try?

R Beginner
  • 191
  • 1
  • 5
  • It is a broad question and we have few information… what have you done so far ? Have other variables to include in you model? What type of regression are you carrying out? – Pitouille Dec 02 '21 at 05:47
  • Hi, @Pitouille I am carrying out multivariable logistic regression. These two are my final predictors (I tried others and ruled them out). But now I want to make my model even better with just these 2 predictors (no more new predictors). Are those methods I proposed above valid? Hope this provides a bit more context! – R Beginner Dec 02 '21 at 07:26
  • 2
    If you have checked that the relationship with the predictors and the outcome (logit) is linear, adding a squared term should not be relevant here. You mentioned an interaction term, have you observed any? – Pitouille Dec 02 '21 at 07:49
  • "If you have checked that the relationship with the predictors and the outcome (logit) is linear, adding a squared term should not be relevant here." can you pls explain this part? Also, does this part apply to log transformation as well (i.e. if I observe a linear relationship, I won't need to log-transform my predictor)? – R Beginner Dec 02 '21 at 08:25
  • 1
    In logistic regression, a relationship between the logit is assumed linear with your predictors. The response is not manipulated as is (meaning 0 and 1) but we are dealing with its log odds. To avoid duplicating content, please have a look at this simple illustration: https://stats.stackexchange.com/a/88607/321901 – Pitouille Dec 02 '21 at 08:35

2 Answers2

1

Your analysis should always be guided by existing knowledge and theory. Don't blindly run many models and pick the "best" one, even if you cross-validate, because you may simply be overfitting to the test set.

In my understanding, there is a reasonable assumption that there may be an interaction between smoking and BMI, so it would make sense to include it.

The relationship between BMI and almost anything else is probably nonlinear, so it would make sense to use a spline transform of the BMI. Frank Harrell's Regression Modeling Strategies provides an excellent overview of splines. These are better than square or other polynomial transformations. Of course, you can use interactions between smoking and spline-transformed BMI.

A binary smoking variable is very crude. "Smoking" can mean two cigarettes per day, or forty. There will be a difference. If you can get an estimate of the actual number of cigarettes smoked per day, it would probably make more sense to include this numerical predictor. You can also spline-transform it. Or use an interaction with BMI (which becomes hard to interpret).

In any case, use proper to assess your model, not accuracy.

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
  • I am thinking of using both the likelihood ratio test and AIC (information criteria) to assess my model. What do you think of these 2 approaches? – R Beginner Dec 02 '21 at 17:27
  • 1
    Both are useful. To be honest, I think the difference between these two approaches is less important than making sure you avoid overfitting (which these will only guard against to a certain degree). – Stephan Kolassa Dec 02 '21 at 17:29
  • Thank you! I actually tried to include an interaction term, and here is what I have: The model fit gets better (as evident by the LR test) but since I now have 3 predictors (smoking status, BMI, smoking status + BMI) instead of just 2, their respective p-values became higher & insignificant in this final model. My question is: should I still include this interaction term? – R Beginner Dec 02 '21 at 17:33
  • 1
    Hm. There is no automatic effect that p values would increase if you add an interaction term. However, the main effects' coefficients are hard to interpret in the presence of an interaction. If your LR test believes the interaction makes sense, then I would go with the model with the interaction, and only interpret the interaction. Best to plot the estimated logit of the response against BMI, separately for smokers and non-smokers, that will show the effect of the interaction. – Stephan Kolassa Dec 02 '21 at 17:42
  • So, I know p-value isn't important when it comes to model prediction, but if the p-value of my interaction term is >0.05 (but this interaction term leads to better model fit), based on what we discussed, you would still include this interaction term in the model, correct? – R Beginner Dec 02 '21 at 17:47
  • 1
    Yes, exactly. It would definitely be good to look at the kind of interaction plot I suggested above. – Stephan Kolassa Dec 03 '21 at 08:59
  • Can you pls explain (or provide an example) when the spline transformation will be absolutely necessary for the logistic regression? – R Beginner Dec 07 '21 at 18:26
  • 1
    One can always simulate some data to illustrate a situation like this, did you have something like this in mind? If so, perhaps post a new question, because that would be a bit too large for a comment (feel free to link to the new question here, and I'll try to take a look). Alternatively, here is a paper I was involved in where a spline transform (of traumatic load, interacting with genotype) did improve a logistic regression model (to develop PTSD): https://www.psychiatrist.com/jcp/trauma/ptsd/association-study-trauma-load-slca-promoter-polymorphism/ – Stephan Kolassa Dec 07 '21 at 18:34
  • This is such a great paper! It helps me answer some of my questions. I have one question regarding the timing of the use of spline transformational. Do you use it before the analysis to ensure linearity or do it after the analysis to improve the model fit? – R Beginner Dec 07 '21 at 19:44
  • 1
    We spline transformed traumatic load before feeding the splines into the logistic model. – Stephan Kolassa Dec 07 '21 at 20:41
  • Do you know if it is possible to report the odds ratio from the spline-transformed predictor (or from the GAM)? – R Beginner Dec 08 '21 at 04:56
  • 1
    Yes, of course. It's just a question of `predict`ing the response for two different values of the predictor (so you would spline transform the predictor and feed it into the model, then predict), then calculate the OR, or any other KPI of interest, from those predictions. – Stephan Kolassa Dec 08 '21 at 07:55
  • Really? I thought because the spline function creates multiple curvatures so a single odds ratio associated with a 1-unit increase of predictor is no longer possible. Can you pls provide me with an R code (or direct me to any other resource/posts) for how to get an odds ratio for the spline-transformed predictor? – R Beginner Dec 08 '21 at 19:31
  • 1
    Can you please create a new post for that question, including a [Minimal Working Example](https://stackoverflow.com/q/5963269/452096)? A helpful answer won't fit into a comment, and a specific thread on this will be helpful for later generations. Also, you might get answers from other people than me. Of course, feel free to link to that new question from here, and I'll try to take a look. – Stephan Kolassa Dec 08 '21 at 20:45
  • 1
    Meanwhile, perhaps one of [these questions](https://stats.stackexchange.com/search?q=odds+ratio+%5Bsplines%5D) is helpful? These two look especially promising: https://stats.stackexchange.com/q/328545/1352 and https://stats.stackexchange.com/q/127134/1352. You will notice that Frank Harrell answered both of them, his book *Regression Modeling Strategies* is very good. – Stephan Kolassa Dec 08 '21 at 20:46
  • Hi Stephan - I have posted a new question here. All the feedback is extremely appreciated. https://stats.stackexchange.com/questions/555416/restricted-cubic-spline-function-summary-intepretation – R Beginner Dec 10 '21 at 17:30
0

You could also create further categorical features from the numerical BMI like 'very underweight' up to 'highly overweight'. It might make it easier for your model to get improved predictions.

It might also make sense to change the algorithm. Why do you want to rule out algorithms like SVM and XG Boost before trying to tune them? These algorithms might get actually very good after tunning the parameters to your specific data set.

janrth
  • 69
  • 5
  • Do you mean including BMI as a categorical variable (rather than a continuous variable) may improve prediction? – R Beginner Dec 02 '21 at 17:20
  • 1
    Yes, but create meaningful categories. Maybe you split it up into 'very_low_bmi' (x<20) and 'very_high_bmi' (x>35) and 'mid_bmi' for the rest. These were just examples, I am not very familiar on how to best categorise BMI. But in general this would be the idea. – janrth Dec 03 '21 at 19:43