Modeling covariates in multiple regression

Question

My aim is to find the association between intake of chocolate (continuous predictor) and blood pressure (continuous outcome) in a multiple linear regression. I have to include many covariates in order to adjust for confounders (many of them are continuous variables).

But I don't understand if I should keep the continuous covariates as continuous or categorize them into a categorical variable. When I categorize some of the countinuous variables I see that they are not linearly related to the outcome (which is one of the assumptions for linear regression models).

For example with fiber intake, with the lowest intake group as the reference group, the beta coefficients for the other intake groups don't increase or decrease linearly as the intake gets higher in the groups. And for many of my covariates the p-values are lower and the $R^2$ is bigger in a categorical covariate compared to a continuous covariate. My questions are:

Should I be concerned when a continuous cofactor isn't linearly related to the outcome? (Or is this not important because I am only interested in finding the association between the predictor and the outcome?)
In choosing between whether to model a covariate as continuous or categorical, should I look at which gives the lowest p-value and the highest R^2?

I would be very grateful for answers as I have been struggling to understand this for a long time.

Turning a continuous variable into a categorical one is pretty much always a bad idea; we have posts on here about that. In you want to model the nonlinear behavior of your covariates, perhaps consider some kind of spline. (We also have posts on here about that.) — Dave, Dec 16 '20 at 12:56
@Dave thank you for the comment! Do you have some thoughts on question number 1? — Mira, Dec 16 '20 at 13:24
The assumptions behind multiple regression do not include that *individual* variables are linearly related to the outcome. What one hopes is that each variable, *after controlling for all the other variables,* will be related in an *approximately* linear way to the outcome. A standard way to evaluate this is the [Added Variable Plot](https://stats.stackexchange.com/questions/125561), *aka* Partial Regression Plot. — whuber, Dec 16 '20 at 14:21
See (about why not bin) https://stats.stackexchange.com/questions/68834/what-is-the-benefit-of-breaking-up-a-continuous-predictor-variable and https://stats.stackexchange.com/questions/390705/why-should-binning-be-avoided-at-all-costs — kjetil b halvorsen, Dec 16 '20 at 16:16
Thank you for commenting, @Whuber and @kjetil-b-halvorsen! Now I understand that I miss-understood the linearity assumption for multiple linear regression. I guess I can evaluate the linearity of my variables with a resiudal plot after I included all my variables in the multiple regression. I have, however, read that continuous variables should be checked on beforehand if they are linearly related to the outcome. And now I dont understand why I should do it on beforehand if the residual plot with all the included variables indicate a linear relationship. — Mira, Dec 16 '20 at 17:16
That's an excellent point. The main reason one evaluates these bivariate relations (rather than the multivariate ones) is because it's easy to do: for instance, a scatterplot matrix (SPM) does that in a comprehensive way. In many cases, nonlinearities in the SPM really do reflect nonlinearities in the multiple regression; at the very least, they suggest a need for further evaluation. — whuber, Dec 16 '20 at 19:27

Modeling covariates in multiple regression

0 Answers0