I am trying to do multiple regression analysis of a continuous health variable (yval) keeping age, gender, height (cm), weight (kg) and waist (cm) as predictor variables for a database of about 7000 children. Age, height, weight and waist are correlated with each other (r varying from 0.53 to 0.88; all p-values < 0.0001). Will the results of multiple regression analysis be valid or should I use some other method? I believe large sample size will reduce the error for my situation.
Asked
Active
Viewed 298 times
3
-
1Agree with @gung that your biggest worry here should be multicollinearity if you're worried about validity. However having correlated predictors is a problem when it comes to interpretating your results. Depending on how the model was run, the variance explained by predictors would be the same, but one may appear to be non-significant. Moreover, one predictor will take variance from another predictor which transforms it into some unknown (e.g. what is waist size representing when corrected for weight?) – Mensen Jan 25 '15 at 15:22
-
What is the best to do this analysis? Should I do univariate analysis first and then put the significant ones in the model? But they are all significantly associated with yvar. I am running following command in R: lm(yvar~age+gender+ht+wt+waist, data=mydata) – rnso Jan 25 '15 at 16:21
-
1@rnso, you should be fine including all the variables you are interested in. It is not recommended to 1st test for bivariate relationships & then only enter the significant ones. This amounts to selecting based on a random variable, which invalidates p-values, eg, & degrades out of sample predictive accuracy. It may help to read my answer here: [algorithms for automatic model selection](http://stats.stackexchange.com/a/20856/7290). – gung - Reinstate Monica Jan 25 '15 at 17:31
-
@gung: thanks for clarifying. I am reading the link now. – rnso Jan 25 '15 at 17:36
1 Answers
2
There is nothing invalid here. These correlations will only increase your standard errors, but you should have enough data to compensate. The term to search for is "multicollinearity", if you'd like to learn more; there is a good deal of information on the site.

gung - Reinstate Monica
- 132,789
- 81
- 357
- 650
-
Thanks for your prompt reply. I will certainly search and read about multicollinearity in multiple regression. – rnso Jan 25 '15 at 04:33