2

I am trying to do regression analysis with level of a chemical in blood as dependent variable and age, gender and weight of children as predicting variables. The sample size is about 5000. Age and weight are highly correlated in children. My doubts are:

  1. Should I use z-scores or percentiles for weight rather than raw values?

  2. Should I use some other technique rather than ordinary linear regression?

  3. Do I need to check if data has normal distribution at this sample size?

Edit: I want to clarify regarding z-score or percentile here: I have ages as 5,6,7,8 etc with no fractional ages. I thought for each age I can calculate z-score or percentile of weight for that individual child and use it instead of raw weight. By this I can answer the question that 'Is being overweight for age has any effect on blood level of the chemical'? Is this reasonable argument? Also, this question differs from the earlier question and is not a duplicate. My questions 2 and 3 do not figure in the title.

Regarding a comment on biological issues by @DLDahly: The ages are 5-15 years. Biologically, I want to determine if the weight is a predictor of blood level of chemical, independent of age? Chemical level rises with age, but it is not clear if being overweight increases it further. Actually, one cannot rule out the possibility that this rise may be related mainly to weight and not to age as such.

rnso
  • 8,893
  • 14
  • 50
  • 94
  • I would like to address your second question. Since age and height are very correlated, this will result in inflated standard errors and therefore insignificant coefficients. The problem is known as *multicollinearity*. The simplest thing you can do is drop one of these variables from the model but if you don't want to do that, ridge regression is a solution. – JohnK Jun 02 '15 at 11:29
  • See also [When should you center your data & when should you standardize?](http://stats.stackexchange.com/q/29781/17230) for the use of z-scores. Note it's residuals that should be normally distributed - [What if residuals are normally distributed, but y is not?](http://stats.stackexchange.com/q/12262/17230). Larger sample sizes do help with confidence intervals for coefficients, but not with prediction intervals - [Regression when the OLS residuals are not normally distributed](http://stats.stackexchange.com/q/29731/17230). – Scortchi - Reinstate Monica Jun 02 '15 at 11:46
  • @JohnK : Any choice between ridge, elasticnet or lasso? – rnso Jun 02 '15 at 11:54
  • I personally like lasso because it sets coefficients equal to zero, rather than shrinking them as ridge regression does. This often simplifies things. Others might prefer a different technique though. – JohnK Jun 02 '15 at 11:56
  • 1
    lasso sometimes simplifies things by selecting a list of variables to get non-zero coefficients that does not reproduce when running lasso on another sample or using the bootstrap. – Frank Harrell Jun 02 '15 at 12:12
  • @JohnK: Note LASSO also shrinks remaining coefficients, & that it isn't likely to perform as well as ridge regression when the predictors are highly collinear. – Scortchi - Reinstate Monica Jun 02 '15 at 12:13
  • 3
    For this type of problem percentile is certainly not recommended. Weight will affect an individual in a physical way, not in a way related to how many other individuals in the sample have similar weight. – Frank Harrell Jun 02 '15 at 12:14
  • Please see my edit as to how I intended to use percentiles or z-scores. – rnso Jun 02 '15 at 12:25
  • On multicollinearity see [Dealing with correlated regressors](http://stats.stackexchange.com/q/3561/17230) & other posts with that tag. – Scortchi - Reinstate Monica Jun 02 '15 at 12:33
  • 1
    @rnso: You asked three questions in one & they're all dealt with here - that's why I've added other links besides the duplicate of the title. Please do search the site before asking questions. I have re-opened this though, as your edit makes the question a very different one. – Scortchi - Reinstate Monica Jun 02 '15 at 12:36
  • 1
    This is more of a biological question rather than a statistical one. Does your sample span multiple stages of development, so that age describes different physiological stages? Should absolute size be more relevant than relative size (probably)? What is the hypothesized biological process you are trying to capture? IMO that is what needs sorting before moving onto the statistical nitty-gritty. – D L Dahly Jun 02 '15 at 13:07
  • I have appended my question with clarification regarding this. – rnso Jun 02 '15 at 13:24

1 Answers1

1

The difficulty's in equating say an eight-year-old whose weight is two standard deviations above the mean for his age (fat), with a fourteen-year-old whose weight is two standard deviations above the mean for his age (shooting up). And even if you're happy with that for the population, you still need to be happy with it for your sample.

Rather than try to stipulate how age moderates the effect of weight on the blood concentration of some chemical, as you've got 5000 observations you can afford to be more flexible: an additive model with some non-linear terms in age already allows the effect of weight to be controlled for age; including interaction terms allows the slope to vary.

Suppose you were considering $$ \operatorname{E} Y = \beta_0 + \beta_a a + \beta_w w' $$ where $Y$ is blood concentration of the chemical, $a$ is age, $w'$ weight standardized within each age, & the $\beta$s the coefficients

then the model $$ \operatorname{E} Y = \beta_0 + \beta_{a} a + ... + \beta_{a^{10}}a^{10} + \beta_w w + \beta_{wa} wa + ... + \beta_{wa^{10}} w a^{10} $$ where $w$ is unstandardized weight, would include the first as a special case while being much more flexible—it doesn't rigidly assume it's the no. standard of deviations from the mean weight within each age group that's what counts, while still allowing slope & intercept for weight to vary within each age group. Of course you likely needn't go up to a 10th-order polynomial for a good fit, & it'd be sensible to allow for non-linearity in the effect of weight as well (I'd suggest a natural spline basis).

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • Thanks for your advice. Could you clarify, preferably using formula terms, what exactly you mean by additive model with non-linear terms & interaction? Is it something like: lm(y ~ wt * (age+I(age^2)) ) ? – rnso Jun 02 '15 at 15:04
  • This analysis may become very complex if other variables are also to be added. The model may also be more difficult to explain or interpret, which is my primary aim here rather than prediction of future data. Also, can you provide some information / good link for natural spline usage here. – rnso Jun 03 '15 at 01:03
  • 1
    Your ability to interpret the model depends on the process you are trying to model, which still hasn't been illucidated. Also, people can't be expected to contribute to your question if the goal posts are going to be moved (e.g. other variables could be added). That said, the variance in total mass at different ages in human populations is large enough that you almost certainly don't need to worry about collinearity - so the relatively simple linear model given here is probably your best bet. – D L Dahly Jun 03 '15 at 08:39
  • 1
    @DLDahly: Very true. I'm not trying to advocate any particular model on such scanty information, just to show that standard empirical modelling procedures allow you to address concerns such as "what if the effect of weight varies with age?" without having to resort to shaky assumptions such as "the effect of weight is inversely proportional to the standard deviation of weight at each age group". – Scortchi - Reinstate Monica Jun 03 '15 at 08:49
  • @rnso: It could go either way: if, as you say is plausible, age per se has no effect, & the mean & variance of weight are very variable for different ages, then using age-standardized weights could necessitate a much more complex model, obfuscating a simple relationship between blood concentrations of the chemical & absolute weight. – Scortchi - Reinstate Monica Jun 03 '15 at 10:38
  • What is the role of rcs() function of rms package? Will following work well: library(rms); ols(y ~ age + gender + rcs(wt), data=mydata) ? – rnso Jun 03 '15 at 12:00
  • @rnso: Useful information on regression splines can be found at Frank Harrell's [RMS site](http://biostat.mc.vanderbilt.edu/wiki/Main/RmS), as well as on how to decide how complex a model to fit overall, how to allocate degrees of freedom among predictors, & how to validate the model. – Scortchi - Reinstate Monica Jun 03 '15 at 12:32
  • Which one is better: lm(y ~ gender+ age + std_wt) or lm(y~ gender + age * wt) or ols(y ~ gender + rcs(age) + rcs(wt) )? The interaction of (age * wt) should tell me if wt is important after correction for age. – rnso Jun 03 '15 at 17:17