How can I deal with a covariate being defined only for a subset of my sample?

Question

My model looks like: ROI_size ~ diagnosis + medication_dose + sex + age. Specifically, I want to find the effect of disease (1 or 0), adjusted for current medication dose (measured in mg) on brain region size, while controlling for age and sex. Both of the latter variables are available for the entire dataset. However, the former (medication dose) is defined only for the patients (diagnosis = 1).

I have thought of two solutions. One would be to assign medication_dose = 0 to all healthy controls. However, this creates 1. collinearity with diagnosis 2. zero-inflated data and is per se not correct (I think), since the difference between 0 and 1 mg is not the same as the difference between 1 mg and 2 mg.

The other solution would be to adjust the ROI_size for medication_dose within the patient cohort, i.e., to run the model ROI_size ~ medication_dose and then by calculating its beta and then adding beta*medication_dose to the ROI_size for all patients.

My question in this case is if I should have age and sex in the second model, since including them will yield a more accurate beta for medication_dose. In that case, do I need to correct for age and sex separately for the healthy controls as well and then to find the effect of disease to just do ROI_size ~ Diagnosis?

Histogram for current medication use in patients

Histogram for current medication use in all

Setting medication_dose =0 does not create collinearity in your model (assuming that it is not constant for the patients who have a disease). — Misius, Jul 02 '21 at 12:19
Does it not create colinearity with diagnosis? Essentially, I would have 2 variables in my model, diagnosis and medication_dose , which are describing the same "process". I.e. there is no healthy control that has medication > 0 and no patient that has medication = 0. A simple Spearman between the 2 variables results in rho>0.6. Also, the zero-inflated data will throw off the accuracy of the t-values by a large margin right? — nonsequitur, Jul 02 '21 at 12:23
Not necessarily. You are right to think hard about the meaning and coding of your variables before you get embroiled in any analytical or computational considerations: in brief, start with a model and statistical procedure you find reasonable and worry about issues like collinearity only if they arise. — whuber, Jul 02 '21 at 12:35
Thanks a lot for the feedback. This is actually the solution which I already used, since it is the simplest. However, comparing the results to the original model without medication_dose, I get the impression that the original effect size of disease is partiotioned by the second linear model to both variables (which could be true but makes me fear distortions due to colinearity). Conceptually, the second solution is more contrived but looks statistically more accurate (violates less assumptions of the linear model). Are there any drawbacks to it which I am missing? — nonsequitur, Jul 02 '21 at 12:43

score 5 · Answer 1 · answered Jul 02 '21 at 17:01

By (arbitrary) convention, people say you have 'a problem' with collinearity when the variance inflation factor ($VIF$) is greater than or equal to $10$. When considering a pair of variables, that means the Pearson correlation is $\gtrapprox .95$. You report that the Spearman correlation is $\approx .6$. If the Pearson correlation were the same (and the other variables were uncorrelated), that would make the $VIF = 1.56$. In other words, the variance of the estimated sampling distributions would be $\approx 1.6\times$ larger than what you could have ideally achieved, or the standard errors would be $1.25\times$ as wide. That's hardly any difference at all.

Assuming there is sufficient variability in doses, and that the relationship is sufficiently linear from a dose of $0$ to your maximum dose, the coefficient on disease gives you an estimate of the size of the ROI for a patient with the disease but who isn't taking any medication. The coefficient on the dose gives you the estimated association between changes in size and a one unit increase in dose. That is potentially useful information. I would start with that model (i.e., assigning ${\rm dose}=0$ for patients not on the medication) and assess to see if the assumptions appear tenable (cf., EdM's answer). Also bear in mind that these data are observational and the dose is endogenous (physicians prescribed higher doses based on their assessments of the patients), so you cannot assume the coefficient is an unbiased estimate of the causal effect of the medication. But that's typical of biomedical research, and the results still have some value (properly understood).

I am aware of the limitation you mention; disease severity confounds the medication dose. My memory on collinearity was wrong. I did the calculation again. Pearson is 0.52 (inaccurate because of skewed data), Spearman 0.94 (p-val not exact with ties) and Kendall t = 0.87. What I fear is that the estimates will tehrefore not be accurate. I am proposing the within-patients analysis because I think it will yield the most accurate estimate for medication dose. Then I was thinking to add/subtract the marginal effect*medication dose so as to have "unmedicated patients". What flaws does this have? — nonsequitur, Jul 02 '21 at 17:45
The Spearman correlation doesn't matter for collinearity. Only the Pearson correlation does. Moreover the main effect of collinearity is on the size of the SEs, not on the point estimate. I would not use what you call the "within-patients analysis". It won't handle the degrees of freedom correctly. — gung - Reinstate Monica, Jul 02 '21 at 19:14

score 3 · Answer 2 · answered Jul 02 '21 at 15:21

... there is no healthy control that has medication > 0 and no patient that has medication = 0.

That suggests a different two-step modeling approach from your second solution. Start without the medication.dose term in the model. That evaluates the critical issue of diagnosis directly, controlling for Sex and Age. Then add the medication.dose term and see whether it improves the fit; e.g., anova() comparison of those two nested models.

If the fit is not improved, then you have documented that adjusting for medication.dose doesn't matter (much) in terms of ROI_size when diagnosis, Sex and Age are taken into account.

If the fit is improved, then you have an estimate of the association between ROI_size and medication.dose. The model is then equivalent to what @whuber recommends in a different context for a predictor whose value is necessarily 0 for some cases. Here, the diagnosis predictor serves the role of the "loan indicator" in that question, with a corresponding interpretation of intercepts and regression coefficients.

There's an important caution about how to handle medication.dose, however. If you treat it as an untransformed continuous variable then you are imposing a linear association between medication.dose and ROI_size. If there are only a few different values of medication.dose then you might be best off treating it as a multi-level categorical predictor, perhaps ordinal. If there's a wide range of values, you might need to transform appropriately.

You would have a complete collinearity problem if there was only one value of medication.dose for all those with a diagnosis; in that case you would not be able to fit a model with both diagnosis and medication.dose uniquely. The partial collinearity that you fear with more than one medication.dose might tend to make the error estimates of the regression coefficients larger. If you are interested specifically in the association of medication.dose with ROI_size for those with diagnosis you might be better off restricting analysis to those with diagnosis. The question (which I don't think can be answered except by looking at the data) is whether having the correction for Age and Sex from extra cases without diagnosis makes up for that.

Thanks a lot for the useful feedback. What interests me the most is delineating the true effect size of disease, controlling for medication. It turns out I was wrong and I have 66 patients with medication 0. I am looking at 68 ROIs so medication improves the model fit on a good proportion of them. @variable: as can be seen from the histograms I added, the variable is continuous and skewed even without adding the 0s and becomes significantly zero-inflated afterwards. What kind of transformation would be appropriate here? Is it a case for GLM or is the GLM used in case the DV has these issues? — nonsequitur, Jul 02 '21 at 17:33
@nonsequitur with a range of continuous medication levels including patients with medication of 0, you're all set to go with the full model. A flexible fit of a continuous predictor with a restricted cubic spline or similar approach lets the data tell you its relationship with outcome. (GLMs deal with characteristics of different types of outcomes.) If you have 68 ROI outcome values per case, however, you probably should be taking the correlations among those outcomes into account with a true "multivariate" (multiple-outcome) model rather than analyzing them separately. — EdM, Jul 02 '21 at 17:48

How can I deal with a covariate being defined only for a subset of my sample?

Histogram for current medication use in patients

Histogram for current medication use in all

2 Answers2