4

Further to my prior question on multivariable adjustment in regression models, using covariates which are available only for some cases, I have researched in some detail the main methods for limited dependent variables, including Heckman correction or tobit models. However, I fear that they do not apply to my issue, which has more to do with limited independent variables.

In particular, I am giving below an example of the dataset and the possible analysis in R (disregard the overfitting, it's just to make an example, my actual dataset has at least 10,000 cases):

dep <- c(8, 9, 21, -3, 4, 6, 9, 10, 8, 9, 11, 39, 91, 51, 38, 28, 21)
cov1 <- c(68, 58, 42, 19, 39, 49, 29, 38, 25, 22, 19, 36, 39,90, 105, 73, 25)
cov2 <- c(0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 1, 0)
cov3 <- c(0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1)
cov4 <- c(NA, NA, NA, NA, NA, NA, 56, 33, 45, 44, 56, 49, 36, 39, 40, 41, 59)
cov5 <- c(NA, NA, NA, NA, NA, NA, 1, 0, 1, 0, 0, 1, 1, 0, 1, 1, 0)
mydata <- data.frame(cbind(dep, cov1, cov2, cov3, cov4, cov5)) 
mydata

reg1 <- lm(dep ~ cov1 + cov2, data = mydata, na.action = na.omit)
anova(reg1)
summary(reg1)

reg2 <- lm(dep ~ cov1 + cov2 + cov3 + cov4 + cov5, data = mydata, na.action = na.omit)
anova(reg2)
summary(reg2)

What should I do to best adjust for covariates cov1, cov2, cov3, cov4 and cov5, having dep as dependent variable, given that cov4 and cov5 are available only for patients with cov3 = 1?

Should I discard all cases with cov3 = 0? Should I instead conduct two separate analyses and then pool the regression coefficients according to their standard error? Or is there any other more reasonable approach?

Unfortunately I did not find anything meaningful searching Google, Google Scholar, or PubMed:

https://www.google.it/search?q=limited+independent+variable&nirf=limited+dependent+variable

https://scholar.google.it/scholar?hl=en&q=limited+independent+variable

http://www.ncbi.nlm.nih.gov/pubmed/?term=limited+independent+variable*

To further clarify what is at stake, this is my real problem: I want to create a clinical prediction score (to predict prognosis and future quality of life) for patients undergoing myocardial perfusion imaging (a non-invasive cardiac test used in subjects with or at risk for coronary artery disease). The imaging test follows immediately an exercise stress test in fit patients, and a pharmacologic stress test in those who are not fit. The latter test is worse than the former, and does not provide several important prognostic features (eg maximum heart rate, or workload), so I must include exercise test variables in the multivariable model. But if I do so, I lose more than 1000 patients who only underwent a pharmacologic stress test.

Giuseppe Biondi-Zoccai
  • 2,244
  • 3
  • 19
  • 48
  • 1
    The $x_i$ can have any features, expect they cannot be constant or a linear combination of each other. If there is not much variation in $x_i$ then the standard error will be larger than otherwise. In itself this is not a problem – Repmat Apr 01 '16 at 09:19
  • I am not sure I follow you. If I use all the covariates in the model I loose several cases (those with NA). If I only use cov1, cov2, and cov3 I don't use the information in cov4 and cov5... – Giuseppe Biondi-Zoccai Apr 01 '16 at 09:28
  • 1
    You can make some arbitrary assumptions, and do data imputation. But for the sample data posted I dont see the need, you do not loose an entire variable. But yeah sure, you will loose data... – Repmat Apr 01 '16 at 09:35
  • The question is not peregrine. Basically, I want to create a clinical prediction score for patients undergoing myocardial perfusion imaging. The imaging test follows an exercise stress test in fit patients, and a pharmacologic stress test in those who are not fit. The latter test is worse than the former, and does not provide several important prognostic features (eg maximum heart rate, or workload), so I must include exercise test variables in the multivariable model. But if I do so, I loose more than 1000 patients who only underwent a pharmacologic stress. I added this also in the question. – Giuseppe Biondi-Zoccai Apr 01 '16 at 09:42
  • 1
    I saw a lot of hits on google.scholar for +"model selection" +"missing covariate", as well as +"model building" +"missing covariate". I suspect that it may be possible - if it is plausible that covariates are simply missing at random - would be to impute them using multiple imputation, do whatever model building you do and combine the results across imputations. I believe there's also models that implicity impute them. However, if covariates will be missing in practice when people are trying to use the prediction score also, that would be an even harder problem. – Björn Apr 09 '16 at 05:33
  • (+1) Thanks @Björn. My problem is that eventually I might want to generate a clinical risk prediction score for those completing the exercise test (thus including all covariates and using them to predict risk), but also for those not doing the exercise test (so including only some variables). Thus, my problem is two-faceted: using a single model encompassing all patients and all variables (despite several variables being missing in some patients, not at random) to adjust for confounders; then creating separate models for risk prediction in the two main strata. – Giuseppe Biondi-Zoccai Apr 09 '16 at 06:24
  • @Björn, based on the OP's comment [here](http://stats.stackexchange.com/questions/200460/multiple-imputation-for-predictive-analysis-using-mice-package-in-r#comment380321_200460), the covariates are not missing at random, hence my original suggestion for using some kind of Heckman correction. Obviously, including the people who did not take the exercise stress test is clinically interesting and important here, because they are already different from those who took the exercise test. Of course, OP can drop the 1,000 or so missing exercise test, but results would have limited generalizability. – Marquis de Carabas Apr 09 '16 at 06:45
  • @marquisdecarabas: I tried to look into Heckman type corrections, but I found they focus on limited dependent variables, and not limited independent variables (but possibly I am mistaken...): http://stats.stackexchange.com/questions/172508/two-stage-models-difference-between-heckman-models-to-deal-with-sample-selecti – Giuseppe Biondi-Zoccai Apr 09 '16 at 07:05
  • @GiuseppeBiondi-Zoccai I admit my ignorance of limited independent variables...is that just like limited dependent variable but for independent variables? :) That is, the independent variable is categorical, count, etc? Can you explain why the limited independent variable might be a problem? The original Heckman model was actually used for a continuous outcome, but nowadays, there are several flavors, including [probit](https://www.google.com/search?q=heckman+probit) and even [Poisson](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=1293108). – Marquis de Carabas Apr 09 '16 at 07:36

1 Answers1

1

I think in this case it is more appropriate to use Classification and Regression Trees (CART) models.

I found the ctree package very helpful, which is an implementation of rpart. Logistic models (or imputation) in this case do not make much sense to me.

Silverfish
  • 20,678
  • 23
  • 92
  • 180
  • 5
    Note that recursive partitioning requires enormous sample sizes, perhaps 10x larger than regression if single trees are used. – Frank Harrell Apr 09 '16 at 12:36