3

I have a surgical database in which there are approximately 100,000 observations and 200 features. Each observation corresponds to a separate patient while the features correspond to either preoperative, perioperative, or postoperative variables (e.g. preoperative labs, length of operation, and days until death). As such some of the features are factors, while others are continuous data types. The database is not "clean" in that many of the features contain NAs because that data was not collected. Each patient has some feature with NAs in it, so when I try to select only observations without NAs that results in 0 observations.

That being all said my end goal is to see whether any of these preoperative variables might be predictive of mortality. Initially I wanted to use L1 regularization (lasso) via glmnet to perform feature selection, however because of the NAs I can't run it. Are there any alternatives or techniques to bypass this problem? I assume that it's more common than not to have a database with NAs, so I wanted to see if you guys could fill me in. Thanks!

oort
  • 823
  • 1
  • 11
  • 21
  • 1
    I asked a similar question about how to deal with missing data when you also need to perform some kind of variable selection. Here it is, in case it is helpful. http://stats.stackexchange.com/questions/46719/multiple-imputation-and-model-fitting – D L Dahly Apr 01 '13 at 18:27

3 Answers3

3

First, you need to figure out whether the missingness is random or not, in relation to mortality. Look up missing completely at random (MCAR) and missing at random (MAR), what's important is whether the variables are MAR or not. If they aren't MAR, you have trouble as you will get confounding which may increase your perceived predictive ability but is really just spurious. Second, if your data are reasonably MAR, you can impute the missing values using a multitude of methods, based on the non-missing data, anywhere from sampling non-missing values, to imputing mean/median for continuous values, to regression models that regress each variable on its neighbours, allowing you to predict what the missing value should be. Third, if you have an independent dataset, you can check how well your model worked, in an unbiased way.

purple51
  • 1,497
  • 10
  • 17
  • The problem with medical data (I'm having the same issue as the OP but in the modeling phase) is that not all the patients have done all the lab tests (for example) available and that is reasonable because some tests are unrelated to the condition they have. You can't impute these values because that would not make much sense. Moreover, you often **don't know** whether the value is missing because of a not-collected result or because of irrelevancy of the patient's condition to the test itself. – Corel Apr 04 '18 at 12:37
0

I'm working on something very similar as you. This is what I did: In my feature space I had two kinds of variables - count variables and measure variables. Example for count: "Number of times test x was taken in 2013". Example for measure: "[Average/Max/Min/Std] of test x for 2013".

Now, if I had a missing value in a count variable I just transformed it to be 0, because missing-count and no-count are the same, if you assume that the hospital system works correctly. So yes, I assumed that if a patient is taking a lab test - it will be recorded and not get lost because someone had forgotten to input it to the computer.

For measure variables you can't transform a missing value to 0 because 0 is itself a value, e.g. max_test_x_2013 = 0 has a scale meaning. So what I did is to transform each measure variable to a categorical variable in which each category was calculated based on the variable's distribution (basically binning the variable), and importantly - "missing" was a category of its own. In that way I converted the missingness to something that you can input to a model.

In glmnet, each of these categories then was converted to a dummy. This results in a higher resolution feature selection in which particular ranges of features gets selected rather than the entire feature. For me it was enough that at least some range of the variable was selected to justify the inclusion of the whole variable later on in the model.

On a side note, I now have the same problem when running the model (not for feature selection). My goal was more inference than prediction really, and so it is important for me to identify the most influential factors/coefficients of sorts. So now getting just some of the range of a certain variable as influential is not going to work because it is hard to explain it ("avg_test_x_2013_[6.0-7.8] really increases the chances of the outcome to occur" is difficult to explain..)

Corel
  • 737
  • 7
  • 18
0

I don't know much about glmnet and I may be on the wrong track, but do you know about the package glmulti? As far as I know it can handle missing observations, although I don't know exactly how it does it. It acts as wrapper for lm, glm, lmer etc, and gives you the most parsimonious candidate models based on your chosen model distribution and your information criterion. You can also select the "genetic algorithm" method which reduces computing time.

http://cran.r-project.org/web/packages/glmulti/index.html

AICcmodavg and MuMIn are also good for model selection, but they would be very very slow given the size of your dataset.

Sorry this should be a comment, but I don't have commenting privileges yet!

atrichornis
  • 258
  • 1
  • 9