Will the results of the feature selection be biased if I perform the feature selection before imputing missing data?
I have a large data set of 20000 samples and 130 variables. The data sets consists of binary, numeric, and ordinal variables. The outcome variable is binary.
I want to do two things:
1) Feature selection to determine the most important variables 2) Build a predictive model with SVM, Random Forest, and Logistic Regression.
The complete case data set contains 70% of the original data (i.e if I keep only samples with no missing variable values, then I'm left with 70% of the samples)
I am using MICE in R to impute the missing data. Following some guidelines I found in this paper, I plan to impute 30 datasets. (I estimate the Fraction of Missing Information using the percentage of incomplete cases, which is 30%. This is where the 30 comes from)
This is computationally intensive and will take too long. If I take only the top 10 predictors and impute this smaller data set, I will be able to impute my 30 data sets as desired in a reasonable amount of time.
I cannot assume the data are Missing Completely at Random (MCAR). Most variables are Missing at Random (MAR) where the missing values can be modeled from existing data.
Will the results of the feature selection be biased because of missing-ness in the data?