Variable Importance for sparse dataset in R, tons of NA values

Question

I have a dataset (50000x50) of prospective customer information and I am trying to understand what features are most predictive for determining whether a prospective customer will become a purchasing customer. The dataset includes rows for both prospective customers and actual customers, and includes a y variable column titled 'did.customer.purchase' that is binary (1,0) indicating whether the row is a purchasing customer or not.

I'd like to fit a variety of machine learning models in R (SVM and logistic regression to start), and use Rs built-in variable importance tools / plots to help understand the importance of variables in predicting the 'did.customer.purchase' column. However, my dataset is filled with NA values. There is not a single row with fewer than 3 NA values, and as a whole the dataset is ~50% NAs. I've already scrubbed the data quite a bit (started as 100000x200), and I believe there's value in keeping each of the 50 remaining columns.

Most R ML packages have an optional na.action parameter, with a variety of options including na.omit, na.pass, etc. If I set the parameter to na.pass, I receive an error if there are NAs in my data. If I set the parameter to na.omit, I throw away all observations and no model is fit.

How can I do variable importance with a ton of NAs? Is there a better way?

Thanks!

score 1 · Answer 1 · answered Sep 27 '17 at 02:55

There are a couple of approaches that can be used here:

You can use NA as a new categorical value.
You can apply naive Bayesian classifier and for each variable to use relevant data only.
If you consider missing data imputation, you should assess the importance of features after this imputation.

Some links:

Variable Importance for sparse dataset in R, tons of NA values

1 Answers1