1

I have a dataset (50000x50) of prospective customer information and I am trying to understand what features are most predictive for determining whether a prospective customer will become a purchasing customer. The dataset includes rows for both prospective customers and actual customers, and includes a y variable column titled 'did.customer.purchase' that is binary (1,0) indicating whether the row is a purchasing customer or not.

I'd like to fit a variety of machine learning models in R (SVM and logistic regression to start), and use Rs built-in variable importance tools / plots to help understand the importance of variables in predicting the 'did.customer.purchase' column. However, my dataset is filled with NA values. There is not a single row with fewer than 3 NA values, and as a whole the dataset is ~50% NAs. I've already scrubbed the data quite a bit (started as 100000x200), and I believe there's value in keeping each of the 50 remaining columns.

Most R ML packages have an optional na.action parameter, with a variety of options including na.omit, na.pass, etc. If I set the parameter to na.pass, I receive an error if there are NAs in my data. If I set the parameter to na.omit, I throw away all observations and no model is fit.

How can I do variable importance with a ton of NAs? Is there a better way?

Thanks!

Canovice
  • 187
  • 8

1 Answers1

1

There are a couple of approaches that can be used here:

  • You can use NA as a new categorical value.
  • You can apply naive Bayesian classifier and for each variable to use relevant data only.
  • If you consider missing data imputation, you should assess the importance of features after this imputation.

Some links:

Karel Macek
  • 2,463
  • 11
  • 23