1

This is a question for those out there working in data scientist roles within your organizations. How many variables in acceptable to use within models that are going to be deployed in production for marketing or other purposes?

The reason I ask is this, we have four analysts, two of whom have been with the company for 10 years, and two of us who are recently graduated with our Masters in Predictive Analytics. Our senior analysts primarily build with linear / logistic regression models, and think that using the least amount of variables (regardless of technique) is always best, usually trying to use around 10-15 variables.

Us newer analysts work primarily with random forest and xgboost, and are comfortable using 100-800 variables in our models. I havnt encountered anything to say that using this many variables in random forest or xgboost should cause any concern, but we cannot come to an agreement. Even if holdout results are better using 100+ variables, we are still encouraged to use less.

Can anyone provide any information regarding this topic that might help shape our decision making progress?

Thank you,

Nate

0 Answers0