1

I have a dataset with more than 70 columns and I have an binary output column.

What I did currently was to explore the dataset by plotting the bar and line graphs for the input variables vs output column.

Though I see that certain variables show a clear distinction between two classes(customer churn or not), what I would like to do is get to know whether the input variables are statistically significant to influence the outcome?

How can I do them without using Random forest feature importance or other ML feature importance methods?

Is there any method or approach like chi-square or anova that can help me do this?

I don't know whether chi-square or anova can do this. But thought of seeking your help

The Great
  • 1,380
  • 6
  • 18

1 Answers1

1

Since your output variable is binary, you should investigate logistic regression. This lets you include multiple input variables and have the effects of one controlled for the others.

If you are only interested in one input variable at a time, then you could do a chi-square test on each input variable.

Whether any of these tests will allow you to conclude that the input variables are influencing the output variable depends on your design. Was it experimental or observational? Were people randomly assigned to groups?

Peter Flom
  • 94,055
  • 35
  • 143
  • 276
  • 1
    Thanks for your time. Upvoted. This is observational. I mean these data points are from hospital EHR system. So I guess there is no randomness here. – The Great Dec 08 '19 at 12:30
  • Hi, when you mean chi-square test on each variable, are you asking me to apply chi-squrae to find the relationship between the variable that I am interested and the output variable. Right? So in total two variables for chi-square – The Great Dec 11 '19 at 10:24
  • It would be 70 chi-squares, if I read your question correctly. – Peter Flom Dec 11 '19 at 11:57