0

One of my friends was asked in the interview following question:

There are 35000 independent variables and 7 million observation over those variables. There is a binary response variable. There is a success rate of 1%. What will be your approach here?

Artiga
  • 303
  • 3
  • 16
  • 1
    Just wanted to clarify: are you asking how to best model the data? – Kontorus Jul 12 '16 at 15:41
  • Yes , plus the reasoning behind it – Artiga Jul 12 '16 at 15:43
  • Not an expert so I won't attempt a formal answer. Overall idea would be to reduce the number of independent variables, then model with logistic regression. – Kontorus Jul 12 '16 at 15:51
  • The point of the interview question seems to be how to deal with computationally complex models; maybe you could add tags dealing with that? – Kontorus Jul 12 '16 at 15:53
  • That was the only information provided by the interviewer. – Artiga Jul 12 '16 at 16:10
  • If that was the only information without any scientific context I hope your friend has found a job in a company with a better grasp of statistics. – mdewey Jul 12 '16 at 17:00
  • Probably the interviewers wany to see how you approach the problem, which questions you ask them, so on ... very little to say without any context. A start: 1% event rate means 70000 events, or two events per variable. Usual rules of thumb say 10-20 events per variable is needed (see https://stats.stackexchange.com/questions/167551/logistic-regression-10-events-per-predictor-rule/167563#167563 ) so some options: using principal components as regressors, or the lasso – kjetil b halvorsen Sep 07 '17 at 17:15

0 Answers0