2

I have been asked this question in interviews. But I could not figure out the logic behind that.

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
user148513
  • 21
  • 2
  • 4
    Can you provide more information? Or maybe phrase the question in a way the interviewer asked you this question. I'm sure some in the community may find this interesting. – Jon Feb 08 '17 at 23:10
  • Maybe of interest: https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels – kjetil b halvorsen Jul 11 '19 at 15:51

1 Answers1

3

I think logistic regression and linear models for classification are widely used in "large feature" data sets. For example, in natural language processing, or computer vision, where we treat the word counts and pixels as features, respectively, and use linear models to do the classification.

The only difference is when using logistic regression on large feature data sets, we do not emphasize the assumptions too much, or the interpretation of the coefficients, and care more about the classification accuracy.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 1
    This seems to be more of an opinion-answer. Would you instead be able to provide some evidence/examples where logistic regression does not perform well when the number of features is large? – Jon Feb 08 '17 at 23:12
  • 2
    For context, I work at FICO we have hundreds of features to throw into a model. We choose one over the other because we've measured the performance of one type of model vs another on the size of the data set and by the number of features and types of features. – Jon Feb 08 '17 at 23:13
  • @Jon why option based? I was trying to say the facts, that in NLP and vision problem LG is widely used – Haitao Du Feb 08 '17 at 23:16
  • Your answer starts with "I think" and does not provide empirical substance on the performance of logistic regression when the number of features are high. – Jon Feb 08 '17 at 23:46
  • 2
    @jon. Are you saying that you chose a type of model to used based on performance of that type of model on a different dataset whose only resemblence to the current problem is the number of datapoints and features? Glms work fine on large datasets and feature speces as long as they are not used carelessly, ie regularization and cross validation. – Matthew Drury Feb 09 '17 at 01:39
  • @MatthewDrury, I can't get into specifics, but a GLM was not overlooked simply because of the size of data. There are numerous (in-house) metrics used to evaluate performance of a model in test and production. It just happened that logistic regression was not "high performance" enough for what is needed based on the measures of interest. – Jon Feb 09 '17 at 02:44
  • I understand not getting into specifics (I used to work for *insert large American car insurance company*), and I'm sure that a more non parametric approach would win in terms of predictive power no matter how skilled the scientist implementing the regression is. I just want to make the point that regression is still useful in many large data / high number of features situations. – Matthew Drury Feb 09 '17 at 03:08
  • @MatthewDrury, I agree that GLM's may still be useful with large high dimensional data sets. But the question remains, at what point does logistic regression begin to, for lack of a better word, degrade in its performance. Does that occur at 200 features and 500 GB of data, or at 400 features and 1TB of data? What models perform better at this scale than logistic regression? This is certainly an interesting question but I am not about to generate this amount of data just to test this theory out. Too much work. – Jon Feb 09 '17 at 16:21
  • 1
    @Jon I think that absolutely depends on the data set, it's more complicated than just being a function of the number of features and amount of data. It depends on the signal to noise ratio in the data itself, and how easy the learning really is for the given problem / dataset. – Matthew Drury Feb 09 '17 at 16:26