1

I am making a model with logistic regression to predict binary sport events and have found some variables that may have an impact. Some features would however not contribute to the accuracy score of the model, but it would also not decrease it. My question is that should I leave these features or just throw them out?

Mike
  • 13
  • 3
  • If they're really useless, why waste computational efforts on them? And, do you only care about the accuracy? I hope, you don't have an imbalanced dataset. – gunes Sep 19 '18 at 18:42
  • @gunes It's a balanced dataset, but small sample size. I mean they're not completely useless, it makes sense logically speaking. For example, I have a 'rest time difference' variable, and it correlates with fatigue, which makes sense to leave it, right? It doesn't impact that much in terms of accuracy score, but what if I have many small insignificant heterogeneous features like that and combine all them together, would that make a difference? – Mike Sep 19 '18 at 18:54
  • It depends on what use you expect to put the model to. For example, if the model is being used for interpolating results, it probably does not matter, that is, if the over-fitting is not problematic. However, if the model is being used for extrapolation it may matter, and whether it matters or not depends on how physical the model is. For example, if the model is quadratic but the true answer is linear, then the quadratic term would be inappropriate. What your model should be can only be determined by testing modelling assumptions for their desired properties. – Carl Sep 19 '18 at 21:34

1 Answers1

3

Particularly with logistic regression, it's good to include any features that might reasonably contribute to outcome even if they aren't correlated with other features. As this page shows, there can be a significant bias from omitting potentially informative variables from a logistic regression in situations where omitting them might not bias a standard linear regression.

Remember that lack of statistical signficance doesn't necessarily mean that a variable is unrelated to outcome; it's possible that there is a real relation but you just didn't have enough power in your data and analysis to demonstrate it yet.

In general for predictive models it's best not to exclude features, just be careful that you're not overfitting the model; use a penalized method like ridge regression or LASSO if you are in danger of overfitting. The rule of thumb for logistic regression is that if you have more than 15-20 of the lower-prevalence class per feature in the model you probably won't have problems with overfitting.

EdM
  • 57,766
  • 7
  • 66
  • 187