1

Possible Duplicate:
Supervised learning with “rare” events, when rarity is due to the large number of counter-factual events

I am trying to predict diabetes using the BRFSS dataset by using a supervised learning classification model. But I see that the target variable which is having diabetes or not is skewed. That is 90% of the records are non-diabetic and only 10% of the records are diabetic. How do I handle the skewness in the target variable?

user3897
  • 517
  • 1
  • 7
  • 13
  • 2
    Why do you perceive "skewness" as a problem that needs correction? – whuber Apr 20 '11 at 18:50
  • 2
    This question sounds rather similar (http://stats.stackexchange.com/questions/9398/supervised-learning-with-rare-events-when-rarity-is-due-to-the-large-number-of) and Dikran gave a good answer to it. – mlwida Apr 21 '11 at 05:57

1 Answers1

1

When your data is skewed you may:

  • use specific error metrics like precision, recall, F-score
  • trade of between precision and recall accordingly:
    • want to predict diabetes with confidence => adjust for higher precision, lower recall
    • want to avoid missing too many diabetes cases => adjust for lower precision, higher recall
    • (for example, in logistic regression, by adjusting the separating threshold)
  • use F-score to find a good balance between precision and recall, that maximizes both as much as possible
clyfe
  • 790
  • 7
  • 8