How to handle skewed binary target variables?

Question

Possible Duplicate:
Supervised learning with “rare” events, when rarity is due to the large number of counter-factual events

I am trying to predict diabetes using the BRFSS dataset by using a supervised learning classification model. But I see that the target variable which is having diabetes or not is skewed. That is 90% of the records are non-diabetic and only 10% of the records are diabetic. How do I handle the skewness in the target variable?

Why do you perceive "skewness" as a problem that needs correction? — whuber, Apr 20 '11 at 18:50
This question sounds rather similar (http://stats.stackexchange.com/questions/9398/supervised-learning-with-rare-events-when-rarity-is-due-to-the-large-number-of) and Dikran gave a good answer to it. — mlwida, Apr 21 '11 at 05:57

clyfe · Answer 1 · 2011-12-20T20:46:54.203

When your data is skewed you may:

use specific error metrics like precision, recall, F-score
trade of between precision and recall accordingly:
- want to predict diabetes with confidence => adjust for higher precision, lower recall
- want to avoid missing too many diabetes cases => adjust for lower precision, higher recall
- (for example, in logistic regression, by adjusting the separating threshold)
use F-score to find a good balance between precision and recall, that maximizes both as much as possible

How to handle skewed binary target variables?

1 Answers1

Linked