Is there any effect of unbalanced dataset on AUROC?

Question

I have worked on many classification problems. One of the parameters for classifier performance is AUROC/AUC which is the area under curve created by TPR and FPR values for different cutoffs of forecasted probabilities. A very easy way to understand it is-

What does AUC stand for and what is it?

My question is related to impact of unbalanced data on AUC value. If 90% of my data are positive instances, I think there are better chances of more TPR (proportion of positive data points that are correctly considered as positive), and therefore a higher AUC value than with a balanced (50-50) population.

I've also run a test on Titanic data to validate my understanding of AUC.

library(pROC)
d<- data.frame(Titanic)
d$Survived <- ifelse(d$Survived == "No", 0,1)
m<- glm(Survived ~ Class+Sex+Age+Freq, data=d, family = binomial(link = "logit"))  
fitted.results <- predict(m,newdata=subset(d,select=c(1,2,3,5)),type='response')
auc(d$Survived, fitted.results)


### unbalancing the data by converting y to 1 where freq is even( just a random condition)

d$Survived_unbalanced <-ifelse(d$Freq %% 2 == 0,1,d$Survived)   
m_ub<- glm(Survived_unbalanced ~ Class+Sex+Age+Freq, data=d, family = binomial(link = "logit"))
fitted.results_ub <- predict(m_ub,newdata=subset(d,select=c(1,2,3,5)),type='response')
auc(d$Survived_unbalanced, fitted.results_ub)

With 50-50 class AUC was 0.4297 and with unbalanced data (80-90% 1), I got .8974. Is my argument correct or unbalanced data is nothing to do with AUC value.

Data which is used to validate my assumption-

There are quite a few papers on the cons of AUROC for imbalanced data and most of them suggest using AUPRC, so search for those and read them. Think about the ROC curve axis, it's not difficult to think of a situation where AUROC is pretty good while AUPRC is very low. — NULL, Aug 17 '17 at 11:54
Two comments regarding your question. First, accuracy is a separate metric from AUC. It is more correct to say, then, that AUC is a measure of classifier performance (along with others such as precision, recall, f-score, etc.) Second, having an unbalanced data set (i.e. skewed 90-10 in favour of the positive class) will cause a relative increase in the number of true positives ($TP$), but not necessarily the true positive rate ($TPR$). The latter is defined $TPR = \frac{TP}{TP+FN}$ and is therefore already scaled by the number of positive instances in the data set. — user77876, Aug 22 '17 at 14:54

score 8 · Answer 1 · answered Aug 17 '17 at 11:26

Thinking of a pure discrimination index such as the $c$-index (concordance probability) AKA the AUROC as something produced by sensitivity, specificity, and cutoffs is counter-productive and adds nothing but confusion IMHO. When you look at the computing formula for $c$ which comes from the Wilcoxon two-sample rank statistic you'll see that no cutoffs are instrumental and $c$ cannot be affected by imbalance in $Y$.

To get a small value of $c$ as you did means that your manipulation of the data is suspect. It is unclear what you are doing with Freq or how the Titanic dataset was constructed. Am I right to infer that this is a categorical dataset where Age is arbitrarily categorized? If so then you are using the wrong dataset (there are Titanic datasets where age is a proper continuous variable) and are treating Freq improperly as a covariate instead of as case weights.

To use logistic regression requires some amount of study, which will also show you why the word 'classification' should not have appeared in your post.

Data is just dummy data to run logistic regression and test my assumption. It has Class, age( Child/Adult) and sex as categorical, Freq as continuous, and survived or not as dependent categorical variable. — Arpit Sisodia, Aug 17 '17 at 11:35
What is `Freq` supposed to represent? If you want to show the effect on $c$ of the $Y$ distribution (there isn't any) then perform a proper sampling design to create other datasets on which to compute $c$. It is not proper to redefine $Y$ as you did; that makes things incomparable. — Frank Harrell, Aug 17 '17 at 11:37

Is there any effect of unbalanced dataset on AUROC?

1 Answers1