I have a binary classification model. The target variable is my_test_data$target_variable
and has values 'y' or 'n'. my_test_data$target_variable_numeric
is the target variable converted to numeric where 'y' is represented by a 1 and 'n' by a 0.
I plotted the ROC curve and computed the AUC using two methods
Case 1: I used type = "prob" and then compared the probabilities with the true 1 or 0 values of my target variable my_test_data$target_variable_numeric
. I got an AUC = 91
pred = predict(my_model, my_test_data, type = "prob")
roc_obj = plot.roc(my_test_data$target_variable_numeric, pred$y,
main = "ROC curve",
percent = TRUE,
ci = TRUE,
print.auc = TRUE)
Case 2: With the same model (without re-training), I let the model predict the categorical target variable 'y' or 'n', then converted 'y' or 'n' to numeric 1 or 0 and then compared them with the true 1 or 0 values of my target variable my_test_data$target_variable_numeric
. I got a ROC of only 75.
pred = predict(my_model, my_test_data)
y = as.data.frame(pred)
colnames(y) = 'my_prediction_categorical'
my_test_data = cbind(test_data,y)
my_test_data$my_prediction_categorical = ifelse(my_test_data$my_prediction_categorical == 'y',1,0)
roc_obj = plot.roc(my_test_data$failure, test_data$predicted,
main = "ROC curve",
percent = TRUE,
ci = TRUE,
print.auc = TRUE)
Why is there a huge difference in the AUC of the two approaches even though I have used the same model (i.e. without re-training) and the same test data and is it possible to get an AUC close to that of Case 1 after the final prediciton.