I am using logistic regression to benchmark the performance of some students in different years. I created a scenario as below:
mydata <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv")
benchmark.data <- mydata[1:300,] # students form year 1990-1995 as benchmark
compare.data <- mydata[301:400,] # students from year 1996
# logistic regression model created using benchmark student result
temp.glm <- glm(admit~gre+gpa+rank, data=benchmark.data, family="binomial")
# using the regression model to predict how students in 1996 perform
compare.data[,"predict"] <- predict(temp.glm, newdata=compare.data, type="response")
# making a threshold such that if the predicted chance of admit > 0.5,
# then it is assumed that the student will get admitted
compare.data[,"predict_admit"] <- ifelse(compare.data[,"predict"]>0.5, 1, 0)
table(compare.data[,c("admit", "predict_admit")])
# predict_admit
# admit 0 1
# 0 59 6
# 1 26 9
From the table, it is seen that 15 students were predicted to get admitted and actual number of students get admitted is 35, so the observed/expected ratio is 35/15 = 2.33
, as it is larger than 1
, so I will say that students in year 1996 are performing better than the benchmark.
- Can I draw my conclusion using the method mentioned above?
- Also, how should I set the threshold? Or should I
sum(compare.data[,"predict"])
and treat it as expected value?
Update 1
I tried and used ROC curve to determine the threshold:
library(ROCR)
benchmark.data[,"predict"] <- predict(temp.glm, newdata=benchmark.data, type="response")
preds <- prediction(benchmark.data[,"predict"], as.numeric(benchmark.data[,"admit"]))
plot(performance(preds,"tpr","fpr"), print.cutoffs.at=seq(0,1,by=0.05))
And the charts suggests that threshold at 0.35 seems to maximize sensitivity and specificity.