I was recently advised to use a penalized logistic regression model to better grasp what drivers influence my outcome (i.e. the eradication success/failure of an invasive plant species after a particular control method was applied). This choice seemed even more appropriate after I realized that one of my predictors completely separates my response variable. (This is what I was expecting, yet I don't think this variable is the only important predictor of the eradication success/failure.)
After some research, I thus decided to fit an ElasticNet logistic regression to my data using this code. (You may not find the same results as me, as I'm not sure as to how properly set the seed.)
library(RCurl)
library(dplyr)
x <- RCurl::getURL("https://raw.githubusercontent.com/mrelnoob/jk.dusz.tarping/master/mydata/erad.csv")
read.csv(text = x) %>%
mutate(eff_eradication = as.factor(eff_eradication),
fully_tarped = as.factor(fully_tarped)) %>%
mutate(distance = log2(distance+1)) %>%
mutate(stand_surface = log2(stand_surface)) -> eff
mydata <- eff[ ,c("eff_eradication", "distance", "fully_tarped", "followups", "obstacles",
"slope", "stand_surface", "tarping_duration")]
# I log-transform some predictors to respect the linearity assumption of their relationship with the logit of my response
# To prepare for a 10-fold cross validation with 5 repeats:
train.control <- caret::trainControl(method = "repeatedcv", number = 10, repeats = 5)
# To convert the data to a matrix format accepted by glmnet:
x <- stats::model.matrix(eff_eradication~., mydata)[,-1] # Creates a matrix of the potential predictors
y <- as.numeric(as.character(mydata$eff_eradication)) %>% as.matrix() # Same for Y
### Train a model using the repeated 10-fold cross validation to find the best hyperparameters:
set.seed(26)
trained.mod <- caret::train(eff_eradication~., data = mydata, method = "glmnet", family = "binomial",
trControl = train.control,
tuneLength = 15)
# To extract the best hyperparameters:
get_best_result <- function(caret_fit) {
best = which(rownames(caret_fit$results) == rownames(caret_fit$bestTune))
best_result = caret_fit$results[best, ]
rownames(best_result) = NULL
best_result
}
best_param <- get_best_result(caret_fit = trained.mod)
best_param
# alpha lambda Accuracy Kappa AccuracySD KappaSD
#1 0.9357143 0.0008972655 0.7420556 0.2757236 0.1260169 0.3631528 # Some other tuning gave slightly better
# results but I'm having trouble with reproducibility here...
# Final elastic-net model (to observe what variables are kept and their associated coefficients):
elasticnet.model <- glmnet::glmnet(x = x, y = y, alpha = best_param$alpha,
lambda = best_param$lambda,
family = "binomial")
glmnet::coef.glmnet(object = elasticnet.model, s = best_param$lambda)
### Make predictions using the final elastic net model (glmnet):
predic.glmnet <- predict(object = elasticnet.model, newx = x,
s = best_param$lambda, type = "class") %>% as.factor()
mean(predic.glmnet == mydata$eff_eradication) # 0.776 ±= Accuracy?
pred.obs <- data.frame(cbind(predic.glmnet, mydata[,1]))
sum(pred.obs$predic.glmnet == pred.obs$eff_eradication) # 66 out of 85 observations
sum(pred.obs$predic.glmnet == 1 & pred.obs$eff_eradication == 1) # Correctly predicted eradication 10 times
# out of 23 eradication events (not very good)!
As I am new to predictive modelling (inference was my initial goal), cross validation and penalized methods, I would like to know:
What kind of diagnostic checks could I run for this type of model? Despite my research, I found no mention of assumptions, diagnostics, residuals, etc.
Given the low class agreement of my model, what could I do to further improve the predictive power of my model?