I have to predict when the soil dries out. The dependent variable is therefore binary (the soil is wet or dry). I have a lot of variables, and I have clustered them together into three main clusters.
- Weather
- Vegetation
- Soil
When I run a penalised ridge logistic regression (glmnet
) for all parameters I get an AUC value of around 0.81. Then I run it for each individual cluster. Weather and vegetation both amount to an AUC of 0.5, while the soil parameter has a AUC of 0.84.
- How can I get a better prediction of the when the soil dries with a cluster of variables than all variables included?
- Do the 'non-predictive' variables in the weather and vegetation cluster "drag down" the overall AUC score for the whole model and that is what I see with the higher AUC score for soil alone?
Here is the script:
library(readr)
library(caret)
library(tidyverse)
library(glmnet)
library(ROCR)
library(pROC)
library(doParallel)
registerDoParallel(4, cores = 4)
set.seed(123)
data <- read_csv("path/soildryness.csv")
df <- data %>% select(V1, V2, ... V25)
df.W <- data %>% select(V1, V2, ... V7)
df.V <- data %>% select(V8, V9, ... V18)
df.S <- data %>% select(V19, V20, ... V25)
training.samples <- df$V1 %>% createDataPartition(p = 0.8, list = FALSE)
train <- df[training.samples, ]
test <- df[-training.samples, ]
x.train <- data.frame(train[, names(train) != "V1"])
x.train <- data.matrix(x.train)
y.train <- train$V1
x.test <- data.frame(test[, names(test) != "V1"])
x.test <- data.matrix(x.test)
y.test <- test$V1
foldid <- sample(rep(seq(10), length.out = nrow(train)))
fits <- cv.glmnet(x.train, y.train, type.measure = "dev", alpha = 0, family = "binomial", nfolds = 10, foldid = foldid, parallel = TRUE, standardized = TRUE)
predicted <- predict(fits, s = fits$lambda.1se, newx = x.test, type = 'response')
pred <- prediction(predicted, y.test)
perf <- performance(pred, "tpr", "fpr")
plot(perf, color = "black")
abline(a = 0, b = 1, lty = 2, col = "red")
auc_ROCR <- performance(pred, measure = "auc")
auc_ROCR <- auc_ROCR@y.values[[1]]
auc_ROCR
Sum up the AUC values:
Weather: 0.5
Vegetation: 0.5
Soil: 0.84
All: 0.81