I want to understand which variables lead to an infection by parasites in a tree. Hence, I want to use stepwise logistic regression based on AIC. First, I describe what I would do, and then my code will follow.
Dependent variable: Infection (Y/N or 1/0). This is approximately imbalanced by 1:10000.
Independent variables: Stand age, leaf density, avg_temperature, soil moisture, precipitation, species (1/2/3/4), carbon and nitrogen content (C & N %).
infected age LAI avg_temp avg_moist precip species C% N%
<fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
0 15 2.46 25.0 6.8 989 4 4.66 0.10
0 13 1.50 18.3 10.7 631 3 11.12 0.8
0 21 3.80 10.5 25.8 1207 2 14.73 1.9
0 56 5.21 24.2 9.2 434 1 3.21 0.12
0 57 4.31 20.6 10.4 499 1 4.63 0.17
0 2 0.58 25.3 2.1 801 4 2.58 0.09
- I perform a stepwise logistic regression on my dataset to see which variables predict infections. The accuracy is 99.5 %, only predicting NO.
- I split my dataset into training and testing (80 : 20 split).
- I use SMOTE on the train.data to balance the dataset.
- I perform a stepwise logistic regression again on train.data and get different coefficients but also a lower accuracy. The model now predicts not only NO but also YES.
My questions are then:
- Is this the right approach and order of steps? If not what should I change?
- Which regression model should I trust (step 1 or step 4) and report? In my opinion step 1 with all data is the 'correct' one and the coefficients are the ones to report and interpret.
Code:
library(readr)
trees <- read_csv("trees_simple.csv")
View(trees)
library(tidyverse)
library(caret)
library(MASS)
library(DMwR)
full.model <- glm(infected ~ age + LAI + avg_temp + avg_moist + precip + species + C% + N%, data = trees, family = binomial)
step.model <- stepAIC(full.model, direction = "both", trace = FALSE)
set.seed(123)
training.samples <- trees$infected %>% createDataPartition(p = 0.8, list = FALSE)
train <- trees[training.samples, ]
test <- trees[-training.samples, ]
trees$infected <- as.factor(trees$infected)
newData <- SMOTE(infected ~ ~ age + LAI + avg_temp + avg_moist + precip + species + C% + N%, data = data.frame(train), perc.over = 100, perc.under = 200)
full.model.smote <- glm(infected ~ age + LAI + avg_temp + avg_moist + precip + species + C% + N%, data = newData, family = binomial)
step.model.smote <- stepAIC(full.model.smote, direction = "both", trace = FALSE)
probabilities.smote <- step.model.smote %>% predict(newData, type = "response")
predicted.classes.smote <- ifelse(probabilities.smote > 0.5, "1", "0")