1

I keep running into warnings in RStudio when I use subsets where p > n. ISLR 6.4.3 mentions that forward stepwise can be useful for high dimensional data, which I'm trying to just test out for learning purposes. It seems that all the examples I have found fit a full model first, but in all those examples n > p. Could someone point me in the right direction or fill me in on what I'm missing? Sample code below.

df <- read.csv('my_sample_data.csv')
df1 <- df[,1:200] # using a subset of the features to explore stepwise
set.seed(123)
# x <- df1[,-1]
# y <-df1[,1]
# train test split
trainIndex <- createDataPartition(df1$Age,p=.8,
                                  list=FALSE,
                                  times=1)
training <- df1[trainIndex,]
testing <- df1[-trainIndex,]

dim(training) # 130 200 ======= 130 samples, 200 features

# parameter tuning
fitControl <- trainControl(
  method = "cv",
  number = 5)

set.seed(42)
step.fit <- train(Age~., data=training, 
                  method="glmStepAIC",
                  trControl=fitControl,
                  trace=FALSE,
                  )
sumthymes
  • 23
  • 4
  • 3
    [Maybe do not do step-wise model building.](https://stats.stackexchange.com/a/558101/44269) – Alexis Jan 14 '22 at 04:29
  • I appreciate the link but I said this is for learning purposes. ISLR mentioned that it could be used for high dimensional data so I'm trying to see for myself. Would you happen to know what I am missing to perform the stepwise when p > n? – sumthymes Jan 15 '22 at 03:18
  • 1
    You cannot fit a glm where p > n, there are an infinite number of solutions that achieve the same loss in these scenarios, hence the warnings. Not sure on the exact wording of ISLR, but keep in mind that high dimensional in this specific context still means n > p (and ideally, n >>> p). Regularization or Bayesian methods are common ways to get around these issues. – aranglol Jan 15 '22 at 04:42
  • Because `glmStepAIC` is not part of base `R`, your code is not reproducible. How are you specifying that it use forward instead of backward stepwise regression? BTW, for learning purposes, applying algorithms to datasets for which they were not intended is usually not productive. – whuber Jan 15 '22 at 18:53

0 Answers0