Stepwise regression in R - what's my alternate?

Question

Details I'm building what is called a direct demand model for predicting boardings at rail transit stations. The most available example is Transit Cooperative Research Project report 16 (TCRP 16). I wish to do so for Amtrak boardings. My outcome variable is annual station boardings. I have collected 2017 count data for all (528) existing Amtrak stations--not all of which are currently in service, or were in service in 2017. So my 'sample' consists of ~500 stations. I wish to predict boarding counts for a planned state-sponsored Amtrak service between two cities in the southeastern US. I collected data for the variables that theory suggested were significant.

I've then usied the STEP function (from the stats package) to build a linear regression model.

#=========regression=======================
## Full model
fullm <- lm(logABA ~ ., data=regress)
## Null model
nullm <- lm(logABA ~ 1, data=regress)
step(nullm, scope=formula(fullm))
#------------------------------------------

After so many hour spent trial-and-error model building, using corrplots with residuals, it seems almost appallingly easy. What a I missing? How do I know when I've buggered my model?

Yes, I read the thread. (The comments on 'Why we hate stepwise' are actually fare more informative.)

Is overfitting still relevant when my sample is my universe? I've only got 36 variables to test, many of which are highly correlated.

Could you clarify a bit? I don't understand what the question is. You've used stepwise regression and now what is the problem you are experiencing? — Demetri Pananos, Apr 06 '20 at 19:25
The thread is pretty clear I ought not use stepwise. So I'm curious regarding what might go wrong. The gist of the answers is 'overfitting'. I'm not sure if that matters in my case. — Mox, Apr 06 '20 at 19:29
You have to expand. What's the purpose of the analysis? What's the $n$? What's the relevance of any significance based procedure when your sample is, as you say, the universe? You refer to US based passenger rail (Amtrak) as a data source. A common pitfall for analysts is having access to an entire historical data source, and making overly confident decisions about the future when forecasting intervals need to be used for the intended application. — AdamO, Apr 06 '20 at 19:34
I'd suggest to compare AIC of your steps. Furthermore, if your variables are highly correlated, personally I have made good experiences by using PCA on my variable set first and regress on my PCs. — Erin Sprünken, Apr 06 '20 at 21:20
The possible variables are highly correlated. The variables the stepwise produces are not. I'm trying to build a (simple) predictive model, so PCA isn't really that helpful. The AIC improves dramatically as variables are added. — Mox, Apr 06 '20 at 21:28

Stepwise regression in R - what's my alternate?

0 Answers0