Details I'm building what is called a direct demand model for predicting boardings at rail transit stations. The most available example is Transit Cooperative Research Project report 16 (TCRP 16). I wish to do so for Amtrak boardings. My outcome variable is annual station boardings. I have collected 2017 count data for all (528) existing Amtrak stations--not all of which are currently in service, or were in service in 2017. So my 'sample' consists of ~500 stations. I wish to predict boarding counts for a planned state-sponsored Amtrak service between two cities in the southeastern US. I collected data for the variables that theory suggested were significant.
I've then usied the STEP function (from the stats package) to build a linear regression model.
#=========regression=======================
## Full model
fullm <- lm(logABA ~ ., data=regress)
## Null model
nullm <- lm(logABA ~ 1, data=regress)
step(nullm, scope=formula(fullm))
#------------------------------------------
After so many hour spent trial-and-error model building, using corrplots with residuals, it seems almost appallingly easy. What a I missing? How do I know when I've buggered my model?
Yes, I read the thread. (The comments on 'Why we hate stepwise' are actually fare more informative.)
Is overfitting still relevant when my sample is my universe? I've only got 36 variables to test, many of which are highly correlated.