6

Given the dataset cars.txt, we want to formulate a good regression model for the Midrange Price using the variables Horsepower, Length, Luggage, Uturn, Wheelbase, and Width. Both:

  1. using all possible subsets selection, and
  2. using an automatic selection technique.

For the first part, we do in R:

cars <- read.table(file=file.choose(), header=TRUE)
names(cars)

#regression
attach(cars)
leap <- leaps(x=cbind(cars$Horsepower, cars$Length, cars$Luggage, cars$Uturn, cars$Wheelbase, cars$Width), 
y=cars$MidrangePrice, method=c("r2"), nbest=3)
combine <- cbind(leap$which,leap$size, leap$r2)
n <- length(leap$size)
dimnames(combine) <- list(1:n,c("horsep","length","Luggage","Uturn","Wheelbase","Width","size","r2"))
round(combine, digits=3)

leap.cp <- leaps(x=cbind(cars$Horsepower, cars$Length, cars$Luggage, cars$Uturn, cars$Wheelbase, cars$Width), 
y=cars$MidrangePrice, nbest=3)
combine.cp <- cbind(leap.cp$which,leap.cp$size, leap.cp$Cp)
dimnames(combine.cp) <- list(1:n,c("horsep","length","Luggage","Uturn","Wheelbase","Width","size","cp"))
round(combine.cp, digits=3)
plot(leap.cp$size, leap.cp$Cp, ylim=c(1,7))
abline(a=0, b=1)

Am I correct in my interpretation that the most adequate model is one with 4 parameters (the three variables Horsepower, Wheelbase and Width) because it has the lowest Mallows' Cp value?

For the second part, we can choose between the forward, backward or stepwise selection models:

#stepwise selection methods
#forward
slm.foward <- step(lm(cars$MidrangePrice ~1, data=cars), scope=~cars$Horsepower + cars$Length + cars$Luggage + cars$Uturn + cars$Wheelbase + cars$Horsepower+ cars$Width, direction="forward")

#backward
reg.lm1 <- lm(cars$MidrangePrice ~ cars$Horsepower + cars$Length + cars$Luggage + cars$Uturn + cars$Wheelbase + cars$Horsepower + cars$Width)
slm.backward <- step(reg.lm1, direction="backward")


#stepwise
reg.lm1 <- lm(cars$MidrangePrice ~ cars$Horsepower + cars$Length + cars$Luggage + cars$Uturn + cars$Wheelbase + cars$Horsepower + cars$Width)
slm.stepwise <- step(reg.lm1,direction="both")

How do I interpret the results I get from this R code?

Rob Hyndman
  • 51,928
  • 23
  • 126
  • 178
BioGeek
  • 163
  • 1
  • 6
  • 1
    you have 6 independent variables, is there a particular reason you need only a subset of them in your model? Why not including all of them? – mpiktas Jan 30 '11 at 15:40
  • isn't this a homework assignment? – mpiktas Jan 30 '11 at 20:14
  • @mpiktas It is not a homework assignment, but the R code does indeed come from a course I'm following. The course text is very short on how to interpret the derived results and I posted here trying to get a better grasp on the material while stuyding. – BioGeek Jan 30 '11 at 23:49
  • I'd suggest to give the "bestglm" package a try. I uses leaps internally but gives you the possibility to use AIC, BIC, ... or even cross-validation as a basis for model selection. – AlefSin May 29 '11 at 20:43
  • I agree w/ mpiktas, there is no reason to need a subset of your variables, & further w/ Frank Harrell below that stepwise selection methods are certain to lead you astray. If that doesn't make sense / you want to understand why, you may want to read my answer here: [algorithms-for-automatic-model-selection](http://stats.stackexchange.com/questions/20836//20856#20856). – gung - Reinstate Monica Nov 27 '12 at 22:47

2 Answers2

7

Stepwise regression in the absence of penalization is frought with so many difficulties that I'm surprised people are still using it. The web has long lists of problems, starting with the extremely low probability of finding the "right" model.

Frank Harrell
  • 74,029
  • 5
  • 148
  • 322
5

For the second part, you must interpret the output as the steps towards your final model.

For example, in the forward case you begin with Start: AIC=377.95 cars$MidrangePrice ~ 1

              Df Sum of Sq    RSS    AIC
+ cars$Horsepower  1    4979.3 3054.9 300.66
+ cars$Wheelbase   1    3172.3 4862.0 338.76
+ cars$Length      1    2448.8 5585.4 350.14
+ cars$Width       1    1969.2 6065.0 356.89
+ cars$Uturn       1    1450.2 6584.0 363.63
+ cars$Luggage     1    1079.6 6954.7 368.12
<none>                         8034.2 377.95

Your current model is only considering the constant cars$MidrangePrice ~ 1.

Each row in the table indicates that in case you add that variable (for example, Horsepower), you will get the following results rearding Sq RSS(Residual Sum of Squares) and AIC (Akaike Information Criterion).

In the other cases you must read the results the same way.

Hope this helps :)

deps_stats
  • 1,615
  • 1
  • 17
  • 17