I try to figure out the relationship between bike rental and some explanatory factor. The model I'm using is log negative binomial model.
Here is the data:
'data.frame': 731 obs. of 16 variables:
$ instant : int 1 2 3 4 5 6 7 8 9 10 ...
$ dteday : Factor w/ 731 levels "2011/1/1","2011/1/10",..: 1 12 23 26 27 28 29 30 31 2 ...
$ season : Factor w/ 4 levels "spring","summer",..: 1 1 1 1 1 1 1 1 1 1 ...
$ yr : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ mnth : Factor w/ 12 levels "1","2","3","4",..: 1 1 1 1 1 1 1 1 1 1 ...
$ holiday : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ weekday : Factor w/ 7 levels "0","1","2","3",..: 7 1 2 3 4 5 6 7 1 2 ...
$ workingday: Factor w/ 2 levels "0","1": 1 1 2 2 2 2 2 1 1 2 ...
$ weathersit: Factor w/ 3 levels "1","2","3": 2 2 1 1 1 1 2 2 1 1 ...
$ temp : Factor w/ 3 levels "low","moderate",..: 1 1 1 1 1 1 1 1 1 1 ...
$ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
$ hum : Factor w/ 3 levels "dry","agreeable",..: 3 3 2 2 2 2 2 2 2 2 ...
$ windspeed : Factor w/ 5 levels "B","C","D","E",..: 2 3 3 2 3 2 3 3 4 3 ...
$ casual : int 331 131 120 108 82 88 148 68 54 41 ...
$ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
$ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
The initial model I used as start point.
log_negativebinomial_fit <- glm(cnt~hum+atemp+weathersit+workingday+yr+season+I(atemp^2), data= bike, family = negative.binomial(theta=1,link="log"))
summary(log_negativebinomial_fit)
Call: glm(formula = cnt ~ hum + atemp + weathersit + workingday + yr +
season + I(atemp^2), family = negative.binomial(theta = 1,
link = "log"), data = bike)
Coefficients:
(Intercept) humagreeable humhumid atemp weathersit2
6.29407 0.02012 -0.07577 5.79034 -0.11419
weathersit3 workingday1 yr1 seasonsummer seasonfall
-0.77882 0.05267 0.46632 0.27571 0.33942
seasonwinter I(atemp^2)
0.39563 -4.63977
Degrees of Freedom: 729 Total (i.e. Null); 718 Residual
(1 observation deleted due to missingness)
Null Deviance: 186.9
Residual Deviance: 40.38 AIC: 13620
I used step
in R to conduct variable selection for me, however, the result seems weird to me.
> step(log_negativebinomial_fit,
+ scope =list(upper=~windspeed*hum*atemp*weathersit*workingday*yr*season+I(atemp^2),lower=~1),
+ direction="both",
+ steps=200,trace=1
+ )
Start: AIC=13622.04
cnt ~ hum + atemp + weathersit + workingday + yr + season + I(atemp^2)
Df Deviance AIC
+ atemp:season 3 38.719 13586
+ hum:atemp 2 38.872 13588
+ yr:season 3 39.111 13596
+ windspeed 4 39.140 13599
+ workingday:season 3 39.567 13608
+ atemp:yr 1 39.771 13609
+ atemp:workingday 1 39.771 13609
+ hum:season 6 39.580 13614
+ hum:weathersit 1 40.019 13615
+ atemp:weathersit 2 40.061 13618
<none> 40.385 13622
+ hum:workingday 2 40.267 13623
+ weathersit:season 6 39.951 13623
+ workingday:yr 1 40.383 13624
+ weathersit:yr 2 40.346 13625
+ hum:yr 2 40.362 13625
+ weathersit:workingday 2 40.371 13626
- workingday 1 40.816 13631
- hum 2 41.426 13644
- I(atemp^2) 1 47.792 13806
- weathersit 2 49.329 13843
- season 3 50.991 13883
- atemp 1 53.819 13958
- yr 1 78.298 14574
Step: AIC=13626.37
cnt ~ hum + atemp + weathersit + workingday + yr + season + I(atemp^2) +
atemp:season
Call: glm(formula = cnt ~ hum + atemp + weathersit + workingday + yr +
season + I(atemp^2) + atemp:season, family = negative.binomial(theta = 1,
link = "log"), data = bike)
Coefficients:
(Intercept) humagreeable humhumid atemp
6.478458 0.008492 -0.093304 4.600862
weathersit2 weathersit3 workingday1 yr1
-0.110714 -0.791679 0.050971 0.466812
seasonsummer seasonfall seasonwinter I(atemp^2)
0.292224 1.458707 0.505776 -2.800671
atemp:seasonsummer atemp:seasonfall atemp:seasonwinter
-0.169793 -2.000234 -0.288588
Degrees of Freedom: 729 Total (i.e. Null); 715 Residual
(1 observation deleted due to missingness)
Null Deviance: 186.9
Residual Deviance: 38.72 AIC: 13630
The selection function based on AIC gives me a model with larger AIC? What am I missing here?