I am using following data and self-explanatory code to create a model for prediction of 'low' (low birth weight) from modified birthwt dataset. I am using 80% for training and 20% for testing:
> library(MASS)
> bwdf = birthwt[-10]
> rownames(bwdf) = 1:nrow(bwdf)
> bwdf$low = factor(bwdf$low)
> bwdf$race = factor(bwdf$race)
> bwdf$smoke = factor(bwdf$smoke)
> bwdf$ui = factor(bwdf$ui)
> bwdf$ht = factor(bwdf$ht)
> str(bwdf)
'data.frame': 189 obs. of 9 variables:
$ low : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ age : int 19 33 20 21 18 21 22 17 29 26 ...
$ lwt : int 182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
$ ptl : int 0 0 0 0 0 0 0 0 0 0 ...
$ ht : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ui : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
$ ftv : int 0 3 1 2 0 0 1 1 1 0 ...
> testrows = sample(1:nrow(bwdf), 0.2*nrow(bwdf))
# USING TRAINING SET:
> mod = glm(low~., data = bwdf[-testrows,], family='binomial')
The model is as follows:
> summary(mod)
Call:
glm(formula = low ~ ., family = "binomial", data = bwdf[-testrows,
])
Deviance Residuals:
Min 1Q Median 3Q Max
-1.9134 -0.7675 -0.5021 0.8468 2.2537
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.474339 1.332309 -0.356 0.72182
age -0.004613 0.042796 -0.108 0.91416
lwt -0.013568 0.007503 -1.808 0.07053 .
race2 1.419952 0.604968 2.347 0.01892 *
race3 0.847140 0.489473 1.731 0.08350 .
smoke1 1.054312 0.440728 2.392 0.01675 *
ptl 1.076627 0.416799 2.583 0.00979 **
ht1 1.979656 0.758125 2.611 0.00902 **
ui1 1.018945 0.520150 1.959 0.05012 .
ftv 0.111426 0.191566 0.582 0.56080
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 195.30 on 151 degrees of freedom
Residual deviance: 159.76 on 142 degrees of freedom
AIC: 179.76
Number of Fisher Scoring iterations: 4
Using the above model for prediction:
# USING TEST SET:
> pred = predict(mod, bwdf[testrows,])
> summary(pred)
Min. 1st Qu. Median Mean 3rd Qu. Max.
-3.416000 -1.777000 -1.351000 -0.651800 -0.003372 4.884000
> table(ifelse(pred<0,0,1), bwdf[testrows,]$low)
0 1
0 19 9
1 6 3
> length(testrows)
[1] 37
> 15/37
[1] 0.4054054
Hence, 40% are incorrect classifications. How can this be improved?
Edit: After putting in various combinations as suggested by @tmangin, prediction is much better, even though model has hardly anything significant (!?!):
> summary(mod)
Call:
glm(formula = low ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5) +
I(log(1 + age)) + I(age^(1/2)) + I(age^(1/3)) + I(age^(1/4)) +
I(age^(1/5)) + lwt + I(lwt^2) + I(lwt^3) + I(lwt^4) + I(lwt^5) +
I(log(1 + lwt)) + I(lwt^(1/3)) + I(lwt^(1/4)) + I(lwt^(1/5)) +
race + smoke + ptl + I(ptl^2) + I(ptl^3) + ht + I(ftv^2),
family = binomial, data = mydf)
Deviance Residuals:
Min 1Q Median 3Q Max
-2.0284 -0.7650 -0.4202 0.6472 2.3372
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.434e+10 2.538e+10 0.565 0.5722
age -1.534e+08 3.009e+08 -0.510 0.6101
I(age^2) 5.205e+05 9.937e+05 0.524 0.6004
I(age^3) -4.101e+03 7.611e+03 -0.539 0.5900
I(age^4) 2.748e+01 4.958e+01 0.554 0.5795
I(age^5) -9.672e-02 1.699e-01 -0.569 0.5691
I(log(1 + age)) 1.059e+09 2.182e+09 0.485 0.6274
I(age^(1/2)) 8.817e+09 1.747e+10 0.505 0.6139
I(age^(1/3)) -3.899e+10 7.731e+10 -0.504 0.6140
I(age^(1/4)) 3.270e+10 6.464e+10 0.506 0.6129
I(age^(1/5)) NA NA NA NA
lwt -1.012e+07 1.379e+07 -0.734 0.4631
I(lwt^2) 9.023e+03 1.217e+04 0.741 0.4585
I(lwt^3) -1.368e+01 1.819e+01 -0.752 0.4520
I(lwt^4) 1.654e-02 2.159e-02 0.766 0.4438
I(lwt^5) -1.024e-05 1.310e-05 -0.782 0.4340
I(log(1 + lwt)) 4.112e+09 5.658e+09 0.727 0.4674
I(lwt^(1/3)) 7.347e+09 1.005e+10 0.731 0.4649
I(lwt^(1/4)) -4.163e+09 7.635e+09 -0.545 0.5856
I(lwt^(1/5)) -2.246e+10 3.175e+10 -0.707 0.4794
race2 1.425e+00 7.298e-01 1.953 0.0508 .
race3 9.400e-01 5.664e-01 1.660 0.0970 .
smoke1 1.214e+00 5.237e-01 2.318 0.0204 *
ptl -2.067e+00 1.204e+05 0.000 1.0000
I(ptl^2) 7.273e+00 1.807e+05 0.000 1.0000
I(ptl^3) -3.196e+00 6.022e+04 0.000 1.0000
ht1 2.521e+00 1.329e+00 1.897 0.0578 .
I(ftv^2) 1.119e-01 8.275e-02 1.353 0.1761
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 192.56 on 151 degrees of freedom
Residual deviance: 138.14 on 125 degrees of freedom
AIC: 192.14
Number of Fisher Scoring iterations: 25
> pred = predict(mod, bwdf[testrows,])
Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type == :
prediction from a rank-deficient fit may be misleading
> pred = ifelse(pred<0, 0, 1)
>
> table(pred, bwdf[testrows,]$low)
pred 0 1
0 27 6
1 1 3
>
> length(testrows)
[1] 37
>
> 7/37
[1] 0.1891892