1

I am using following data and self-explanatory code to create a model for prediction of 'low' (low birth weight) from modified birthwt dataset. I am using 80% for training and 20% for testing:

> library(MASS)
> bwdf = birthwt[-10]
> rownames(bwdf) = 1:nrow(bwdf)
> bwdf$low = factor(bwdf$low)
> bwdf$race = factor(bwdf$race)
> bwdf$smoke = factor(bwdf$smoke)
> bwdf$ui = factor(bwdf$ui)
> bwdf$ht = factor(bwdf$ht)
> str(bwdf)
'data.frame':   189 obs. of  9 variables:
$ low  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ age  : int  19 33 20 21 18 21 22 17 29 26 ...
$ lwt  : int  182 155 105 108 107 124 118 103 123 113 ...
$ race : Factor w/ 3 levels "1","2","3": 2 3 1 1 1 3 1 3 1 1 ...
$ smoke: Factor w/ 2 levels "0","1": 1 1 2 2 2 1 1 1 2 2 ...
$ ptl  : int  0 0 0 0 0 0 0 0 0 0 ...
$ ht   : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
$ ui   : Factor w/ 2 levels "0","1": 2 1 1 2 2 1 1 1 1 1 ...
$ ftv  : int  0 3 1 2 0 0 1 1 1 0 ...
> testrows = sample(1:nrow(bwdf), 0.2*nrow(bwdf))
# USING TRAINING SET:
> mod = glm(low~., data = bwdf[-testrows,], family='binomial')

The model is as follows:

> summary(mod)
Call:
glm(formula = low ~ ., family = "binomial", data = bwdf[-testrows, 
    ])

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.9134  -0.7675  -0.5021   0.8468   2.2537  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept) -0.474339   1.332309  -0.356  0.72182   
age         -0.004613   0.042796  -0.108  0.91416   
lwt         -0.013568   0.007503  -1.808  0.07053 . 
race2        1.419952   0.604968   2.347  0.01892 * 
race3        0.847140   0.489473   1.731  0.08350 . 
smoke1       1.054312   0.440728   2.392  0.01675 * 
ptl          1.076627   0.416799   2.583  0.00979 **
ht1          1.979656   0.758125   2.611  0.00902 **
ui1          1.018945   0.520150   1.959  0.05012 . 
ftv          0.111426   0.191566   0.582  0.56080   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 195.30  on 151  degrees of freedom
Residual deviance: 159.76  on 142  degrees of freedom
AIC: 179.76

Number of Fisher Scoring iterations: 4

Using the above model for prediction:

# USING TEST SET: 
> pred = predict(mod, bwdf[testrows,])
> summary(pred)
     Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
-3.416000 -1.777000 -1.351000 -0.651800 -0.003372  4.884000 

> table(ifelse(pred<0,0,1), bwdf[testrows,]$low)
     0  1
  0 19  9
  1  6  3

> length(testrows)
[1] 37
> 15/37
[1] 0.4054054

Hence, 40% are incorrect classifications. How can this be improved?

Edit: After putting in various combinations as suggested by @tmangin, prediction is much better, even though model has hardly anything significant (!?!):

> summary(mod)

Call:
glm(formula = low ~ age + I(age^2) + I(age^3) + I(age^4) + I(age^5) + 
    I(log(1 + age)) + I(age^(1/2)) + I(age^(1/3)) + I(age^(1/4)) + 
    I(age^(1/5)) + lwt + I(lwt^2) + I(lwt^3) + I(lwt^4) + I(lwt^5) + 
    I(log(1 + lwt)) + I(lwt^(1/3)) + I(lwt^(1/4)) + I(lwt^(1/5)) + 
    race + smoke + ptl + I(ptl^2) + I(ptl^3) + ht + I(ftv^2), 
    family = binomial, data = mydf)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-2.0284  -0.7650  -0.4202   0.6472   2.3372  

Coefficients: (1 not defined because of singularities)
                  Estimate Std. Error z value Pr(>|z|)  
(Intercept)      1.434e+10  2.538e+10   0.565   0.5722  
age             -1.534e+08  3.009e+08  -0.510   0.6101  
I(age^2)         5.205e+05  9.937e+05   0.524   0.6004  
I(age^3)        -4.101e+03  7.611e+03  -0.539   0.5900  
I(age^4)         2.748e+01  4.958e+01   0.554   0.5795  
I(age^5)        -9.672e-02  1.699e-01  -0.569   0.5691  
I(log(1 + age))  1.059e+09  2.182e+09   0.485   0.6274  
I(age^(1/2))     8.817e+09  1.747e+10   0.505   0.6139  
I(age^(1/3))    -3.899e+10  7.731e+10  -0.504   0.6140  
I(age^(1/4))     3.270e+10  6.464e+10   0.506   0.6129  
I(age^(1/5))            NA         NA      NA       NA  
lwt             -1.012e+07  1.379e+07  -0.734   0.4631  
I(lwt^2)         9.023e+03  1.217e+04   0.741   0.4585  
I(lwt^3)        -1.368e+01  1.819e+01  -0.752   0.4520  
I(lwt^4)         1.654e-02  2.159e-02   0.766   0.4438  
I(lwt^5)        -1.024e-05  1.310e-05  -0.782   0.4340  
I(log(1 + lwt))  4.112e+09  5.658e+09   0.727   0.4674  
I(lwt^(1/3))     7.347e+09  1.005e+10   0.731   0.4649  
I(lwt^(1/4))    -4.163e+09  7.635e+09  -0.545   0.5856  
I(lwt^(1/5))    -2.246e+10  3.175e+10  -0.707   0.4794  
race2            1.425e+00  7.298e-01   1.953   0.0508 .
race3            9.400e-01  5.664e-01   1.660   0.0970 .
smoke1           1.214e+00  5.237e-01   2.318   0.0204 *
ptl             -2.067e+00  1.204e+05   0.000   1.0000  
I(ptl^2)         7.273e+00  1.807e+05   0.000   1.0000  
I(ptl^3)        -3.196e+00  6.022e+04   0.000   1.0000  
ht1              2.521e+00  1.329e+00   1.897   0.0578 .
I(ftv^2)         1.119e-01  8.275e-02   1.353   0.1761  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 192.56  on 151  degrees of freedom
Residual deviance: 138.14  on 125  degrees of freedom
AIC: 192.14

Number of Fisher Scoring iterations: 25

> pred = predict(mod, bwdf[testrows,])
Warning message:
In predict.lm(object, newdata, se.fit, scale = 1, type = ifelse(type ==  :
  prediction from a rank-deficient fit may be misleading

> pred = ifelse(pred<0, 0, 1)
> 
> table(pred, bwdf[testrows,]$low)

pred  0  1
   0 27  6
   1  1  3
> 
> length(testrows)
[1] 37
> 
> 7/37
[1] 0.1891892
rnso
  • 8,893
  • 14
  • 50
  • 94
  • 1
    (1) the 2nd model must be grossly over-fitting with so many predictor terms for so few observations (note the AIC has increased from the first model), (2) the sample is far too small for hold-out validation to give an accurate estimate of predictive performance (try repeating the analysis with a different test set & see what happens), & (3) you're using an improper scoring rule as a performance metric & thus discarding most of the information in the predictions. – Scortchi - Reinstate Monica May 06 '15 at 19:03
  • What is the best way to use predicitons here? – rnso May 07 '15 at 01:20
  • 2
    Some links on over-fitting [here](http://stats.stackexchange.com/questions/22566), [here](http://stats.stackexchange.com/questions/20295), & [here](http://stats.stackexchange.com/questions/17047); on validation; [here](http://stats.stackexchange.com/questions/1826), [here](http://stats.stackexchange.com/questions/104713), & [here](http://stats.stackexchange.com/questions/49692); & on scoring rules [here](http://stats.stackexchange.com/questions/73537), [here](http://stats.stackexchange.com/questions/91088), & [here](http://stats.stackexchange.com/questions/58756). – Scortchi - Reinstate Monica May 07 '15 at 11:14

1 Answers1

2

You can try adding "polynomial features" (cf. https://class.coursera.org/ml-005/lecture/23):

  • take your 9 features:

f_1 = low

f_2 = age

f_3 = lwt

f_4 = race

f_5 = smoke

f_6 = ptl

f_7 = ht

f_8 = ui

f_9 = ftv

  • for each of these 9 features, create a new feature which is equal to feature^2

f_10 = f_1^2

f_11 = f_2^2

...

f_18 = f_9^2

Run your regression.

If it's not good enough, keep the ^2 features and add ^3 features.

f_19 = f_1^3

...

f_27 = f_ 9^3

You can also add ^(1/2).

f_28 = f_1^(1/2)

...

However, sometimes, the poor quality of the estimation is not linked with your model but with the data: the data you have may not "contain" enough information for you to estimate the "low" birth-weight! Think of an exercise where you're ask to estimate a share price with weather data. Whatever be the model you chose, you'll never have a good estimate of your share price!

I hope this was useful.

tmangin
  • 493
  • 3
  • 10
  • No, this process of "adding polynomial features" doesn't have a specific name. It has to be done manually, as most of the features tunings. Through with two For loops running over your feature for each polynomial degree, you can implement it quite quickly. I really encourage you to look at the coursera video. It explain you why adding those "polynomial features" can improve the accuracy of your model. – tmangin May 04 '15 at 16:57
  • Actualy, it's called "Polynomial regression". I don't think there is a package for this but here you can find an implementation example in R: http://www.r-bloggers.com/polynomial-regression-techniques/ – tmangin May 04 '15 at 18:26
  • Thanks. I was able to reduce the errors to 30%. Obviously, it cannot be applied to factor variables. Also, is this related to nls function? How can I use nls here? – rnso May 05 '15 at 00:55
  • Polynomial regression is a linear regression. I've never used nls. Regarding factor variables, you can use them in your model: for each of your factors just add a feature and the value 1 if present and 0 if not. For your example, low doesnt need to be divided, smoke should be reformated 0 or 1, race should be divided in three (feature "race 1": yes or no, feature "race 2": yes or no, etc.),ht and ui should be reformated in 0 or 1. – tmangin May 05 '15 at 10:34
  • To what power should one go to? Is using only ^2 and ^3 reasonable. Also ^(1/2) and ^(1/3) to be used along with ^2 etc? That makes many combinations. Is it to correct distribution problem? Can one use Box-Cox transformation instead? – rnso May 05 '15 at 11:00
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/23461/discussion-between-tmangin-and-rnso). – tmangin May 05 '15 at 14:12
  • You should look at this: http://stackoverflow.com/questions/3822535/fitting-polynomial-model-to-data-in-r – tmangin May 06 '15 at 13:38
  • You can try: log(x), log(x^2) => logarithmic regression; and exp(x), exp(x^2), etc. => exponential regression for sure. 1/(x^2), 1/(x^3) too => it's still called polynomial regression. – tmangin May 06 '15 at 14:19
  • It's because x shouldnt take the value "0". So instead of log(x), use log (1+x). – tmangin May 06 '15 at 14:25
  • I don't know why it doesn't work for the exp and 1/x^2... Sorry. – tmangin May 06 '15 at 14:37
  • How's your algorithm doing know? Still at 30% errors? – tmangin May 06 '15 at 14:37
  • It is down to 19%, thanks to your efforts (see my edit above). – rnso May 06 '15 at 14:49
  • :) not too bad. – tmangin May 06 '15 at 14:57