I have a disease dataset, for this dataset. disease_rate is the dependant variable, and rest independant's.
data <- read.csv("H:/uni/MS_DS/disease.csv")
data
> data
radius texture perimeter area smoothness desease_rate
1 -0.018743998 0.002521470 -0.005025 0.0710 0.00000000 0.07
2 -0.027940652 0.003164681 -0.004625 0.0706 0.06476967 0.02
3 0.002615946 0.001328688 -0.005525 0.0726 0.06268457 0.07
4 0.041963329 0.002769471 -0.004325 0.0699 0.06013138 0.06
5 0.030261380 0.005725780 -0.003525 0.0695 0.05942403 0.04
6 -0.030559594 0.001576348 -0.002525 0.0695 0.06110087 0.05
7 0.002698690 -0.003028856 -0.006025 0.0706 0.06207810 0.07
8 -0.044996901 0.000617110 -0.009525 0.0691 0.05940039 0.05
9 0.022993350 -0.000637109 -0.015425 0.0695 0.05870643 0.03
10 0.001398530 -0.000470057 -0.017125 0.0705 0.05540871 0.01
11 0.026827990 0.000509490 -0.014025 0.0681 0.05588225 0.06
12 -0.076220726 0.001018820 -0.010225 0.0631 0.05515852 0.01
13 -0.021917789 0.000822517 -0.003925 0.0576 0.05584590 0.03
14 0.012491060 -0.007363090 0.005175 0.0569 0.05120000 0.03
15 0.038281834 -0.008005798 0.014975 0.0576 0.04940000 0.06
16 -0.033198384 0.000350052 0.022875 0.0564 0.04930000 0.01
17 -0.002358179 0.003846831 0.022675 0.0572 0.05050000 0.07
18 0.020808766 0.000536629 0.024575 0.0656 0.04820000 0.04
19 0.091888897 -0.002393641 0.009775 0.0761 0.04740000 0.07
20 -0.036293550 -0.002889337 0.001775 0.0828 0.04770000 0.01
PART 1: MANUAL VARIABLE SELECTION METHOD:
#Multiple Linear Model - fitting the model.
multilinearmodel = lm(desease_rate ~ radius + texture + perimeter + area +
smoothness, data = df1)
summary(multilinearmodel)
Call:
lm(formula = desease_rate ~ radius + texture + perimeter + area +
smoothness, data = df1)
Residuals:
Min 1Q Median 3Q Max
-0.032172 -0.013960 -0.004256 0.013622 0.033051
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.06616 0.06155 1.075 0.3006
radius 0.33809 0.14270 2.369 0.0327 *
texture 1.16524 1.54157 0.756 0.4623
perimeter -0.02464 0.46819 -0.053 0.9588
area -0.06218 0.82411 -0.075 0.9409
smoothness -0.36014 0.38102 -0.945 0.3606
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.0219 on 14 degrees of freedom
Multiple R-squared: 0.3298, Adjusted R-squared: 0.09049
F-statistic: 1.378 on 5 and 14 DF, p-value: 0.2909
> #Anova test.
> anova(multilinearmodel)
Analysis of Variance Table
Response: desease_rate
Df Sum Sq Mean Sq F value Pr(>F)
radius 1 0.0026031 0.00260313 5.4272 0.03531 *
texture 1 0.0002587 0.00025868 0.5393 0.47484
perimeter 1 0.0000134 0.00001340 0.0279 0.86964
area 1 0.0000012 0.00000118 0.0025 0.96109
smoothness 1 0.0004285 0.00042853 0.8934 0.36058
Residuals 14 0.0067151 0.00047965
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
> # AIC
> AIC(multilinearmodel)
[1] -89.2251
> # BIC
> BIC(multilinearmodel)
[1] -82.25498
here only radius had a p value - P <= 0.05, rest all other variable has p value greater that radius.
is there any way to do the variable selection in such situation? cause rest all other variable has greater p value.
If there's any we can do for variable selection, please suggest. Also please help me to extract Mallows CP value for this model.
PART 2: #Variable selection using automatic methods
library(leaps)
library(MASS)
model <- regsubsets(desease_rate ~ radius + texture + perimeter + area + smoothness, data = df1, nbest = 1, method = "forward",
nvmax =4 )
summary(model)
Subset selection object
Call: regsubsets.formula(desease_rate ~ radius + texture + perimeter +
area + smoothness, data = df1, nbest = 1, method = "forward",
nvmax = 4)
5 Variables (and intercept)
Forced in Forced out
radius FALSE FALSE
texture FALSE FALSE
perimeter FALSE FALSE
area FALSE FALSE
smoothness FALSE FALSE
1 subsets of each size up to 4
Selection Algorithm: forward
radius texture perimeter area smoothness
1 ( 1 ) "*" " " " " " " " "
2 ( 1 ) "*" " " " " " " "*"
3 ( 1 ) "*" "*" " " " " "*"
4 ( 1 ) "*" "*" " " "*" "*"
i am not sure what should be done after this code: how can the variable selection process done automatically??? please help.