I am a beginner in R. I am doing logistic regression using around 80 independent variables using glm
function in R. The dependent variable is churn
which says whether a customer churned or not. I want to know how to identify the right combination of variables to get a good predictive logistic regression model in R. I also want to know how to identify the same for making good decision tree in R ( I am using the ctree
function from the party
package).
So far, I had used drop1
function and anova(LogMdl, test="Chisq")
where LogMdl
is my logistic regression model to drop unwanted variables in the predictive model. But maximum accuracy I was able to achieve was only 60%.
Also I am not sure if I am using the drop1
and anova
functions correctly. I dropped the variables with lowest AIC using drop1
function. Using anova
function, I dropped variables with p value > 0.05
Kindly help me how to identify the right set of variables for both logistic regression and decision tree models to increase my model's predictive accuracy to close to 90% or more than that if possible.
library(party)
setwd("D:/CIS/Project work")
CellData <- read.csv("Cell2Cell_SPSS_Data - Orig.csv")
trainData <- subset(CellData,calibrat=="1")
testData <- subset(CellData,calibrat=="0") # validation or test data set
LogMdl = glm(formula=churn ~ revenue + mou + recchrge+ directas+
overage + roam + changem +
changer +dropvce + blckvce + unansvce+
custcare+ threeway+ mourec +
outcalls +incalls + peakvce + opeakvce+
dropblk + callfwdv+ callwait+
months + uniqsubs+ actvsubs+ phones + models +
eqpdays +customer+ age1 + age2 +
children+ credita + creditaa+
creditb +creditc + creditde+ creditgy+ creditz +
prizmrur+ prizmub +
prizmtwn +refurb + webcap + truck +
rv + occprof + occcler +
occcrft +occstud + occhmkr + occret +
occself + ownrent + marryun +
marryyes +marryno + mailord + mailres +
mailflag+ travel + pcown +
creditcd +retcalls+ retaccpt+ newcelly+ newcelln+
refer + incmiss +
income +mcycle + creditad+ setprcm + setprc + retcall,
data=trainData, family=binomial(link="logit"),
control = list(maxit = 50))
ProbMdl = predict(LogMdl, testData, type = "response")
testData$churndep = rep(0,31047) # replacing all churndep with zero
testData$churndep[ProbMdl>0.5] = 1 # converting records with prob > 0.5 as churned
table(testData$churndep,testData$churn) # comparing predicted and actual churn
mean(testData$churndep!=testData$churn) # prints the error %
Link for documentation of variables: https://drive.google.com/file/d/0B9y78DHd3U-DZS05VndFV3A4Ylk/
Link for Dataset (.csv file) : https://drive.google.com/file/d/0B9y78DHd3U-DYm9FOV9zYW15bHM/
I could not produce the output of dput
since the data size is more than 5 MB. So I have zipped the file and placed in the above link.
Description of important variables:
* churn
is the variable that says whether a customer churned or not.....
* churndep
is the variable that needs to be predicted in the test data (validation data) and has to be compared with the churn
variable which is already populated with actual churn.
For both churn and churndep, value of 1 means churned and 0 means not churned.