I am writing my thesis about binge drinking at my universitity. I have done a survey and collected 957 instances with 29 variables that could be interesting. I also got a binary value for binge drinking. The class balance between a binge drinker and a non binge drinker in the dataset is:
- 23% yes (220)
- 77% no (737)
What would my perc.over and perc.under be in R if I would apply SMOTE to my dataset?
newData <- SMOTE(Riskdrinker~ ., bob, perc.over = 200)
The aim of the thesis is to identify which attributes that contribute most to binge drinking. I have experimented with decision trees, random forest and logistic regression models. However I am curious and slightly confused on how exactly how I should present my results in a professional manor.
My initial plan (structure) have been to divide the results like so and present following items:
- Decision trees
- Display the various decision trees in 1 or 2 iterations. Example of trees are pruned/unpruned and with cross validation.
- Display the decision trees importance variable table.
- Random Forest
- Display the Random Forest importance variable table with and without cross validation.
- Logistic regression
- Stargaze through the table and compare the p value for significance.
- Evaluation
Compare all models for precision, recall and accuracy.
- Concerns
Since my dataset is rather unbalanced the results is highly dependent on the sample between training and testing set. For example how many instances of positive versus negative of the dependent variable. This is something I've noticed during many iterations and samples from the dataset. Is it very important that I state the number of positive versus negative for each decision tree that I present and for the rest of the models that I will produce?
What are your suggestions on displaying and comparing the results? And testing since the sample size is highly dependent. One idea is that I could do over sampling, that is to replicate the samples of the minority class.
Also I am working in R so I've mainly been using caret, rpart, randomforest and glm for my experiments.