Presenting results from tree classification methods

Question

I am writing my thesis about binge drinking at my universitity. I have done a survey and collected 957 instances with 29 variables that could be interesting. I also got a binary value for binge drinking. The class balance between a binge drinker and a non binge drinker in the dataset is:

23% yes (220)
77% no (737)

What would my perc.over and perc.under be in R if I would apply SMOTE to my dataset?

newData <- SMOTE(Riskdrinker~ ., bob, perc.over = 200)

The aim of the thesis is to identify which attributes that contribute most to binge drinking. I have experimented with decision trees, random forest and logistic regression models. However I am curious and slightly confused on how exactly how I should present my results in a professional manor.

My initial plan (structure) have been to divide the results like so and present following items:

- Decision trees

Display the various decision trees in 1 or 2 iterations. Example of trees are pruned/unpruned and with cross validation.
Display the decision trees importance variable table.

- Random Forest

Display the Random Forest importance variable table with and without cross validation.

- Logistic regression

Stargaze through the table and compare the p value for significance.

- Evaluation

Compare all models for precision, recall and accuracy.

- Concerns

Since my dataset is rather unbalanced the results is highly dependent on the sample between training and testing set. For example how many instances of positive versus negative of the dependent variable. This is something I've noticed during many iterations and samples from the dataset. Is it very important that I state the number of positive versus negative for each decision tree that I present and for the rest of the models that I will produce?

What are your suggestions on displaying and comparing the results? And testing since the sample size is highly dependent. One idea is that I could do over sampling, that is to replicate the samples of the minority class.

Also I am working in R so I've mainly been using caret, rpart, randomforest and glm for my experiments.

I like CART models and their ensembles. I like 'Boruta' for variable importance. What measure/rubric do you use to justify saying "imbalanced". You could do CV but only vary on the dominant sample-type. You can clone, or if you are daring clone/blur your low-population variable to be equal to the high-population one. Have you looked at the leverage associated with individual points (think plot of simple lm)? Each tree is a weak learner - lots of error. glm. glmulti. Use AIC or AICc to select model given same data and account for sample sizes as well as param count. — EngrStudent, May 10 '17 at 21:27
Thanks for the tips. I am currently looking into oversampling. Why would you call that "daring" to clone/blur, i.e (oversampling)? I thought that was normal to do in data mining. Do you have any other ideas of how I could present my results rather than focusing on other models for now? — sockevalley, May 10 '17 at 21:33
oversample isn't "daring". Transform after copy is. [link](http://proceedings.mlr.press/v28/vandermaaten13.pdf) There are precautions to make is safe, but if you do it badly you could "poison" the learning. If you are fitting a "factor" in RF then you can use double trees, especially in more modern packages than "randomForest" (think h2o) and they improve the classification of binary inputs. H2o also lets you flag to "balance classes". [link](https://www.rdocumentation.org/packages/h2o/versions/3.10.3.6/topics/h2o.randomForest) — EngrStudent, May 10 '17 at 21:39
Can you provide some dummy-data, so brief script can be made? — EngrStudent, May 11 '17 at 12:30
There are several packages that R can provide. Or are you looking for data that I perticular use in my thesis? Btw, I have looked into SMOTE in R and played around with a couple of trees and the result seemed not to be accurate. It didn't make as much sense as before. — sockevalley, May 11 '17 at 19:24

score 2 · Answer 1 · answered May 10 '17 at 22:41

In terms of presentation you seem to be on a pretty good track. If you don't already have it, get a copy of the freely available Introduction to Statistical Learning, which has examples of how to present results of these types of models. Your presentation suggestions at first glance seem similar to examples in that book.

It's not clear that over-sampling your minority (binge-drinking) class would do much good. See, for example, this Cross Validated page. As your re-sampling results indicate, there may be variability in your estimates of logistic regression coefficients from sample to sample with a 23/77 class breakdown, but that might well represent the true population value of the class ratio and the variability may be an inherent issue with this sample size. I understand that you would have more of a problem if the unbalanced classes in the sample do not represent the actual population class ratios, as for example the intercept in logistic regression for the sample would not be the value for the population.

With 220 cases in the least prevalent class your 29 variables might lead to some over-fitting; in unpenalized logistic regression you typically want 10-20 such cases per predictor variable, limiting you to 11 to 22 predictor variables. (For a multi-level categorical variable, each level beyond the first counts as a predictor variable.) You thus might also consider penalized approaches. LASSO-based logistic regression will identify a subset of predictors, although the subset chosen might vary from sample to sample. A logistic ridge regression model that penalizes regression coefficients might be best for predictions from new data, as it uses information from all the predictors in a way that minimizes overfitting.

Finally, if your binge/non-binge definition was based directly on some of your predictor variables, then those should be removed as predictors in your model as otherwise you will just be re-inventing your initial definition of binge drinking.

Thanks for your answer. I tried appyling smote which resultet in something like: 650 (yes), 880 no) but the decision trees just didn't make as much sense as before. From these experiments I think I can safely say that Oversampling does not apply when the ratio of class balance is 80/20. This may be due to that it is that there may not be sufficient patterns belonging to the minority class to adequately represent its distribution as said on the Cross validated page you linked before. But really, how do one know if the sample represent the actual population? — sockevalley, May 11 '17 at 19:32
@sockevalley : one never knows for sure that a sample represents the underlying population. In your case you might see if your binge/non-binge ratio is similar to reports from similar universities, for a start. — EdM, May 11 '17 at 20:52
This study is kind of one of a kind so that's hard but I have things I can go on. From what I've seen in my experiments is that it correlates with simliar studies. — sockevalley, May 11 '17 at 21:24