Validating models on small dataset with imbalanced classes

Question

At this moment I'm trying to predict adverse events for the next 8 hours in hospital patients receiving a certain type of treatment, using python and pandas. Every row in my dataset represents one treatment and contains features like blood values and settings of the machines being used. The outcome of the model is a ratio of two blood values, which can be regressed of classified in a binary way (the event is adverse if this ratio exceeds 2.5).

The problem I'm having is that there are only 437 rows (treatments) available and one patient has received this treatment 54 times.

I'd prefer to validate my models on unseen data and keep patients separate in every split, but I'm not sure if this is feasible given the little amount of data.

At this moment I'm splitting my data 70/30 and making sure that patients cannot be in both the train and test set. This approach does feel rather weak, as I'd rather like to cross-validate my models. Therefore my questions are:

How can I improve this validation approach in general, given this little amount of data and imbalance in patient representation?
Would it hurt to let go of the 'keeping patients separate in every split' approach in this kind of setting?

score 1 · Accepted Answer · answered Jul 25 '18 at 11:05

1

If at all possible, work on the original ratios, not the discretized-via-cutoff binary data: Classification probability threshold

The best approach would likely be to model your ratios, while accounting for the fact that some patients may have repeated measurements. A standard method would be a mixed-model, with (say) a random intercept per participant. This accounts for different inter- vs. intra-individual variability.

In cross-validation, you could then predict the out-of-bag samples on a population level, or (if you have seen a particular participant already in the training sample for this CV split) on the individual level.

answered Jul 25 '18 at 11:05

Stephan Kolassa

95,027
13
197
357

Thanks for the brief response. A mixed-model is indeed what I was looking for. Do you happen to have any references about how to work with mixed models in sklearn? – Wymen Jul 25 '18 at 12:00
Unfortunately, no. And [searching](https://duckduckgo.com/?q=mixed+model+sklearn&t=ffsb&ia=web) gives a lot of information on *mixture* models, which are of course something completely different than mixed models. Other terms for similar models would be "hierarchical linear models", you might want to search for those. If you are not completely wedded to sklearn, [statsmodels supports mixed models](http://www.statsmodels.org/devel/mixed_linear.html). Good luck! – Stephan Kolassa Jul 25 '18 at 12:06

Validating models on small dataset with imbalanced classes

1 Answers1