0

Hey there,

I have a question about a topic that's been discussed many times before, but to which I could find a satisfying answer.

I'm working with a self generated dataset that is comprised of only 340 'datapoints'. The dataset is not balanced, due to experimental reasons. It consists of eleven classes that vary from 60 events to 7 events per class. Bacause of the data origin, I cannot use common augmentation algorithms, so we developed our own. I trained a model on these data. It performs pretty well and is able to generalize the problem satisfactorily. I also tested different amounts of augmentation based on the resulting performance of the model.

My question now: Is it a good idea to use data augmentation to balance out the dataset, even though it produces a well performing model? Or do I acutally don't need a balanced dataset as long as my model performs to my satisfaction?

My concern is the integrity of my model. It is to be published as part of a larger project and I just want to make sure that it stands up to the review process.

I welcome any ideas and feedback on this topic.

TheoBoveri
  • 11
  • 2
  • 2
    Does this help? [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Oct 20 '21 at 16:42
  • As my mathematical understanding of this probelm is very limited, it only helps to a certain point. I get your argument, that an unabalanced dataset is not a problem in itself, but only if one uses *accuracy* as a metric. After reading some of your posts I also understand, that you are not a big friend of accuracy. I feel a bit lost in the jungle of different metrices. Could you perhaps make a suggestion for a more suitable metric for such a multiclass classification model that also works with an unbalanced data set? – TheoBoveri Oct 21 '21 at 06:15
  • I would usually recommend going with a probabilistic classifier, i.e., one that for each instance gives a predicted *probability* of it belonging to classes A, B, C or D (with predicted probabilities summing to 1). I would argue that the *decision* on what to do with this classification is a separate issue and should be informed by the costs of possibly wrong actions - even if there is a *small* possibility of a malignant cancer, we would want to run additional tests, rather than treat the patient as "healthy", simply because P(healthy)=0.60. https://stats.stackexchange.com/a/312124/1352 – Stephan Kolassa Oct 21 '21 at 07:39
  • You can assess the quality of probabilistic classifications using *proper scoring rules*. [The tag wiki](https://stats.stackexchange.com/tags/scoring-rules/info) contains information and references. Note that many scoring rules are only formulated for binary classifications, but many work just as well for multi-class situations. [This thread](https://stats.stackexchange.com/q/274088/1352) compares the log and the Brier score with a little specific emphasis on multi-class classifications. Good luck! – Stephan Kolassa Oct 21 '21 at 07:42
  • The last layer in my model acutally uses `categoricalcrossentropy` as a loss function. As I read elsewhere, the *log loss score* is also referred to as *cross entropy*, right? Also the output of my network runs through a *softmax* function, resulting in a [0:1] scoring. So do these circumstances fit to your recommendation from above to use a probabilistic classifier, or am I still missing something? – TheoBoveri Oct 21 '21 at 08:55
  • That does sound promising! Then my recommendation would simply be to work directly with the scoring in $[0,1]$ and not check them against some threshold. – Stephan Kolassa Oct 21 '21 at 09:02
  • I am already working directely with the [0,1] scoring. So I assume that it's fine for me working with an unbalanced dataset :) That's fantastic news. Thank's a lot for your help. – TheoBoveri Oct 21 '21 at 09:40

0 Answers0