0

Currently I have a dataset with a class distribution of 75:25.

While the 1st thought that came to my mind was to use the binary classification using logistic regression. Of course, I can upsample using SMOTE etc to make reliable predictions for minority class.

However, I had another question.

Why can't this problem be formulated as anomaly detection? Meaning, since I have too much of positives (to learn from), is it okay to consider that anything that doesn't fall with the positive examples, can it be considered as negative? So, would anomaly detection work?

q1) Is there any disadvantages to this?

q2) can help me understand why do you think classification would be the best approach for imbalanced classification? If not, why do you think anomaly detection would be the best approach?

q3) when to choose binary clf or anomaly detection?

The Great
  • 1,380
  • 6
  • 18
  • I would encourage you instead to consider probabilistic class membership predictions. [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) [Classification probability threshold](https://stats.stackexchange.com/q/312119/1352) – Stephan Kolassa Feb 01 '22 at 10:13
  • @StephanKolassa - Thanks for the link. Useful. I have increased the threshold from 0.5 to 0.7. But this feels like something that I am doing based on my convenience to predict more negative samples correctly. But that threshold is just a random decision. How can I justify my threshold choice of 0.7 over 0.5? Simply, say it helped identify more negatives as negatives? Hence, I was looking at anomaly detection alogos – The Great Feb 01 '22 at 10:19
  • Understand that imbalanced dataset isn't a problem with arguments like `class_weight`, `scaled_pos_weight`, boosting models etc, but how does one justify threshold choice? Any insight from your experience? – The Great Feb 01 '22 at 10:21
  • Per the first link in my comment above, I don't think weighting addresses the imbalance "problem". Rather, it "addresses" problems caused by misleading evaluation metrics like accuracy, precision, recall etc., see [Why is accuracy not the best measure for assessing classification models?](https://stats.stackexchange.com/q/312780/1352) For thresholds, I recommend setting them not to optimize some arbitrary (and, again, misleading) KPI, but based on the costs of subsequent decisions. See the second link above, and https://stats.stackexchange.com/search?tab=votes&q=user%3a1352%20threshold – Stephan Kolassa Feb 01 '22 at 10:32
  • @StephanKolassa - one quick question on this topic. Should we optimize the threshold for both train and test data? Meaning currently it is 0.5 for train and test data by default. Should I change thresholds to get the result (that I would like to see) by modifying the threshold in both train and test data? – The Great Feb 21 '22 at 08:28
  • TheGreat (the @-functionality seems to be broken?): [per this thread, also linked above](https://stats.stackexchange.com/a/312124/1352), if you have a notion of the costs of different actions $\times$ outcomes, then yes, it makes perfect sense to optimize one or multiple thresholds for mapping probabilistic classifications to actions or decisions. And you can assess the quality of the thresholds on your holdout data. – Stephan Kolassa Feb 21 '22 at 08:35
  • @StephanKolassa - understand. I read the other link that you shared. The threshold optimization is mainly a decision component (and not a statistics part), does it mean the model will still be trained based on `0.5` threshold and outputs probabilities. We later apply another decision function on top of it to change the predictions the way we want. Am I right? – The Great Feb 21 '22 at 08:50
  • Most models do not use a threshold in training, but maximize, e.g., the log-likelihood, which is exactly the log loss, a proper scoring rule. Which is as it should be, as (IMO) thresholds only come in when we make decisions, but models are exclusively concerned with the statistical aspect. And as you write, the trained models (rather: their predicted class membership probabilities) can then be combined with thresholds to make decisions. – Stephan Kolassa Feb 21 '22 at 09:04

1 Answers1

2

There are various methods of anomaly detection, so details depend on which one you'd use, however the general idea of anomaly detection is that there is one reference class, and that everything that deviates clearly from what goes on in the reference class is classified as anomaly. This is essentially asymmetric. If you in fact have two classes, depending on the problem there may be areas of overlap between the two classes; in some applications classes may overlap more or less strongly everywhere. Using anomaly detection, no region in data space with a good number of observations from the reference class will be classified as anomaly, so this means that in your two class problem the minority class cannot be found anywhere where the majority class is present strongly enough. On the other hand, places where there are atypical observations from the majority class may be detected as anomaly even if there is nothing of the minority class around.

So in a classification problem in which you want to treat both classes in the same manner, anomaly detection doesn't look like the right approach, even if there is imbalance, as it means that all analysis is focused on one reference class. Intuitively, this introduces even more asymmetry than the imbalance implies already, and the latter is better dealt with using appropriate loss functions as discussed in the links given by @StephanKolassa in the comments to the question.

PS: "When to choose anomaly detection"? When detecting anomalies from a reference class is really what you want, such as outliers, stray observations from unknown heterogeneous sources, or erroneous observations. Particularly anomaly detection is of interest if you have a more or less clean training sample from the reference class, but you suspect that when collecting new data without class label, some observations occur that do not belong to the reference class from sources that are not represented in the training sample.

Christian Hennig
  • 10,796
  • 8
  • 35