Is it better to do per-class anomaly detection on P(x, y) or P(x | y)?

Question

(Not an expert in anomaly detection.)

I'd like to experiment with per-class anomaly detection.

That is, we have a feature vector $x$, and a classifier that predicts its class $\hat{y}$. I'd like to see if the combination $(x, \hat{y})$ is an anomaly, given some training set of non-anomalous $(x, y)$ pairs.

It seems that I can train one joint anomaly detector on $P(x,y)$, or multiple independent detectors on $P(x|y)$.

I think the latter is easier and sufficient. Are there any downsides? Also, is there a name for this technique?

It is not clear, do you observe the class $y$, or do you predict it using a classifier? — Konstantin, Dec 12 '20 at 17:09
The true class is observed at training time, and predicted when using the anomaly detector. It doesn't matter for your answer, though. — kennysong, Dec 13 '20 at 05:44
I'd say, that the predicted label $\hat y$ then simply aggregates information from $x$, adding nothing new, and there is no point in using it in normally detection. — Konstantin, Dec 13 '20 at 09:35

score 0 · Accepted Answer · answered Dec 12 '20 at 17:11

According to the Bayes law the joint probability of a sample observation $(x,y)$ can be decomposed into the familiar product:

$$ P(x,y) = P(x|y)P(y).$$

In other words, we can always partition the set of features, $(x,y)$, into two subsets, $x$ and $y$, and then decompose the probability of the sample into the marginal probability of one subset $P(y)$ and the conditional probability of the other $P(x|y)$.

Note that the joint probability of the sample observation might fall below the anomaly threshold for two reasons:

the probability of observed values for a subset of features, $P(y)$ is low, or
the probability of observed values for the complementary subset, conditional on the values of the first subset, $P(x|y)$ is low.

When you only use the $P(x|y)$ model, you forego the information contained in the realization of $y$. In other words you implicitly agree with the observation of $y$, be it a modal (that's okay) or an extremely rare value (that's not okay, this already may be a sufficient reason for rejection).

Thanks, I agree with your analysis. There's also an orthogonal consideration of whether P(x,y) or P(x|y) is easier to learn. I think usually, P(x|y) is simpler since it's not a mixture, but P(x,y) may be easier if the classes are imbalanced & there is reusable structure between the classes. — kennysong, Dec 13 '20 at 05:54
Yes, I like the point on the balance between classes. Depending on the size of your data, you can also face a trade-off: smaller models may be easier to train, but they may have worse performance due to insufficiency of training data. — Konstantin, Dec 13 '20 at 09:31

Is it better to do per-class anomaly detection on P(x, y) or P(x | y)?

1 Answers1