I will use an example of comparing two different antifraudes to better illustrate the statistical problem.
(Antifraud is an algorithm that analyzes whether a transaction is a real transaction or a fraud. If the algorithm sees too much risk in the transaction it classifies it as a fraud not allowing the transaction to continue, it is more common when we think of ecommerce that can be attacked by hackers, stolen credit cards and others)
Suppose you want to compare which of two different antifrauds is the best.
The best way to do this is to build a random experiment and spend half of the transactions for each of the algorithms and create a matrix of confusion to choose whether you prefer precision or recall for your specific problem.
Now, in reality, few companies run a controlled experiment; the most common case is that we have a single antifraud and we attach a second antifraud later to serve as a backup, and there begin the statistical problems of censored and skewed data.
In this real case, we have an 'A' antifraud that passes all transactions and a 'B' antifraud where we go through the same transactions and do the same analyzes but only use their decision if 'A' did not work.
Thus we construct the matrix of A as follows:
Normal transaction Predicted as normal (Normal / Normal) Normal transaction Predicted as fraud (Normal / Fraud) Fraudulent Transaction Predicted as normal (Fraud / Normal) Fraudulent Transaction Predicted as Fraud (Fraud / Fraud)
The first problem is that when algorithm A says that a transaction is a fraud it does not move forward so we can not know if it was a true fraud or not. We only identify the frauds in the transactions that he passed as normal.
The second big one is that when we make that same matrix for algorithm B, the data is very skewed because although we were able to complete the four fields of the matrix, we can only say that we predicted a fraud that was not a fraud for the transactions that the first algorithm left (all that algorithm A said was a fraud, did not go forward and can no longer be analyzed), thus making algorithm B arificially better than A (since it can identify A's failures but not its own failures).
How can we circumvent the bias problems and data censored in such a problem?
Can we use Bayesian inference to get around this?