2

Suppose we want to build a binary classifier with weighted loss, i.e., it penalize different types of errors (false positive and false negative) differently. At the same time, the software we are using does not support a weighted loss.

Can I hack it by manipulating my data?

For example, suppose we are doing some fraud detection problem (let's assume the prior is 50% to 50% fraud vs. normal here, although most fraud detection are extremely imbalanced), where we can afford some false positives (false alerts on normal transactions), but really want to avoid false negatives (missed detection on fraud transactions).

Let's say we want the loss ratio to be 1:5 (false positive : false negative), can we make 5 copies of my fraud transactions?

Intuitively, by doing such copy we changed the prior distribution, and the model would more likely to say a transaction is a fraud one. So the false negative will be reduced.

My guess is if we are truly minimize 0-1 loss, this can do the trick, but if we are minimizing a proxy/logistic/hinge loss (see this post), then this hack will not work well.

Any formal/mathematical explanations?

Haitao Du
  • 32,885
  • 17
  • 118
  • 213
  • 1
    This question seems to confuse the concepts of "positive/negative" *responses* with "false positive" and "false negative" *errors*. Because the two are completely different, it is not reasonable to expect your approach to have any useful properties in general. But maybe you have some very special types of data and classification procedures in mind? If so, what are they? – whuber Jul 08 '16 at 14:41
  • @whuber Thanks, I will revise and clarify my question with some use case. – Haitao Du Jul 08 '16 at 14:48
  • @whuber could you help me to check if the revision make the question more clear? – Haitao Du Jul 08 '16 at 14:56
  • @hxd1011 Non-mathematical sidenote on using copies of samples in training: yes, this is possible and used sometimes. See e.g. `caret::upSample`in R: "`up-sampling: randomly sample (with replacement) the minority class to be the same size as the majority class.". As you concluded the core idea is to change how the model fits your data (otherwise models would just be equal with and without using it). – geekoverdose Jul 08 '16 at 16:13

1 Answers1

4

Yes you can (as long as your weights are integers (fractional to be pedantic)), though it's obviously not very efficient.

To see this, note that most loss functions can be written as $$\text{loss}(y, p) = \sum_{i=1}^n l(y_i, p_i)$$ where $p_i$ is the predicted value of $y_i$ for a suitable function $l$.

We can easily transform this to a weighted loss function by introducing weights: $$\text{weighted loss}(y, p) = \sum_{i=1}^n w_i l(y_i, p_i)$$

Now we see that if we duplicate each observation $i$ $w_i$ times, and minimize the (unweighted) loss, that this is equivalent to minimizing the weighted loss with weights $w_i$. Of course duplicating something $\pi$ times is difficult so make sure your weights are integers.

Note that adding a regularization penalty to the loss function does not have any effects on this reasoning.

Sven
  • 1,021
  • 8
  • 10
  • Have you seen the comment by @whuber? – Richard Hardy Jul 26 '16 at 19:49
  • hi @RichardHardy I didn't quite follow whuber's comment, could you explain a bit what is the problem of this answer? thanks! – dontloo Jul 27 '16 at 08:03
  • @dontloo, the OP is interested in a different kind of loss, let me cite: "...binary classifier with weighted loss, i.e., it penalize different types of errors (false positive and false negative) differently". Meanwhile, Sven considers weighting instances, i.e. making one data point more important than another. The two are not the same, and the latter does not facilitate the former. – Richard Hardy Jul 27 '16 at 08:07
  • @RichardHardy thank you, I see difference. But I'm still not sure why the latter does not facilitate the former. As we increase the weights of the entire class, if we find a solution that minimizes the weighted-sample-loss, doesn't that solution also minimize the weighted-class-loss? – dontloo Jul 27 '16 at 08:25
  • I think it does (if I have the right definition of weighted class loss in my head), but it is not what the OP is after. Weighing the observations will not make e.g. negative losses be penalized harsher than positive losses, or the like. It would be nice to come up with a simple counterexample, but I do not have the time now. But feel free to disagree, I could be wrong. – Richard Hardy Jul 27 '16 at 08:36
  • @RichardHardy I think that the answer above answers the question if you weight the observations with y_i = 0 differently than the observations with y_i = 1. It might not be necessarily true that by putting the weights as 5:1 you get a false positive rate : false negative rate of 5:1 too, but you can definitely adjust the weights to achieve this ration. Also, the question starts out with the question; can we hack weighted loss by duplicating observations, which is true. – Sven Jul 27 '16 at 15:23
  • Sven and @dontloo, hmm, probably you are right. Frankly, I have never worked on classification problems, so I might have some misunderstandings. Weighting like this would not work for a regression, but maybe it does work for classification. Could be my bad, after all. – Richard Hardy Jul 27 '16 at 17:04
  • ye, as an OP, I also do not quite understand whuber's comment... – Haitao Du Aug 01 '16 at 14:03