Classifier which minimizes inaccuracy

Question

I recently interviewed for a machine learning job which involved very mathematically rigorous questions. This is one of them, which I'm still very confused about.

Question: Given a data generating distribution $\mathbb{P}(x,y)$ and a classifier $f(x)$, where $x$ is the observation and $y$ is the label, prove that the inaccuracy given by:

$$\mathbb{E}[\mathbb{1}(f(x) \neq y)]$$

is minimized by the classifier: $$f^*(x) = \text{sign}\big(\mathbb{P}(y|x) - \frac{1}{2}\big)$$

where $\mathbb{1}(a \neq b)$ is $1$ when $a \neq b$ and $0$ otherwise, and $\text{sign}(a)$ gives the arithmetic sign of $a$, that is $+1$ or $-1$.

Now I have multiple concerns about this question which really confuse me:

What is the meaning of this statement? What does it say on an intuitive level?
What information does it give us about the classifier? Can this information be used to design a "practically good" classifier for the problem which can be learned from data?
What is the use of this result? How does it relate to, say, designing a logistic regression classifier? I ask about logistic regression because there also we threshold our predicted probability at $0.5$.

I think there are many ways to attack this. Just so I stay in my comfort zone I would say that minimising the inaccuracy is the same as maximising the accuracy and then go ahead and just browbeat the Naive Bayes on them. Tim's answer [here](https://stats.stackexchange.com/questions/296014) and Antoine's answer [here](https://stats.stackexchange.com/questions/191924) give the necessary background. (Side-note: what kind of job was that? Was it about algorithmic trading?) — usεr11852, May 24 '19 at 14:44
No, it was a research job. The work involves experimenting with new ML algorithms and writing papers. Given that the interviewers were researchers themselves, I guess it makes sense that the interview was so mathematical. — Bruce Wayne, May 24 '19 at 15:10
Fair enough. I hope you get the position! (If you want the position) — usεr11852, May 24 '19 at 15:11
Haha, thanks! Yes, I do want the position and am waiting for their decision. Fingers crossed! :) — Bruce Wayne, May 24 '19 at 15:18

score 3 · Answer 1 · answered May 24 '19 at 14:54

3

To archive optimal accuracy (minimal inaccuracy) you would assign a label A to x whenever P(y|x) > 0.5 and label B whenever P(y|x) < 0.5.

In other words, P(y|x) together with threshold 0.5 is the optimal Bayes classifier.

This is equivalent to assign labels based on the sign P(y|x) - 0.5, as you simply project the probabilities of P(y|x) in [0, 1] to the interval [-0.5, +0.5]. The decision boundary in probability space, 0.5, becomes 0 in the projected space and thus you can check whether a sample lies on either side of the decision boundary by checking the sign of P(y|x) - 0.5.

I'm sure you can formulate this a lot more mathematically, but this is the intuition.

answered May 24 '19 at 14:54

Scholar

965
4
17

Makes sense, thanks. One more thing that was really confusing to me was the $\mathbb{P}(y|x)$. Why is this a probability between [0, 1]? I thought it should be *exactly* 0 or 1? Is the true label $\mathbb{P}(y|x)$ probabilistic because of noise in the data collection process? – Bruce Wayne May 24 '19 at 15:08
Not necessarily. For example, assume y $\in$ {0, 1} and x is a real number. The x of samples with label 0 could follow a normal distribution with mean 0 and sd 1 (D1), while the x of samples with label 1 could follow a second normal distribution of mean 1 and sd 1 (D2). What P(y|x) actually tells you is the probability that x was drawn from say D1 as opposed to D2. However, both D1 and D2 overlap at the upper and lower tails, so there is always a degree on uncertainty. The optimal decision boundary would be the x-value where the curves of D1 and D2 intersect. – Scholar May 24 '19 at 15:25

Classifier which minimizes inaccuracy

1 Answers1