In learning theory, why can't we bound like $P[|E_{in}(g)-E_{out}(g)|>\epsilon] \leq 2e^{-2\epsilon^{2}N}$?($g$ is our learned hypothesis)

Question

Given Data $D_{in}$, number of data $N=|D_{in}|$, and hypothesis set $H=\{h_1,h_2, ...,h_M\}$.

For a fixed hypothesis $h$, for example $h_1$, we can derive

$$P[|E_{in}(h_1)-E_{out}(h_1)|>\epsilon] \leq 2e^{-2\epsilon^{2}N}$$

from hoeffding's inequality since $\mathbb{E}_{D_{in}}[E_{in}(h_1)]=E_{out}(h_1)$.

If $g$ is our learned hypothesis using the data $D_{in}$, we can't apply hoeffding's inequality directly since $\mathbb{E}_{D_{in}}[E_{in}(g)] \neq E_{out}(g)$.

But my question is, why can't we bound $P[|E_{in}(g)-E_{out}(g)|>\epsilon]$ as following:

$ P[|E_{in}(g)-E_{out}(g)|>\epsilon] \\\ = \sum_{h \in H} P[|E_{in}(g)-E_{out}(g)|>\epsilon \;\land \;g=h]\\\ = \sum_{h \in H} P[|E_{in}(g)-E_{out}(g)|>\epsilon \;\lvert \;g=h]\;P[g=h] \\\ = \sum_{h \in H} P[|E_{in}(h)-E_{out}(h)|>\epsilon]\;P[g=h] \\\ \leq \sum_{h \in H} (2e^{-2\epsilon^{2}N})\;P[g=h] \\\ = 2e^{-2\epsilon^{2}N}$

Inequality $P[|E_{in}(g)-E_{out}(g)|>\epsilon] \leq 2e^{-2\epsilon^{2}N}$ means that for the probability $1-\delta$, and learned hypothesis $g$ using $D_{in}$, inequality $$E_{out}(g) \leq E_{in}(g) + \sqrt{{1 \over 2N}ln{2\over\delta}}$$

holds and thus, the size of hypothesis set $|H|$ doesn't affect generalization bound.

Is this wrong? Where am I doing wrong?

If this inequality is right, why do we use uniform convergence bound like: $$ \forall g \in H, \;P[|E_{in}(g)-E_{out}(g)|>\epsilon] \leq 2|H|e^{-2\epsilon^{2}N} $$?

score 2 · Accepted Answer · answered Jun 20 '15 at 17:06

2

I believe the error is in the following equality

$$\sum_{h \in H} P[|E_{in}(g)-E_{out}(g)|>\epsilon \;\lvert \;g=h]\;P[g=h] = \sum_{h \in H} P[|E_{in}(h)-E_{out}(h)|>\epsilon]\;P[g=h]$$

but its hidden by the notation.

Consider your equality

$$ P [|E_{in}(g)-E_{out}(g)|>\epsilon \;\lvert \;g=h] = P [|E_{in}(h)-E_{out}(h)|>\epsilon] $$

It is only specific training sets $X, Y$ that produce $g = h$. And, unfortunately, it is exactly these data sets that bias the out of sample error to be higher than the training error. To produce an equality like this you have to change the probability space to include only those training data sets for which $g = h$, so $P$ just doesn't mean the same thing on the left and the right hand sides.

answered Jun 20 '15 at 17:06

Matthew Drury

33,314
2
101
132

Thank you for your answer. I recognized the error you mentioned. Thank you. I have a few more questions. Is there any way to bound $P [|E_{in}(g)-E_{out}(g)|>\epsilon \;\lvert \;g=h]$ value which finally leads to $P[|E_{in}(g)-E_{out}(g)|>\epsilon] \leq 2|H|e^{-2\epsilon^{2}N}$? – asqdf Jun 20 '15 at 17:33
Also, I'm having a trouble with 'How to connect inequality $\forall h \in H, \;P[|E_{in}(h)-E_{out}(h)|>\epsilon] \leq 2|H|e^{-2\epsilon^{2}N}$ to inequality $P[|E_{in}(g)-E_{out}(g)|>\epsilon] \leq 2|H|e^{-2\epsilon^{2}N}$ if $g$ is our learned hypothesis'. I understood it intuitively, but I'm having trouble with formulating rigorous mathematical equations. – asqdf Jun 20 '15 at 17:34
For your second question, you simply must observe that $g \in H$, so it is being quantified over in the $\forall$. For the first, I can't say for sure, but since you don't have much control over the conditional probability space, i'd be surprised if it was easy to do so. – Matthew Drury Jun 20 '15 at 20:28
Thanks! I finally understood. Thank you. However, I have one more last question. Could you please give me an explanation about my another question? http://stats.stackexchange.com/questions/157967/in-statistical-learning-theory-isnt-there-a-problem-of-overfitting-on-a-test-s – asqdf Jun 21 '15 at 07:21

In learning theory, why can't we bound like $P[|E_{in}(g)-E_{out}(g)|>\epsilon] \leq 2e^{-2\epsilon^{2}N}$?($g$ is our learned hypothesis)

1 Answers1