0

This question is not the same as this one I asked previously. In the previous question I asked to prove that the sigmoid and softmax are equivalent. I found a solution here, but I think it's not correct. Here is the exercise that I'm trying to do:

enter image description here

And here is how they prove it:

enter image description here

However it seems incorrect because I don't think we can set $H_t(b)\leftarrow H_t(b)-H_t(b)=0$ and then consequently set $H_t(a)\leftarrow H_t(a)-H_t(b)=H_t(a)$ because it does change the probability. I mean let $H_t(a)=2$ and then left $H_t(b)=1$ and then $H_t(b)\leftarrow 0$. It's not true that (softmax) $P(A_t=a)=\frac{e^2}{e^2+e^1}=\frac{e^2}{e^2+e^0}$

So what is actually the correct way to do it?

Slim Shady
  • 203
  • 9
  • The question is nearly the same as your previous one, so I'm marking it as a duplicate. – Tim Jan 20 '22 at 11:31

1 Answers1

1

After slightly simplifying the notation, by the properties of the exponential function:

$$ \frac{e^{a-b}}{e^{a-b} + e^{b-b}} = \frac{e^a e^{-b}}{e^a e^{-b} + e^b e^{-b}} = \frac{e^a}{e^a + e^b} $$

Here you can double check.

Using your example,

> exp(2)/(exp(2) + exp(1))
[1] 0.7310586
> exp(2-1)/(exp(2-1) + exp(1-1))
[1] 0.7310586

Notice that you forgot to update the value of $H_t(a)$.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Thank you. But I didn't forget, I think they forgot, and that was the point of my question - to ask whether their solution is incorrect – Slim Shady Jan 20 '22 at 11:47
  • They didn't forget anything. – Tim Jan 20 '22 at 11:47
  • I think they forgot to update $H_t(a)$. They write $H_t(a)\leftarrow H_t(a)-0=H_t(a)$. Your example shows the updated version where $H_t(a)-1 \neq H_t(a)$ – Slim Shady Jan 20 '22 at 11:50
  • is my last comment incorrect? In case it's correct, could you please reopen the question or edit your answer? – Slim Shady Jan 20 '22 at 11:59
  • $H_t(a) - H_t(b)$ not $H_t(a) - 0$ – Tim Jan 20 '22 at 12:01
  • But they write $H_t(a) \leftarrow H_t(a)-H_t(b)=H_t(a)$? Which clearly implies $H_t(b)=0$? And they do that because before they show that $H_t(b)\leftarrow H_t(b)-H_t(b)=0$. – Slim Shady Jan 20 '22 at 12:06
  • @SlimShady but those are not operations that are done sequentially. It also wouldn't make any sense to subtract zero. – Tim Jan 20 '22 at 12:51
  • They are not done sequentially, I know - that's why i'm confused how exactly $H_t(a)\leftarrow H_t(a)$? – Slim Shady Jan 20 '22 at 13:23
  • $H_t(b)' \leftarrow H_t(b) - H_t(b)$ and $H_t(a)' \leftarrow H_t(a) - H_t(b)$ – Tim Jan 20 '22 at 13:31
  • Okay, but then why in the end they write $P(A=a) = \frac{e^{H_t(a)}}{e^{H_t(a)}+1}$ instead of $\frac{e^{H_t(a)-H_t(b)}}{e^{H_t(a)-H_t(b)}+1}$? – Slim Shady Jan 20 '22 at 13:50
  • @SlimShady because they don't distinguish in the notation the old vs new values and assume the reader would understand this from the context. – Tim Jan 20 '22 at 14:03
  • You are correct, I'm blind.. I was fixated on the fact that they did it sequentially.. thank you very much for explaining – Slim Shady Jan 20 '22 at 14:08