1

Recall the setup of logistic regression: We assume that the posterior probability is of the form

$p(Y=1|x) = \frac{1}{1+e^{\beta^Tx}}$

This assumes that Y|X is a Bernoulli Random variable. We now turn to the case where Y|X is a multinomial random variable over K outcomes. This is called softmax regression, because the posterior probability is of the form

$ p(Y=k|x) = \mu_k(x) = \frac{e^{-\beta^Tx}}{\sum_{j=1}^Ke^{-\beta^Tx}} $

which is called the softmax function. Assume we have observed data $D=\{x_i, y_i\}_{i=1}^N$. Our goal is to learn the weight vectors $\beta_1, ..., \beta_K$.

The question is given like this. What my attempt is I first write this equation as product of all probabilities of random variables as: $\prod P(Y|X)=\prod_{i}P(y_i|x_i)$ I thereafter take negative logarithm of both sides:

$-log \prod P(Y|X) = -log(\prod_{i}P(y_i|x_i))$ which is equal to:

$-\sum_{i=1}^Nlog(P(y=i|x=i) = -[\beta_k^Tx_i- log(\sum_{j=1}^{K}e^{\beta_x^Tx_i})]$

In the solution, however, it was given as:

$-log \prod P(Y|X) = -log(\prod_{i}P(y_i|x_i)) = -log{\prod_{i=1}^N\prod_{K=1}^K}(\frac{e^{\beta_k^Tx_i}}{\sum_{j=1}^Ke^{\beta_j^Tx_i}})^{1\{y_i=k\}}$

Here, I couldn't understand meaning of the term $1\{y_i=k\}$. What does that mean? Why does it appears there? I've solved tons of likelihood questions and I've never encountered such a term before. In addition, this is the first time I'm asking a question here, sorry for both my latex and English. Hope both are clear. Thx for your reply.

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
kursat
  • 13
  • 3

1 Answers1

0

The $\mathbf{1}\{y_i = k\}$ is an indicator function. It is equal to 1 when $y_i = k$ and zero otherwise. What this means is that in this product you consider only the probabilities for the appropriate classes. Recall that $x^0 = 1$ for any $x$, so for all the other classes you would be multiplying by 1 (doing nothing), while for the appropriate classes you multiply by the actual values $x^1 = x$.

Also keep in mind that this is a notational trick, if you implemented it, you probably would skip calculating the probabilities for classes that you don't consider to save computations.

Tim
  • 108,699
  • 20
  • 212
  • 390