Weird Term of Log likelihood

Question

Recall the setup of logistic regression: We assume that the posterior probability is of the form

$p(Y=1|x) = \frac{1}{1+e^{\beta^Tx}}$

This assumes that Y|X is a Bernoulli Random variable. We now turn to the case where Y|X is a multinomial random variable over K outcomes. This is called softmax regression, because the posterior probability is of the form

$ p(Y=k|x) = \mu_k(x) = \frac{e^{-\beta^Tx}}{\sum_{j=1}^Ke^{-\beta^Tx}} $

which is called the softmax function. Assume we have observed data $D=\{x_i, y_i\}_{i=1}^N$. Our goal is to learn the weight vectors $\beta_1, ..., \beta_K$.

The question is given like this. What my attempt is I first write this equation as product of all probabilities of random variables as: $\prod P(Y|X)=\prod_{i}P(y_i|x_i)$ I thereafter take negative logarithm of both sides:

$-log \prod P(Y|X) = -log(\prod_{i}P(y_i|x_i))$ which is equal to:

$-\sum_{i=1}^Nlog(P(y=i|x=i) = -[\beta_k^Tx_i- log(\sum_{j=1}^{K}e^{\beta_x^Tx_i})]$

In the solution, however, it was given as:

$-log \prod P(Y|X) = -log(\prod_{i}P(y_i|x_i)) = -log{\prod_{i=1}^N\prod_{K=1}^K}(\frac{e^{\beta_k^Tx_i}}{\sum_{j=1}^Ke^{\beta_j^Tx_i}})^{1\{y_i=k\}}$

Here, I couldn't understand meaning of the term $1\{y_i=k\}$. What does that mean? Why does it appears there? I've solved tons of likelihood questions and I've never encountered such a term before. In addition, this is the first time I'm asking a question here, sorry for both my latex and English. Hope both are clear. Thx for your reply.

"Softmax" is late to the game. This has been called multinomial or polytomous logistic regression for ages. — Frank Harrell, Jan 31 '22 at 13:35

score 0 · Accepted Answer · answered Jan 31 '22 at 08:45

The $\mathbf{1}\{y_i = k\}$ is an indicator function. It is equal to 1 when $y_i = k$ and zero otherwise. What this means is that in this product you consider only the probabilities for the appropriate classes. Recall that $x^0 = 1$ for any $x$, so for all the other classes you would be multiplying by 1 (doing nothing), while for the appropriate classes you multiply by the actual values $x^1 = x$.

Also keep in mind that this is a notational trick, if you implemented it, you probably would skip calculating the probabilities for classes that you don't consider to save computations.

Weird Term of Log likelihood

1 Answers1