Questions on Bayesian Softmax Regression

Question

My question is about how to actually do this both rigorously and practically. Allow me to elaborate.

Suppose that we have data $(x_1,y_1),...,(x_N,y_N) \in \mathbb{R}^p \times \{0,...,k-1 \}$. I'd like to do a softmax regression. More precisely, I'd to find vectors $W_1,...,W_k \in \mathbb{R}^p$ and $b_1,...,b_k \in \mathbb{R}$ such that my model predicts $$\mathbb{P}(Y = j | X = x) = \frac{\exp( w_j \cdot x + b_j)}{\sum_{i=0}^{k-1} \exp (w_i \cdot x + b_j)}$$

where the $W$ and $b$ satisfy some constraint. This is all fine and dandy, since we can find the parameters by maximum likelihood, which can be easily executed by a variety of software, e.g. sklearn, PyTorch, R, etc. There's a certain issue regarding identifiability, since we can always shift the parameters by a constant, but this shouldn't be an issue since they should all yield the same probabilities. I've seen people choose to zero out the last coefficient to deal with this. However, if possible, I'd like to avoid this, since this is essentially choosing a class as reference and this choice is a bit arbitrary (and also because I haven't figured out how to do that in software I'm using for this, this is the real concern haha).

Suppose now I want to take a Bayesian approach to this i.e. put a prior on $w,b$, compute the posterior $\mathbb{P}(w,b| X = x, Y = y)$ and find the MAP estimate, $$\text{argmax}_{w,b} \mathbb{P}(w,b | Y =y, X = x) = \text{argmax}_{w,b} \mathbb{P} (Y = y | w,b, X = x)\mathbb{P}(w,b).$$ My first question is,

1. What is the distribution of the first factor on the right in the formula above?

I understand that it's given explicitly by the softmax function, but I was wondering if there was some name for this. I've been calling it the softmax distribution but I don't want to sound like an idiot when talking with other people.

To state my next question, let me compare it to the linear regression model. In order to do a Bayesian model for linear regression, you assume $Y \sim \mathcal{N}(Wx + b,\sigma^2)$. This allows you to put a Gaussian prior on $W$ and still be able to recover the posterior distribution $\mathbb{P}(W | X=x,Y=y)$ analytically.

In my case, the relevant question is

2. What should the priors be for $W_j$ and $b_j$?

I guess my main concern is, how much should I be worrying about the failure of identifiability? As I said before, due to practical (read: implementation) issues, I'd like to avoid substracting the coefficients, or messing with them in anyway that's not just endowing them with a distribution. My worry is that this will the make the Gaussian prior not a great choice.

My next two questions concern the actual conditioning on the data. As noted above, in the linear regress case, we use the fact that we have data on $Y$, and that we know its distribution $Y \sim \mathcal{N}(Wx+b,\sigma^2)$ to compute the posterior $\mathbb{P}(W,b | X=x,Y=y)$.

In the logistic regression model, we have $Y$ but we are really computing $\mathbb{P}(Y | W, b)$. This seems to suggest to me that we should put a distribution on the space of probabilities i.e. we should make $\mathbb{P}(Y = \cdot | W,b)$ a random variable then condition this on the data. However, as stated earlier, we are given $Y$, not the probabilities. This raises the following two questions:

3. What distribution do I put on the probabilities?

4. Given 3, how do I condition the probabilities on $Y$?

More specifically, should I just do one-hot-encoding and condition on that? i.e. turn $Y = j$ into the vector $(0,...,0,1,0,0...,0)$ that has zeroes everywhere except at $j$ where it is one, then condition the distribution of probabilities on that?

My final question is a humble request. It's quite possible that these past 4 questions are completely obvious to anyone who knows Bayesian inference and by asking, I've revealed the fact that I don't really know any Bayesian inference. Should that be the case, the following question is a natural follow-up:

5. What is a good text on Bayesian inference?

You can impose as high of a math requirement as you'd like. In truth, the mathier the better, so long as it's not just abstraction for the sake of abstraction.

Sorry for the long wall of text and thanks in advance!

Your last question was answered in https://stats.stackexchange.com/questions/125/what-is-the-best-introductory-bayesian-statistics-textbook and https://stats.stackexchange.com/questions/7351/bayesian-statistics-tutorial — Tim, May 06 '18 at 20:50

Tim · Accepted Answer · 2018-05-06T21:08:30.087

Logistic regression models probability of success $\mathbb{P}(Y=1 \mid X)$, and uses a Bernoulli likelihood function

$$ Y \sim \mathcal{B}\big(g^{-1}(X\beta)\big) $$

and, most often, a logistic link function $g^{-1}$. For more then two categories, we use categorical distribution and multinomial regression (recall that categorical distribution is a special case of multinomial distribution with $n=1$), to model probability of observing $j$-th category $\mathbb{P}(Y=j\mid X)$. The likelihood is

$$ Y \sim \mathcal{C}\big(\pi_1,\pi_2,\dots,\pi_K\big) $$

where the probabilities are calculated using softmax as a link function

$$ \pi_j = \frac{ e^{x^T w_j} }{ \sum_{k=1}^K e^{x^T w_k} } $$

Since the vector of parameters $w$ does not really differ from the parameters that we'd see for linear regression, the priors won't differ either. So you can use any priors that you'd find reasonable (e.g. following normal distributions). You can find many examples of such models implemented in Stan or PyMC3, this could possibly help you with concerns about implementing it.

As about the possible distribution for the $\pi_1,\pi_2,\dots,\pi_K$ probabilities, the two common distributions for them could be Dirichlet, or in this case Multivarite log-normal distribution, yet this is not something that you'd care, since they do not come into model definition. Saying this differently, you don't put any prior for those probabilities, since they are estimated from the data, you put priors on the regression parameters. For the intercept-only model, you could use the conjugate Dirichlet-multinomial model.

I think I understand. At first, I was just treating the $pi$ as a vector, which is why I was having a hard time figure how to condition them on $y$. However, giving y the categorical distribution makes the implementation much cleaner. Thanks! — Teagan, May 06 '18 at 21:38
Also, a minor questoin. I've noticed a lot of people seem to exclude the bias term. In your post, you set what I called b to zero (unless you included a 1 coordinate on your X?). Is that common? Or is there some reason why this is valid/encouraged/done so commonly? Thanks! — Teagan, May 06 '18 at 21:43
If you concatenate a column of ones to $X$ you have bias term as one of the regular parameters. Most often people assume this and simplify the notation by dropping the bias term. — Tim, May 06 '18 at 21:49
Gotcha! Thank you for the answer. Sorry it took so long to grant you the accepted answer, I didn't have enough reputation. — Teagan, May 11 '18 at 14:06

Questions on Bayesian Softmax Regression

1 Answers1