My question is about how to actually do this both rigorously and practically. Allow me to elaborate.
Suppose that we have data $(x_1,y_1),...,(x_N,y_N) \in \mathbb{R}^p \times \{0,...,k-1 \}$. I'd like to do a softmax regression. More precisely, I'd to find vectors $W_1,...,W_k \in \mathbb{R}^p$ and $b_1,...,b_k \in \mathbb{R}$ such that my model predicts $$\mathbb{P}(Y = j | X = x) = \frac{\exp( w_j \cdot x + b_j)}{\sum_{i=0}^{k-1} \exp (w_i \cdot x + b_j)}$$
where the $W$ and $b$ satisfy some constraint. This is all fine and dandy, since we can find the parameters by maximum likelihood, which can be easily executed by a variety of software, e.g. sklearn, PyTorch, R, etc. There's a certain issue regarding identifiability, since we can always shift the parameters by a constant, but this shouldn't be an issue since they should all yield the same probabilities. I've seen people choose to zero out the last coefficient to deal with this. However, if possible, I'd like to avoid this, since this is essentially choosing a class as reference and this choice is a bit arbitrary (and also because I haven't figured out how to do that in software I'm using for this, this is the real concern haha).
Suppose now I want to take a Bayesian approach to this i.e. put a prior on $w,b$, compute the posterior $\mathbb{P}(w,b| X = x, Y = y)$ and find the MAP estimate, $$\text{argmax}_{w,b} \mathbb{P}(w,b | Y =y, X = x) = \text{argmax}_{w,b} \mathbb{P} (Y = y | w,b, X = x)\mathbb{P}(w,b).$$ My first question is,
1. What is the distribution of the first factor on the right in the formula above?
I understand that it's given explicitly by the softmax function, but I was wondering if there was some name for this. I've been calling it the softmax distribution but I don't want to sound like an idiot when talking with other people.
To state my next question, let me compare it to the linear regression model. In order to do a Bayesian model for linear regression, you assume $Y \sim \mathcal{N}(Wx + b,\sigma^2)$. This allows you to put a Gaussian prior on $W$ and still be able to recover the posterior distribution $\mathbb{P}(W | X=x,Y=y)$ analytically.
In my case, the relevant question is
2. What should the priors be for $W_j$ and $b_j$?
I guess my main concern is, how much should I be worrying about the failure of identifiability? As I said before, due to practical (read: implementation) issues, I'd like to avoid substracting the coefficients, or messing with them in anyway that's not just endowing them with a distribution. My worry is that this will the make the Gaussian prior not a great choice.
My next two questions concern the actual conditioning on the data. As noted above, in the linear regress case, we use the fact that we have data on $Y$, and that we know its distribution $Y \sim \mathcal{N}(Wx+b,\sigma^2)$ to compute the posterior $\mathbb{P}(W,b | X=x,Y=y)$.
In the logistic regression model, we have $Y$ but we are really computing $\mathbb{P}(Y | W, b)$. This seems to suggest to me that we should put a distribution on the space of probabilities i.e. we should make $\mathbb{P}(Y = \cdot | W,b)$ a random variable then condition this on the data. However, as stated earlier, we are given $Y$, not the probabilities. This raises the following two questions:
3. What distribution do I put on the probabilities?
4. Given 3, how do I condition the probabilities on $Y$?
More specifically, should I just do one-hot-encoding and condition on that? i.e. turn $Y = j$ into the vector $(0,...,0,1,0,0...,0)$ that has zeroes everywhere except at $j$ where it is one, then condition the distribution of probabilities on that?
My final question is a humble request. It's quite possible that these past 4 questions are completely obvious to anyone who knows Bayesian inference and by asking, I've revealed the fact that I don't really know any Bayesian inference. Should that be the case, the following question is a natural follow-up:
5. What is a good text on Bayesian inference?
You can impose as high of a math requirement as you'd like. In truth, the mathier the better, so long as it's not just abstraction for the sake of abstraction.
Sorry for the long wall of text and thanks in advance!