Confusion about Understanding Supervised Learning as Bayesian Inference

Question

I am going through a lecture that is explaining how supervised learning can be thought of from a Bayesian perspective, where we are trying to maximize log p(theta | data). I am confused as to what the statement p(theta | data) means, where theta is the model parameters.

I do understand what p(theta), the prior, and p(data|theta) represent. p(data|theta) represents the probability that we encounter the given data points given the model parameters theta. But what exactly does "p(theta|data)" mean in words? In other words, I'm confused about why we are trying to maximize "p(theta|data)" as opposed to trying to maximize "p(data|theta)" since that's the actual measure of model performance.

score 2 · Accepted Answer · 2020-12-11T00:44:20.430

It all comes back to Bayes' Theorem: $P(A \mid B)=\frac{P(B \mid A) \cdot P(A)}{P(B)}$

Thing first thing we need to make sure we are on the same page about is this: In the Bayesian philosophy, parameters have a distribution. That mental shift alone might answer your question, as the distribution of the parameter is the main pursuit in Bayes. What is this equation saying? Given a supposed distribution of the parameter "A" (prior) and some data we observe, how does that change our knowledge of the left-hand side? If our data, conditional on the assumed distribution for A (i.e. $P(B|A)$) is unlikely if we assume a certain $P(A)$, then the posterior on the left-hand side will have a mean that is more like the mean of the data rather than the mean of the prior. If $P(B|A)$ is perfectly reasonable and likely using prior $P(A)$, you will find that your posterior will agree with your prior's mean and $P(A|B)=P(A)$. This is why people talk about Bayesian statistics as "updating" beliefs. PS: We usually ignore the bottom as a normalizing constant.

The magic moment for me on this matter is when I actually first derived the Bayesian estimate for a mean a few years ago. Suppose your data is normal and you have a normal prior. Let the subscript $0$ denote a parameter from the prior. I'll skip the derivation for the estimator, but the mean of the posterior turns out to be: $\bar{y} \cdot \frac{\frac{n}{\sigma^{2}}}{\frac{n}{\sigma^{2}}+\frac{1}{\tau_{0}^{2}}}+\mu_{0} \cdot \frac{\frac{1}{\tau_{0}^{2}}}{\frac{n}{\sigma^{2}}+\frac{1}{\tau_{0}^{2}}}$

What does this look like?...A weighted average of the mean from the prior and the estimated mean of the data! If you prior is really strong, your posterior mean estimate will end up looking like your prior's mean. If your data is really different from your prior, the weight will be in the $\bar{y}$ term and your posterior's mean will look like your data's mean. What would drive the weight to be one way or the other? If $n$ is huge, then the prior mean's term on the RHS will shrink away to nothing. And isn't that just what we want? If we have a lot of new evidence, we want it to overshadow our prior.

If the prior's variance $\tau_0$ is small, that means we feel we know a lot about it. Notice that this will make the second term on the RHS large as compared to the data's influence.

All in all, Bayesian statistics is always a tug of war between the data and the assumed prior. How much data we have, or how strong our prior is, will influence our posterior. And remember, in regular old Frequentist statistics, we spend a lot of time figuring out the distribution of parameters, so hopefully the emphasis of finding the parameter's posterior does not feel out of place to you.

Thanks for the answer -- Got it, the parameters have a distribution. So then I can rephrase my question about the term "p(theta|data)". Why would us being given some data influence the parameters "theta" at all? This means the assumption is that the model parameters distribution isn't just any parameter distribution, but a distribution of what the "correct" parameters are (i.e. the parameters that predict the actual relation between X and Y or the data). Would this be an adequate interpretation? — Michael Turner, Dec 10 '20 at 06:48
(I'm not sure I understand your question, feel free to elaborate) — , Dec 11 '20 at 00:51

score 1 · Answer 2 · answered Dec 10 '20 at 09:33

I think the best way to understand the $p(\theta|Data)$ is to understand what the maximization of multiplication of the likelihood and prior means. We know from Bayes theorem that

$$p(\theta|Data)= \frac{p(Data|\theta)p(\theta)}{p(Data)}\propto p(Data|\theta)p(\theta)$$

Then maximizing $p(\theta|Data)$ is equivalent as maximizing the $p(Data|\theta)p(\theta)$ in the following sense, from all the samples $\theta$ from $p(\theta)$ keep the ones that maximizes the $p(Data|\theta)$, i.e keep the sampled parameters that make your data more probable.

score 0 · Answer 3 · answered Dec 11 '20 at 20:38

Another point - this is related but not necessarily the only explanation:

Supervised learning algorithms may be either generative or discriminative.

For example,

Generative classifiers:

Assume a functional form of $P(Y)$, $P(X|Y)$
Estimate both $P(Y)$, $P(X|Y)$ from data (training), which leads to the joint probability $P(Y, X)$ (this makes it a generative model), then use Bayes rule to get $P(Y|X)$

Discriminative classifiers:

Assume a functional form of $P(Y|X)$
Estimate $P(Y|X)$ directly from data (training)

Most of the Bayesian classifiers, such as naive Bayes, Markov models (MRFs, HMMs), and similar Bayesian graphical models are generative. Some examples of discriminative models are logistic regression, CARTs, SVMs, and CRFs (as opposed to MRFs).

The Bayesian framework can be modified to see how different supervised learning algorithms behave, and why some models may compared one vs. the other, or model the joint probability and make a prediction based on that.

As for kNN, since I don't want to get embroiled in what is going on here: Is KNN a discriminative learning algorithm?

Have fun.

Confusion about Understanding Supervised Learning as Bayesian Inference

3 Answers3