How does one interpret the distribution over parameters in bayesian estimation?

Question

I am new to Bayesian estimation. The assumption that the parameters are random variables seems a little unsettling to me. For example when considering a model for data, what physical interpretation can I provide to the equation

$$ \begin{eqnarray*} P(Data) & = & \sum_{\theta} {P(Data,\theta)} \\ & = & \sum_{\theta} {P(Data|\theta)*P(\theta)} \end{eqnarray*} $$

This $P(\theta)$ i.e. probability over parameters, seems to be a bit awkward, after all how do I know what is the relative probability of the process of generation being a gaussian MM with this particular parameter combination instead of say a neural network with that parameter configuration.

And further it is intuitive to think of one process generating the data, whose parameters we are guessing. But instead here we have multiple processes generating the data in tandem, i.e. a sense of a true model is lost.

You should think about the probability as a measure of certainty: 1- you are completely sure about the given event, 0 - you have no idea about whats going on. Even though the data was generated by some true model, you generally do not know the value of the parameters. But from the data you may get some insight and that information, given by that data sample and expressed by posterior distribution, now gives you more certainty about which values of the parameters might be more probabable as compared to other values. — Tomas, Sep 10 '14 at 18:11
All in all, probability distribution of parameter expresses our knowledge (as obtained by data alone, or by other means) about that parameter. Sense of a true model is not lost - you simple do not know with complete certainty which one of possible models is the true one. Therefore employ a measure (probability distribution) to describe available knowledge about the true model. A good source of information to clarify differences between frequentist and Bayesian approaches might be this link http://stats.stackexchange.com/questions/31867/bayesian-vs-frequentist-interpretations-of-probability — Tomas, Sep 10 '14 at 18:12
Looking at $P(\theta)$ as a prior belief about what the true $\theta$ might be makes sense. $P(Data)$, however, is more intuitive when thinking in terms of frequencies: if I collect enough data (infinite) I can have $P(Data)$ by calculating frequencies. $P(Data|\theta)$ is similarly understood. When I use frequencies I am getting knowledge of a particular physical process, so using the Bayes rule like $P(Data)=\sum_{\theta}{P(\theta)*P(Data|\theta)}$ is confusing, as it seems like combining different physical processes to lead to the Data distribution. — user3246971, Sep 11 '14 at 02:47

score 3 · Accepted Answer · answered Sep 10 '14 at 21:36

how do I know what is the relative probability of the process of generation being a gaussian MM with this particular parameter combination instead of say a neural network with that parameter configuration.

Your $\theta$ is the set of parameters in your model. So for a Gaussian mixture model they are the means, covariances, and mixing parameters. In a Neural Network they are the weights and biases. These are totally different sets of quantities, so there's no reason to think that the $P(\theta)$ in either case will be related, either a priori or after seeing $D$.

$P(D \mid \theta)$ is the part of the formula that will be realised as a mixture model or a network, or whatever. But you have to decide, otherwise your prior is for the wrong quantities, which makes no sense.

And further it is intuitive to think of one process generating the data, whose parameters we are guessing. But instead here we have multiple processes generating the data in tandem, i.e. a sense of a true model is lost.

You already think of the data as being potentially generated by different values of $\theta$ before any Bayesian questions arise. After all, the likelihood tells you how likely the data would have been generated under different sets of values. But your 'in tandem' idea suggests you think they all do it 'all at once' in the Bayesian case, so there is no sense of 'one true model'. That's a mistake. Maybe think of it like this:

Call the 'true model parameters' $\theta_0$. Bayesians and everybody else can agree that these are the things we want to know about. Then $D$ is actually a sample from $P(D \mid \theta_0)$. We just don't happen to know the $\theta_0$ is.

Our $P(D \mid \theta)$, where $\theta$ is any setting of parameters, just specifies the mechanism by which $D$ is assumed to be generated if we knew what the parameters were - a 'forward model' if you like. Often it's straightforwardly physical, think of the $\theta$ as settings in a control panel. Bayesian methods start with $P(\theta)$ - your opinions or knowledge about what $\theta_0$ might be before seeing $D$, and then condition on $D$ to get $P(\theta \mid D)$ - your new opinions or knowledge about what $\theta_0$ is after seeing $D$.

The sum you present above is actually mostly useful just as a normalising constant on the way to getting $P(\theta \mid D)$ which actually is useful. It's our updated beliefs about $\theta_0$. It has some other roles, as 'evidence', but for the purposes of your question these aren't relevant.

Actually, I used a "universal model" when talking about $\theta$; with suitable values of its parameters, it can perform like a GMM or a Neural Net of any arbitrary connectivity. With this idealization, I am relieved of the limitations of being bound to a model type, and now with $P(\theta)$ running over a large set of $\theta$, I can consider whether I want the universal model to behave like a GMM or NN or whatever. So in 1 model I consider parametrizations over GMMs,NNs etc. together. — user3246971, Sep 11 '14 at 02:18
I see your point that the usefulness of this rule might be in getting the posterior over $\theta$. However,for understanding's sake, couldn't we scrutinize the statement in the sense above, i.e. for arriving at the $P(Data)$? — user3246971, Sep 11 '14 at 02:53
Sure, there's not much to say: $P(D)$ is the probability of a data set $D$ taking into account (in Bayesian terms: integrating out) all your prior uncertainty about what the $\theta_0$ is, as represented by $P(\theta)$. You could also think of it as a distribution over data sets you might see *next* ie a predictive distribution. Or as your expectations about how data will look before you actually see any. It's often called a 'prior predictive distribution' for that reason. — conjugateprior, Sep 11 '14 at 10:46
'Prior Predictive Distribution' is a helpful name, I see how it is an expectation about what the data distribution might be, before seeing any. I used this interpretation in an answer above, but I had a doubt. I equated calculating $P(Data|S)$ to actually choosing a model from the gamut of $Params$, i.e. what actually happens is that $P(Params)$ changes to $P(Params|S)$. But suppose I calculate $P(Data|S)$ as a KDE. Then I could interpret this in 2 ways: a. I have chosen the $Params$ configuration for which $P(Data|Params)$ looks like the calculated $P(Data|S)$ — user3246971, Sep 12 '14 at 02:23
b. or that I chose a **set** of $Params$ (i.e. $P(Params|S)$ is not simply an indicator), and then aggregated their $P(Data|Params)$ to form $P(Data|S)$. So is there a degeneracy in this, and our intuition is in danger of being artificial? Or $P(Params|S)$ is actually decided by the family of models I was considering, i.e. suppose I was looking at KDE models, then $P(Params|S)$ would be an indicator. But, if I was considering simple frequency based models,then $P(Params|S)$ would have a larger span, and $P(Data|S)$ would be seen as an aggregation over their different $P(Data|Params)$. — user3246971, Sep 12 '14 at 02:34
But if I consider 'universal models', then again I have a confusion between what has been chosen, i.e. I could have chosen a $Params$ set which has a $P(Data|Params)$ looking exactly like $P(Data|S)$, or I could have aggregated over many $Params$. In all, how problematic to interpretation is this degeneracy? — user3246971, Sep 12 '14 at 02:39

score 2 · Answer 2 · answered Sep 11 '14 at 08:48

This was too long for the comments, so posting it here. From what the others have pointed out about thinking about the prior as a belief, I think a road-block in understanding had been combining the prior and the conditional.

The prior $P(\theta)$ is understood as a belief in what the true $\theta$ might be. The conditional $P(Data|\theta)$ is better thought in frequentist terms, i.e. take a model with this $\theta$ and generate many samples from it, and just count the frequencies for each sample. Their combination $\sum_{\theta} {P(\theta)\times P(Data|\theta)}$ doesn't remain a concrete process with a well defined $\theta$. So the problem is to understand that.

Suppose, initially I didn't have any concrete data, I just had a belief about what the background generating process could be, i.e. a $P(\theta)$. Also, for each process I could tell what the frequencies $P(Data|\theta)$ would be. Because I wasn't really sure about the process, the $P(Data)$ was a belief: With all my uncertainty about $\theta$, I'd on average expect data, if I ever collected any, to have a distribution like this $P(Data)$.

But now I actually collect some samples, call this set $S$, and I calculate the frequencies of the samples. What I have now is $P(Data|S)$.

But I could write: $P(Data|S)=\sum_{\theta} P(Data|\theta)P(\theta|S)$. Thinking in this way, my counting probability $P(Data|S)$ has been arrived by first changing my belief about $\theta$ to $P(\theta|S)$, which becomes more spiked towards a particular $\theta$, and the data distribution now looks more like $P(Data|\theta)$ for that $\theta$. So, was the crux the difference between $P(Data)$ and $P(Data|S)$?

OK now I see. Let's distinguish a random variable $D$ from one of its realisations $d$ and say that $d$ and $d'$ are two sets of observed data. Now $P(D)$ represents your beliefs about the probability of seeing each possible set of observations. $P(d)$ gives the probability of seeing exactly $d$ (keep $D$ discrete-valued so that's not always zero). Then $P(D | d) = \sum_\theta P(D | \theta)P(\theta | d)$ (with usual conditional independence assumptions). We've moved from expectations about what $d$ might be to expectations in the light of $d$ about what a new set of data $d'$ might be. — conjugateprior, Sep 12 '14 at 08:33
This makes my tentative understanding a bit more sure-footed. I made some comments to your answer about doubts concerning model selection, i.e. $P(\theta|d)$ as in your argument above, I'd appreciate your comments on those. — user3246971, Sep 12 '14 at 15:03
I pretty sure I don't follow your comments on my answer tbh. But all the talk of indicators suggests that you've missed the crux of Bayesian inference: you never actually *choose* parameters. In particular, you're not searching for the best ones, or the best range of them, according to some pre-specified criteria. All you do is *move probability around over unknowns* (here that means over the possible values of the parameters) as new observations come in. The customary notation elides this (confusing $D$ and $d$ for example) which does not make things easier. — conjugateprior, Sep 12 '14 at 15:42

How does one interpret the distribution over parameters in bayesian estimation?

2 Answers2