Better understanding what is the model in Bayesian approach

Question

I was trying to understand model likelihood and now with great confidence I can also call it:

integrative likelihood
marginal likelihood
predictive likelihood
the evidence

since I read the paper:

The following formula:

$$ \begin{align} p(X\mid M)=\int_\theta p(X \mid \theta, M) \,p(\theta \mid M)\, d\theta \tag{1} \end{align} $$

is marginalized out parameters formula.

This formula say when $\theta$ vary, we will get model likelihood given $X$, since $X$ is fixed (all data we have).

But, the shorter formula also called evidence would be:

$$ \begin{align} p(X)=\int_\theta p(X \mid \theta) \,p(\theta)\, d\theta \tag{2} \end{align} $$

This leads me to a conclusion that if not explicitly set we always assume $p(X)$ is $p(X \mid M)$ because we need to have the model always.

Question:

We never write $p(M \mid \theta)$ so we never write the model depends on parameters because the model is not a random variable I guess. It is just a parameter from the finite/infinite set of models.

Or maybe the model is a random variable after all. Can you elaborate and help me sort how parameters are connected with the model.

_{As a preparation to this question I have several similar treads in here and I am actively reading 2 books on Bayesian subject but this is still unknown to me.}

Can you precise what do you mean by the Model ? Is it the model that you are selecting and that contains the parameters theta (linear reg, trees, neural net etc...) or is it the underlaying Model that generated X ? — Romain, Mar 27 '21 at 10:48
They’re referring to the former. (This seems to be a follow-up to their question [here](https://stats.stackexchange.com/questions/515659/what-is-the-role-of-model-likelihood).) — Arya McCarthy, Mar 27 '21 at 12:14
@GoodLuck, let me ask you this: what do you think is the set of options for the random variable $M$? If there’s only one called $m$, then to be a well-defined distribution, $p(m \mid \theta)$ must be 1 for every value $\theta$ takes on. — Arya McCarthy, Mar 27 '21 at 12:40
Look at keywords like "Bayesian model selection", "Bayes factor", "Bayesian model choice". — Xi'an, Mar 27 '21 at 13:39
@Romain "Can you precise what do you mean by the Model" ? I am interested in **all** kind of models, because I am currently trying to understand what exists in Bayesian probability model for *machine learning*. So my objective is to mark the territory of unknowns. Whatever is in constraint with my common sense I will ask in here (on this site) and I will not be afraid of downvotes because I am asking with reason. To answer second part, yes I believe that you helped me with this distinction "model that generated X", isn't it the same model that has parameters? Can NN generate X? ... — Good Luck, Mar 27 '21 at 17:16
@AryaMcCarthy I am still trying to catch the basics. You mentioned the random variable $M$. I guess this is the model, I am pretty sure this was your point that $m$ is the solo realization of $M$. I still cannot swallow that model $M$ doesn't depend on paramters. I am not sure how we can represent $M$. Why do we think it is a RV, it may be just a set of parameters, and $p(X \mid M)$ may be just an untidy way of really writing $p(X ; M)$. The question contains my unknowns. The question of yours confuses me because you said once $p(M \mid \theta)$ should never be written. I wanted to check this. — Good Luck, Mar 27 '21 at 17:26
@GoodLuck I think I start to see the direction of your question. To answer your answer on my comment: "model that generated X" is the real life model that is unknown to us but we are trying to approximate. It is clear that you are not talking about that but rather a model that we would specify in a regression or classification problem (NN, SVM, etc...). I think your question has to do with how general and poorly defined is M in many case. — Romain, Mar 28 '21 at 09:50
@Romain thanks, you may remove the case when we try to approximate the model with another model that is *similar*, but we can track it better, since I know that there are models where transformations are intractable (usually this means Jacobian determinant is intractable). The *similar* model is my term, I just made it up but I guess you understand this would be model that performs similar with respect to input and outputs. This is why Jacobians are important for the models in machine learning since they can compute the flow of input to the output. — Good Luck, Mar 28 '21 at 23:13
I am interested in any model that do machine learning this means any regression and classification problem. I have [one answer](https://stats.stackexchange.com/a/514641/313941) on this site where I showed that classification is the same as regression, so yeah I am interested in all that, but from the Bayesian perspective. I am still processing the terms from @Xi'an but I will eventually process this in the next days and be able to understand the *mystery* of the relation model ~to~ parameter, that is all I am asking actually :( — Good Luck, Mar 28 '21 at 23:19
beware that likelihood is not a synonym of 'probability'. in much of the question, you used the term wrong. — carlo, Mar 29 '21 at 16:10
@carlo, just checked I don't use the term *probability* in my question. Here is the [resource](https://academic.oup.com/sysbio/article/53/5/793/2842928) where I found these likelihood synonyms. Which particular term you found incorrect, please let me know and I will update. The resource was provided by *Arya McCarthy*. — Good Luck, Mar 29 '21 at 16:16
No, Carlo is making the point that you’re using the term “likelihood” in cases where you don’t want to. I tried to make this point in your last question as well. Note that most of the ways you’ve used “likelihood” are not the same as was done in the paper I linked. — Arya McCarthy, Mar 29 '21 at 16:24
@AryaMcCarthy I just updated the post (making a screenshot) with the terms I copied from that paper. I was just using that as synonym terms. I am not sure what terms exactly are not mentioned in a nice and correct way. Sorry. — Good Luck, Mar 29 '21 at 16:43

score 5 · Accepted Answer · answered Apr 03 '21 at 00:42

Before addressing the specifics of conditioning on a model or parameter, the first thing to note here is that all probability statements are conditional on implicit information. As I have noted in some other answers (e.g., here), many theories of probability regard conditional probabilities as the "primitive" in probability theory, and derive "marginal" probabilities only as a consequence of removing certain explicit conditioning events. This viewpoint is most famously associated with the axiomatic approach of the mathematician Alfréd Rényi (see e.g., Kaminski 1984). Rényi argued that every probability measure must be interpreted as being conditional on some underlying information, and that reference to marginal probabilities was merely a reference to probability where the underlying conditions are implicit.

In practical applications of probabilistic/statistical modelling, any conditioning event that holds for the entire analysis is usually removed as an explicit condition in the notation --- it is simply not useful to condition every probability statement in the analysis on the same conditioning event. Consequently, if we are working with only a single model, we would not bother to mention conditioning on the model form at all; the assumptions of the model would instead form implicit conditions for the whole analysis. Thus, explicit conditioning on the model is only useful in applications where we are considering more than one possible model form, and even then, only when the various model forms cannot fruitfully be stated as different parameter values within one overall model. This occurs in some practical modelling applications, and it also occurs when examining statistical properties of models in a meta-analytical perspective, where we remove the underlying modelling assumptions and look at statistical behaviours in their absence.

So, as a practical matter, explicit conditioning on "models" (i.e., sets of assumptions about probabilistic behaviour of observable values) is only useful when:

(1) There is more than one model under consideration in the analysis; and
(2) The models in the analysis cannot be stated more simply as different parameter ranges in a single more general model (i.e., they are not just "nested" models).

Of note here is that, under these conditions, different models will have different parameters that mean different things in the context of those models. This leads you to a problem if you want to refer to a probability like $p(M|\theta)$, which is the probability of a specific model conditional on the outcomes of a specific set of parameter values --- i.e., does the parameter $\theta$ even exist (and mean the same thing) under model $M$ and the alternative models in the analysis?

In this modelling context, in order to make conditioning probabilities of this kind make sense, you have to ensure that all parameters are well-defined regardless of which model is used. (Otherwise you may end up conditioning on parameters that don't exist.) This means that you will need to stipulate a framework where all parameters in all models exist, and you have a prior over all these parameters. Moreover, parameters that don't appear in a model don't affect that model, and so it stands to reason that your prior should treat parameters and models independently, unless we are talking about groups of parameters that may jointly exist under a single model (in which case we may allow prior dependence). The outcome of these assumptions will render conditional probabilities of models conditional on parameters trivial. We will see below that in order to make this question sensible in the context of non-nested models, we get trivial and unhelpful results.

An example: To illustrate this issue, consider an analysis where you are modelling a survival time $X \geqslant 0$ with one of two non-nested models and their corresponding parameters:

$$\begin{matrix} \text{Model } M_1 & & & X \sim \text{Ga}(\text{Shape} = 2, \text{Scale} = \theta), \quad \quad \\[6pt] \text{Model } M_2 & & & X \sim \text{Weibull}(\text{Shape} = 2, \text{Scale} = \lambda). \\[6pt] \end{matrix}$$

If you start to look at the conditional probabilities of models conditional on parameters you see an immediate problem --- the parameters for the two models are different. In order to ensure that the probabilties are well-defined (so that you can condition on them regardless of which model is used) you can stipulate that all the parameters exist under each model, but parameters don't do anything in the model if they don't appear. So in this case, we could stipulate that the parameter vector $(\theta, \lambda)$ always exists, with model $M_1$ only using the first parameter and model $M_2$ only using the second. If we do this then it obviously also makes sense that our prior distribution over the models and parameters should treat them as independent --- i.e., we have:

$$\pi(M, \theta, \lambda) = \pi(M) \pi(\theta) \pi(\lambda).$$

Using a prior of this form and applying Bayes' rule gives the trivial results:

$$\begin{align} p(M | \theta ) &= \frac{p(M, \theta)}{p(\theta) } = \frac{\pi(M) \cdot \pi(\theta)}{\pi(\theta)} = \pi(M), \\[12pt] p(M | \lambda) &= \frac{p(M, \lambda)}{p(\lambda) } = \frac{\pi(M) \cdot \pi(\lambda)}{\pi(\lambda)} = \pi(M). \\[6pt] \end{align}$$

As you can see, the prior assumptions in this case lead to the parameters giving no information on which model is used. (Note that conditioning on the observed value of $X$ will usually give information on which model is used, but that is a different question.) Consequently, the inquiry into the conditional probability of models given the true parameter values is trivial and unhelpful.

The astute reader will no doubt have noticed that this trivial outcome comes directly from prior assumptions that stipulate independence between models and parameters. Consequently, it is natural to wonder whether we might get non-trivial results if we adopt a prior that treats these as dependent. Of course, this is possible, but it doesn't seem very sensible. If the parameter $\theta$ is meaningless under model $M_2$ (and the parameter $\lambda$ is meaningless under model $M_1$) then there is no value in stipulating different prior distributions for these parameters under the two models.

Thank you, this was an excellent cover. I have/had rather different perspective that parameters are bound to the models and cannot be shared among models. The fact that two models share the parameter of the same name, type, dimension and every other attribute that I have not mentioned (i.e. relations to other parameters) may be irrelevant. This way we can only consider conditional probability of **one model** in condition to **its parameters**, since adding additional parameters has no sense (we can remove them). — Good Luck, Apr 03 '21 at 11:12
But what is conditional probability of **one model** in condition to **its parameters**, this looks again like probability of that model. To me conditioning on parameters of a model having that parameters is just more formal way saying model has parameters. Just like an explicit form of writing. From that perspective, we only pay attention to the dimensions of measurable spaces mentioned in the comments of @Gianni. — Good Luck, Apr 03 '21 at 11:17
Still your method to *stipulate that all the parameters exist under each model* cover all cases, even those I mentioned, just saying; but doesn't incorporate the point of view that parameters are per model parameters. (That view may be even wrong). Also I feel that there is no need to mention the fact that model is random variable any more, since this is obvious to me (just about everything can be represented as a random variable). I really like like your answer. — Good Luck, Apr 03 '21 at 11:21
You can certainly take the approach that parameters only exist under their models, but if you do this then the issue becomes even more trivial, since many of the conditioning events just don't exist. — Ben, Apr 03 '21 at 11:44
Feels great that this approach that **parameters only exist under their models** is possible @Ben ;) I was not so sure before. ☆☆☆☆☆ — Good Luck, Apr 03 '21 at 12:14
One thing to bear in mind is that in the operational approach to Bayesian analysis, parameters are defined as limiting quantities on the observable values (via the LLN). In this case all the parameters are defined irrespective of whether "their model" holds. — Ben, Apr 03 '21 at 12:35

Gianni · Answer 2 · 2021-04-02T23:58:47.480

Short answer: for some applications, models are treated with uncertainty, but usually the interest is in $p(M | X)$, not $p(M | \theta)$.

I think the example - $p(M | \theta)$ - is obscuring the question - "Are models random variables?".[1] $p(M | \theta)$ is kinda an odd thing to ask about. Does specifying an intercept of a linear model tell you much about whether a linear or quadratic model is more plausible, independently of any data?

On the other hand, $p(M | X)$ is of intense interest in model comparison / selection tasks.

Suppose I have some data and want estimate whether a linear model $m_0$: $y = \alpha + \beta x$ is more plausible than simple quadratic model $m_1$: $y = \alpha + \beta x^2$. The set of events $M$ for the distribution $p(M)$ will then have two elements: $M = \{m_0, m_1\}$. $p(M)$ is then an assignment of values in $(0, 1)$ to each element of $M$ s.t. they sum to 1, representing their relative a priori plausibility.

If I want to determine which model is more plausible for my data, I can compute $P(M | X)$ using Bayes' theorem:

$$p(M | X) = \frac {p(X | M) p(M)}{p(X)}$$

Where

$$p(X) = \sum_{m \in M}{p(X | m)p(m)}$$

And as you have above (using $\theta_m$ to clarify that different models can have different parameters).

$$p(X | m) = \int_{\theta_m} { p(X | m, \theta_m) p(\theta_m | m)d \theta_m}$$

The posterior $p(M|X)$ allocates mass to models in $M$ that were more plausible given the data.

You might find Bayes' factors relevant here, as an example usage of posteriors over models.

[1] You'll probably see Bayesians often talking about quantifying uncertainty over sets of events/propositions/beliefs rather than "random variables" per se. Models themselves aren't random in some metaphysical sense, but we are often uncertain which model is "the best" given our observations. So it's often more natural in this context to talk about uncertainty over beliefs.

Edit: copying some notes from the comments.

To explicitly answer the question, I would say $$ is an RV as long as you interpret it carefully. If $$ is the set of all models, and $p($) is the function that maps models to plausibilities, then $$ is an RV that just indexes our set of models $$. Basically a categorical- or function-valued RV. This is not a very common way to talk about it, though.

However, to me it's not intuitive to talk about $M$ as an RV. If somebody is about to roll a die, then it's natural to use an RV that maps the set of die faces (outcomes) to the numbers 1-6. But if somebody rolls a die and hides it, and asks what side I think is up, it's no longer "random" in the sense that there's already an outcome; I just don't know it yet. This is closer to the situation in model comparison, where it'd be more typical to talk about plausibilities than random "outcomes".

As for how a model relates to parameters. You can think of a function as a mapping from values in one set to another. Eg $=$ maps values in $\mathcal{R}^2→\mathcal{R}$. You could also say $=$ describes a set of functions, one for each possible value of $$, that map $\mathcal{R} \rightarrow \mathcal{R}$. We call $$ a parameter when we have observations for $$ and $$, but not $$. Another way to think about it is that $$ "selects" a function, such as $=2$ for $=2$. $(|,)$ is then the likelihood of param $$ for the model $=$, given our observations.

(This relates to something the other answer was trying to get at: models don't depend on parameters. For example, if we have two models: $=_0+_0$ and $=^2+_1+_1$, we can see immediately that they don't share the same parameters. If we knew the parameters for one, it wouldn't tell us anything about the plausibility of either model independent of any data.)

[...] And since models in general don't/cannot share parameters we cannot really express $(∣)$. Similar for $(∣)$ unless $$ denotes models corresponding to the parameters of those models.

I believe technically you can express both of these as long as $m \in M$ and $\theta_m$ are consistent. What's challenging is the interpretation of $p(M | \theta)$; I can't think of a use for this, and my intuition is that it's actually just $p(M)$. ($p(\theta_m | M=m)$ is fine, as it just describes the prior we have on parameters for some selected model.)

Very good read @Gianni, at this moment I cannot conclude if the models are random variables or no, even though you provided the clue that models aren't random in metaphysical sense. I would simple accept either yes or no, because so far I learned that Bayesian use random variables with $\mid$ condition. I would rather ask you in the sense of this quest, do you possible see, what I get the feeling of, what is on the right side of the $\mid$ is an attribute: say $p(\mathsf{person} \mid \mathsf{redhair})$, so the attributes limit the probability mass function of person. — Good Luck, Mar 30 '21 at 23:49
So in $p(M | \theta)$ I get the feeling that we can ask models for the attributes, possible all models, while with $\theta$ we assume all the parameters of that model so typically we get all models by all parameters independent of data and this may be what is unusual. While if we would have just a single model then the previous probability would possible be 1, even without considering the data. Any comment on how the model relates to parameters is welcome! — Good Luck, Mar 30 '21 at 23:56
I would say that $M$ is an RV as long as you interpret it carefully. If $\textbf{M}$ is the set of all models, and $P(M)$ is the function that maps models to plausibilities, then $M$ is an RV that just indexes our set of models $\textbf{M}$. Basically a categorical- or function-valued RV. Not a very common way to talk about it, though: https://en.wikipedia.org/wiki/Random_variable#Extensions — Gianni, Apr 01 '21 at 01:49
But to me, it's not very useful or intuitive to talk about models as RVs. If somebody rolls a die, then it's natural to use an RV that maps the set of die faces to the numbers 1-6. But if somebody rolls a die and hides it, and asks what side I think is up, it's no longer "random" in the sense that there's already an outcome; I just don't know it yet. This is closer to the situation in model comparison, where it'd be more typical to talk about plausibilities than random "outcomes". — Gianni, Apr 01 '21 at 01:50
As for how a model relates to parameters. You can think of a function as a mapping from values in one set to another. Eg $y = mx$ maps values in $\mathcal{R}^2 \rightarrow \mathcal{R}$. You could also say $y = mx$ describes a set of functions, one for each possible value of $m$, that map $\mathcal{R} \rightarrow \mathcal{R}$ We call $m$ a parameter when we have observations for $y$ and $x$, but not $m$. Another way to think about it is that $m$ "selects" a function, such as $y = 2x$ for $m=2$. $p(m | x, y)$ is then the likelihood of param $m$ for the model $y = mx$, given our observations. — Gianni, Apr 01 '21 at 02:01
(This relates to something the other answer was trying to get at: models don't depend on parameters. For example, if we have two models: $y = \beta_0 x + \alpha_0$ and $y = \gamma x^2 + \beta_1 x + \alpha_1$, we can see immediately that they don't share the same parameters. If we knew the parameters for one, it wouldn't tell us anything about the plausibility of either model independent of any data.) — Gianni, Apr 01 '21 at 02:06
OK, I think I started to sort things. Wikipedia [link](https://en.wikipedia.org/wiki/Random_variable#Extensions) is fantastic. I kind a knew few measurable spaces $E$ but the list in there is detailed. Than the Bayes formulae like in [quest here](https://stats.stackexchange.com/questions/515659/what-is-the-role-of-model-likelihood) are great, correct from probability standpoint but cannot tell us what is $M$ or $\theta$ all about unless we don't specify, say $\theta \in \mathbb R^d$. Similar mention of conditionality is needed for models. — Good Luck, Apr 01 '21 at 12:39
And since models in general don't/cannot share parameters we cannot really express $p(M \mid \theta)$. Similar for $p(\theta \mid M)$ unless $M$ denotes models corresponding to the parameters of those models. Would you agree? — Good Luck, Apr 01 '21 at 12:44
Lastly independent of you agree or not on the previous I can say with Bayes formula we should also write the measurable spaces dimension such as $E^d$ whatever the $d$ is for all the random variables. And one more since sets can be measurable spaces it is perfectly fine to have a model $M$ as a single element from a set, say $M \in \mathcal M^1$, where $\mathcal M$ is set of models. Then $p(M \mid \theta)$ formula can actually work. ;) if the parameters are parameters of that model. But I agree it is strange formula. Thank you, thank you, thank you! — Good Luck, Apr 01 '21 at 12:52

Romain · Answer 3 · 2021-03-29T19:43:06.417

I think that the confusion you are experiencing is mainly due to notation and a bit of philosophy around bayesian probability.

In general, we define a model $M$ having parameters $\theta$. This structure $M$ is "a structure", it can be a linear regression, a Tree, a Random Forest, A Neural Net, etc... $\theta$ are the parameters of the model. They are the unknowns that exists in a parameter space, and that will allow you to find the model for your specific $X$. However that definition is already somehow arbitrary, for instance we don't count hidden layer of NN or depth of Tree for Random Forest to be parameters but hyperparameters. In essence there is no difference, but we consider that "they are the model".
Why do we do this trick structure/hyperparameter/parameter ? Well mainly because otherwise no problem would be solvable. If your model $M$ can be any model (you don't make a choice, but condition over all the possible structure of models), it would be an impossible problem to solve. So you set a structure $M$ and then optimize within that structure. More recently, there is an entire field of research called Neural Architecture Search or AutoML that aims at finding the best model (mainly hyperparameters) but this is a very complex task.
It's important to note that the 2 formulas are actually equivalent, it is more about interpretation. The fact that every term in the first formula is conditioned on $M$ means that $M$ acts as a deterministic value. It could be a random variable but in this case, you always work under the assumption that $M$ is fully known. In the second formula, $M$ is not there because simply it is useless to add it. There is no dependence, $M$ is a feature of the world that can't be change, it is not a random variable, no need to condition on it. The way to use both formulas remains the same regardless of the interpretation.
Why don't we write $P(M|\theta)$ ? In theory, one could completely compute $P(M|\theta)$, but we don't. First because it does not answer any useful question, in real life $M$ will always be specified before $\theta$. Second because as we've seen, there is no need to have a distribution over $M$, you always live in a "sub-world" where $M$ is either a fixed value of random variable or unchangeable. Finally, because that does not really make sense. Choosing a specific $M$ will specify your vector $\theta's$. For instance choosing linear regression will make $\theta's$ coefficient of the linear reg, but choosing a tree will make $\theta's$ being splits along a feature.

I hope this helps!

Nice wrap, "M it is not a random variable, no need to condition on it", this has sense and it clears the things, but possible the only argument against it is that we do use the formula with $P(\theta \mid M)$ which is what is confusing me about. And also one may say there is an infinite set of different modules possible each represented with different $\theta$s. If you ask me at this very moment I think the formula #1 is completely useless. The model is a set of parameters and no need for explicit involvement. — Good Luck, Mar 29 '21 at 12:01
To clarify when we use the formula $P(\theta \mid M)$ this means that the $M$ is a random variable in Bayeisian tradition. Also the $\theta$ is random variable. In fact in Bayesian tradition there are no unknowns, instead we have random variables and sets of parameters, but in the upper formula all are RVs. — Good Luck, Mar 29 '21 at 12:04
Agreed but a deterministic variable is a RV with all probability mass in a single point. So the first formula in more general, but it makes no difference in real life, because we never consider to set of all possible models, we optimize within a given set of models. — Romain, Mar 29 '21 at 14:55
Hi, at this moment of my analysis it [appears](https://ocw.tudelft.nl/course-readings/pre-1-2-deterministic-random-variables/) that deterministic variable is not a random variable or maybe it is a special case I don't know. — Good Luck, Mar 29 '21 at 15:42
it can be considered as a special case. a fixed value can be described as a Dirac random value, which is saying that is not actually random. — carlo, Mar 29 '21 at 16:14

score 0 · Answer 4 · answered Apr 02 '21 at 15:40

At time of writing- Two good answers here, so I'll take a different approach and try to explain some things in "plain English."

When we talk about likelihood, think of "plausibility" rather than probability. What makes something plausible? The parameter(s) that describe the process which (assumedly) underlie the data generating process. If the parameter is a good fit, observing the data should seem plausible to you. If the parameter is a terrible fit, then observing the data should be somewhat surprising.

A "frequentist" would use calculus to optimize the parameter that makes observing the data most plausible (maximum likelihood estimation.) However, this is just a point estimate and we don't know how quickly it loses predictive accuracy if we were to nudge it.

Bayesians find this uncertainty around parameter estimates to be troublesome. And so they use Bayes Rule to flip the relationship, no longer asking, "how plausible is the data (given the parameter)?" (question 1) But rather "How plausible is the parameter (given the data)?" (question 2.)

Question 1 is often called the likelihood. Question 2 is often called the posterior.

Another element we haven't addressed is the prior. It's some belief about the parameter(s) that guides them to areas, which you believe are more reasonable. For example, if you're studying the relationship between height and weight in a linear model, you know "a priori" (pre-analysis) that there shouldn't be a negative relationship between height and weight. It would make no sense that an increase in weight should be associated with a decrease in height.

And the last element is the evidence- it's the probability of seeing the data, irrespective of any specific parameter value. In the likelihood, we talk about how plausible observing the data would be given some parameter describing the data generating process. The evidence however, marginalizes this parameter out via integration (continuous) or summation (discrete). If this sounds weird, think of it in the discrete case first. If you have data of children at a school, and more specifically you have two probabilities: (1) that a student is both (A) a member of the chess club and (B) a boy and (2) that a student is both (A) a member of the chess club and (B) a girl. Then if you marginalize out sex, you can isolate the probability of belonging to the chess club. In the continuous case, we have an infinite number of classes/buckets, which the parameter could take on.

To make things more complicated, often the most difficult part of Bayesian Inference (getting the posterior distribution) is in computing the evidence. Integrals are hard in 1D, damn tricky in 2D, extremely difficult in 3D, and 4-plus dimensions? Forget it. Fortunately, the evidence is only used to ensure that the numerator (likelihood * prior) is scaled such that it sums to 1 (a valid probability distribution.) With this in mind, we still know a good deal about what we'd like to model given just the numerator, whether or not it integrates to 1. In fact, there are several different ways by which the posterior could be approximated- MCMC sampling, grid approximation (like multidimensional Riemann sums), etc.

Anyway, all that to say, it's very often to see the posterior being "proportionate to" to the numerator. Don't let this trip you up.

The Bayesian mindset is really cool; unfortunately it's often taught in two unsatisfactory ways. (A) In a stats department with the assumption that you've had 3 semesters of calc, 1 semester of linear algebra, and perhaps even more mathematical firepower- so the proofs ought to speak for themselves. Or (B) As an anecdote or chapter in a machine learning course- these sort of books will discuss priors as little more than regularizing coefficients (not entirely false) and view Bayesian Inference as just a tool for the job, which should be used when your random forest (or insert ML model) performs poorly.

Neither of these approaches are really helpful for an outside looking in. I recommend Statistical Rethinking, by Richard McElreath. Lectures, PyMC3 code. Cheers!

Dear @jbuddy_13 is statistical plausibility term possible to formalize? — Good Luck, Apr 02 '21 at 23:37

Better understanding what is the model in Bayesian approach

4 Answers4