Before addressing the specifics of conditioning on a model or parameter, the first thing to note here is that all probability statements are conditional on implicit information. As I have noted in some other answers (e.g., here), many theories of probability regard conditional probabilities as the "primitive" in probability theory, and derive "marginal" probabilities only as a consequence of removing certain explicit conditioning events. This viewpoint is most famously associated with the axiomatic approach of the mathematician Alfréd Rényi (see e.g., Kaminski 1984). Rényi argued that every probability measure must be interpreted as being conditional on some underlying information, and that reference to marginal probabilities was merely a reference to probability where the underlying conditions are implicit.
In practical applications of probabilistic/statistical modelling, any conditioning event that holds for the entire analysis is usually removed as an explicit condition in the notation --- it is simply not useful to condition every probability statement in the analysis on the same conditioning event. Consequently, if we are working with only a single model, we would not bother to mention conditioning on the model form at all; the assumptions of the model would instead form implicit conditions for the whole analysis.
Thus, explicit conditioning on the model is only useful in applications where we are considering more than one possible model form, and even then, only when the various model forms cannot fruitfully be stated as different parameter values within one overall model.
This occurs in some practical modelling applications, and it also occurs when examining statistical properties of models in a meta-analytical perspective, where we remove the underlying modelling assumptions and look at statistical behaviours in their absence.
So, as a practical matter, explicit conditioning on "models" (i.e., sets of assumptions about probabilistic behaviour of observable values) is only useful when:
- (1) There is more than one model under consideration in the analysis; and
- (2) The models in the analysis cannot be stated more simply as different parameter ranges in a single more general model (i.e., they are not just "nested" models).
Of note here is that, under these conditions, different models will have different parameters that mean different things in the context of those models. This leads you to a problem if you want to refer to a probability like $p(M|\theta)$, which is the probability of a specific model conditional on the outcomes of a specific set of parameter values --- i.e., does the parameter $\theta$ even exist (and mean the same thing) under model $M$ and the alternative models in the analysis?
In this modelling context, in order to make conditioning probabilities of this kind make sense, you have to ensure that all parameters are well-defined regardless of which model is used. (Otherwise you may end up conditioning on parameters that don't exist.) This means that you will need to stipulate a framework where all parameters in all models exist, and you have a prior over all these parameters. Moreover, parameters that don't appear in a model don't affect that model, and so it stands to reason that your prior should treat parameters and models independently, unless we are talking about groups of parameters that may jointly exist under a single model (in which case we may allow prior dependence). The outcome of these assumptions will render conditional probabilities of models conditional on parameters trivial. We will see below that in order to make this question sensible in the context of non-nested models, we get trivial and unhelpful results.
An example: To illustrate this issue, consider an analysis where you are modelling a survival time $X \geqslant 0$ with one of two non-nested models and their corresponding parameters:
$$\begin{matrix}
\text{Model } M_1 & & & X \sim \text{Ga}(\text{Shape} = 2, \text{Scale} = \theta), \quad \quad \\[6pt]
\text{Model } M_2 & & & X \sim \text{Weibull}(\text{Shape} = 2, \text{Scale} = \lambda). \\[6pt]
\end{matrix}$$
If you start to look at the conditional probabilities of models conditional on parameters you see an immediate problem --- the parameters for the two models are different. In order to ensure that the probabilties are well-defined (so that you can condition on them regardless of which model is used) you can stipulate that all the parameters exist under each model, but parameters don't do anything in the model if they don't appear. So in this case, we could stipulate that the parameter vector $(\theta, \lambda)$ always exists, with model $M_1$ only using the first parameter and model $M_2$ only using the second. If we do this then it obviously also makes sense that our prior distribution over the models and parameters should treat them as independent --- i.e., we have:
$$\pi(M, \theta, \lambda) = \pi(M) \pi(\theta) \pi(\lambda).$$
Using a prior of this form and applying Bayes' rule gives the trivial results:
$$\begin{align}
p(M | \theta )
&= \frac{p(M, \theta)}{p(\theta) }
= \frac{\pi(M) \cdot \pi(\theta)}{\pi(\theta)}
= \pi(M), \\[12pt]
p(M | \lambda)
&= \frac{p(M, \lambda)}{p(\lambda) }
= \frac{\pi(M) \cdot \pi(\lambda)}{\pi(\lambda)}
= \pi(M). \\[6pt]
\end{align}$$
As you can see, the prior assumptions in this case lead to the parameters giving no information on which model is used. (Note that conditioning on the observed value of $X$ will usually give information on which model is used, but that is a different question.) Consequently, the inquiry into the conditional probability of models given the true parameter values is trivial and unhelpful.
The astute reader will no doubt have noticed that this trivial outcome comes directly from prior assumptions that stipulate independence between models and parameters. Consequently, it is natural to wonder whether we might get non-trivial results if we adopt a prior that treats these as dependent. Of course, this is possible, but it doesn't seem very sensible. If the parameter $\theta$ is meaningless under model $M_2$ (and the parameter $\lambda$ is meaningless under model $M_1$) then there is no value in stipulating different prior distributions for these parameters under the two models.