Inverse Predictive Posterior

Question

Suppose I have a parametric nonlinear model, say $$ y_i |\theta \sim N(f_{\theta}(x_i), \sigma^2) $$ with known form of $f_\theta$. We get data $d=(y_i,x_i)_{i=1,\ldots,n}$ and obtain posterior samples so we can make inference on $\theta|d$. This data is collected in such a way that $x$ is not random, but rather fixed by experimental design.

Now suppose I get new data $y^{(new)}$ but no $x^{(new)}$. Is there someway I can utilize my previous Bayesian model to make posterior inference about $x^{(new)}$?

EDIT: apparently in frequentist world this is called inverse prediction and the initial dataset is called a calibration dataset.

EDIT2: What would the consequences be if I were to fit the inverse model $$ x_i|\theta \sim N(f^{-1}_\theta(y_i),\sigma_x^2) $$ Note that $x_i$ in the calibration set are fixed and measured without error, so this seems strange to me and probably problematic in some way. I then could get my answer by using the predictive posterior $p(x^{(new)}|x)$. In retrospect this approach seems nonsensical.

EDIT3: My particular model can also be formulated as $y_i = f_\theta(x_i) + \epsilon_i$ where $\epsilon_i \sim N(0,\sigma^2)$. I could use my posterior samples to draw from this residual $e_j \sim \epsilon|d$ and then $f^{-1}_\theta(y^{(new)} - e_j)$ might be samples from the distribution I am interested in. Thoughts?

In your topline model you are conditioning on $x_i$ rather than modeling it. So that's a non-starter in terms of your stated objective. Your inverse model could work if it matches the context, but it seems like you're setting up your model solely to get a desired output. Does your context warrant you specifying what you called the inverse model? Alternatively, it would seem to me that you need to posit and fit a joint model to $x$ and $y$ and then to make your predictions given an observed $y$ you derive the appropriate conditional model for $x|y$ — psboonstra, Aug 24 '21 at 21:54
You would need a prior distribution on $x_i$ if you want to infer about it. — Xi'an, Aug 25 '21 at 11:24
@psboonstra I agree a joint model makes the most sense, but the data collected are $y|x$, ie I don't have a random sample from the joint to model. I think the proposed inverse method is nonsensical, especially since the $x$'s are not random. — bdeonovic, Aug 25 '21 at 14:06
@Xi'an I would be willing to assume a prior distribution on $x$ in the population — bdeonovic, Aug 25 '21 at 14:06
If the $x_i$ are not random, then why isn't there a process to derive the $x^{(new)}$ for the given $y^{(new)}$? What exactly is the process that generates the $x_i$, and, correspondingly, for the $y^{(new)}$? — Dave, Aug 25 '21 at 17:11
it may be expensive and/or impossible for a variety of reasons to generate $x^{(new)}$ for $y^{(new)}$ — bdeonovic, Aug 25 '21 at 17:45
@bdeonovic wouldn't the $x_i$ (and $x^{(new)}$) then be "effectively random", in the same sense that a deterministic RNG is "effectively random"? If so then the fact that they are not "truly" random (whatever that means) is a red herring in this question? — Dave, Aug 25 '21 at 18:16
@Dave Lets say $y$ is some response to some treatment which can be measured on some continuous level (ie $x$). I gather data on the relationship between $y$ and $x$ by selected 10 levels of $x$ to administer and then collecting data on the response $y$. I now have some new $y^{(new)}$ with unknown treatment level. I want to make inference on what this unknown treatment is. Does that make sense? — bdeonovic, Aug 25 '21 at 18:23
@bdeonovic why isn't regular bayesian inference assuming a uniform prior on $x$ sufficient for your problem? (I'm assuming the 10 levels you've run effectively span the space for the values of $x$). The picture I have is that "out in the field" the technicians are sloppy so they might have actually administered some arbitrary amount of treatment; all that is getting back to HQ for analysis is the outcome (not the actual applied amount). This might be non-random in that specific technicians are precise but not accurate. (even there once you pick that model there might be things to do...) — Dave, Aug 25 '21 at 18:27
"Now suppose I get new data $y^{(new)}$ but no $x^{(new)}$. Is there someway I can utilize my previous Bayesian model to make posterior inference about $x^{(new)}$?" This sentence is not very clear. I have the idea that you wish to predict $x_i$ based on observed $y_i$, and you do this with a previous data set of matched pairs of $x_j, y_j$. You mention 'my previous Bayesian model' but what is this model more precisely? You say " This data is collected in such a way that x is not random" but what about the second data set? — Sextus Empiricus, Aug 28 '21 at 17:14

psboonstra · Answer 1 · 2021-08-26T14:18:07.517

Ok, I've edited my response taking into account feedback from the OP. Below is a DAG that captures the assumptions provided. So for example, $x^{(\mathrm{new})}$ need not be equal in distribution to $x$ as required, and $y^{(\mathrm{new})}$ is conditionally independent of the training data $\mathbf{Y}$ given $\theta$

We require draws from the distribution $p(x^{(\mathrm{new})}|y^{(\mathrm{new})}, \mathbf{Y})$. We have that

\begin{align} p(x^{(\mathrm{new})}|y^{(\mathrm{new})}, \mathbf{Y}) \propto p(y^{(\mathrm{new})}|x^{(\mathrm{new})}, \mathbf{Y}) p(x^{(\mathrm{new})}|\mathbf{Y}) \end{align}

where we've dropped terms that do not depend on $x^{(\mathrm{new})}$. Focusing on one term at a time:

\begin{align} p(y^{(\mathrm{new})}|x^{(\mathrm{new})}, \mathbf{Y}) &= \int_\theta p(y^{(\mathrm{new})}|x^{(\mathrm{new})}, \mathbf{Y}, \theta) p(\theta|x^{(\mathrm{new})}, \mathbf{Y})d\theta\\ &= \int_\theta p(y^{(\mathrm{new})}|x^{(\mathrm{new})},\theta) p(\theta|\mathbf{Y})d\theta\\ &= E_{\theta|\mathbf{Y}} \left[p(y^{(\mathrm{new})}|x^{(\mathrm{new})},\theta) \right] \end{align}

The implied conditional independences from the DAG justify dropping the terms in the second line. Next

\begin{align} p(x^{(\mathrm{new})}|\mathbf{Y}) \propto p(x^{(\mathrm{new})},\mathbf{Y})\propto p(x^{(\mathrm{new})}) \end{align}

again using the DAG as justification that $x^{(\mathrm{new})}$ and $\mathbf{Y}$ are independent.

Thus, \begin{align} p(x^{(\mathrm{new})}|y^{(\mathrm{new})}, \mathbf{Y}) \propto p(x^{(\mathrm{new})}) E_{\theta|\mathbf{Y}} \left[p(y^{(\mathrm{new})}|x^{(\mathrm{new})},\theta) \right] \end{align}

Then use an importance sampling approach to draw from this distribution. Specifically, first create $M$ draws from the prior distribution for $x$, as in $x^{(\mathrm{new})}_\ell \sim p(x^{(\mathrm{new})})$, $\ell=1,\ldots,M$, where $M$ is some very large integer.

Then calculate $w_\ell\equiv E_{\theta|\mathbf{Y}} \left[p(y^{(\mathrm{new})}|x^{(\mathrm{new})}_\ell,\theta) \right]$ and resample with replacement from your set of $M$ values, with sampling probability proportional to $w_\ell$. I do not know how large your resampled set should be, but I think it should be smaller than $M$.

Your resampled set can be taken as a draw from the posterior predictive distribution of interest.

EDIT: Or do a Metropolis-Hasting algorithm to do your sampling. Or others. Xi'an (who commented on your question) is an expert on sampling algorithms.

Is $p(Y|x^{(new)},\theta)$ really the likelihood of data as if the covariate were identical to $x^{(new)}$? How would I get samples from this posterior predictive distribution? — bdeonovic, Aug 25 '21 at 16:49
Also I don't see how this utilize the fact that I observe $y^{(new)}$ — bdeonovic, Aug 25 '21 at 16:58
Yes I overlooked that you wanted to condition on ynew and have incorporated that now. I also added how to sample from the posterior distribution — psboonstra, Aug 26 '21 at 00:07

Dave · Answer 2 · 2021-08-25T19:17:12.883

I'm going to take a Bayesian approach to this.

As far as I can tell, the existence of the training set is irrelevant for this problem -- it doesn't matter how we obtained the model, we can just take it as given and fixed.

So, the actual inference problem is to obtain posterior distribution for $\vec{x}' = [x_1', x_2' \ldots x_N']$ corresponding to an ensemble $\vec{y}' = [y_1', y_2' \ldots y_N']$ (these are the $y^{(new)}$).

Apply Bayes' theorem to the get the conditional distribution

$$ P(\vec{x}' \vert \vec{y}'; \theta) = \frac{P(\vec{y}' \vert \vec{x}'; \theta)P(\vec{x}'; \theta) }{ \int P(\vec{y}' \vert \vec{x}'; \theta) P(\vert \vec{x}'; \theta) } $$

$$ P(\vec{x}' \vert \vec{y}'; \theta) = \frac{P(\vec{x}'; \theta) \prod_i P(y_i' \vert x_i'; \theta) }{ \int P(\vec{y}' \vert \vec{x}'; \theta) P(\vert \vec{x}'; \theta) } $$

Now comes the modeling part. Given the statement "x is not random, but rather fixed by experimental design", we're going to have to make some kind(s) of assumptions about the process that generated the $\vec{y}'$.

The place I'd start is just to assume that the $x_i$ are independently draw from some a priori distribution $P(\vec{x}') = \prod_i P(x_i')$, suitably specified by domain knowledge or by using the relevant maximum entropy prior. Then the probability distribution fully factorizes, and one can compute the posterior distribution for each factor corresponding to a particular $x_i'$.

If there are domain reasons to assume that the sequence of $x_i$ are correlated with one another, then you'd need to incorporate that into the model. Note that even in this case, if desired, one can compute $P(x_i' \vert \vec{y}'; \theta)$ by marginalizing out all of the other $x'$ variables.

From here, the main problem is practical: the fact that the factors $P(y_i' \vert x_i'; \theta) = N(y_i' ; f_\theta (x_i') , \sigma^2)$ means that they are not simple functions of $x_i'$ due to the non-linear $f_\theta$ means that you'll probably need to resort to some sort of numerical approach (Monte Carlo maybe), or approximation (saddlepoint method maybe) to construct a representation of the posterior.

I think I was over thinking it; for some reason in my head I didn't think I could just refit the model with the $(x^{(new)}, y^{(new)})$ included in the model — bdeonovic, Aug 25 '21 at 19:29
@bdeonovic you can't use this to update the model (that is what I'm taking from your "refit"). Your question was " Is there someway I can utilize my previous Bayesian model to make posterior inference about x(new)?" this is a way to get the posterior distributions for xnew; that is all. — Dave, Aug 25 '21 at 19:32
yes I think we are on the same page. After some thought I think I've come to the conclusion it is simpler to just refit the model with the $y^{(new)}$ observed and $x^{(new)}$ a parameter with a prior. — bdeonovic, Aug 25 '21 at 20:13

Ben · Accepted Answer · 2021-08-25T23:56:09.480

Denoting all the conditioning explicitly (which you should make a habit of doing in Bayesian analysis), your nonlinear regression model is actually specifying:

$$p(y_i | x_i, \theta, \sigma) = \text{N}(y_i | f_\theta(x_i), \sigma^2).$$

Now, if you want to make a Bayesian inference about any of the values in the conditional part, you are going to need to specify a prior for them. Fundamentally this is no different from any situation in Bayesian analysis; if you want a posterior for the regressors then your model must specify an appropriate prior. I'm going to assume that you will want to model the regressors using a parametric model with an additional parameter vector $\lambda$. In this case, it is useful to decompose the prior for these three conditioning variables in a hierarchical manner as:

$$\begin{align} \text{Prior for model parameters} & & & \pi(\theta, \sigma, \lambda) \\[6pt] \text{Sampling distribution for regressors} & & & \phi(x_i | \theta, \sigma, \lambda) \end{align}$$

I'm also going to assume that the regressors are IID conditional on the model parameters, so that $p(\mathbf{x}| \theta, \sigma, \lambda) = \prod \phi(x_i | \theta, \sigma, \lambda)$. If you specify this sampling distribution for the regressors then you will get the posterior distribution:

$$\begin{align} \phi(\mathbf{x} | \mathbf{y}) &\overset{\mathbf{x}}{\propto} p(\mathbf{x}, \mathbf{y}, \theta, \sigma, \lambda) \\[12pt] &= \pi(\theta, \sigma, \lambda) \prod_{i=1}^n p(y_i | x_i, \theta, \sigma) \cdot \phi(x_i | \theta, \sigma, \lambda) \\[6pt] &= \pi(\theta, \sigma, \lambda) \prod_{i=1}^n \text{N}(y_i | f_\theta(x_i), \sigma^2) \cdot \phi(x_i | \theta, \sigma, \lambda). \\[6pt] \end{align}$$

Computing the last line of this formula will give you the posterior kernel, and then you can get the posterior distribution by computing the constant for the density directly, or by using MCMC simulation.

Bob Durrant · Answer 4 · 2021-08-26T00:52:34.953

This is an observation rather than an answer, and if $f_{\theta}$ has some heinous form and/or your sample size is small may not have any practical relevance for your particular problem.

You know the form of $f_{\theta}$, and that must help! In particular all plausible potential values of $X_{new}|y_{new}$ must be consistent with what you know about the form of $f_{\theta}$.

Now if $f_{\theta}$ belongs to a family of well-behaved functions then that might tell you a great deal about what are the plausible values of $x_{new}$ once you have observed $y_{new}$. For example, if $f_{\theta}$ is continuous in $x$ or better yet (in order of increasing niceness) Lipschitz continuous, or differentiable, or monotonic, or convex, or linear.

In any case it certainly makes sense to look at observed $(x,y)$ pairs in the data, with similar $y$-values to $y_{new}$. As a very simple toy example, suppose that for all observed $y$-values in the data within an interval $I:= y_{new} \pm \epsilon$ it is true that the corresponding observed $x$-values are tightly clustered in a small region $A$ (e.g. a ball or an interval) of the domain of $X$. Clearly in this situation if $f_{\theta}$ is continuous in $x$ we should reason that $P(x_{new}\in A|y_{new}\in I) \gg P(x_{new}\notin A|y_{new}\in I)$.

Inverse Predictive Posterior

4 Answers4