9

In the comments on this answer, user Scortchi asks:

So iff there's a sufficient statistic of constant dimension, there's a conjugate prior?

As far as I know this didn't get a complete answer, so I'm asking it as a new question in the hope of finding out whether it's true. My question is the quote above; I give more details below.

This question can be seen as asking for a generalisation of the Pitman-Koopman-Darmois theorem, which states that if a family of distributions is such that the support does not depend on the parameters, and if the family has a sufficient statistic whose dimensionality doesn't change as the number of samples increases, then the family must be an exponential family.

We also have that if a family of distributions is such that the support does not depend on the parameters and the family admits a conjugate prior, then it must be an exponential family, which is a similar but different result.

However, as the example in the linked answer shows, if we relax the assumption that the support doesn't depend on the parameters, then it's possible for a distribution to have a conjugate prior without being an exponential family. The question is whether something similar happens if we relax the corresponding assumption in the Pitman-Koopman-Darmois theorem, and specifically, whether we end up with the same set of families of distributions in both cases.

In other words, the conjecture is that the Pitman-Koopman-Darmois theorem can be generalised into the following statement: "For an arbitrary family of distributions, whose support might depend on its parameters, the family has a conjugate prior if and only if it has a sufficient statistic whose dimensionality doesn't change as the number of samples increases." Is this statement true or false?

N. Virgo
  • 323
  • 1
  • 2
  • 11

1 Answers1

8

If there exists a finite dimensional conjugate family, $$\mathfrak F=\{\pi(\cdot|\alpha)\,;\ \alpha\in A\}$$ with $\dim(A)=d$, this means that, for any $\alpha\in A$, there exists a mapping $\tilde\alpha_n\,:\mathfrak X^n \mapsto A$ such that $$\pi(\theta|x_{1:n},\alpha)\propto f_n(x_{1:n}|\theta)\pi(\theta|\alpha) \propto \pi(\theta|\tilde\alpha_n(x_{1:n}))$$ Hence, for an arbitrary $\alpha$, $$f_n(x_{1:n}|\theta) = \pi(\theta|\tilde\alpha_n(x_{1:n})) m_\alpha(x) \pi(\theta|\alpha)^{-1}$$ factorises into a function of $\theta$ and $\tilde\alpha_n(x_{1:n})$ and a function of $x$ that does not depend on $\theta$. This implies that $\tilde\alpha_n(x_{1:n})$ is a sufficient statistic when $\Theta$ is restricted to the support of $\pi(\cdot|\alpha)$. Furthermore, assuming $\alpha$ and $\alpha^\prime$ are such that the supports of $\pi(\cdot|\alpha)$ and of $\pi(\cdot|\alpha^\prime)$ intersect, each posterior is a function of both its summary statistic (e.g., attached with $\alpha$) and of the other (e.g., attached with $\alpha^\prime$), which means they must be functions of one another (i.e., in bijection), I believe, leading to a potential conclusion that is independent from the support.

Considering the converse implication, and assuming iid data, if there exists a sufficient statistic of fixed dimension for all (large enough) $n$'s, $t_n\,:\mathfrak X^n \mapsto \mathbb R^d$, then by the factorisation theorem, $$f(x_{1:n}|\theta) = \tilde {f_n}(t_n(x_{1:n})|\theta)\times m_n(x_{1:n})$$ which implies that, for any prior $\pi(\cdot)$, the posterior satisfies $$\pi(\theta|x_{1:n})\propto \tilde {f_n}(t_n(x_{1:n})|\theta)\times \pi(\theta)$$ For a given distribution density $\pi_0(\cdot)$ over $\Theta$, the family of priors $$\mathfrak F=\{ \tilde \pi(\theta)\propto \tilde {f_n}(t_n(x_{1:n})|\theta) \pi_0(\theta)\,;\ n\in \mathbb N, x_{1:n}\in\mathfrak X^n\}$$ where the $x_{1:n}$'s are pseudo-observations of size $n$ indexing the prior distributions, is conjugate since, if the prior writes $$\tilde \pi(\theta)\propto \tilde {f_n}(t_n(x_{1:n})|\theta) \pi_0(\theta)$$ the posterior associated with the observations $y_{1:m}$ writes $$\tilde \pi(\theta|y_{1:m})\propto \tilde {f_m}(t_m(y_{1:m})|\theta)\tilde {f_n}(t_n(x_{1:n})|\theta)\pi_0(\theta)\propto \tilde {f}_{n+m}(t_{n+m}(z_{1:(n+m)})|\theta)\pi_0(\theta) $$ where $z_{1:(n+m)}=(x_{1:n},y_{1:m})$.

Xi'an
  • 90,397
  • 9
  • 157
  • 575
  • 1
    Thank you, this is great. There is maybe a slightly tricky point in your third equation though: it seems like we have to assume that $\pi(\theta|\alpha)>0$ for every $\theta$ and the chosen value of $\alpha$. This isn't the case for any arbitrary $\alpha$ in the example I linked to, where the conjugate prior is the Pareto family. However, it seems like that might not be a problem, because you only really need there to exist some value of $\alpha$ for which it's the case, and that seems like it should always be true. Do you have any thoughts about that? – N. Virgo Mar 14 '21 at 04:19
  • You are correct, I missed that point. – Xi'an Mar 14 '21 at 10:13
  • The first implication would work were there a conjugate prior whose support included the support of the likelihood function. Or a finite collection of conjugate priors whose support union contained the support of the likelihood function. – Xi'an Mar 14 '21 at 10:29
  • I think there could be a possible resolution by arguing that the summary statistics $α_n(x_{1:n},α)$ are functions of one another when the prior hyperparameter α varies, but cannot quickly produce a rigorous argument, because of the varying supports (in θ and in $x_{1:n}$). – Xi'an Mar 14 '21 at 13:08