To be clear, I doubt I am using the term "cross-validation" correctly here; what I am suggesting also seems similar to "boot-strapping" and "hyperparameter tuning". Terminology is not my strength.
Let's say we have a data set with $20$ observations, $D_1, \dots, D_{20}$. We don't know what prior to use for the data set, so we decide to use the maximum entropy prior given the population mean and variance, i.e. a normal prior. (This of course assumes the population distribution has finite second moment. I am not convinced that this assumption is innocuous, but it is common.)
But of course we don't know the population mean and population variance, so we estimate them. We can't use all of the data to estimate them, because then there wouldn't be any data left to do our inference on. So let's say we use observations $D_1, \dots, D_{15}$ to get an estimate $\hat{\mu}$ for the population mean $\mu$ and an estimate $\hat{\sigma}^2$ for the population variance $\sigma^2$. Then we choose $\mathscr{N}(\hat{\mu}, \hat{\sigma}^2)$ as our prior and then use the remaining $5$ observations $D_{16}, \dots, D_{20}$ to do inference with our prior.
No one would like this situation because we are no longer using all of our data for inference. So:
Question: In this situation would it make sense to:
Calculate priors $\mathscr{N}(\hat{\mu}_1, \hat{\sigma}^2_1)$, $\mathscr{N}(\hat{\mu}_2, \hat{\sigma}_2^2)$, $\mathscr{N}(\hat{\mu}_3, \hat{\sigma}_3^2)$, $\mathscr{N}(\hat{\mu}_4, \hat{\sigma}_4^2)$, the first using the exact same procedure above, the second using an analogous procedure with $D_{11}, \dots, D_{15}$ as the "holdout set", the third using an analogous procedure with $D_{6}, \dots, D_{10}$ as the "holdout set", and the fourth using an analogous procedure with $D_1, \dots, D_5$ as the "holdout set",
Choose as our prior either (a) the convex combination of these above four priors, which would be a Gaussian mixture model I guess, or (b) take as our prior $\mathscr{N}(\tilde{\mu}, \tilde{\sigma}^2)$, where $$\tilde{\mu} := \frac{1}{4}(\hat{\mu}_1 + \hat{\mu}_2 + \hat{\mu}_3 + \hat{\mu}_4 ) \,, \quad \tilde{\sigma}^2 := \frac{1}{4}(\hat{\sigma}^2_1 + \hat{\sigma}_2^2 + \hat{\sigma}_3^2 + \hat{\sigma}_4^2) \,?$$
The above example generalizes readily, of course (for example I didn't even specify a particular method to get our estimates $\hat{\mu}$ and $\hat{\sigma}^2$), but I thought I would use this concrete example because I don't think I could explain myself clearly in full generality.
(Actually I even doubt that the above concrete example is explained clearly.)
Additional questions: Does something similar to the above procedure already have an established name? And is there any literature either showing its lack of optimality properties or otherwise analyzing it theoretically?
I think this question is different from this related question because that question has both an internal and external source of data. In this example, our parameter estimation and inference are both "competing" for the same data, so we use "cross-validation" or repeated sub-sampling ("bootstrapping") to accomplish the required "hyperparameter tuning" for the prior.
It is also different from the method suggested here which Andrew Gelman argued (probably convincingly, I don't understand the argument to be honest) does not work well. But that method is suggesting use of an "M-estimation" approach with cross-validation to get the prior, i.e. to select the "best-performing" prior from $\mathscr{N}(\hat{\mu}_1, \hat{\sigma}^2_1)$, $\mathscr{N}(\hat{\mu}_2, \hat{\sigma}_2^2)$, $\mathscr{N}(\hat{\mu}_3, \hat{\sigma}_3^2)$, $\mathscr{N}(\hat{\mu}_4, \hat{\sigma}_4^2)$, whereas I am suggesting using some combination of them. This makes more sense to me than the other method, in order to both (a) avoid "overfitting" and (b) to use more of the data to inform the choice of prior.