Fit a graphical model. A new dataset is generated. Is this new dataset generated by the same process as the one used to train the GM?

Question

Train a graphical model by fitting it to some data generated by process A.

You get some new data, perhaps one record, perhaps more. You want to know if these data items were also generated by process A. Ideally you'd like a distribution: the probability that the data were generated by A.

I imagine you can use the posterior predictive. If the probability of the new data given the old data is low (compared to what?) then you have a strong hint that the new data comes from a different process.

Alternatively, you could fit the original model to the new data (if you have enough data) and then compare posteriors (how? KL divergence?).

Do you still have the original dataset? Or can you re-run process A to generate new data? Or can you use your graphical model to generate simulated data from your estimate of process A? If so, then you just want a test of equality of distributions, e.g., [Kolmogorov-Smirnov](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). — Dan Hicks, Sep 13 '17 at 00:52
Yes to all three questions. Does KS only works for 1-D distributions (I know, there's an extension, but ...). Also, what if you wanted a Bayesian approach? Intuition says the posterior predictive is the way to go; it measures the probability of new data. I am just not sure exactly how to make it work in practice. Also, as an aside, there are some interesting thoughts on out there on KL-Divergence and the KS test -- Since they both measure the difference between distributions. — yalis, Sep 13 '17 at 02:36
Based on the Wikipedia page, the multivariate extension of KS is messy. I don't know any more than that. — Dan Hicks, Sep 13 '17 at 22:12

score 1 · Accepted Answer · answered Sep 13 '17 at 22:19

For a Bayesian approach, you're asking for something like $$ p(A | y') $$ where $y'$ is the new data and $A$ is the hypothesis that the data are generated by process A. Bayes' theorem gives us $$ p(A | y') \propto p(y' | A) \cdot p(A), $$ the likelihood of $A$ times the prior on $A$. The likelihood can also be understood as the probability that process A would produce data (that look like) $y'$. So that suggests either (a) running process A repeatedly, generating 500-1000 data sets that can be compared to $y'$, or (b) assuming your fitted model is correct and using it to computational generate 500-1000 simulated datasets that can then be compared to $y'$. Option (b) involves a substantial assumption, but is somewhat in the spirit of a posterior predictive check.

Fit a graphical model. A new dataset is generated. Is this new dataset generated by the same process as the one used to train the GM?

1 Answers1