I have a data set of 200 observations and around 10 continuous variables. I would like to build a graphical model to study dependencies between variables.
Unfortunately, my data is not very stable. For instance, if I build my graph on a subsample of 180 out of my 200 observations, I will have quite a different graph from the one I obtain when considering the whole data set.
To solve this problem, I would like to use bootstrap. My algorithm would be
- consider B subsamples of my data and build B graphical models.
- compute a probability for each edge $e_{ij}: p_{ij}$ = # models in which $e_{ij}$ is selected / B
- define a threshold t and keep the edge e_ij in the final graph if and only if $p_{ij} > t$
If I run this algorithm different times, I will have B different samples and thus B different models and thus a different final graph (especially if t is low) each time. I would like to find B and t optimal so that I obtain the same final graph every run (reason for using bootstrapping).
1) Does that make sense? How can I find these optimal B and t? For a B and t given, I could run the algorithm K times, obtaining K final graphs and compare them to see if they are similar. Do you know any measure of similarity between graphs?
Also, I know bootstrapping reduces variance but increases bias (my final graph from bootstrap will have less edges in general than the standard graph not using bootstrap, but it should be more stable).
2) How could I show that the variance has indeed been reduced (i.e that the graph is more stable, less influenced by outliers for instance, not overfitting) ? Is there a general method to do so in a non-supervised context please (no cross validation)? Basically, I would like to be able to prove that it is worth using bootstrap.