How to show that stability is improved when using bagging in an unsupervised context?

Question

I have a data set of 200 observations and around 10 continuous variables. I would like to build a graphical model to study dependencies between variables.

Unfortunately, my data is not very stable. For instance, if I build my graph on a subsample of 180 out of my 200 observations, I will have quite a different graph from the one I obtain when considering the whole data set.

To solve this problem, I would like to use bootstrap. My algorithm would be

consider B subsamples of my data and build B graphical models.
compute a probability for each edge $e_{ij}: p_{ij}$ = # models in which $e_{ij}$ is selected / B
define a threshold t and keep the edge e_ij in the final graph if and only if $p_{ij} > t$

If I run this algorithm different times, I will have B different samples and thus B different models and thus a different final graph (especially if t is low) each time. I would like to find B and t optimal so that I obtain the same final graph every run (reason for using bootstrapping).

1) Does that make sense? How can I find these optimal B and t? For a B and t given, I could run the algorithm K times, obtaining K final graphs and compare them to see if they are similar. Do you know any measure of similarity between graphs?

Also, I know bootstrapping reduces variance but increases bias (my final graph from bootstrap will have less edges in general than the standard graph not using bootstrap, but it should be more stable).

2) How could I show that the variance has indeed been reduced (i.e that the graph is more stable, less influenced by outliers for instance, not overfitting) ? Is there a general method to do so in a non-supervised context please (no cross validation)? Basically, I would like to be able to prove that it is worth using bootstrap.

score 1 · Answer 1 · edited Apr 13 '17 at 12:44

I don't think there are any simple formulas for how much the variance of the sampling distribution is reduced via bootstrap aggregation in the general case. In a paper by Friedman and Hall (2000), the authors explain the mechanism behind the variance reduction but do quantify it in general. But, if you're looking to "show that the variance has been reduced," perhaps their reasoning will be sufficient.

Edit: It looks like the jackknife (as well as a few other methods) can be used to estimate the standard error of the bagged estimator. See here.

Instead of saying "the data is not very stable," I'd say that "the fit of the model to these data is not very stable."

Also, I wouldn't think of the bootstrap as "adding bias." In fact it can be used to estimate the bias of an estimator.

The bootstrap is at its core a Monte Carlo technique. Without specifying other costs (e.g., computational) and considering diminishing returns, you want B to be as large as possible. I suppose you could watch your estimates stabilize by plotting them by iteration.

I don't know what to tell you about $t$. Is there anyway to know if you're right or wrong? If so, you could create a loss function and evaluate it on the out-of-bag samples.

score 0 · Answer 2 · answered Jul 26 '15 at 20:13

I think you need to clarify some things so people can answer this question.

What sort of graphical method for showing dependence are you talking about? If it's just showing the dependence (correlation) between all 10 of your covariates- I think a correlation heat map is your first step (all observation, no bootstrap). If you are ultimately interested in feature selection to build an algorithm to predict an outcome, then a figure showing the frequency of each covariate being included in a model makes sense to me...

1) It is unclear if what you are describing is actually performing a bootstrap procedure. What your description states is just subsampling (not necessarily sampling with replacement- which is what I think of bootstrapping).

If you really want to do bootstrap, sample each row of your dataset with replacement to obtain 100 data sets containing 200 observations. Then in each dataset, apply your algorithm that is finding significant predictors of your outcome. Then build a single graph that counts the frequency of each of your 10 covariates being included in the model (bar graph). Pick those with greater 50% frequency (or whatever level you are comfortable with)

2) Typically, we do model building in a training set then test our performance on a test set to prove we've built a parsimonious model that predicts the outcome with the same bias/variance as the fit within the training data set.

Bonus answer: whatever you decide, it sounds like you have outliers in your data, which should be investigated. If they are indeed statistical anomalies, you can rerun your analysis after imputing these values to follow a robust gaussian distribution and see if the same variables show up in your feature selection.

score 0 · Answer 3 · answered Jul 27 '15 at 20:24

I will try to give you more details of what I am doing/looking for and answer your questions.

I used the wrong term, I am not only using bootstrapping but I am using bagging. I am trying to build a graphical model using the stepwise function from R: the assumption is that my data comes from a gaussian distribution and en edge between 2 nodes is added to the graphical model if it reduces the log-likelihood.

The objective later on could indeed to use this graph to predict some of my variables, but for now it is just to see which variables are conditionally dependent on which other variables. What is important to understand is that I am not working with explanatory variables and a response variable (but only explanatory variables in a way) and that I am not in a supervised context. I do not know what is the "true" graph. So I cannot use a so-called test set.

I agree with you Ben, it is not my data that is not stable, but my model. And indeed it is not bootstrapping that tends to increase the bias while reducing the variance, but bagging.

Apart from finding the optimal B and t, my biggest issue is to be able to justify that using a 'bagged stepwise model selection' is more stable than using a standard stepwise procedure because I cannot use a test set. Or that it is not. That is why I thought about the following thing:

Given a covariance matrix C and a mean vector m
For d=1 to D
- sample from a gaussian with covariance C, mean m
- build a graphical model using the standard stepwise
- build a graphical model using the bagged stepwise

If the model was stable, as it depends on the covariances between variables, we would expect to find similar graphs for each d. But then, I would need a measure of similarity between graphs (to show that the d graphs from the bagged procedure are more similar than the d graphs from the standard stepwise procedure... or not). What do you think of this experiment? Does it make sense? Any suggestion for a measure of similarity between graphs please?

Hope this helps, Thanks!

How to show that stability is improved when using bagging in an unsupervised context?

3 Answers3