Confidence intervals around a centroid with modified Gower similarity

Question

I would like to obtain 95% confidence intervals for centroids based on Gower similarity between some mulivariate samples (community data from sediment cores). I have so far used the vegan{} package in R to obtain modified Gower similarity between cores (based on Anderson 2006; now included in R as part of vegdist()). Does anyone know how I can calculate 95% confidence intervals for the centroids of, for example, sampling sites, based on modified Gower similarity?

Additionally, if possible, I would like to plot these 95% CIs on a PCO that shows the centroids, so it's evident if they're overlapping.

To get modified Gower similarity, I used:

dat.mgower <- vegdist(decostand(dat, "log"), "altGower")

But as far as I know, you don't get centroids from vegdist(). I need to get centroids, then 95% CIs, then plot them... in R. Help!

Anderson, M. J., K. E. Ellingsen, and B. H. McArdle. 2006. Multivariate dispersion as a measure of beta diversity. Ecology Letters 9:683–693.

If you are looking at clusters in k-dimensions aren't the centroids k-dimensional? In which case you should be looking for confidence regions and not intervals. Any confidence region for a variable like a cluster center would depend on all the compoents that make up the uncertainty in the estimate. I would think that could be pretty complex and that generating confidence regions would not be a simple matter. Could n't you do simulation to approximate them? — Michael R. Chernick, Aug 09 '12 at 20:45
Thanks, Michael. Yes, I meant confidence regions, which would be in k-dimensional space where k is the number of taxa found in the community. I would do simulations instead of calculating them, but I don't know how to go about that. Approximate CIs would do. — Margaret, Aug 09 '12 at 20:52
I would not have a good idea how to approximate the regions but I do think simulation could be fairly straightforward. You have a data set of k-dimensional points that you feed into some algorithm that generates the clusters and their centroids. Randomly perturb the data to take account of sampling error and run the perturbed data through the algorithm. Repeat this many times. — Michael R. Chernick, Aug 09 '12 at 21:05
I see there has been some discussion whilst I was writing my Answer. Not sure what you describe and I illustrate is in terms of the $k$ species as we've thrown that info away in computing the dissimilarities. We can then compute the centroids in some ordination space, in this case a PCO of the modified Gower dissimilarities. Do let me know if this is not what you wanted and I can try to help some more. — Gavin Simpson, Aug 09 '12 at 21:12
Another approach would be to bootstrap. For the n k-dimensional points generate bootstrap samples by sampling n times with replacement from the data set. Run the bootstrap data set through the clustering algorithm. Repeat this many time. This will give you a distribution of chosen clusters and centroids. Then for each centroid (if you can match from one bootstrap sample to another) you would get a distribution of centroids for each cluster and from that construct bootstrap confidence regions for them. — Michael R. Chernick, Aug 09 '12 at 21:16
I think a problem with this whole idea is that the number of clusters can vary from one case to the next and hence there may not be a reasonable way to match clusters from one bootstrap sample (or simulation) with the clusters in the next. — Michael R. Chernick, Aug 09 '12 at 21:17
@MichaelChernick That may not be an issue too much if the groupings are defined a priori as per my example. That would be typical of the sort of data described in the paper Margaret cites. — Gavin Simpson, Aug 09 '12 at 21:23
@GavinSimpson Then if the groupings are fixed but the cluster centers can vary based on the data wouldn't my suggestions work as ways to get confidence ellipsoids or regions for the centroids/ — Michael R. Chernick, Aug 09 '12 at 21:40
@MichaelChernick Yes, if I follow your argument properly and that was the point I was trying to make. I wouldn't bet my house on the theory used to form the intervals computed using `ordiellipse()` so would certainly want to base any inference on some other analyses, the multivarate ANOVA-like thing in `adonis()` is one option and your bootstrap suggestion is another. A complication with the bootstrap is the use of the dissimilarity instead of the original data. Would you need to redo the PCO for each bootstrap or apply the boostrap *after* doing that PCO step? — Gavin Simpson, Aug 09 '12 at 22:09
@GavinSimpson It seems to me that the bootstrap should replicate the entire process on the bootstrap samples that is applied to the original data. So it sounds like that means repeating the PCO for each bootstrap sample. — Michael R. Chernick, Aug 09 '12 at 22:40

Gavin Simpson · Accepted Answer · 2012-08-09T21:33:43.200

I'm not immediately clear what centroid you want, but the centroid that comes to mind is the point in multivariate space at the centre of the mass of the points per group. About this you want a 95% confidence ellipse. Both aspects can be computed using the ordiellipse() function in vegan. Here is a modified example from ?ordiellipse but using a PCO as a means to embed the dissimilarities in an Euclidean space from which we can derive centroids and confidence ellipses for groups based on the Nature Management variable Management.

require(vegan)
data(dune)
dij <- vegdist(decostand(dune, "log"), method = "altGower")
ord <- capscale(dij ~ 1) ## This does PCO

data(dune.env) ## load the environmental data

Now we display the first 2 PCO axes and add a 95% confidence ellipse based on the standard errors of the average of the axis scores. We want standard errors so set kind="se" and use the conf argument to give the confidence interval required.

plot(ord, display = "sites", type = "n")
stats <- with(dune.env,
              ordiellipse(ord, Management, kind="se", conf=0.95, 
                          lwd=2, draw = "polygon", col="skyblue",
                          border = "blue"))
points(ord)
ordipointlabel(ord, add = TRUE)

Notice that I capture the output from ordiellipse(). This returns a list, one component per group, with details of the centroid and ellipse. You can extract the center component from each of these to get at the centroids

> t(sapply(stats, `[[`, "center"))
         MDS1       MDS2
BF -1.2222687  0.1569338
HF -0.6222935 -0.1839497
NM  0.8848758  1.2061265
SF  0.2448365 -1.1313020

Notice that the centroid is only for the 2d solution. A more general option is to compute the centroids yourself. The centroid is just the individual averages of the variables or in this case the PCO axes. As you are working with the dissimilarities, they need to be embedded in an ordination space so you have axes (variables) that you can compute averages of. Here the axis scores are in columns and the sites in rows. The centroid of a group is the vector of column averages for the group. There are several ways of splitting the data but here I use aggregate() to split the scores on the first 2 PCO axes into groups based on Management and compute their averages

scrs <- scores(ord, display = "sites")
cent <- aggregate(scrs ~ Management, data = dune.env, FUN = mean)
names(cent)[-1] <- colnames(scrs)

This gives:

> cent
  Management       MDS1       MDS2
1         BF -1.2222687  0.1569338
2         HF -0.6222935 -0.1839497
3         NM  0.8848758  1.2061265
4         SF  0.2448365 -1.1313020

which is the same as the values stored in stats as extracted above. The aggregate() approach generalises to any number of axes, e.g.:

> scrs2 <- scores(ord, choices = 1:4, display = "sites")
> cent2 <- aggregate(scrs2 ~ Management, data = dune.env, FUN = mean)
> names(cent2)[-1] <- colnames(scrs2)
> cent2
  Management       MDS1       MDS2       MDS3       MDS4
1         BF -1.2222687  0.1569338 -0.5300011 -0.1063031
2         HF -0.6222935 -0.1839497  0.3252891  1.1354676
3         NM  0.8848758  1.2061265 -0.1986570 -0.4012043
4         SF  0.2448365 -1.1313020  0.1925833 -0.4918671

Obviously, the centroids on the first two PCO axes don't change when we ask for more axes, so you could compute the centroids over all axes once, then use what ever dimension you want.

You can add the centroids to the above plot with

points(cent[, -1], pch = 22, col = "darkred", bg = "darkred", cex = 1.1)

The resulting plot will now look like this

use of ordiellipse

Finally, vegan contains the adonis() and betadisper() functions that are designed to look at differences in means and variances of multivariate data in ways very similar to Marti's papers/software. betadisper() is closely linked to the content of the paper you cite and can also return the centroids for you.

Do read the help `?ordiellipse` for details of what is being done here, esp in computing the confidence interval. Whether the theory matches the data is something you might want to look into with simulation or resampling or something rather than rely on "theory". — Gavin Simpson, Aug 09 '12 at 21:13
Further to the comment and the last paragraph of the Answer; `adonis()` can be used to test for similar means (centroids) of groups as one might use ANOVA in the univariate case. A permutation test is used to determine if the data are consistent with the null hypothesis of no difference of centroids. Note also that differences of centroids can be caused by different group dispersions (variances). `betadisper()` can help you test if that is the case, again using a permutation-based test of the average distances of the sample points to their centroid. — Gavin Simpson, Aug 09 '12 at 21:26
@Gavin-- thank you. I have done the test to measure differences between the centroids using PERMANOVA and PERMDISP in PRIMER (which perform the same task as `adonis()` and `betadisp()`, I believe), I was just looking for a good way to display the data. I have some site x season interaction for a repeated measures design so I wanted to be able to easily show which sites showed a seasonal effect. I think these ellipses are what I'm looking for; this example was very helpful. — Margaret, Aug 10 '12 at 00:52
also, yes-- the multivariate center of mass for each group is the type of centroid for which I was trying to calculate CIs. — Margaret, Aug 10 '12 at 00:55
One more thing-- If I wanted to fill the ellipses with different colors depending on my factors, is there a way I could do that in `ordiellipse()` without embedding a for loop? I have both seasons and sites in my data, and I wanted to show differences sites in one plot and seasons in another by color coding them. For whatever reason, using col=c(1,2,1,2) etc doesn't work, nor does col=as.numeric(cent["Site_TP"]). Is there an elegant way to do this? — Margaret, Aug 16 '12 at 21:53

Confidence intervals around a centroid with modified Gower similarity

1 Answers1