What criteria to use for separating variables into explanatory variables and responses for ordination methods in ecology?

Question

I have different variables that interact within a population. Basically I have been doing an inventory of millipedes and measuring some other values of the terrain, like:

The species and the amount of specimens collected
The different environments where the animals are
the pH
The percentage of organic material
the amount of P, K, Mg, Ca, Mn, Fe, Zn, Cu
Ca + Mg/K relation

Basically I would like to use PCA to determine which variables drive the variability of the samples and make the forest (environments) different; which variables should I use for "variables" and which ones for "individuals"?

I think you might be confused about PCA. All variables can (of course) be only "variables". You probably have made a number of measurements in different locations (or at different times); then these locations (or times) are your "individuals", or rather "samples". — amoeba, Dec 06 '14 at 13:54
Also, I cannot help asking: your profile says you are a startup founder; is it a startup working with millipedes? Wow! — amoeba, Dec 06 '14 at 13:55
in fact @amoeba is my wife the one working on that, I'm good at calculus but not that well develop in statistics. And she wanted me to ask. — Leonardo, Dec 06 '14 at 14:56
I see. But what about my first comment, does it make sense to you? Because I am not sure I correctly understood what you mean by separating variables into "variables" and "individuals". — amoeba, Dec 06 '14 at 15:11
Hpw far is this really a statistical question? Although it seems confused on statistical terminology, to the extent that it's difficult to decode, if that is sorted out the advice is to use scientific judgement. — Nick Cox, Dec 06 '14 at 15:42
This may well be a statistical question, just in a different context & with different terms than are more typical in statistics. I believe you are asking about ordination methods from ecology. [This website](http://ordination.okstate.edu/) may be helpful to you. Relatively few of our active members here are expert in this, but @GavinSimpson may be able to help you if we can get his attention. — gung - Reinstate Monica, Dec 06 '14 at 16:17

Gavin Simpson · Accepted Answer · 2014-12-10T02:39:11.253

As @amoeba mentioned in the comments, PCA will only look at one set of data and it will show you the major (linear) patterns of variation in those variables, the correlations or covariances between those variables, and the relationships between samples (the rows) in your data set.

What one normally does with a species data set and a suite of potential explanatory variables is to fit a constrained ordination. In PCA, the principal components, the axes on the PCA biplot, are derived as optimal linear combinations of all variables. If you ran this on a data set of soil chemistry with variables pH, $\mathrm{Ca^{2+}}$, TotalCarbon, you might find that the first component was

$$0.5 \times \mathrm{pH} + 1.4 \times \mathrm{Ca^{2+}} + 0.1 \times \mathrm{TotalCarbon} $$

and the second component

$$2.7 \times \mathrm{pH} + 0.3 \times \mathrm{Ca^{2+}} - 5.6 \times \mathrm{TotalCarbon} $$

These components are freely selectable from the variables measured, and which get chosen are those that explain sequentially the largest amount of variation in the dataset, and that each linear combination is orthogonal (uncorrelated with) the others.

In a constrained ordination, we have two datasets, but we are not free to select whatever linear combinations of the first data set (the soil chem data above) we want. Instead we have to select linear combinations of the variables in the second data set that best explain variation in the first. Also, in the case of PCA, the one data set is the response matrix and there are no predictors (you could think of the response as predicting itself). In the constrained case, we have a response data set which we wish to explain with a set of explanatory variables.

Although you haven't explained which variables are the response, normally one wishes to explain variation in the abundances or composition of those species (i.e. the responses) using the environmental explanatory variables.

The constrained version of PCA is a thing called Redundancy Analysis (RDA) in ecological circles. This assumes an underlying linear response model for the species, which is either not appropriate or only appropriate if you have short gradients along which the species respond.

An alternative to PCA is a thing called correspondence analysis (CA). This is unconstrained but it does have an underlying unimodal response model, which is somewhat more realistic in terms of how species respond along longer gradients. Note also that CA models relative abundances or composition, PCA models the raw abundances.

There is a constrained version of CA, known as constrained or canonical correspondence analysis (CCA) - not to be confused with a more formal statistical model known as canonical correlation analysis.

In both RDA and CCA the aim is to model the variation in species abundances or composition as a series of linear combinations of the explanatory variables.

From the description it sounds like your wife wants to explain variation in the millipede species composition (or abundance) in terms of the other variables measured.

Some words of warning; RDA and CCA are just multivariate regressions; CCA is just a weighted multivariate regression. Anything you've learned about regression applies, and there are a couple of other gotchas too:

as you increase the number of explanatory variable, the constraints actually become less and less and you aren't really extracting components/axes that explain the species composition optimally, and
with CCA, as you increase the number of explanatory factors, you risk inducing an artefact of a curve into the configuration of points in the CCA plot.
the theory underlying RDA and CCA are less well-developed than more formal statistical methods. We can only reasonably choose which explanatory variables to keep using step-wise selection (which is not ideal for all the reasons we don't like it as a selection method in regression) and we have to use permutation tests to do so.

so my advice is the same as with regression; think ahead of time what your hypotheses are and include variables that reflect those hypotheses. Don't just throw all explanatory variables into the mix.

Example

Unconstrained ordination

PCA

I'll show an example comparing PCA, CA and CCA using the vegan package for R which I help maintain and which is designed to fit these kinds of ordination methods:

library("vegan")                        # load the package
data(varespec)                          # load example data

## PCA
pcfit <- rda(varespec)
## could add `scale = TRUE` if variables in different units
pcfit

> pcfit
Call: rda(X = varespec)

              Inertia Rank
Total            1826     
Unconstrained    1826   23
Inertia is variance 

Eigenvalues for unconstrained axes:
  PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
983.0 464.3 132.3  73.9  48.4  37.0  25.7  19.7 
(Showed only 8 of all 23 unconstrained eigenvalues)

vegan doesn't standardise the Inertia, unlike Canoco, so the total variance is 1826 and the Eigenvalues are in those same units and sum to 1826

> cumsum(eigenvals(pcfit))
      PC1       PC2       PC3       PC4       PC5       PC6       PC7       PC8 
 982.9788 1447.2829 1579.5334 1653.4670 1701.8853 1738.8947 1764.6209 1784.3265 
      PC9      PC10      PC11      PC12      PC13      PC14      PC15      PC16 
1796.6007 1807.0361 1816.3869 1819.1853 1821.5128 1822.9045 1824.1103 1824.9250 
     PC17      PC18      PC19      PC20      PC21      PC22      PC23 
1825.2563 1825.4429 1825.5495 1825.6131 1825.6383 1825.6548 1825.6594

We also see that the first Eigenvalue is about half the variance and with the first two axes we have explained ~80% of the total variance

> head(cumsum(eigenvals(pcfit)) / pcfit$tot.chi)
      PC1       PC2       PC3       PC4       PC5       PC6 
0.5384240 0.7927453 0.8651851 0.9056821 0.9322031 0.9524749

A biplot can be drawn from the scores of the samples and species on the first two principal components

> plot(pcfit)

enter image description here

There are two issues here

The ordination is essentially dominated by three species — these species lie farthest from the origin — as these are the most abundant taxa in the data set
There is a strong arch of curve in the ordination, suggestive of a long or dominant single gradient that has been broken down into the two main principal components to maintain the metric properties of the ordination.

CA

A CA might assist with both these points as it handles long gradient better due to the unimodal response model, and it models relative composition of species not raw abundances.

The vegan / R code to do this is similar to the PCA code used above

cafit <- cca(varespec)
cafit

> cafit <- cca(varespec)
> cafit
Call: cca(X = varespec)

              Inertia Rank
Total           2.083     
Unconstrained   2.083   23
Inertia is mean squared contingency coefficient 

Eigenvalues for unconstrained axes:
   CA1    CA2    CA3    CA4    CA5    CA6    CA7    CA8 
0.5249 0.3568 0.2344 0.1955 0.1776 0.1216 0.1155 0.0889 
(Showed only 8 of all 23 unconstrained eigenvalues)

Here we explain about 40% of the variation among sites in their relative composition

> head(cumsum(eigenvals(cafit)) / cafit$tot.chi)
      CA1       CA2       CA3       CA4       CA5       CA6 
0.2519837 0.4232578 0.5357951 0.6296236 0.7148866 0.7732393

The joint plot of the species and site scores is now less dominated by a few species

> plot(cafit)

enter image description here

Which of PCA or CA you choose should be determined by the questions you wish to ask of the data. Usually with species data we are more often interested in difference in the suite of species so CA is a popular choice. If we have a data set of environmental variables, say water or soil chemistry, we wouldn't expect those to respond in a unimodal manner along gradients so CA would be inappropriate and PCA (of a correlation matrix, using scale = TRUE in the rda() call) would be more appropriate.

Constrained ordination; CCA

Now if we have second set of data which we wish to use to explain patterns in the first species data set, we must use a constrained ordination. Often the choice here is CCA, but RDA is an alternative, as is RDA after transformation of the data to allow it to handle species data better.

data(varechem)                          # load explanatory example data

We re-use the cca() function but we either supply two data frames (X for species, and Y for explanatory/predictor variables) or a model formula listing the form of the model we wish to fit.

To include all variables we could use varechem ~ ., data = varechem as the formula to include all variables — but as I said above, this isn't a good idea in general

ccafit <- cca(varespec ~ ., data = varechem)

> ccafit
Call: cca(formula = varespec ~ N + P + K + Ca + Mg + S + Al + Fe + Mn +
Zn + Mo + Baresoil + Humdepth + pH, data = varechem)

              Inertia Proportion Rank
Total          2.0832     1.0000     
Constrained    1.4415     0.6920   14
Unconstrained  0.6417     0.3080    9
Inertia is mean squared contingency coefficient 

Eigenvalues for constrained axes:
  CCA1   CCA2   CCA3   CCA4   CCA5   CCA6   CCA7   CCA8   CCA9  CCA10  CCA11 
0.4389 0.2918 0.1628 0.1421 0.1180 0.0890 0.0703 0.0584 0.0311 0.0133 0.0084 
 CCA12  CCA13  CCA14 
0.0065 0.0062 0.0047 

Eigenvalues for unconstrained axes:
    CA1     CA2     CA3     CA4     CA5     CA6     CA7     CA8     CA9 
0.19776 0.14193 0.10117 0.07079 0.05330 0.03330 0.01887 0.01510 0.00949

The triplot of the above ordination is produced using the plot() method

> plot(ccafit)

enter image description here

Of course, now the task is to work out which of those variables is actually important. Also note that we have explained about 2/3 of the species variance using just 13 variables. one of the problems of using all variables in this ordination is that we've created an arched configuration in sample and species scores, which is purely an artefact of using too-many correlated variables.

If you want to know more about this, check out the vegan documentation or a good book on multivariate ecological data analysis.

Relationship with regression

It is simplest to illustrate the link with RDA, but CCA is just the same except everything involves row and column two-way-table marginal sums as weights.

At it's heart, RDA is equivalent to the application of PCA to a matrix of fitted values from a multiple linear regression fitted to each species (response) values (abundances, say) with predictors given by the matrix of explanatory variables.

In R we can do this as

## centre the responses
spp <- scale(data.matrix(varespec), center = TRUE, scale = FALSE)
## ...and the predictors
env <- as.data.frame(scale(varechem, center = TRUE, scale = FALSE))

## fit a linear model to each column (species) in spp.
## Suppress intercept as we've centred everything
fit <- lm(spp ~ . - 1, data = env)

## Collect fitted values for each species and do a PCA of that
## matrix
pclmfit <- prcomp(fitted(fit))

The Eigenvalues for these two approaches are equal:

> (eig1 <- unclass(unname(eigenvals(pclmfit)[1:14])))
 [1] 820.1042107 399.2847431 102.5616781  47.6316940  26.8382218  24.0480875
 [7]  19.0643756  10.1669954   4.4287860   2.2720357   1.5353257   0.9255277
[13]   0.7155102   0.3118612
> (eig2 <- unclass(unname(eigenvals(rdafit, constrained = TRUE))))
 [1] 820.1042107 399.2847431 102.5616781  47.6316940  26.8382218  24.0480875
 [7]  19.0643756  10.1669954   4.4287860   2.2720357   1.5353257   0.9255277
[13]   0.7155102   0.3118612
> all.equal(eig1, eig2)
[1] TRUE

For some reason I can't get the axis scores (loadings) to match, but invariably these are scaled (or not) so I need to look into exactly how those are being done here.

We don't do the RDA via rda() as I showed with lm() etc, but we use a QR decomposition for the linear model part and then SVD for the PCA part. But the essential steps are the same.

+1 and looking for continuation! Several comments: (1) in your example PC1 is not orthogonal to PC2; you could e.g. change $+1.3$ to $-5.6$ to fix it. (2) It would probably make sense to edit the title of the OP to reflect the contents of your answer; current edition of the title is mine, but I had little idea what OP was talking about. (3) Is the "response" usually univariate or multivariate? Sounds like the latter, but how is e.g. millipede abundance multivariate? Abundance of several species? (4) How are these methods different from regression? Can you include some mathematical pointers? — amoeba, Dec 07 '14 at 18:55
Thanks for the suggestion and follow up - it didn't occur to me to make the linear combination examples orthogonal but I have updated them. Re 2), I made a presumption, but given that there are c. 12,000 species of millipedes, I suspect the response here is observations of the abundance of $m$ species at each of $n$ sampling locations. In that sense, the RDA or CCA would model a multivariate response matrix of dimension $n \times m$. I'll try to deal with 4 later after I put the kids to bed. — Gavin Simpson, Dec 07 '14 at 23:29
@amoeba Sorry for the delay but I've added a section to my answer to try to show the link with regression and how RDA can be viewed as a PCA of fitted values from a series of linear regressions, one per response variable. — Gavin Simpson, Dec 10 '14 at 02:40
Thanks, @Gavin! Given response matrix $Y$ (samples in rows) and predictor matrix $X$, multivariate regression finds matrix $\beta$ such that $\|Y-X\beta\|$ is minimized (in a more common case of univariate $Y$, $\beta$ is just one column of coefficients); you say that RDA then performs SVD of $\beta$ matrix. Are you familiar with something called "reduced-rank regression"? There the problem is to find $\beta_r$ of given rank $r$ minimizing the same cost function. One solves it by doing standard regression first and then doing SVD of $X\beta$, not just $\beta$. I wonder how this is all related. — amoeba, Dec 10 '14 at 15:26
@amoeba We do do SVD of $\mathbf{X}\beta$ (fitted values) not of the coefficients $\beta$, at least that's what `fitted()` gives: $\mathbf{X}\beta$. Hence RDA is often called reduced-rank regression. — Gavin Simpson, Dec 10 '14 at 15:33
Ah, @Gavin, yes indeed. Very interesting! I am working with reduced rank regression in an *entirely* different context, but I had no idea about even existence of RDA. Can you maybe recommend me any texts covering RDA/CCA, ideally for a mathematically oriented audience (rather than purely user-oriented discussions "how to interpret triplots" etc.)? — amoeba, Dec 10 '14 at 16:59
The origin of RDA is due to [Rao (1964)](http://www.jstor.org/stable/25049339) which is a statistical paper so should be appropriate. — Gavin Simpson, Dec 10 '14 at 17:18

What criteria to use for separating variables into explanatory variables and responses for ordination methods in ecology?

1 Answers1

Example

Unconstrained ordination

PCA

CA

Constrained ordination; CCA

Relationship with regression

Linked