(if my question should be cut up into sub-questions please let me know, since all those questions are related I decided to ask them here together as one long question)
Main question
As part of my research I try to perform multivariate analysis in order to establish which environmental factors are important in explaining the vegetation species composition over time in a newly constructed wetland (with 27 different basins, each sampled yearly for 4 years).
Apart from that, I aim to analyse whether those environmental factors change over time, and which of them influence biomass production 4 years after establishment of the wetland. For those last two questions I know how to perform the analyses, but for the multivariate analyses I am not sure which test to choose, how to interpret the results and which statistics to report in my manuscript.
Since I perform all my statistical analyses in R, I would like to also perform the multivariate analyses in R. I have read up on the vegan-package and think it contains all I need, but I cannot figure out some of the details.
Overview of my data
My species-data are Tansley-classes (abundance classes 0-9), with 77 species and 108 observations (27 basins * 4 years).
My environmental-data consist of two groups: sediment samples and water samples. The sediment samples were only taken in the last year (27 observations, 19 vector-variables, 2 factor-variables), the water samples were taken each year (108 observations, 40 vector-variables, 2 factor-variables, 1 column with 'year' as vector-variable).
- If I want to include the sediment samples in my multivariate analyses, can I simply use the same numbers for all years? If not, does this mean I cannot use them, since I don't have numbers for all rows in my species-data?
- I had some NA's for the water samples and got some problems when trying out some MVA-tests. Therefore, I filled them up with the average number per variable for the same basin. Are there any strong reasons why I should not do this?
- I am not sure whether or not to use 'year' as a variable for the environment. I measured everything over time, so it feels right to take it into account, but should I do this by taking it as a vector-variable? And should I then only use it in CCA, or also in CA with ENVFIT? I think I should only use 'year' in CCA, so I will still see which other factors influence species composition.
Tests which I tried with example data
To show you what I tried, I will first supply some example data (with species-data in spec (containing classes 0-9), and environment-data in env (containing numbers, factor, and year)):
data(dune)
spec = rbind(dune,dune)
spec = spec[1:36,]
v = paste0(rep(LETTERS[1:9], each=4), rep(1:4,9))
row.names(spec) = v
env = data.frame(replicate(40,sample(0.0:20.0,36,rep=TRUE)))
env = env/10
row.names(env) = v
env$fac1 = as.factor(rep(c("gr1","gr2","gr3"), each=12))
env$year = as.numeric(rep(1:4,9))
env_excl_year = subset(env, select = -year)
I think I should do the following with my data:
- Perform an unconstrained correspondence analysis (or RDA, how to choose?)
- Fit environmental vectors to the ordination
- Use significant environmental factors in constrained model
- Make useful plots
This is what I did in R (with questions in between):
library(vegan)
spec.ca = cca(spec)
spec.ca
What should I report from spec.ca? Since the inertia is mean squared contingency coefficient (instead of variance or correlation), I think it doesn't make sense to report the Eigenvalues. Or should I use: spec.rda = rda(spec, scale=TRUE)
? With inertia being correlation, where it does make sense to report Eigenvalues?
ef_ca = envfit(spec.ca, env_excl_year, permu=999) #without year
ef_ca
(scores_ef_ca = scores(ef_ca,"vectors"))
What should I report from ef_ca? Why are the scores different from the number I see when inspecting ef_ca
? Can I indeed use R2 as "the proportion of variation in species composition explained by this variable"? And can I use the p-value to establish whether the variable can explain species composition (with p<0.05: H0 of no explanation by this variable rejected)?
In my case, X13 and year had p<0.05 (since the data is made randomly, you may have a different outcome). I would also like to know more about 'fac1' (with my real data, this one also has p<0.05). If I would like to say more about those variables, specifically about how they can explain the development of the species composition in the different basins (see research question), should I add those to a constrained model (cca), inclusing year (which was not included in ca, was this a good decision?)?
spec.cca = cca(spec ~ X13 + year + fac1, env)
spec.cca
anova(spec.cca)
RsquareAdj(spec.cca)
anova(spec.cca, by="term", step=200) #Type-I: sequential test, dependent on order
anova(spec.cca, by="margin", step=500) #Type-III: marginal effects, independent of order
anova(spec.cca, by="axis", step=1000) #test of individual axes
And should I then report on the F- and p-value of the whole model, with R2, and F- and p-value for separate terms? For the terms: should I use "term" or "margin", and why (results differ)? How do you choose the amount for step? Should I perform VIF or correlation analysis to exclude certain variables from my analyses? What is a good rule of thumb to use here for exclusion?
I know you can also perform cca with all variables or using forward or backward selection.
spec.cca.all = cca(spec ~ ., env)
spec.cca.all
spec.cca.start = cca(spec ~ 1, env)
spec.cca.forward = step(spec.cca.start, scope = formula(spec.cca.all), test = "perm")
spec.cca.forward
spec.cca.backward = step(spec.cca.all, scope = list(lower=formula(spec.cca.start), upper = formula(spec.cca.all)), trace = 0)
spec.cca.backward
When I did this with forward selection, I got exactly the same model as when choosing variables by envfit p<0.05. Is this a coincidence, or logical? Which is preferred: using envfit or forward selection? I guess envfit, because I think the exact same results was just random, but I am not sure.
I would like to plot my data in a clear way. However, my plots tend to get a bit crowded. Also, I am not sure when to use which scaling (I think scaling = 1 means sites are weighted averages of species, scaling = 2 means species are weighted averages of sites, but then which one do you choose and why?). If I plot cca, the plot is immediately very crowded, if I plot ca I have a bit more freedom.
plot(spec.cca) #very crowded, seems not to be adaptable
plot(spec.ca, scaling=2) #less crowded, and adaptable
p1 = plot(spec.ca, scaling=2, type="n")
p1_sel = orditorp(spec.ca, display="species",labels = names(spec), pcol="green", pch="+")
p1_sel = orditorp(spec.ca, display="sites",labels = row.names(spec), pcol="blue", pch="o")
plot(ef_ca, p.max=0.05, col = "red")
with(env, ordispider(spec.ca, fac1, col=c("orange","purple","magenta"), label=T))
Can I adapt the cca-plot to be less crowded? Can I just also use scaling = 1 in the code above for ca? I am not sure if I can do this, since ef_ca keeps plotting the arrows in the same direction, even though the arrangement of species and sites differs. Probably I should just make multiple plots, each containing a part of the information. Any hints are very welcome!
My species-data can be grouped in three groups, based on another data.frame, say veg_groups = as.data.frame(t(rbind(names(spec),rep(c("sp1","sp2","sp3"), each=10))))
. Can I also use this in my plots, say coloring the species in three different colours depending on veg_groups? How do I do this?
Thank you all very much in advance for helping me out, even if it is only with a little part of my question.