1

I have a dataset consisting of several response variables related to vegetation diversity and around 20 explanatory variables describing plot age, soil characteristics (depth, pH etc.) of 151 observations gathered during a sampling campaign on green roofs.

Issues arise when deciding how to proceed with this dataset. While general descriptives have already delivered some insights, the main goal is to gather insight in which factors are influencing a certain response variable. Currently, I have considered two approaches:
1.Testing a specific hypothesis (e.g. age influences the number of species):
Advantages: straightforward and informative
Disadvantages: limited a priori information available from other studies, exploratory analyses (pairwise plots, ordination) do not indicate clear relationships. Furthermore, it's unclear to me how complex the models for testing a specific hypothesis can/should be.

2. Building a descriptive model
Advantages: no/limited a priori selection of explanatory variables, however stepwise backward model selection, model/data dredging in a 'blind'manner is probably not a good idea (see this post). Some testing with this approach did yield results that make sense and deliver insight.
Disadvantages: apart from (probably) not being a good idea, stepwise backwards and model dredging with many explanatory variables in the model can take a long time.

Questions:
1. What would be the right approach in my case? Are there things/options that I have overlooked?
2. If hypothesis testing is the best approach, how complex should models for this testing be? (e.g. always include random effects, should other fixed effects be controlled for in a model testing a specific fixed effect)

jvnstckm
  • 11
  • 5
  • My first reaction is that by your use of the word “influence,” you are looking for a causal relationship. Were these data generated in such a way that you can infer causality from either approach? I don’t know vegetation that well (read: at all), so I’m not sure what the limits of this analysis would be. But it seems to me that your question is about which will give you more convincing causal answers. Correct me if I’m wrong. – RickyB Nov 08 '17 at 14:20
  • Data was generated by visiting multiple green roofs and conducting a survey of the plants (e.g. number of species) and their environment (e.g. roof age, characteristics of the soil) on each one. So, for instance, when investigating the "influence" of age on the number of plant species, I want to know if species number increases if roof age increases. When stating it like this, I think it would be best to investigate this as a specific hypothesis, but then question 2 applies. – jvnstckm Nov 08 '17 at 14:48
  • I’m not sure if this is a “specific hypotheses versus not” issue. Both a bivariate analysis, like a correlation between the diversity of species and the roof age, and a regression analysis of diversity on age, size, etc. get at a “specific hypothesis” in that you are focusing on a particular predictor. The bigger issue is if *either* of these methods will truly be able to tell you the causal relationship in the absence of a source of exogenous variation. The regression methods will get you closer, but if you can ever think of an alternative explanation, you need to do something else. – RickyB Nov 08 '17 at 14:55
  • Furthermore, as I have limited knowledge beforehand about which environmental factors should/could matter, it's hard to know which (and how many different) hypotheses I should test. I imagine it's also not a good idea to test 10-20 hypotheses for all explanatory variables separately. This is where building (and dredging or backwards selecting) a descriptive model becomes tempting to see which variables could matter (although using it this way probably not recommended). – jvnstckm Nov 08 '17 at 14:56
  • Again, it depends on your goals. Many analyses end up doing both - displaying a table with bivariate relationships and then also doing the regression analysis with all the predictors included. If you’re trying to do prediction, then there’s another set of things you could turn to, including some machine learning methods. – RickyB Nov 08 '17 at 15:01
  • Ok, I think I get it (but please correct me if I'm wrong): As for me, it's mainly about gathering insight in the predictors that matter and not about making an optimal predictive model, the bivariate relationships would be most interesting. Additionally, a regression analysis with everything included could deliver some additional insight in relative impacts of all predictors. The problem of how far to go in the inclusion of predictors (interactions, etc.) in this model could then be solved mainly by relying on common (ecological) sense. – jvnstckm Nov 09 '17 at 08:23
  • However, should this regression model then be considered 'as is', a full model with everything included (even though some predictors might not matter that much)? – jvnstckm Nov 09 '17 at 08:27
  • It depends on what you mean by “predictors that matter.” Some might argue that the evaluation of what matters is defined by variance captured, whereas others might talk about effect size. “What matters,” in my view, is a bit vague as a goal. I also wouldn’t talk about multiple regression in terms of “relative impact,” since regression does not by itself capture impact as much as it does association, and it’s not necessarily the case that you can compare coefficients to get “relative” magnitudes of association since those coefficients are scale-dependent often. – RickyB Nov 09 '17 at 15:12
  • More accurately, the multiple regression allows you to look at the association of your various factors *independent* of associations those factors may have with other factors. In other words, the “all else being equal” interpretation. – RickyB Nov 09 '17 at 15:14
  • Lastly, you seem to be asking about variable selection based on “what matters” in your data. This is frowned upon in some fields, see here for a discussion: http://www.lexjansen.com/pnwsug/2008/DavidCassell-StoppingStepwise.pdf – RickyB Nov 09 '17 at 15:17
  • The discussion you provide does indeed nicely summarize the problem I'm facing and which pitfalls I should avoid. The multiple regression including all variables (described as 'full(er) model' in the text) could be helpful (but would then also be considered more or less 'exploratory' in nature). For the "what matters" part: I think it would be best if I limit the number of variables in the full (which would also be the only) model based on (additional) literature review. This way, all variables in the model would give interesting results based on their (non-)significance. – jvnstckm Nov 09 '17 at 17:00

0 Answers0