I have a dataset consisting of several response variables related to vegetation diversity and around 20 explanatory variables describing plot age, soil characteristics (depth, pH etc.) of 151 observations gathered during a sampling campaign on green roofs.
Issues arise when deciding how to proceed with this dataset. While general descriptives have already delivered some insights, the main goal is to gather insight in which factors are influencing a certain response variable. Currently, I have considered two approaches:
1.Testing a specific hypothesis (e.g. age influences the number of species):
Advantages: straightforward and informative
Disadvantages: limited a priori information available from other studies, exploratory analyses (pairwise plots, ordination) do not indicate clear relationships. Furthermore, it's unclear to me how complex the models for testing a specific hypothesis can/should be.
2. Building a descriptive model
Advantages: no/limited a priori selection of explanatory variables, however stepwise backward model selection, model/data dredging in a 'blind'manner is probably not a good idea (see this post). Some testing with this approach did yield results that make sense and deliver insight.
Disadvantages: apart from (probably) not being a good idea, stepwise backwards and model dredging with many explanatory variables in the model can take a long time.
Questions:
1. What would be the right approach in my case? Are there things/options that I have overlooked?
2. If hypothesis testing is the best approach, how complex should models for this testing be? (e.g. always include random effects, should other fixed effects be controlled for in a model testing a specific fixed effect)