1

I would like to do a regression analysis for my data, but I am not sure what kind of regression I have to apply. I want to do a regression for weed_coverage [%] ~ soil_moisture content [%] for each date . Below you can see a sample (n=10) of my data and some diagnostic plots of the residuals. I know its best to post just one question, but I have a couple and I hope you dont mind:

Since I am measuring soil moisture and weed coverage at the same kettleholes every date, does it mean that I have dependent(paired?) observations? In that case I cant do simple linear regression as far as I understood. Also, the residuals are not normally distributed. Do I need Generalized Linear Mixed Models then? What is the best approach here? Thanks a lot in advance!

         date kettlehole treatment distance soil_moisture yield weed_coverage
1  2021-04-27       1189         a        2         11.32  56.1           1.0
2  2021-03-17        119         b        5         36.17  68.3           0.6
3  2021-04-07        552         b        5         25.77  72.6           1.2
4  2021-04-07        119         a        2         26.15  48.5           0.2
5  2021-05-11        119         b        2          7.55  52.9           2.0
6  2021-04-27       1202         b        2         19.45  65.6           2.0
7  2021-04-07        119         a        2         23.10  63.6           1.0
8  2021-05-11       2484         b        5         10.28  43.4           4.5
9  2021-03-17        552         b        5         36.00  68.1           0.9
10 2021-04-07       2484         a        2         26.15  48.8           1.5

enter image description here enter image description here

enter image description here

Effigy
  • 51
  • 5

1 Answers1

0

Since you said you want to repeat this analysis for each date, pick just one of the dates and imagine for a minute that the index of each kettlehole gives you the average amount of pesticide of a kettlehole*. What would you do in that case? You'd certainly regress weed_coverage on soil_moisture AND kettlehole because the relationship between weed_coverage and soil_moisture should differ with the amount of pesticide in each kettlehole.

Now back to your reality, as I understand it, in which the kettlehole number is just an index devoid of any meaning. We can't include it in the regression as is, but you could create a dummy variable for all but one value of kettlehole and put those in your regression to effectively obtain the desired relationship within each kettlehole. In so doing, you've effectively run a Fixed Effects Regression.

Entity fixed effects can control for variables that are constant over time but differ across entities (i.e. kettleholes). Time fixed effects can control for variables that are constant across entities but change over time. You can use both in the same model or just one of them, and you want to be guided by theory. Ask yourself if you expect to have entity fixed effects. Ask yourself if you expect to have time fixed effects.

Think of entity fixed effects as conditioning but instead of conditioning on a particular set of variables for which you have data you are conditioning on the set of all variables that do not change over time and for which you may or may not have data. You are implicitly exploring the relationship of interest within each entity and aggregating the result.

Finally, ask yourself why you want to explore this relationship for each date separately. Why not explore this relationship across time?

*You haven't postulated a causal model here so we'll speak in terms of associations. See here for why this matters.

ColorStatistics
  • 2,699
  • 1
  • 10
  • 26
  • Hey and thanks a lot. I want to do it for each date seperately because I want to see what effect the soil moisture has on weed coverage early in the year (e.g 2021-03-17) compared to later on. I guess it makes sense to do a regression with entity and time fixed effects, as well. (did you by any chance mixed the terms? if they were swapped it would make more sense to me..). However, since my residuals are not normally distributed and depedent(=paired?), I would need GLMMs for that right? – Effigy Feb 04 '22 at 13:14
  • See here for your question with regards to the residuals https://stats.stackexchange.com/a/197589/198058 – ColorStatistics Feb 04 '22 at 13:22