I am new to CV and statistical data analysis so I hope I can make my question precise enough to get some answers on here.
I am working on some data from cherry trees. I have data from a lot of different varieties over multiple years. My data contains some fruit parameters as well as weather data. Not all varieties are at each location and measured each year. I have predominantly data from one location and only the minority of varieties was sampled over many years. Other locations only provide data from one or two years per variety. I mostly have average values from around three to four trees per variety and unfortunately the plots have not been randomized.
I have now looked at some fruit parameters from the data and would like to run a model.
For example the fruit diameter seems to correlate with the temperature around the flowering date. When plotting the fruit diameter over the temperature at flowering for each variety separately, it seems as there are some differences between the varieties. For some varieties the regression line is steeper.
Here's how that looks:
So this could mean that certain varieties are not as much effected by the temperature during flowering as others (causal relationship not inevitably). So I was thinking that either I run a lm()
model for each variety separately or I run a linear mixed model. From what I googled thus far the lmer()
would have the advantage that it is better with missing data and can also deal with non independent sampling. Most trees are used for several year so the data from the same variety is correlated.
My mixed model would look like this:
model <- lmer(fruit_diameter ~ T.flowering +
(T.flowering | variety), data)
This model visualized looks like this:
I also looked at further parameters that correlate with the diameter and came up with this more detailed model:
model2 <- lmer(fruit_diameter ~ T.flowering + yield + humidity
+ (T.flowering || variety), data)
What would be the advantages of running a lm
model for each variety? Would it be advisable to used a mixed model for this kind of data. What are some disadvantages?
I only have averaged values from the three to four trees per year, is this impacting the model choice?
I am also happy to get links to similar topics or questions that could help me.