0

I am new to CV and statistical data analysis so I hope I can make my question precise enough to get some answers on here.

I am working on some data from cherry trees. I have data from a lot of different varieties over multiple years. My data contains some fruit parameters as well as weather data. Not all varieties are at each location and measured each year. I have predominantly data from one location and only the minority of varieties was sampled over many years. Other locations only provide data from one or two years per variety. I mostly have average values from around three to four trees per variety and unfortunately the plots have not been randomized.

I have now looked at some fruit parameters from the data and would like to run a model.

For example the fruit diameter seems to correlate with the temperature around the flowering date. When plotting the fruit diameter over the temperature at flowering for each variety separately, it seems as there are some differences between the varieties. For some varieties the regression line is steeper.

Here's how that looks:

the fruit diameter plotted for each variety over the flowering temperatur. The line is the linear regression.

So this could mean that certain varieties are not as much effected by the temperature during flowering as others (causal relationship not inevitably). So I was thinking that either I run a lm() model for each variety separately or I run a linear mixed model. From what I googled thus far the lmer() would have the advantage that it is better with missing data and can also deal with non independent sampling. Most trees are used for several year so the data from the same variety is correlated.

My mixed model would look like this:

model <-  lmer(fruit_diameter ~ T.flowering + 
                (T.flowering | variety), data)

This model visualized looks like this:

lmer()

I also looked at further parameters that correlate with the diameter and came up with this more detailed model:

model2 <-  lmer(fruit_diameter ~ T.flowering + yield + humidity 
                + (T.flowering || variety), data)

What would be the advantages of running a lm model for each variety? Would it be advisable to used a mixed model for this kind of data. What are some disadvantages?

I only have averaged values from the three to four trees per year, is this impacting the model choice?

I am also happy to get links to similar topics or questions that could help me.

Lisa
  • 1
  • 2

1 Answers1

1

What would be the advantages of running a lm model for each variety?

I can't think of any. With only a few observations on most varieties you will have little ability to get precise coefficient estimates for individual varieties that way. A mixed model is a well-accepted way to pool information efficiently in a situation like this. See the discussion on this page, for example.

What are some disadvantages?

A tradeoff is that you are assuming Gaussian distributions of the random effects. You also have to make choices about correlations among random effects. For example, your first lmer model with(T.flowering | variety) implicitly assumes a correlation structure between random slopes and intercepts, while your second with (T.flowering || variety) does not. See this answer.

I only have averaged values from the three to four trees per year, is this impacting the model choice?

Generally, the closer you get to raw measurements to start, the better. So if you had repeated measurements on the same individual trees of each variety, then using each observation individually with a random effect of tree within variety might be better. But if you don't have the raw data per tree all you can do is work with what you have. The mixed model would be superior in any event.

EdM
  • 57,766
  • 7
  • 66
  • 187