6

My understanding of a random effect is based on this paper, specifically this definition:

Random effects: factors whose levels are sampled from a larger population, or whose interest lies in the variation among them rather than the specific effects of each level. (Bolker et al., 2009)

In ecology random effects seem to be mostly used to avoid (psuedo-)replication from repeated measures, for example sampling from the same location repeatedly, or to account for phylogeny i.e. that closely related species are more likely to be similar due to shared evolutionary history.

This seems to me to be only a restricted application of a random effect, based on the above definition. The Bolker definition says to me that treating a sampling unit as a random effect will control for unmeasured differences between sampling units that may affect the variables I’m interested in. Is this correct?

Say I have a study where I’m interested in measuring X. My sampling design involves paired sampling at a number of different locations (not repeated), on different dates. Pairs would be random effects, to avoid repeated measures as discussed above. What about location and date? I’m not interested in the differences between locations or date, only X. In fact, I’d like to control for the differences between location and date to get a better understanding of the effect of X on my response. Would treating location and date as random effects accomplish this? I.e.:

Response ~ X + (1|location/pair) + (1|date)

But why not treat location and date as fixed variables?

Response ~ X + location + date + (1|pair) 

This will still seperate the effect of location and date from the effect of X, so why have them as random variables? If I have them as fixed effects I'll be able to measure the effect they're having on X, so why use random effects?


While the answers from @Royce Yang and @Guille were helpful to get me thinking along the right lines, the best explanation I've found (and should have found before posting this question, not sure how I missed it) is here (thanks @mkt for the link) and leading on from that, the post here. I think my question was the problem - I should have phrased it more broadly.

Thomas
  • 65
  • 6
  • 1
    Relevant, probably a duplicate: https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode – mkt Jun 28 '20 at 13:28
  • 1
    Thanks @mkt-ReinstateMonica, answers in this link are what I was after – Thomas Jul 01 '20 at 07:13

3 Answers3

4

For almost all variables you have the choice to model them with a fixed or random effect. I personally find the term random effect quite confusing, since random effects are usually just grouping factors for which we are trying to control. They are always categorical, as you can’t force R to treat a continuous variable as a random effect. A lot of the time we are not specifically interested in their impact on the response variable, but we know that they might be influencing the patterns we see.

I’m not interested in the differences between locations or date, only X. In fact, I’d like to control for the differences between location and date to get a better understanding of the effect of X on my response. Would treating location and date as random effects accomplish this?

Including date and location as grouping factors (= random effects) would achieve exactly what you have outlined in your question.

But why not treat location and date as fixed variables?

Response ~ X + location + date + (1|pair)

This will still separate the effect of location and date from the effect of X, so why have them as random variables? If I have them as fixed effects I'll be able to measure the effect they're having on X, so why use random effects?

This second part is not entirely true. You are not measuring the effect of location and date on X but you are estimating the effect of location, date or X individually while keeping the other two constant. Also it adds more degrees of freedom which might be undesirable with small sample size.

Rootless17b
  • 328
  • 1
  • 8
1

Yes, when you include the location and date as independent variables (as in your formula), you are separating their effects from X.

However, you do want to be sure that you are not missing variables in your formula that impact the dependent variable. If you are missing variables, the effect of X that you get may not be the pure effect from X alone.

By the way, the random effects that you describe sounds a lot like a simpler version of Bootstrap Aggregation, used to reduce variance and overfitting.

Royce Yang
  • 129
  • 3
  • Thanks. See updated question - why not have `location` and `date` as fixed effects? How does this differ? – Thomas Jun 29 '20 at 12:11
  • The difference between fixed effects and random effects primarily lies in how they are calculated. Since you are using a partial pooling process (repeated measures), the model coefficients represent random effects. For more details, check out this [thread](https://stats.stackexchange.com/questions/4700/what-is-the-difference-between-fixed-effect-random-effect-and-mixed-effect-mode) – Royce Yang Jun 29 '20 at 21:06
1

Yes, the proposed mixed model will separate the sources of variability, leaving the fixed effects (X) separated from the random effects of the combination of location/pair nested variables and date.

Essentially, what the introduction of random effects do is to identify sources of variability, and by estimating them you can separate it from the error term, which is the one used for hypothesis testing on the fixed effects (potentially, the object of study). This is accomplished by modelling them with expected value equal to zero, but non-zero variance, probably with some structure. Responses of the same random effect will be correlated.

For just one fixed effect and one random effect, $Y_{ij} = \mu + \alpha_i+\beta_j+\varepsilon_{ij}=X\alpha_i+Z\beta_j+\varepsilon_{ij}$, where $X$ corresponds to the design matrix of the fixed effects and $Z$ to the random effects, assuming $\varepsilon_{ij}\thicksim N(0,\sigma_\varepsilon^2)$ and $\beta_j \thicksim N(0,\sigma_\beta^2)$. Then, the expected value for $Y_{ij}$ is $E(Y_{ij})=\mu+\alpha_i$, and you can run hypothesis tests on their estimators, and the variance is $Var(Y_{ij})=\sigma_\varepsilon^2+\sigma_\beta^2$, identifying the sources of variability, the first one to noise and the second one to the random effect.

Guile
  • 111
  • 4
  • Thanks. Sorry but I can't follow the formula, could you try explaining in words? Alsosee above, I updated the question - why not have `location` and `date` as fixed effects? – Thomas Jun 29 '20 at 12:12
  • You definitely can put them as random or fixed effects, assuming you have enough observations, but the purpose is different. To put location as fixed would use one degree of freedom by location, but you can test if results in different locations were different. However, all the variance goes to the error term. To put them as random effect means that your locations are samples of a larger population, and you can't test if the effects were statistically different; however, the variance to run tests over the fixed effects is smaller, which enables you to differentiate smaller effects. – Guile Jun 29 '20 at 15:52