4

Problem

There are two excellent CV posts on specifying crossed effects models (post 1, post 2).

The issue I'm trying to wrestle with pertains to part of the answer to post 2, in particular how to nest crossed random effects.

In my study, I have:

  • About 20 individuals per site
  • About 10 sites
  • Within each site, there were about 20 samples

The outcome in the example is participant's "interest" (the study is about out-of-school programs).

Because there are dependencies by both participant and sample, I think there are two crossed random effects, one for observations associated with each individual, and one for observations associated with each sample. The hard part for me is that these random effects are nested in one of the 10 programs.

The samples were at the same time for all of the individuals within the site, but at different times at different sites, so that sample 1 in site A was not necessarily at the same time in any sense (not the same date / time nor at the same interval from the "start" of the site's activities). Therefore, to create the variable identifying the time of the sample, I combined the site variable, the date that the sample was collected, and another variable specifying whether the sample was the 1st, 2nd, 3rd, or 4th sample collected for that date. It's a factor.

The data (in R) are as follows:

# A tibble: 2,970 × 4
    interest participant_ID  site           sample
   <dbl+lbl>          <dbl> <chr>         <fctr>
1          2           1001     1 1-2015-07-14-1
2          2           1001     1 1-2015-07-14-2
3          4           1001     1 1-2015-07-15-1
4          3           1001     1 1-2015-07-15-2
5          3           1001     1 1-2015-07-21-1
6          1           1001     1 1-2015-07-21-2
7          3           1001     1 1-2015-07-21-4
8          3           1001     1 1-2015-07-22-1
9          4           1001     1 1-2015-07-22-4
10         3           1001     1 1-2015-07-28-1
# ... with 2,960 more rows

Possible Solution

In the answer to post 2, the author of the selected answer wrote:

Because you do not have unique values of the tow variable (i.e. because as you say below tows are specified as 1, 2, 3 at every station), you do need to specify the nesting, as (1|station:tow:day). If you did have the tows specified uniquely, you could use either (1|tow:day) or (1|station:tow:day) (they should give equivalent answers).

In mapping this to my example, I do have unique values of the sample (tow variable), I do not need to specify the nesting. I'm having trouble specifying this model mathematically, and, thus, in terms of model syntax. (I am using lme4 in R).

But, here seem to be the options:

  1. Not nesting the crossed random effects within the site because the sample variable includes a site identifier:

    lmer(interest ~ 1 + (1|participant_ID) + (1|sample), data = df)

  2. Creating the sample variable without a site identifier but in a way so that samples within each site were still identified uniquely and nesting the crossed random effects within the site:

    lmer(interest ~ 1 + (1|site/participant_ID) + (1|site/sample), data = df)

Other examples interact the crossed random effects, via adding a term such as (1|participant_ID:sample).

Does either of these seem like they would account for dependencies by both participant and sample? Or, are there other options or better ways to model this?

amoeba
  • 93,463
  • 28
  • 275
  • 317
Joshua Rosenberg
  • 754
  • 10
  • 26
  • 1
    You are over-complicating this. You need a random effect of sample, of participant, and of site. So your option #1 is okay, but you should probably add `(1|site)` into there. Your option #2 is also fine, and you don't need to change anything about how you code your sample variable; `(1|site/sample)` is equivalent to `(1|site)+(1|site:sample)` and if your sample is coded like it is then this is further equivalent to `(1|site)+(1|sample)`. The same goes for the participant term. So option #2 will be equivalent to #1 if you add the site term to #1 as I suggested above. – amoeba Apr 13 '17 at 23:14
  • Whether you use `(1|participant_ID:sample)` is another issue. I would try fitting models with and without this term and see what comes out. – amoeba Apr 13 '17 at 23:15
  • And one more thing: what do you actually expect the interest to depend upon? You are studying how the interest is influenced -- by what? I am asking because you don't have any fixed effects in your models, only random effects. That's strange. – amoeba Apr 13 '17 at 23:23
  • Thank you, for your advice re option #1, I tried adding site, and the random effects error terms remain the same -- the random effect for each program come out to `0` (and hence the standard deviation of the random effects comes out to `0`). And as you suggest, the random effects turn out to be identical for both option #1 and option #2. This seems to suggest that there's very little (to no) variance explained by `program_ID` not explained by `sample`. Does that sound about right? – Joshua Rosenberg Apr 14 '17 at 17:54
  • The `(1|participant_ID:sample)` interaction is interesting because the `participant_ID` and `sample` random effects remain the same but this term does seem to explain some variance - as its random effects have a standard deviation of about `.20` (relative to `.50` for `participant_ID` and `.20` or `sample`), and the residual variance decreases. This is confusing to me because the `(1|participant_ID:sample)` seems to be redundant with the residual - both would seem to explain variance not explained by `participant_ID` or `sample`. – Joshua Rosenberg Apr 14 '17 at 18:00
  • Regarding the lack of fixed effects - I have fixed effects predictors at each level (i.e., students' self-reported perceived competence at the level of the response (`interest`), features of the activity at the `sample` level, and pre-program intentions to study / pursue a career in STEM fields at `participant` level. I just didn't include them because I was focused first on the variance components. – Joshua Rosenberg Apr 14 '17 at 18:01
  • Just as a note, there are random effects greater than `0` for `program_ID` with a different outcome (we have outcomes other than interest, and it seems for interest that there's not enough variation at the program level). – Joshua Rosenberg Apr 14 '17 at 19:03
  • Clarification: What is program_ID? You do not mention it in the post. – amoeba Apr 15 '17 at 10:41
  • In any case, `(1|participant_ID:sample)` is not redundant with the residual if you have multiple observations for each participant_ID/sample combination. Is this the case? From your question, it's not exactly clear to me. – amoeba Apr 15 '17 at 10:47
  • Sorry, `program_ID` the same as `site` (I inadvertently changed the name to `program_ID` in the comment. – Joshua Rosenberg Apr 15 '17 at 12:13
  • There is only one observation for each participant_ID/sample combination. Each of about 20 participants responded to each of about 20 prompts (in one of 10 programs). So in another of the 10 programs, 20 different participants responded to 20 different prompts. – Joshua Rosenberg Apr 15 '17 at 12:15
  • 1
    It seems from running the models different ways that a unique sample identifier (for each prompt in each program) with a participant random effect and site random effect is the most straightforward way to go - and is equivalent, as you suggest, to a not unique sample identifier nested within a site random effect and a participant random effect. – Joshua Rosenberg Apr 15 '17 at 12:17
  • 1
    That's correct @Joshua (referring to your last comment). And an interaction term won't make sense if you have 1 observation per interaction combination. So the bottomline IMHO is that you should use `(1|site) + (1|participant_ID) + (1|sample)` and if the `site` term comes out with zero variance then so be it; but you cannot know in advance, so it makes sense to put it in the model anyway. – amoeba Apr 15 '17 at 15:51

1 Answers1

4

Here's my read on what experiment was done and how I would model it.

Each observation has three identifiers and one value.

You stated that these identifiers are participant, site, and sample and the value is interest.

You explained to me that there are many levels of each of these identifiers and you expect measurements that have a common value of any one of them (same participant, same site or same sample) are likely to be more similar than observations that have none in common.

This sounds like a perfect situation for an LMM with random intercept for each of those factors. Thus, model I would fit would be:

lmer(formula = interest ~ (1|participant_ID) + (1|site) + (1|sample),
     data = df)

EDIT: deleted misunderstandings.

rcorty
  • 438
  • 4
  • 13
  • Thanks, there is a calendar-date-varying component of the observation (I wrote: `The samples were at the same time for all of the individuals within the site, but at different times at different sites, so that sample 1 in site A was not necessarily at the same time in any sense (not the same date / time nor at the same interval from the "start" of the site's activities)`). – Joshua Rosenberg Apr 14 '17 at 16:03
  • When I wrote "calendar-date-varying", what I meant was something like outdoor temperature, where what day of the year it is really matters (hotter in the summer, colder in the winter). I'm not hearing anything in your post to suggest that what day of the year it is matters to the measurement. That's why I dropped the date of the measurement in my analysis. – rcorty Apr 14 '17 at 16:05
  • Thank you, I am sorry that this was not clear, there is something important about day of the year - the samples were during different activities during after-school programs that exhibit a lot of variation (i.e., kids' interest is higher during certain calendar dates (and times) when doing hands-on group activities versus working individually). So, activities during the moment / time point are expected to be a source of variation. – Joshua Rosenberg Apr 14 '17 at 17:50
  • Okay, I would make another random effect for that, then. See edit above. – rcorty Apr 14 '17 at 18:28
  • Thank you, I am definitely considering `season` (or activity) as a fixed effect - I want fixed effects for these different activities. In other words, I'd consider this as a predictor of the random intercept for the sample, not as a random effect. – Joshua Rosenberg Apr 14 '17 at 18:44
  • "predictor of the random intercept" is not the same as "fixed effect". How many different values are there of `season`? – rcorty Apr 14 '17 at 19:03
  • Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/57129/discussion-between-rcorty-and-joshua-rosenberg). – rcorty Apr 14 '17 at 19:08
  • Why would `sample_num` be a fixed effect? (CC to @JoshuaRosenberg) – amoeba Apr 15 '17 at 10:43
  • 1
    discussed in chat and understand that three random effects for participant, site, and sample may be on right track, i.e. `lmer(formula = interest ~ (1|participant_ID) + (1|site) + (1|sample), data = df)` – Joshua Rosenberg Apr 15 '17 at 14:17
  • @JoshuaRosenberg I agree that that's the best option, but I would suggest that rcorty edits their answer to make this explicit. At the moment I find the answer quite confusing. – amoeba Apr 15 '17 at 15:52