6

I would like to model data where the outcomes are produced, jointly, by a pair of indistinguishable subjects. As an example[*], consider the length of two-participant conversations. These data have the following properties

  • Outcomes can only be measured at the pair level: for example, we only know the total length of the conversation $Y_{i,j}$, rather than how much $i$ and $j$ individually spoke during their interaction.
  • The outcomes are undirected/symmetric $Y_{i,j} \triangleq Y_{j,i}$
  • I have data--and equal amounts of it--for all possible pairs $\{i,j\}$ except for when $i=j$, where the outcome can't be measured.
  • The pairs are sets: they're indistinguishable/unordered. This is somewhat different from many situations, where each member of the pair has a fixed role (e.g., buyer/seller) that could be modeled separately.
  • I'm predominately interested in predictors that are also defined on the pairs (e.g. are $i$ and $j$ from the same organization? How far apart are they standing?), but would ideally like to include some subject-level and fixed effects.

When fitting a mixed-effect model, it seems wrong to treat the $Y_{i,j}$ as independent because there could be subject-specific effects: if Uncle Bob tends to ramble, conversations that include him may tend to be longer than average, regardless of the other participant. If the subjects were distinguishable, I would be tempted to account for this by adding a (random) effect for each subject: y_ij ~ X + (1 | i) + (1 | j). However, the $i$ and $j$ here are indistinguishable. Providing lmer() with a pair of subject id vectors doesn't seem like it would account for the dependence between observations $<y_{i,j}, \textrm{ }i, j>$ and $<y_{j,k}, j, k>$ due to the possible contribution of $j$ to both.

How can I account for subject-specific effects when the subjects are indistinguishable?

I had considered adding in a ton of dummy variables (pair contains 1, pair contains 2, ...). However, this would add ~100 fixed effects that I don't actually care about, which doesn't seem great.

I also had a quick look at the dyadic literature, but nothing seems like quite the right fit. The actor-partner models seem to require subject-by-subject responses, ditto for social relations models, which always seem to have directed rating ($i$ likes $j$ is obviously different from $j$ likes $i$).

Would a better way to address this be changing the model's correlation structure, so that the covariance of observations from pair ${i,j}$ are some combination of $\sigma_i$ and $\sigma_j$? This is a little beyond my experience with mixed models, so I would like to know if this is both theoretically sound and if/how it's practically doable.


[*] The actual data is not conversation length--and is not even from humans--so don't worry too much about sociolinguistic-specific confounds. It seems like a very similar situation though, and the conversation example is much clearer.
Matt Krause
  • 19,089
  • 3
  • 60
  • 101

2 Answers2

7

Perhaps something it is not clear to me, but the following simulation for your setting seems to work with the random effects structure you suggested, i.e.,

set.seed(2019)
N <- 50 # number of subjects

# id indicators for pairs
ids <- t(combn(N, 2))
id_i <- ids[, 1]
id_j <- ids[, 2]

# random effects
b_i <- rnorm(N, sd = 2)
b_j <- rnorm(N, sd = 4)

# simulate normal outcome data from the mixed model 
# that has two random effects for i and j
y <- 10 + b_i[id_i] + b_j[id_j] + rnorm(nrow(ids), sd = 0.5)

DF <- data.frame(y = y, id_i = id_i, id_j = id_j)

library("lme4")
#> Loading required package: Matrix

lmer(y ~ 1 + (1 | id_i) + (1 | id_j), data = DF)
#> Linear mixed model fit by REML ['lmerMod']
#> Formula: y ~ 1 + (1 | id_i) + (1 | id_j)
#>    Data: DF
#> REML criterion at convergence: 2403.071
#> Random effects:
#>  Groups   Name        Std.Dev.
#>  id_i     (Intercept) 2.1070  
#>  id_j     (Intercept) 3.0956  
#>  Residual             0.5053  
#> Number of obs: 1225, groups:  id_i, 49; id_j, 49
#> Fixed Effects:
#> (Intercept)  
#>       9.562

However, I think you will not be able to include covariates at the subject level because you only have data at the pair level.


EDIT: Based on the comments, the symmetric nature of the data has become more clear. As far as I know, the current implementation of lmer() does not allow for such data. The code below simulates and fits a model for such data using STAN.

set.seed(2019)
N <- 50 # number of subjects

# id indicators for pairs
ids <- expand.grid(i = seq_len(N), j = seq_len(N))
ids <- ids[ids$i != ids$j, ]
id_i <- ids$i
id_j <- ids$j

# random effects
b <- rnorm(N, sd = 2)

# simulate normal outcome data from the mixed model 
# that has one random effect but accounts for the pairs i and j
y <- 10 + b[id_i] + b[id_j] + rnorm(nrow(ids), sd = 0.5)


library("rstan")

Data <- list(N = nrow(DF), n = length(unique(id_i)), 
             id_i = id_i, id_j = id_j, y = y)
model <- "
data {
    int n;
    int N;
    int id_i[N];
    int id_j[N];
    vector[N] y;
}

parameters {
    vector[n] b;
    real beta;
    real<lower = 0> sigma_b;
    real<lower = 0> sigma;
}

transformed parameters {
    vector[N] eta;
    for (k in 1:N) {
        eta[k] = beta + b[id_i[k]] + b[id_j[k]];
    }
}

model {
    sigma_b ~ student_t(3, 0, 10);
    for (i in 1:n) {
        b[i] ~ normal(0.0, sigma_b);
    }
    beta ~ normal(0.0, 10);
    sigma ~ student_t(3, 0, 10);
    y ~ normal(eta, sigma);
}
"

fit <- stan(model_code = model, data = Data, pars = c("beta", "sigma_b", "sigma"))
summary(fit)
Dimitris Rizopoulos
  • 17,519
  • 2
  • 16
  • 37
  • 1
    This will definitely run, and I'm reasonably convinced it's the right solution for distinguishable subjects (e.g., $i$ indexes over buyers, $j$ over sellers). I'm worried that it's not quite right for indistinguishable subjects though: the observations where $i=1,j=2$, for example, are not independent of $i=2,j=3$, since both involve subject 2. In other words, your model has two effects of subjects (id_i, id_j), but I'm shooting for something where there's just one (I think). – Matt Krause Jun 03 '19 at 13:31
  • 1
    It's certainly possible that I'm overthinking this and your solution is perfect, but ideally I'd like a slightly more rigorous argument than "well, it runs". (Not that I don't appreciate the effort!) – Matt Krause Jun 03 '19 at 13:33
  • The model above accounts for exactly that using the cross-random effects. Namely that pairs that have the same `id_i` are correlated because they share the same random effect $b_i$, and likewise pairs that have the same `id_j` are correlated because they share the same random effect $b_j$. – Dimitris Rizopoulos Jun 03 '19 at 13:36
  • Right, but what about a pair of examples where one's `id_i` equals the other's `id_j`? In other words, is correlated with , even though neither the first ids nor the second match. – Matt Krause Jun 03 '19 at 13:41
  • 1
    Thanks for the extra explanation. The nature of the data is more clear to me now. Indeed, I think `lmer()` could not fit this model. I proposed an alternative solution using STAN. – Dimitris Rizopoulos Jun 04 '19 at 12:14
2

For people looking for a more little theory, Peter Hoff at the Duke/University of Washington has worked on this.

His 2005 JASA paper "Bilinear Mixed-Effects Models for Dyadic Data" (pdf and code) describes an MCMC approach that is very similar to @Dimitris Rizopoulos’ answer above).

That rough code was made into a more polished R package called AMEN (Additive Multiplicative Effects Models), and this 2017 preprint/tech report is effectively a tutorial with a few worked examples. He calls this the "Social Relations Regression Model."

Dimitris Rizopoulos
  • 17,519
  • 2
  • 16
  • 37
Matt Krause
  • 19,089
  • 3
  • 60
  • 101