What to do with random effects correlation that equals 1 or -1?

Question

Not so uncommon occurrence when dealing with complex maximal mixed models (estimating all possible random effects for given data and model) is perfect (+1 or -1) or nearly perfect correlation among some random effects. For the purpose of the discussion, let's observe the following model and model summary

Model: Y ~ X*Cond + (X*Cond|subj)

# Y = logit variable  
# X = continuous variable  
# Condition = values A and B, dummy coded; the design is repeated 
#             so all participants go through both Conditions  
# subject = random effects for different subjects  

Random effects:
 Groups  Name             Variance Std.Dev. Corr             
 subject (Intercept)      0.85052  0.9222                    
         X                0.08427  0.2903   -1.00            
         CondB            0.54367  0.7373   -0.37  0.37      
         X:CondB          0.14812  0.3849    0.26 -0.26 -0.56
Number of obs: 39401, groups:  subject, 219

Fixed effects:
                 Estimate Std. Error z value Pr(>|z|)    
(Intercept)       2.49686    0.06909   36.14  < 2e-16 ***
X                -1.03854    0.03812  -27.24  < 2e-16 ***
CondB            -0.19707    0.06382   -3.09  0.00202 ** 
X:CondB           0.22809    0.05356    4.26 2.06e-05 ***

The supposed reason behind these perfect correlations is that we have created a model which is too complex for the data that we have. The common advice that is given in these situations is (e.g., Matuschek et al., 2017; paper) to fix the overparameterized coefficients to 0, because such degenerate models tend to lower the power. If we observe a marked change in fixed effects in a reduced model, we should accept that one; if there is no change, then there is no problem in accepting the original one.

However, let's presume that we are not interested only in fixed effects controlled for RE (random effects), but also in the RE structure. In the given case, it would be theoretically sound to assume that Intercept and the slope X have non-zero negative correlation. Several questions follow:

What to do in such situations? Should we report the perfect correlation and say that our data is not "good enough" to estimate the "real" correlation? Or should we report the 0 correlation model? Or should we maybe try to set some other correlation to 0 in hope that the "important" one will not be perfect anymore? I don't think there are any 100% correct answers here, I would mostly like to hear your opinions.
How to write the code which fixes the correlation of 2 specific random effects to 0, without influencing the correlations between other parameters?

Package nlme gives you fine control regarding the variance-covariance matrix of random effects. I've never really needed this myself, but I would reread *Mixed-Effects Models in S and S-PLUS* (Pinheiro and Bates, 2000) if I did. — Roland, Jan 16 '18 at 07:59
A radical alternative is to regularize the model, i.e. fit a Bayesian model with somewhat informative priors on the random effects structures (e.g. via `blme`, `MCMCglmm`, `rstanarm`, `brms` ...) — Ben Bolker, Jan 16 '18 at 13:18
@BenBolker Ben. I am not sure that is a radical idea, as fitting an unregularized model might be the more radical way to fit a model ;) — D_Williams, Jan 17 '18 at 03:03
Thank you guys all for great answers... Unfortunately, I was offline for couple of days, but I am back. — User33268, Jan 18 '18 at 09:13

amoeba · Accepted Answer · 2018-01-17T08:14:14.213

Singular random-effect covariance matrices

Obtaining a random effect correlation estimate of +1 or -1 means that the optimization algorithm hit "a boundary": correlations cannot be higher than +1 or lower than -1. Even if there are no explicit convergence errors or warnings, this potentially indicates some problems with convergence because we do not expect true correlations to lie on the boundary. As you said, this usually means that there are not enough data to estimate all the parameters reliably. Matuschek et al. 2017 say that in this situation the power can be compromised.

Another way to hit a boundary is to get a variance estimate of 0: Why do I get zero variance of a random effect in my mixed model, despite some variation in the data?

Both situations can be seen as obtaining a degenerate covariance matrix of random effects (in your example output covariance matrix is $4\times 4$); a zero variance or a perfect correlation means that the covariance matrix is not full rank and [at least] one of its eigenvalues is zero. This observation immediately suggests that there are other, more complex ways to get a degenerate covariance matrix: one can have a $4\times 4$ covariance matrix without any zeros or perfect correlations but nevertheless rank-deficient (singular). Bates et al. 2015 Parsimonious Mixed Models (unpublished preprint) recommend using principal component analysis (PCA) to check if the obtained covariance matrix is singular. If it is, they suggest to treat this situation the same way as the above singular situations.

So what to do?

If there is not enough data to estimate all the parameters of a model reliably, then we should consider simplifying the model. Taking your example model, X*Cond + (X*Cond|subj), there are various possible ways to simplify it:

Remove one of the random effects, usually the highest-order correlation:
```
X*Cond + (X+Cond|subj)
```
Get rid of all the correlation parameters:
```
X*Cond + (X*Cond||subj)
```
Update: as @Henrik notes, the || syntax will only remove correlations if all variables to the left of it are numerical. If categorical variables (such as Cond) are involved, one should rather use his convenient afex package (or cumbersome manual workarounds). See his answer for more details.
Get rid of some of the correlations parameters by breaking the term into several, e.g.:
```
X*Cond + (X+Cond|subj) + (0+X:Cond|subj)
```
Constrain the covariance matrix in some specific way, e.g. by setting one specific correlation (the one that hit the boundary) to zero, as you suggest. There is no built-in way in lme4 to achieve this. See @BenBolker's answer on SO for a demonstration of how to achieve this via some smart hacking.

Contrary to what you said, I don't think Matuschek et al. 2017 specifically recommend #4. The gist of Matuschek et al. 2017 and Bates et al. 2015 seems to be that one starts with the maximal model a la Barr et al. 2013 and then decreases the complexity until the covariance matrix is full rank. (Moreover, they would often recommend to reduce the complexity even further, in order to increase the power.) Update: In contrast, Barr et al. recommend to reduce complexity ONLY if the model did not converge; they are willing to tolerate singular covariance matrices. See @Henrik's answer.

If one agrees with Bates/Matuschek, then I think it is fine to try out different ways of decreasing the complexity in order to find the one that does the job while doing "the least damage". Looking at my list above, the original covariance matrix has 10 parameters; #1 has 6 parameters, #2 has 4 parameters, #3 has 7 parameters. Which model will get rid of the perfect correlations is impossible to say without fitting them.

But what if you are interested in this parameter?

The above discussion treats random effect covariance matrix as a nuisance parameter. You raise an interesting question of what to do if you are specifically interested in a correlation parameter that you have to "give up" in order to get a meaningful full-rank solution.

Note that fixing correlation parameter at zero will not necessarily yield BLUPs (ranef) that are uncorrelated; in fact, they might not even be affected that much at all (see @Placidia's answer for a demonstration). So one option would be to look at the correlations of BLUPs and report that.

Another, perhaps less attractive, option would be to use treat subject as a fixed effect Y~X*cond*subj, get the estimates for each subject and compute correlation between them. This is equivalent to running separate Y~X*cond regressions for each subject separately and get the correlation estimates from them.

See also the section on singular models in Ben Bolker's mixed model FAQ:

It is very common for overfitted mixed models to result in singular fits. Technically, singularity means that some of the $\theta$ (variance-covariance Cholesky decomposition) parameters corresponding to diagonal elements of the Cholesky factor are exactly zero, which is the edge of the feasible space, or equivalently that the variance-covariance matrix has some zero eigenvalues (i.e. is positive semidefinite rather than positive definite), or (almost equivalently) that some of the variances are estimated as zero or some of the correlations are estimated as +/-1.

What my example shows is that for `(Machine||Worker)` `lmer` estimates one variance more than for `(Machine|Worker)`. So what `lmer` does for `||` with factors cannot be described by 'this only removes correlations between factors, but not between levels of a categorical factor.' It alters the random effects structure in a somewhat weird way (it expands `(Machine||Worker)` to `(1|Worker) + (0+Machine|Worker)`, hence the additional variance). Feel free to change my edit. My main point is that in this statements the distinction between numerical and categorical covariates needs to be made clear. — Henrik, Jan 16 '18 at 15:59
No, also does not work with binary variables, see for yourself: `machines2 — Henrik, Jan 16 '18 at 16:23
@amoeba: I think you made an interesting point by suggesting to turn to `ranef` values for studying the correlation between random effects. I am not too deep into this topic, but I know that it is usually not recommended to work with extracted values of `ranef`, but rather with estimated correlations and variances. What's your opinion on that? Plus, I don't know how one would explain to the reviewer that the correlation in the model was not postulated, but we still calculate the correlation of the extracted values. That doesn't make sense — User33268, Jan 18 '18 at 09:45
@RockyRaccoon Yes, I think it's better to use/report the estimated correlation parameter but here we are talking about the situation when we arguably *cannot* estimate it because it converges to 1. That's what I would write in a paper: "The full model converged to solution with corr=1 so following advice in [citations] we used a reduced model [details]. The correlation between random effect BLUPs in this model was 0.9." Again, when you are not including the correlation, you are not constraining the model to treat them as uncorrelated! You are simply not *modeling* this correlation explicitly. — amoeba, Jan 18 '18 at 09:56
I have one more question: do close-to-zero variances and perfect and close-to-perfect correlations of random effects imply something about the real value of the parameters? For example, do -1 correlations imply that the real correlation is at least negative and/or that it is at least non-zero? More concretely, if we try to estimate the correlation which is 0 in reality, is it possible that we would get -1 estimation? — User33268, Jan 20 '18 at 09:21
@RockyRaccoon I would expect the optimization algorithm to get stuck at the boundary solution of -1 correlation if real correlation is strong and negative (i.e. relatively close to -1). Similarly to how I would expect to see random variance estimated as 0.0 if the real variance is small. — amoeba, Jan 20 '18 at 22:07
Ok, that would also be my guess... Bates et al. (2015; https://arxiv.org/abs/1506.04967) mention on p. 4 regarding the +-1 correlation or 0-variance estimations: "...This corresponds to estimates of zero random-eﬀects variance in a model with random-intercepts only or a correlation of ±1 in a model with correlated random intercepts and slopes. However, already a three-by-three correlation matrix will not usually show boundary values like these, even when it is singular." Do they suggest here that boundaries can be around some value other than 1, or 0 for variances? It's not so clear... — User33268, Jan 20 '18 at 22:36
Did you read the 3rd paragraph of my answer? That's *exactly* what I was talking about there. — amoeba, Jan 20 '18 at 22:44
Sorry, my mistake. Was reading it some time ago and I just commented as I was reading Bates et al. (2015). — User33268, Jan 21 '18 at 13:28
Also, is there any other way of "testing" whether a covariance matrix is degenerate (singular), apart from the mentioned PCA methodology from Bates et al (2015)? I am talking about the situations where you have 4x4 or larger matrices, where degeneration does not have to yield +/-1 correlation or 0 variance. — User33268, Jan 21 '18 at 14:58
@RockyRaccoon Why do you need another way? You need to extract covariance matrix with `VarCorr()` and then get eigenvalues with `eigen()`. It's 1 line of code. — amoeba, Jan 21 '18 at 15:26
One question from a bit different perspective: I got the impression that the greatest problem with singular fits, +/-1correlation, overfitted models etc. is that fixed effects test lose power. What if the maximal model yielded all significant fixed effects. Would you say that then it is justified to keep the maximal model, although it is overfitted, i.e. it has singular fit? — User33268, Jan 24 '18 at 00:07

score 9 · Answer 2 · edited Jan 20 '18 at 13:06

I agree with everything said in amoeba's answer which provides a great summary of the current discussion on this issue. I will try to add a few additional points and otherwise refer to the handout of my recent mixed model course which also summarizes these points.

Suppressing the correlation parameters (options 2 and 3 in amoeba's answer) via || works only for numerical covariates in lmer and not for factors. This is discussed in some detail with code by Reinhold Kliegl.

However, my afex package provides the functionality to suppress the correlation also among factors if argument expand_re = TRUE in the call to mixed() (see also function lmer_alt()). It essentially does so by implementing the approach discussed by Reinhold Kliegl (i.e., transfomring the factors into numerical covariates and specify the random-effects structure on those).

A simple example:

library("afex")
data("Machines", package = "MEMSS") # same data as in Kliegl code

# with correlation:
summary(lmer(score ~ Machine + (Machine  | Worker), data=Machines))
# Random effects:
#  Groups   Name        Variance Std.Dev. Corr       
#  Worker   (Intercept) 16.6405  4.0793              
#           MachineB    34.5467  5.8776    0.48      
#           MachineC    13.6150  3.6899   -0.37  0.30
#  Residual              0.9246  0.9616              
# Number of obs: 54, groups:  Worker, 6

## crazy results:
summary(lmer(score ~ Machine + (Machine  || Worker), data=Machines))
# Random effects:
#  Groups   Name        Variance Std.Dev. Corr     
#  Worker   (Intercept)  0.2576  0.5076            
#  Worker.1 MachineA    16.3829  4.0476            
#           MachineB    74.1381  8.6103   0.80     
#           MachineC    19.0099  4.3600   0.62 0.77
#  Residual              0.9246  0.9616            
# Number of obs: 54, groups:  Worker, 6

## as expected:
summary(lmer_alt(score ~ Machine + (Machine  || Worker), data=Machines))
# Random effects:
#  Groups   Name         Variance Std.Dev.
#  Worker   (Intercept)  16.600   4.0743  
#  Worker.1 re1.MachineB 34.684   5.8894  
#  Worker.2 re1.MachineC 13.301   3.6471  
#  Residual               0.926   0.9623  
# Number of obs: 54, groups:  Worker, 6

For those not knowing afex, the main functionality for mixed models is to provide p-values for the fixed effects, e.g.,:

(m1 <- mixed(score ~ Machine + (Machine  || Worker), data=Machines, expand_re = TRUE))
# Mixed Model Anova Table (Type 3 tests, KR-method)
# 
# Model: score ~ Machine + (Machine || Worker)
# Data: Machines
#    Effect      df        F p.value
# 1 Machine 2, 5.98 20.96 **    .002
# ---
# Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘+’ 0.1 ‘ ’ 1

summary(m1)  
# [...]
# Random effects:
#  Groups   Name         Variance Std.Dev.
#  Worker   (Intercept)  27.4947  5.2435  
#  Worker.1 re1.Machine1  6.6794  2.5845  
#  Worker.2 re1.Machine2 13.8015  3.7150  
#  Residual               0.9265  0.9626  
# Number of obs: 54, groups:  Worker, 6
# [...]

Dale Barr from the Barr et al. (2013) paper is more cautious in recommending reducing the random-effects structure than presented in amoeba's answer. In a recent twitter exchange he wrote:

"reducing the model introduces unknown risk of anticonservativity, and should be done with caution, if at all." and
"My main concern is that people understand risks associated with model reduction and that minimizing this risk requires a more conservative approach than is commonly adopted (eg each slope tested at .05)."

So caution is advised.

As one of the reviewers I can also provide some insight on why we the Bates et al. (2015) paper remained unpublished. Me and the other two reviewers (which signed, but will remain unnamed here) had some criticism with the PCA approach (seems unprincipled and there is no evidence that it is superior in terms of power). Furthermore, I believe all three criticized that the paper did not focus on the issue of how to specify the random-effects structure, but also tries to introduce GAMMs. Thus, the Bates et al (2015) paper morphed into the Matuschek et al. (2017) paper which addresses the issue of the random-effects structure with simulations and the Baayen et al. (2017) paper introducing GAMMs.

My full review of the Bates et al. draft can be found here. IIRC, the other reviews had kind of similar main points.

OK. Then I might insert some small edits/updates in it to clarify some of the points that you are making. Regarding Bates preprint it might very well be suboptimal in various respects. But I do fully agree with Bates et al. that singular covariance matrices are *exactly the same problem* as the correlations of +1/-1. Mathematically, there is just no difference. So **if** we accept that perfect correlations compromise power, **then** we must be very wary of the singular cov. even in the absence of explicit simulations showing it. I disagree that it's "unprincipled". — amoeba, Jan 16 '18 at 11:43
@amoeba `lmer_alt` basically works exactly like `lmer` (or even `glmer`) with the only difference that it allows the `||` syntax. So I am not sure why you would want to avoid `afex` at all costs. It should even work without attaching (i.e., `afex::lmer_alt(...)`). — Henrik, Jan 16 '18 at 12:34
@amoeba What it does is basically the approach described in the code by Reinhold Kliegl (i.e., expanding the random effects). For each random effects term of the formula it creates a model matrix (i.e., converts the factors into numerical covariates). This model.matrix is then `cbind` to the data. Then the random-effects term in the formula are replaced with a new one in which each of the newly created columns is concatenated with a +. See lines 690 to 730 in https://github.com/singmann/afex/blob/master/R/mixed.R — Henrik, Jan 16 '18 at 12:55
Regarding categorical variables to the left of `||`, this is a really important point, thanks for bringing it up and explaining it to me (I edited my answer to reflect it). I like this functionality of `lmer_alt` in `afex`. I will just mention here for completeness that to get the same output with vanilla `lmer` call without any additional preprocessing one can e.g. specify `(1+dummy(Machine,'B')+dummy(Machine,'C') || Worker)`. This clearly becomes very cumbersome when categorical variable has many levels. — amoeba, Jan 16 '18 at 16:53
@amoeba It is important to note that the approach using `dummy()` only works with the default treatment contrasts and not when the random-effects use sum-to-zero contrasts (which one should use in case the model has interactions). You can e.g., see that if you compare the variance components in the example above for the `lmer_alt` call with the `mixed` call. — Henrik, Jan 16 '18 at 17:00
@Henrik: Thank you for complementing the answer of @amoeba. I think it added the info about the `afex` functionality. I will accept the answer from @amoeba because everything relevant is summarized there. — User33268, Jan 18 '18 at 09:51
@Henrik: I got the following error: `Error in model.matrix.default(tmp_random[[i]], data = data) : model frame and formula mismatch in model.matrix() In addition: Warning messages: Due to missing values, reduced number of observations to 22211 Due to missing values, set_data_arg set to FALSE.` Do you know what is the problem? I was trying to model GLMM with `afex` — User33268, Jan 18 '18 at 13:29
@RockyRaccoon That sounds like a bug that I have never seen. Please post a reproducible example to either github (https://github.com/singmann/afex/issues) or the afex forum (http://afex.singmann.science/) or send it via mail. Otherwise I will be unable to help. — Henrik, Jan 18 '18 at 13:42
Henrik, it would be great to have your input here: https://stats.stackexchange.com/questions/345842. It's about how to specify models with no correlations between levels of a categorical factor, so directly related to this answer and to `lmer_alt`. I have just posted an answer in that thread and would much appreciate if you could take a look. Cheers. — amoeba, May 28 '18 at 08:03

score 1 · Answer 3 · answered Jan 17 '18 at 17:05

I too have had this problem when using maximum likelihood estimation - only I use Goldstein IGLS algorithm as implemented through the MLwiN software and not LME4 in R. However in each and very case the problem has resolved when I have switched to MCMC estimation using the same software. I have even had a correlation in excess of 3 which resolved when I changed estimation. Using IGLS, the correlation is calculated post estimation as the covariance divided by the product of the square root of the product of the associated variances - and this does not take account of the uncertainty in each of the constituent estimates.

The IGLS software does not 'know' that the covariance implies a correlation and just calculates estimates of a constant, linear, quadratic etc variance function. In contrast the MCMC approach is built on the assumption of samples from a multivariate normal distribution which corresponds to variances and covariances with good properties and full error propogation so that the uncertainty in the estimation of the covariances is taken into account in the estimation of the variances and vice versa.

MLwin beings the MCMC estimate chain with IGLS estimates and the non-negative definite variance covariance matrix might need to be altered by changing the covariance to zero at the outset before starting the sampling.

For a worked example see

Developing multilevel models for analysing contextuality, heterogeneity and change using MLwiN 3, Volume 1 (updated September 2017); Volume 2 is also on RGate

https://www.researchgate.net/publication/320197425_Vol1Training_manualRevisedSept2017

Appendix to Chapter 10

What to do with random effects correlation that equals 1 or -1?

3 Answers3

Linked

Related