Why are emmeans package means different than regular means?

Question

I am analyzing a dataset with missing data using the lme4 package for fitting mixed models and calculating fitted means from it using package emmeans.

I have a feeling it relates to the missing data but why are the means that emmeans displays different than calculating the mean of a group directly and removing the NAs?

If the dataset is balanced with all points present, would these be identical?

This is important because doing usual pairwise t-tests uses the regular means/SDs while the emmeans utilizes the mixed model and I get different results.

Read the basics vignette where it is explained. `vignette(“basics”, “emmeans”)` — Russ Lenth, Sep 12 '19 at 23:00

Dimitris Rizopoulos · Answer 1 · 2019-09-15T04:33:51.067

4

You are indeed right that this difference can be explained from the missing data you have. In particular, when you have missing data that are of the missing at random type, then the observed data are not a representative sample of your target population. In this case, the simple sample means will be biased and should not be trusted.

The mixed model, on the contrary, will give you correct estimates and inferences in a missing at random setting, provided that your model is correctly/flexibly specified.

Hence, you should better trust what is reported by emmeans based on your fitted mixed model.

edited Sep 15 '19 at 04:33

answered Sep 13 '19 at 07:15

Dimitris Rizopoulos

17,519
2
16
37

Thanks, this helps. I find it interesting that you say that if the data is MAR then the simple means would be biased. Isn't this equivalent to list-wise deletion which is ok as long as the data is MCAR (missing completely at random) or MAR? It seems intuitively that if its missing at random then its like not having collected it to begin with. – Vattaka Sep 13 '19 at 08:54
When the data are missing at random it means that missingness depends on the observed outcomes. Hence, the sample you end up with is not a representative sample of your target population. – Dimitris Rizopoulos Sep 13 '19 at 10:00
I thought that was the definition of MNAR Missing Not at Random? https://en.wikipedia.org/wiki/Missing_data#Missing_not_at_random. In this case, I took a look and did a Logistic Regression on missingness, and its completely explained by the observed factors (it was almost complete separation). So I think that means its MAR not MNAR – Vattaka Sep 13 '19 at 16:23
Missing not at random you have when the missingness process depends on **unobserved** outcomes. – Dimitris Rizopoulos Sep 13 '19 at 17:07
2

@rvl it is not clear where *”here”* in your comment refers to. – Dimitris Rizopoulos Sep 14 '19 at 01:51
+1 for your answer @Dimitris Rizopoulos. Is it always better to use emmeans instead of sample means? In my case, I am trying to describe the data as it is (the descriptive part of my paper). Then I have an inferential section where I report the models. the estimated marginal means from the models are quite different from the sample means due to the inclusion of random effects. – Cmagelssen Sep 30 '21 at 10:10

Russ Lenth · Accepted Answer · 2019-09-15T20:17:24.343

The fundamental difference between estimated marginal means (EMMs) and ordinary marginal means of data (OMMs) is that OMMs summarize the data, while EMMs summarize a model. Thus, if you fit a different model to the data, the EMMs are potentially different. EMMs are not just one thing.

To be a bit more precise, EMMs involve three entities:

A model for the data
A grid consisting of all combinations of reference valuses for the predictors. Typically, the reference values are, in the case of factors, the levels of those factors; and in the case of numeric predictors, the means of those predictors.
A weighting scheme (usually equal weights)

Given these, EMMs are obtained by first using the given model to obtain predictions at each combination of reference values; and then obtaining marginal averages of those predictions according to the weighting scheme.

In the case where equal weights are used, the model is fitted using lm() (or equivalent), all the predictors are factors, the design is balanced, and the model contains all interactions among these factors, then the predicted values are the cell means of the data, and the EMMs are the same as the OMMs. However, any deviations from these issues -- e.g., unequal weights, not using least-squares, not having balanced data, having some numerical predictors, not having all interactions in the model -- may lead to the EMMs being different from the OMMs.

Some further notes specific to other answers or comments in this thread:

Regarding empty cells, then usually a model with all interactions will be unable to unable to estimate all the grid values, causing some or all of the EMMs to be non-estimable (but see an exception below). Fitting a different model where one or more of the interactions are excluded may lead to the grid values being estimable, and hence the EMMs being estimable.
The question of whether observations are missing at random, not at random, completely at random, etc. is a modeling issue (or, per some comments, whether you trust the model you used). If the model is [in]appropriate or [un]trustworthy, the resulting EMMs will also be [in]appropriate or [un]trustworthy. Some missingness assumptions allow for multiple imputation techniques, and those may (or may not) allow for grid means to be estimable, and will; impact the EMMs accordingly.
Alternative weighting schemes (such as weighting proportionally to marginal frequencies) obviously affect the EMMs as well. A weighting scheme that gives zero weight to any grid combination that is non-estimable will provide estimable EMMs where otherwise they would be non-estimable. In particular, in an (all-factors, all-interactions, least-squares) situation, weighting according to cell frequencies will yield EMMs equal to OMMs.

I respectfully disagree with some points in your post. The missing data mechanism is *not* a modeling issue. Namely, it specifies the form of the conditional distribution $[R_i \mid Y_i^o, Y_i^m]$, where $R_i$ is the missing indicator and $Y_i^o$ and $Y_i^m$ the observed and missing part of the longitudinal outcome vector for subject $i$. When this distribution depends only on $Y_i^o$ we have Missing at Random and when it depends on $Y_i^m$ then Missing Not At Random. This is *irrespective* of the model we have chosen for the outcome. — Dimitris Rizopoulos, Sep 15 '19 at 18:41
When we have Missing At Random, then the observed data are not a representative sample of the target population. This is why, for example, the Generalized Estimating Equations (GEEs) approach that is based on sample moments gives biased results in this case, and you need to use inverse probability weighted GEEs. A likelihood based approach however will give you valid results in this case, provided that the model is correctly/flexibly specified. The so is the so-called ignorability property. — Dimitris Rizopoulos, Sep 15 '19 at 18:41
Specifying the form of a conditional distribution is an aspect of modeling. — Russ Lenth, Sep 15 '19 at 18:44
You do not need to specify any model at all. You only need to say if it depends in any possible way imaginable on $Y_i^o$ and/or $Y_i^m$. If it does, it will be in the former case MAR and in the latter case MNAR. — Dimitris Rizopoulos, Sep 15 '19 at 18:46
But the assumptions you make regarding this affect how you use the data to make predictions and hence the model underlying the EMMs. — Russ Lenth, Sep 15 '19 at 18:52
The assumptions you make for the missing data mechanism affect if the model that you are going to use to get the predictions will give valid predictions or not. — Dimitris Rizopoulos, Sep 15 '19 at 18:55
I edited the answer in a way that I hope addresses this fine point. — Russ Lenth, Sep 15 '19 at 20:18

Why are emmeans package means different than regular means?

2 Answers2

Linked