3

I know the formula for a linear mixed model (LMM) is often (always?) written with a tilde, rather than an equals sign, between the LHS and RHS. For example, one would write

outcome ~ 1 + var1 + (1|var2)

to denote that the outcome variable is modeled by an intercept plus var1 plus var2 with random effects (random intercept model).

I am doing proofs of a paper with LMM equations in the methods, and the journal for some reason has trouble rendering the tilde symbol properly (appears instead as a dash). So my questions is, is it also acceptable to write the formula with an equals sign instead of a tilde? Is there a difference, or would this be the incorrect way of writing a formula for a LMM?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Joel
  • 51
  • 1
  • 1
    I would like to revolt against the weird journal not being able to render a tilde ~, but I have to say that it is no problem to use an equal sign. See for instance the [use on Wikipedia](https://en.m.wikipedia.org/wiki/Mixed_model#Definition) which is very standard. The tilde sign is a notation used in R to write regression equations but elsewhere it has a different meaning as mentioned in Alexia's answer. – Sextus Empiricus Nov 23 '21 at 18:13
  • More about the tilde see: https://stats.stackexchange.com/questions/531826/reference-who-introduced-the-tilde-notation-to-mean-has-probability-distri – Sextus Empiricus Nov 23 '21 at 18:14
  • 1
    If this is actual programming code, then you may need the need the tilde for the code to run – Henry Nov 23 '21 at 18:22
  • There's certainly a difference. "$=$" means that the two things on either side are identical (e.g. if you're talking about numbers, they have identical values), which is not the case here. What you could do is (a) simply report the RHS of the model and write your discussion in a way that explains that its a model for $y$, or (b) give up and use "$=$" - or perhaps some other symbol like "$\approx$" - but on first use specify that you *don't* mean mathematical equality (or whatever the symbol you use conventionally means) and then explain what you're using it to mean instead. – Glen_b Nov 23 '21 at 22:21
  • 2
    The formula you included in the OP is R code used for supplying a LMM to the `lme4::lmer()` function; it is not universal notation (e.g., it wouldn't work for `nlme::lme()`) and therefore should not be how you denote your model in a paper. Use the model in @AdamO's answer, which is formal statistical notation. – Noah Nov 24 '21 at 19:27

2 Answers2

8

The tilde as a relational operator is frequently used in a statistical context to indicate "is distributed as," and read "the term(s) on the left are distributed as indicated by the terms on the right." You will frequently see this indicating a simple distribution model, such as the "$\varepsilon \sim \mathcal{N}(0,\sigma^{2})$" in a regression context, indicating the errors are modeled with a normal distribution centered on zero and with variance $\sigma^{2}$. In a regression context you may also see an expectation of the dependent variable conditional on independent variables, or complex error structures expressed this way. (Aside: What you have written looks more like R code than mathematical notation in a textual context, and @AdamO makes a good point about deficiencies in this notation.)

By contrast the equals sign as a definitional operator is used to indicate just that: mathematical equality in a definitional sense; for example, in a regression context, you may define the way every observed value of a dependent variable is a function of both a fixed, or deterministic model, and a random, or stochastic model (the latter of which is often subsequently expressed with tilde notation to indicate distribution model, as in the snippet in the above paragraph). Sometimes a link function is left implicit in the mathematical formalism (e.g., where the text indicates something like "where the term to the left of the equals sign is the anti-link function of the dependent variable"), and sometimes it will be explicitly articulated in the mathematical formalism on either the left or right side of the equals sign.

Alexis
  • 26,219
  • 5
  • 78
  • 131
  • 4
    +1 it's important to note that R's formula language is endemic to R. It's idiosyncratic and has some deficiencies. The R formula expression does not suffice to explain a regression model in a scientific journal. For instance, `y ~ x*w` implies that $E[Y|X,W] = \gamma X W$ meaning there's no intercept or the 1st order effects of $X$ or $W$, but R adds those in for you. – AdamO Nov 23 '21 at 18:28
  • 1
    You mean that the distribution has SD $\sigma$, I guess -- although in my experience the notation $N(\mu, \sigma^2)$ is as or more common than $N(\mu, \sigma)$. – Nick Cox Nov 23 '21 at 18:38
  • @NickCox Changed. – Alexis Nov 23 '21 at 20:40
  • 2
    @AdamO the notation is not unique to R (so I wouldn't call it endemic). It's based on Wilkinson and Rogers notation and is used by programs like GLIM and Genstat; R makes some implementation changes for various reasons (e.g. it has to replace "$.$" with something else, so it uses "$:$") and because it broadens the use of the notation, there's a few oddities that creep in to avoid clashes with other bits of R. – Glen_b Nov 23 '21 at 22:29
  • 1
    @Glen_b good point. However, I daresay we both agree: computer code does not suffice to express a linear model any more than unformatted computer output suffices to report the results of a regression. – AdamO Nov 24 '21 at 16:41
8

If the journal is typesetting in LaTeX, the formula needs to be in math mode and use the \sim expression $\sim$ which is different from ~ in text mode - you can see the difference, no? But that's not your job, and I'm sure the journal has a steep publication fee because why?

Anyway, the correct way to express a LMM is not using the $\sim$ notation because it means "is distributed as". If I were a reviewer or editor, I would insist to define variables: so say "var1" is obesity/overweight. You have also defined a random intercept in var2. Is this actually a covariate or a subject identifier? You can already see how this is become confusing for the audience. Assuming var2 is a subject identifier, this is a simple example of a random intercepts model, or repeated measures ANOVA. Lastly suppose Y is creatinine level, define them as $\text{Obese}, \text{Subject}$, and $\text{Creat}$ respectively.

My preferred way to express your mixed model is WAY more formal, such as the below

$$ \text{Creat}_{ij} = \beta_0 + \beta_1 \text{Obese} + \epsilon_j+ \epsilon_{i,j}$$

Then interpret the model by saying,

The repeated measures ANOVA models creatinine level for the $i$-th subject at time $j$ with a fixed effect of obesity at baseline, a random intercept for subject, and $\epsilon_{i,j}$ a random error term.

AdamO
  • 52,330
  • 5
  • 104
  • 209
  • I am not seeing a mixed model here: e.g., there's no $j$-level random term like $\mu_{0j}$? – Alexis Nov 23 '21 at 20:39
  • @Alexis that's the $\beta_{2,j}$. This is similar to the Diggle, Heagerty, Liang, Zeger, notation. – AdamO Nov 23 '21 at 21:01
  • Got it... I am not accustomed to seeing mixed models written sans explicit notation for random effects (random intercepts, and random slopes), but know that there are a bunch of different ways of presenting these models. – Alexis Nov 23 '21 at 21:06
  • 2
    I would probably say $\beta_0 + \beta_1 \textrm{Obese} + \epsilon_{i,T} T_i + \epsilon_{r,ij}; \epsilon_T \sim N(0,\sigma^2_T); \epsilon_{r} \sim N(0, \sigma^2_\epsilon)$ to complete the specification, and note that $T_i$ is an indicator variable. – Ben Bolker Nov 23 '21 at 22:48
  • @BenBolker No covariance between $\epsilon_{T}$ and $\epsilon_{r}$? – Alexis Nov 24 '21 at 07:06
  • *A la* $\left[\begin{array}{c}{\epsilon}_{T}\\ {\epsilon}_{r}\end{array}\right] \sim \pmb{\Omega} \left(0, \begin{array}{ll}{\sigma}^{2}_{\epsilon_T} & \\ {\sigma}_{\epsilon_{T r}} & {\sigma}^{2}_{\epsilon _r}\end{array}\right)$? – Alexis Nov 24 '21 at 07:27
  • No, there's no covariance between the group-level random effect and the residual variance term. – Ben Bolker Nov 24 '21 at 14:46
  • @BenBolker Got it, yes, that makes sense. (I am so used to seeing the same Greek letter used to denote the same level of variance… too many different style of representing mixed models. :). – Alexis Nov 24 '21 at 16:21
  • My notation above isn't quite right either; I don't need the indicator variable $T_i$ in this form. – Ben Bolker Nov 24 '21 at 16:22
  • It seems like both $\beta_{2,j}$ and $\mathit{T2DM}_j$ are not needed, but either one suffices. Additionally, the Obese variable should have subscripts $i$ and $j$. – user551504 Nov 24 '21 at 17:00
  • @user551504 my contrived example is thinking of diabetes status as a time varying covariate *and* that the effect varies with respect to time. The overall point is to illustrate the type of presentation that I would expect in a scholarly journal. – AdamO Nov 24 '21 at 17:03
  • @BenBolker I significantly edited the question because I doubt that OP's mentioned "var2" was really a variable rather than a subject identifier - a distinction I care to make. Anyway, I framed the model in terms of a much more expected presentation. Are the distributional assumptions for the error term if there are reasonable asymptotics? – AdamO Nov 29 '21 at 17:21
  • 1
    `(1|var2)` explicitly denotes variation in intercepts by levels of a (categorical) grouping variable (your "subject identifier") `var2`. I agree your notation above (although I would generally augment it with the distributional information about $\epsilon_j$ and $\epsilon_{i,j}$ - you could argue that the distributional information is relatively unimportant if asymptotics etc., but I would rather be complete). (Also, I agree that the OP shouldn't use R syntax in a formal setting, but this is fairly widely used: https://bbolker.github.io/mixedmodels-misc/glmmFAQ.html#model-specification ) – Ben Bolker Nov 29 '21 at 17:43