18

Model formulae in R such as

y ~ x + a*b + c:d

are based on the so called Wilkinson notation: Wilkinson and Rogers 1973, Symbolic Description of Factorial Models for Analysis of Variance.

This paper did not discuss notations for mixed models (which might not have existed back then). So where did the mixed model formulae used in lme4 and related packages in R such as

y ~ x + a*b + c:d + (1|school) + (a*b||town)

come from? Who introduced them for the first time, and when? Is there any agreed upon term such as "Wilkinson notation" for them? I am specifically referring to the terms like

(model formula |  grouping variable)
(model formula || grouping variable)
amoeba
  • 93,463
  • 28
  • 275
  • 317

1 Answers1

13

The notation | has been around in nlme docs since version 3.1-1 and that is probably late 1999; we can easily check that on CRAN nlme code archive. nlme does use this notation, for example try library(nlme); formula(Orthodont); the | comes up - so 2000's are off. So let's dig.... "Graphical Methods for Data with Multiple Levels of Nesting" Pinheiro & Bates (1997) where the groupedData constructor is introduced. And they say: "The formula in a grouped data object has the same pattern as the formula used in a call to a Trellis graphics function in S-PLUS, such as xyplot" Which.... makes sense as are P&B working in... Bell Labs (RIP) which developed the Trellis graphics system which actually used the operator | already to indicate groups. Which probably means... "The Visual Design and Control of Trellis Display" by Becker et al. (1996) has something to do with this. Notation is not introduced in this paper but it is the first electronic Trellis display reference I can find.

Essentially we need to dig-up visualisation literature at this point. Probably I would check Cleveland's book Visualizing Data (1993) and early works of Deepayan Sarkar (who developed lattice). Notice that the actual operator | (and ||) are true primitive operators as they are associated with OR operators, so it was just a matter of time till someone overloads them. While not a full answer, I strongly suspect P&B checked their colleagues cool visualisation system (the plots in that 1996 paper are quite good for late 2010's standards) and realised that someone (Becker, Cleveland and Shyu) already did some work on this (maybe even discussed this with them at the time) and just followed up what was already there. I.e. the | operator originates in graphics notation. Trellis almost certainly used it; potential predecessors of Trellis may have done so too but their e-footprint is very hard to track.

In general, I think you might want this page on NLME: Software for mixed-effects models by Bell Labs for more historic information on nlme.

usεr11852
  • 33,608
  • 2
  • 75
  • 117
  • 1
    Thanks a lot! It's true that `nlme` uses `|` but I don't think it uses `()` to denote random effects, right? Random effects are listed as a separate argument to the function call. Was it `lme4` that introduced `(x|id)` as part of the *same* formula? – amoeba Jun 13 '17 at 05:27
  • 2
    I think you are reading a bit too much into the presence of the parentheses; I strongly suspect they exist for parsing purposes given `lme4` uses a unified syntax for all terms. For example `fm1 – usεr11852 Jun 13 '17 at 07:55
  • Oh. Indeed. Never thought about it this way :) – amoeba Jun 13 '17 at 08:00
  • Bates just confirmed that the random part was introduced by the nlme authors (of which he is one): https://twitter.com/BatesDmbates/status/1111283948615802881 – Jonas Lindeløv Apr 14 '19 at 22:08
  • @JonasLindeløv: Cool! Thanks for sharing, I will make a link to the answer tomorrow night. – usεr11852 Apr 14 '19 at 22:21