16

"Dummy variable" and "indicator variable" are labels frequently used terms to describe membership in a category with 0/1 coding; usually 0: Not a member of category, 1: Member of category.

On 11/26/2014 a quick search on scholar.google.com (with enclosing quotes) reveals "dummy variable" is used in about 318,000 articles, and "indicator variable" is used in about 112,000 articles. The term "dummy variable" also has a meaning in non-statistical mathematics of "bound variable" which is likely contributing to the greater use of "dummy variable" in indexed articles.

My topically-linked questions:

  1. Are these terms always synonymous (within statistics)?
  2. Are either of these terms ever acceptably applied to other forms of categorical coding (e.g. effect coding, Helmert coding, etc.)?
  3. What statistical or disciplinary reasons are there to prefer one term over the other?
mdewey
  • 16,541
  • 22
  • 30
  • 57
Alexis
  • 26,219
  • 5
  • 78
  • 131
  • 4
    I tend to use "indicator variable" for binary conditions, e.g. sex might be coded as `male` with values `1` or `0`. If there is a categorical variable with more than 2 categories that is then expanded into indicator variables for membership in each level, I would use "dummy variables" to describe that set of indicator variables. – Gregor Thomas Nov 26 '14 at 18:23
  • 2
    I think you mean *sex* might be encoded as 1 or 0, *gender* is a far more complicated construct. (for that matter sex can be more complicated too) ;) – Alexis Nov 26 '14 at 18:25
  • 2
    point well-taken, edited to `sex`. – Gregor Thomas Nov 26 '14 at 18:28
  • 2
    I tend to call such an indicator variable `male`, where 1 means true (in this case male) and 0 means false (in this case female). If I use the variable name `sex` I will have to look up how I coded that variable everytime I return to that dataset. – Maarten Buis Nov 27 '14 at 08:39
  • 4
    I've heard various stories of "dummy variable" being wildly and unfortunately misinterpreted by non-technical audiences as implying disdain or disparagement. They were embarrassing and convincing enough to turn me against the term. "indicator" is to me clear and straightforward. – Nick Cox Nov 27 '14 at 13:45
  • 1
    @NickCox: Would you also use "indicator variable" for the dummy variables in {-1,0,1} used for sum-to-zero coding? – Scortchi - Reinstate Monica Nov 27 '14 at 14:01
  • 1
    @Scortchi No. I've never had to write about such variables, but I'd seek some other way of reporting that. – Nick Cox Nov 27 '14 at 15:33
  • @Scortchi Do you have a go-to reference for understanding sum-to-zero coding? – Alexis Jun 05 '17 at 19:33
  • 1
    @Alexis: See https://stats.idre.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/#DEVIATION (where it's called "deviation coding", but I think "sum-to-zero coding" is clearer). – Scortchi - Reinstate Monica Jun 06 '17 at 08:02

2 Answers2

13

I'd say "dummy variable" is a more general way to refer to (one of) the numerical variable(s) that represents (together represent) a categorical predictor; therefore the term applies also to those used in Helmert & effect coding. That's mainly owing to the general use of "dummy" to mean "stand-in". "Indicator variable" I relate to indicator functions‡—so those can only be one or zero to indicate having or not having some property; therefore the term applies only to those used in reference-level coding. Of course some people use "dummy coding" to mean "reference-level coding"; they presumably have a more restricted definition of "dummy variables", or at any rate ought to have.

† And if you don't call those "dummies", what do you call them?

‡ So e.g. the dummy $x_i$ is an indicator variable for when the $i$th person $u_i$ is male (a member of set $M$): $$ x_i=\boldsymbol{1}_\mathrm{M}(u_i)=\left\{ \begin{array}{l l} 1 & \mathrm{when}\ u_i \in M\\ 0 & \mathrm{when}\ u_i \notin M\\ \end{array}\right.$$

where $\boldsymbol{1}_M(\cdot)$ is the indicator function for membership of $M$.

※ Or, as @gung has pointed out, level-means coding.

Scortchi - Reinstate Monica
  • 27,560
  • 8
  • 81
  • 248
  • 2
    Huh... can you provide links to some resources motivating that? In my experience "dummy variable" gets used for 0/1 coding a great deal. Not sure I have seen dummy used as you suggest, and know others use it in an opposite sense. For example, Alkharusi, H. (2012) "Categorical Variables in Regression Analysis: A Comparison of Dummy and Effect Coding" *International Journal of Education* 4(2):202–210. – Alexis Nov 26 '14 at 18:34
  • 2
    I didn't say "dummy variable" isn't used for 0/1 coding, just that it may be used in a more general sense. – Scortchi - Reinstate Monica Nov 26 '14 at 18:36
  • 1
    Indeed the very paper you cite says that, using effect coding, "the dummy variables take on the values 1, 0, and -1". (Of course I think they should have called "dummy coding" something else if they're going to say that.) – Scortchi - Reinstate Monica Nov 26 '14 at 18:45
  • 1
    Got ya... as to the question from your daggered superscript, I tend to call them "categorical variables using XXX coding". – Alexis Nov 26 '14 at 19:12
  • Notation such as $x_i = (u_i \in M)$ allows one to dispense with the apparatus of an indicator function. The convention that true or false evaluates as 1 or 0 matches many programs, naturally. – Nick Cox Nov 27 '14 at 17:00
  • @NickCox Can you amplify both on the notation and the point you are making with it... I am afraid I am not following. :( – Alexis Nov 27 '14 at 17:45
  • 2
    The point is best made by Knuth in http://arxiv.org/abs/math/9205211 He attributes the idea to K.E. Iverson. In short, we don't need to invent or invoke an indicator function but follow in formal discussion what our software does for us. – Nick Cox Nov 27 '14 at 18:39
7

@Scortchi has provided a good answer here. Let me add one small point. Even using the stricter definition of indicator variable, this can still be associated with (at least) two different coding schemes for categorical data in a regression-type model: viz. reference level coding and level means coding. With level means coding, you have a categorical variable with $k$ levels that are represented with $k$ indicator variables, but you do not include a vector of $1$s for the intercept (i.e., the intercept is suppressed). (For a fuller explication, with example model matrices, see my answer here: How can logistic regression have a factorial predictor and no intercept?) When there is only a single categorical variable, this yields model output in a way that is simple and may be preferred by some people. (For an example where using this scheme facilitates comparisons of interest, see my answer here: Why do the estimated values from a Best Linear Unbiased Predictor (BLUP) differ from a Best Linear Unbiased Estimator (BLUE)?)

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650