In machine learning, why are superscripts used instead of subscripts?

Question

I'm taking Andrew Ng's course on Machine Learning through Coursera. For equations, superscripts are used instead of subscripts. For example, in the following equation $x^{(i)}$ is used instead of $x_i$:

$J(\theta_0, \theta_1) = \frac{1}{2m} \sum\limits_{i=1}^{m}{(h_\theta(x^{(i)}) - y^{(i)})^2}$

Apparently, this is common practice. My question is why use superscripts instead of subscripts? Superscripts are already used for exponentiation. Granted I seem to be able to disambiguate between the superscript and exponentiation use cases by paying attention to whether or not parentheses are present, but it still seems confusing.

I suspect it's perhaps because some computer science people are not versed in standard mathematical notation, and therefore make up their own notation. Actuaries do this sometimes too, and it's frustrating when you get to more complicated concepts. — rocinante, Feb 03 '16 at 21:36
Is `i` indexing over the data set size, or over the elements of the vector `x`? If the former, that's totally standard. If the latter, that's totally non-standard. And the reason why the superscript is used is because sometimes you want to refer to the element of the vector using the subscript. — Rex Kerr, Feb 04 '16 at 00:06
@RexKerr: I strongly believe this is the correct answer (in this case). I was typing my answer when your comment appeared. — amoeba, Feb 04 '16 at 00:13
@rocinante lol no, it's because subscripts are already taken for indexing vectors. — Neil G, Feb 04 '16 at 04:48
@rocinante That's rather presumptuous. What about contravariant vectors/[Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation)? — Will Vousden, Feb 04 '16 at 10:21
@rocinante I have to echo others in underlining that your wording is unfortunate. We all have a tendency to regard what is local and familiar as standard. — Nick Cox, Feb 04 '16 at 19:08
Hello @Jonathan, I wonder if you find that the existing answers have resolved your question. If so, consider accepting one of them. If no, feel free to clarify your remaining doubts. — amoeba, Feb 09 '16 at 22:02
Good point @amoeba. I was waiting for things to calm down and it definitely has. Answer accepted. — entpnerd, Feb 10 '16 at 02:32

amoeba · Accepted Answer · 2016-02-04T10:12:03.810

31

If $x$ denotes a vector $x \in \mathbb R^m$ then $x_i$ is a standard notation for the $i$-th coordinate of $x$, i.e. $$x = (x_1, x_2, \ldots, x_m)\in\mathbb R^m.$$

If you have a collection of $n$ such vectors, how would you denote an $i$-th vector? You cannot write $x_i$, this has other standard meaning. So sometimes people write $x^{(i)}$ and that is I believe why Andrew Ng does it.

I.e.

\begin{equation} x^{(1)} = (x_1^{(1)}, x_2^{(1)}, \ldots, x_m^{(1)}) \in \mathbb R^m\\ x^{(2)} = (x_1^{(2)}, x_2^{(2)}, \ldots, x_m^{(2)}) \in \mathbb R^m\\ \ldots \\ x^{(n)} = (x_1^{(n)}, x_2^{(n)}, \ldots, x_m^{(n)}) \in \mathbb R^m.\\ \end{equation}

edited Feb 04 '16 at 10:12

answered Feb 04 '16 at 00:11

amoeba

93,463
28
275
317

I'm not disagreeing, but often $x_{ij}$ is used, ie for repeated measurements. – Cliff AB Feb 04 '16 at 00:13
1

Yes, but $x_{ij}$ is equivalent to my $x^{(i)}_j$; what would be the equivalent of $x^{(i)}$? – amoeba Feb 04 '16 at 00:14
1

yes, that's an advantage. I think $x_{i.}$ is used sometimes, but this could be confused with $\sum_{j= 1}^n x_{ij}/m$. – Cliff AB Feb 04 '16 at 00:16
@CliffAB If you have repeated measurements, wouldn't you specify that first by clarifying whether $x$ is a list or a set? (A list allows repeated numbers, whereas a set does not), so you don't need two subscripts to describe an element. – rocinante Feb 04 '16 at 01:28
@rocinante: I believe you may be misinterpreting what I mean by "repeated measurements". In this case, $x_{ij}$ refers to the $j^{th}$ measurement of the $i^{th}$ subject. As example, if $x$ represents running mile times where subjects have 5 runs, $x_{2,3}$ is the mile time for the 2nd subject's 3rd attempt. – Cliff AB Feb 04 '16 at 02:01
1

If you wish to iterate over matrices then the $x_{mn}^{(i)}$ seems the most intuitive way to do so. Therefore the notation stays consistent when moving from vectors to matrices. – josh Feb 04 '16 at 10:04
How would $x_i$ not denote the $i$-th vector if $x$ is a collection of vectors? Or is it just that you don't necessarily know whether $x$ is supposed to be a collection of points or a collection of vectors in isolation, so the $x^{(i)}$ syntax is just for type hinting? – JAB Feb 04 '16 at 13:54
3

@JAB Yes, it's to make the notation more explicit ("type hinting" as you say). Of course one can agree to use $x_i$ for the $i$-th vector and $x_{ij}$ for the $j$-th element of the $i$-th vector. There are various conventions possible, this is just one of them. I am not even saying it is the best one, just explaining the rationale behind it. – amoeba Feb 04 '16 at 14:44
Isn't $x_{i\bullet}$ a common notation for $x^{(i)}$? – Francis Feb 05 '16 at 14:16
@Francis, I think I have *never* seen this notation (if you really mean `x_{i\bullet}` and not `x_{i\dot}`). In what fields is it common? – amoeba Feb 05 '16 at 14:52
@amoeba: they are the same thing that I am referring to. Some authors use $\bullet$ over $\cdot$ to make the dot more noticeable. – Francis Feb 05 '16 at 14:57

Cliff AB · Answer 2 · 2016-02-04T18:54:34.887

11

The use of super scripts as you have stated I believe is not very common in machine learning literature. I'd have to review Ng's course notes to confirm, but if he's putting that use there, I would say he would be origin of the proliferation of this notation. This is a possibility. Either way, not to be too unkind, but I don't think many of the online course students are publishing literature on machine learning, so this notation is not very common in the actual literature. After all, these are introductory courses in machine learning, not PhD level courses.

What is very common with super scripts is to denote the iteration of an algorithm using super scripts. For example, you could write an iteration of Newton's method as

$ \theta^{(t+1)} = \theta^{(t)} - H(\theta^{(t)}) ^{-1} \nabla \theta^{(t)}$

where $ H(\theta^{(t)}) $ is the Hessian and $\nabla \theta^{(t)}$ is the gradient.

(...yes this is not quite the best way to implement Newton's method due to the inversion of the Hessian matrix...)

Here, $\theta^{(t)}$ represents the value of $\theta$ in the $t^{th}$ iteration. This is the most common (but certainly not only) use of super scripts that I am aware of.

EDIT: To clarify, in the original question, it appeared to suggest that in the ML notation, $x^{(i)}$ was equivalent to statistic's $x_i$ notation. In my answer, I state that this is not truly prevalent in ML literature. This is true. However, as pointed out by @amoeba, there is plenty of superscript notation in ML literature for data, but in these cases $x^{(i)}$ does not typically mean the $i^{th}$ observation of a single vector $x$.

edited Feb 04 '16 at 18:54

answered Feb 03 '16 at 21:52

Cliff AB

17,741
1
39
84

1

The clash with the use of parenthesized/bracketed superscripts for iteration counts (a notation that is in common use across a wide range of areas) is a really important thing to raise. – Glen_b Feb 03 '16 at 22:02
2

It is also commonly used to indicate the index of the sample in the training set, which is like the iteration but not exactly the same because you usually end up iterating through your training set many times. – Rex Kerr Feb 04 '16 at 00:07
3

I've also seen iteration counts noted using subscripts ($a_{n+1} = a_n + 1$) as well as in line ($a(n+1) = a(n) + 1$). Which is why, when using some specific notation, I'll usually put something at the start to disambiguate (e.g. saying "in the following series, blah blah blah" and then putting the math). Thus, whatever notation is in use, readers can (hopefully) intuit the meaning for potentially ambiguous cases rather than having to guess based on the conventions they know. – JAB Feb 04 '16 at 14:00
1

I agree with @JAB. More generally, I don't think it's heinous for people who will be writing and using code to borrow notation from software in mathematical treatments. For example, and contentiously, computing people are way ahead of many mathematical groups in using clean notation such as $(x > 0)$, to be evaluated as 1 if true and 0 if false, instead of unnecessary formalisms such as $I(x > 0)$; here I am merely following behind Donald Knuth. – Nick Cox Feb 04 '16 at 19:04
@NickCox I generally only see the $I(x > 0)$ form when it comes to probability; otherwise, $x > 0$ is just an inequality constraint. When it comes to mathematical equations, they're either broken up into piecewise representations or they just represent the equation itself as an inequality as doing otherwise would induce ambiguity. (It's similar to how $=$ in math is more subtle than either `=` or `==` in most programming languages; it introduces a constraint or definition rather than an actual assignment or equality check.) – JAB Feb 04 '16 at 20:58
I often see true or false statements in mathematics to be evaluated as 1 or 0. Thus a definition of $\text{sign}(x)$ is $[x > 0] - [x < 0]$. (Correcting my earlier comment, square brackets are slightly better here than parentheses.) The notations $:=$ and $=:$ used by some mathematicians and statisticians are another superb example of how a notation from programming (in this case Algol in the 1950s) can help to make nuanced distinctions, namely the difference between a definition and an equivalence. – Nick Cox Feb 04 '16 at 21:14

Aksakal · Answer 3 · 2016-02-05T13:34:32.083

Superscripts are already used for exponentiation.

In mathematics superscripts are used left and right depending on the field. The choice is always historical legacy, nothing more. Whoever first got into the field set the convention of using sub- or superscripts.

Two examples. Superscripts are used to denote derivatives: $f(x)^{(n)}$

In tensor algebra both super and subscripts are used heavily for the same thing like $R^i_i$ could mean $i$ rows and $j$ columns. It's quite expressive: $T_i^k=R_i^jC_j^k$

Also I remember using scripts before letters (prescripts) in Physics, e.g. $^i_jB_k^l$. I think it was with tensors.

Hence, the choice of superscripts by Ng is purely historical too. There's no real reason to use or not use them, or prefer them to subscripts. Actually, I believe that here ML people are using tensor notation. They definitely are well versed in the subject, e.g. see this paper.

Another example for your point: [Einstein notation](https://en.wikipedia.org/wiki/Einstein_notation) — Neil G, Feb 04 '16 at 19:50

In machine learning, why are superscripts used instead of subscripts?

3 Answers3

Linked