Usage of tensor notation in statistics

Question

A friend of mine (mathematician) basically told me I shouldn't bother with matrix algebra and should focus on tensor analysis/manipulation. He said it's much more general and intuitive. I've been trying to grasp most of the matrix literature (Harville, Magnus, Gentle) due to its relevance to linear models.

My question is this: is there a part of mathematical statistics/prob. theory where tensor notation is employed regularly? If so, why isn't it popular elsewhere since it (supposedly) makes matrix manipulations easier?

Have you read http://stats.stackexchange.com/questions/198061/why-the-sudden-fascination-with-tensors ? — Tim, Dec 21 '16 at 12:28
Intuitive supposedly means easy to learn. In my experience when used as a sales pitch it usually means easy to remember and use once you understand it, which is quite different, as in "the syntax of this software is highly intuitive". It usually isn't! I distrust all such claims generically. Your friend is very comfortable with tensors; that's great and you could be too if you worked hard enough with them. — Nick Cox, Dec 21 '16 at 13:18
@Tim I believe that thread might represent exactly the opposite of what the mathematician friend is advocating. The contrast is between *manipulation of objects with multiple indices* versus manipulation of more abstract mathematical objects using more abstract algebraic rules. Another way to understand the distinction is that people who need to *think* about linear algebraic objects tend to use basis-free methods (without indices) whereas those who need to *compute* with them (statisticians, computer programmers, physicists, engineers) eventually are forced to use indices. — whuber, Dec 21 '16 at 15:34

Mustafa Eisa · Answer 1 · 2017-02-15T09:52:51.640

The most obvious and straightforward application of tensors (that I know of) in statistics is computing high-order moments of a multivariate distribution. For example, consider a random vector $x\sim F$, where $F$ is some $p$-dimensional distribution. Given some data matrix $X \in \mathbb{R}^{n\times p}$ where $n$ is the number of observations, each of which is drawn iid from $F$, the second moment $\mathbb{E}(xx^\top) = \mathbb{E}(x\otimes x)$ can be estimated from the sample $X$ as follows $$\hat{\mathbb{E}}(x\otimes x) = \frac1n \sum_{i=1}^n X_{i\cdot} \otimes X_{i\cdot} = \frac1n X^\top X \in \mathbb{R}^{p\times p}$$ where $X_{i\cdot}$ is the $i^{th}$ row of $X$. Certainly this is a matrix that is only a few operations away from the covariance matrix. Continuing on to the third moment, which is again related to "co-skewness," we see we are dealing with an order-3 tensor $$\hat{\mathbb{E}}(x\otimes x \otimes x) = \frac1n \sum_{i=1}^n X_{i\cdot} \otimes X_{i\cdot}\otimes X_{i\cdot} \in \mathbb{R}^{p\times p\times p}$$ The "co-kurtosis" tensor is order 4 and so on for higher-order moments.

These moment tensors have been applied in financial portfolio optimization decades ago, multivariate data standardization (standardize by skew, not just mean and variance), and obviously in deep learning (eg. tensorflow) where the gradients of the loss function with respect to model parameters contain tensors that are used in back-propagation. I believe there are additional applications in natural language processing, multivariate time series, and stochastic block models.

I agree with @whuber: When the indices are not pivotal to the work, it certainly is an intuitive and flexible generalization that sheds light on the lower-dimensional cases. However, it tends to make things difficult for statisticians and engineers that have to stress out about three or more indices and weird complicated generalization of what seemed like ergonomic rules (eg. what is the trace of a high-order tensor? what does symmetry mean? etc) That's probably why many of the statisticians and applied mathematicians I know avoid tensors and simply stack/flatten the 2d cross-sections of each tensor into a tall matrix.

+1 for the first part of this answer. Covariance matrix is a bona fide tensor, as well as these higher order moment generalizations (of which I was not aware until now). However, I am not sure I agree with your paragraph beginning with "These moment tensors have been applied..." For example, in what sense are neural network gradients "tensors"? — amoeba, Feb 15 '17 at 09:52
Yea, sorry, I changed it to "contains tensors" rather than are tensors. It happens because in back-propagation on deep networks, the gradient of a parameter vector/matrix with respect to another parameter vector/matrix is a tensor. Those terms however are summed and reduced before turning into the gradient. — Mustafa Eisa, Feb 15 '17 at 09:55
@MustafaSEisa In portfolio optimization, can quadratic programming be used to maximize portfolio skewness based on the co-skewness matrix? If not, can portfolio skewness (cubic objective function) be converted to a quadratic objective function somehow? https://quant.stackexchange.com/questions/58786/is-quadratic-programming-used-to-maximize-portfolio-skewness-and-kurtosis — develarist, Oct 30 '20 at 08:28
Maximization of an even-order moment is non-convex, so no, I don’t think it would be possible to convert to a (convex) quadratic program that’s solvable with IPM. So skewness is furthermore cubic, which makes it even more impossible. — Mustafa Eisa, Oct 31 '20 at 00:50

Usage of tensor notation in statistics

1 Answers1