Why does the condition number of the covariance matrix explode as number of variables increases?

Question

From asset returns of $N$ stocks, the symmetric covariance matrix sized $N\times N$ is constructed, which treats the asset returns as variables.

When the number of variables $N$ is fairly low like $N=5$ or $N=12$, the condition number is relatively low around cond$=1-5$.
As I increase the number of variables in the covariance matrix though, such as $N = 30$ or $N=50$, it already explodes to the cond$=500^+$ range.

This discussion explains the worsening of condition number for when the features/variables have different scales, but this obviously doesn't apply to my case because all of the variables are in the same units: returns.

What my case does have in common with theirs though is that the standard deviations of the variables are higher or lower than one another (stocks being more or less risky than one another), but I wouldn't call this a difference in scale.

Why is the covariance matrix condition number so reactive to an increase in the number of variables $N$?

I believe this is not a mathematical phenomenon: it describes a property of the stock returns. But please explain what you might mean by a rectangular covariance matrix! — whuber, Sep 17 '20 at 18:01
the linked page shows it's not a phenomenon restricted to financial data. — develarist, Sep 17 '20 at 18:08
I don't see that anywhere on the linked page. In fact, when one generates random matrices with independent standard normal components, it is clear no such "explosion" takes place. I am not asserting the problem is *only* with financial phenomena, but such an experiment demonstrates that it is not a universal phenomenon and (therefore) the untoward increase in CN must be a function of the dependence structure within your sequence of variables. In fact, the only "explosion" occurs as $N$ grows from $1$ to $10,$ after which the CN enjoys a nice power law relationship with $N.$ — whuber, Sep 17 '20 at 18:17
there *is* a dependence structure between the variables. stocks tend to be highly correlated and their return series are rarely independent. is this what is causing the high condition number — develarist, Sep 17 '20 at 18:22
Yes, your covariance matrix is (approximately) low rank, ie. its singular values have a high range. This implies that there is a low-rank matrix which well-approximates your covariance matrix, or put another way: your data approximately lies on a subspace of dimension < N. — proof_by_accident, Sep 17 '20 at 18:23
@proof_by_accident not sure what to do with explanations that bring in a subspace. do you mean there is a *lower* rank matrix that approximates the true one better? — develarist, Sep 17 '20 at 18:24
Yes, there is a matrix of rank < N which approximates the covariance matrix very well. Your data only varies meaningfully in D < N dimensions. If you call the full covariance matrix M, and the covariance matrix of the data restricted to those $D$ dimensions $M_D$, then $M_D$ is a very good approximation for $M$. When this happens, your condition number goes up — proof_by_accident, Sep 17 '20 at 18:32
The covariance matrix of highly correlated data is closer to singular and then has higher condition number. — javierazcoiti, Sep 17 '20 at 18:33
the matrix becoming singular happens when off-diagonal elements are reaching levels found along the diagonal of the matrix right? — develarist, Sep 17 '20 at 18:48
A matrix is singular when a column can be obtained by linear combinations of other columns (same with rows.). In this concrete application, I can't say which ways can cause singularity. — javierazcoiti, Sep 17 '20 at 18:57
I lowered the dependence between the variable on an artificial Gaussian dataset so that there is almost no correlation between variables, and the condition number is still at $500$ in the large variable case. The current best answer makes no mention of dependence structure affecting the condition number, so it might be indicating that high correlation between variables is not a factor — develarist, Sep 17 '20 at 19:11
@develarist although I didn't mention it "the columns of $X$ only varying meaningfully in $D$ dimensions" is the same as saying "the columns of $X$ have dependence on each other". This is because if the columns of $X$ only vary in $D$ dimensions, then any one column can be (almost) represented as a sum of $D$ of the others, ie. they are dependent. — proof_by_accident, Sep 17 '20 at 19:27
Usually the stock returns are seen as a linear combination of the returns of diverse underlying factors (macro, style, etc.) and some part of the return specific for each stock. With two stocks, the specific or idiosyncratic part of their returns are usually close to uncorrelated, but with many stocks, there is a point where the idiosyncratic part of the return of a stock can be approximated by linear combinations of the others, causing the covariance matrix to be closer to singular. — javierazcoiti, Sep 17 '20 at 19:31
@javierazcoiti apart from being linear functions of underlying factors, do inter-asset returns have non-linear dependence that correlation can't capture — develarist, Sep 17 '20 at 19:38

score 8 · Accepted Answer · answered Sep 17 '20 at 18:46

Explaining this in the comments was a little limiting, apologies:

Assuming centered data matrix $X$, then your covariance matrix $M = X^T X$. This will have high condition number if the range of singular values of $M$ is high, because condition number is defined $\kappa(M) = \frac{s_{\text{max}}}{s_{\text{min}}}$ where $s_{\text{max}}$ and $s_{\text{min}}$ are the min and max singular values of $M$.

Let's look at what features of $X$ will produce a high range in the signular values. In general, the singular values of $M$ satisfy: $$ M = \sum_{i=1}^N s_i v_i v_i^T = V \Sigma V^T $$ Where the $v_i$ (the columns of V) are some orthogonal vectors, and $\Sigma$ is a diagonal matrix whose on-diagonal elements are the singular values $s_i$ and everything else is 0. Since $V^{-1} = V^T$ (because orthogonal) we can see that: $$ \Sigma = V^T M V = V^T X^T X V = (XV)^T(XV) $$ Letting $(XV)_i$ denote the $i^{\text{th}}$ column of $XV$, matrix multiplication is set up so that: $$ s_i = (XV)_i^T (XV)_i = | (XV)_i |^2 $$ Thus, if some columns of $XV$ are very big and others are very small, then some $s_i$ will be very big and others will be very small. When this happens, then your condition number will be large (by the definition of condition number).

Recall from linear algebra that, since $V$ is an orthogonal matrix, the columns of $XV$ are just rotations of the columns of $X$. In effect, what multiplication by $V$ is doing is rotating your data matrix so that the directions along which it varies the most are aligned with the cardinal directions of the data space. The large columns of $XV$ correspond to the directions along which the data varies a lot, and the small columns correspond to the directions where the data varies only a little bit. For your data, it sounds like it's the case that only $D << N$ columns of $XV$ have any appreciable magnitude, and that the rest of very very small. This number $D$ doesn't grow much, but $N$ does. As $N$ grows, the data varies less and less along each new dimension, bringing $s_{\text{min}}$ lower and lower, and causing $\kappa(M)$ to explode.

It's part of the definition of the singular value decomposition of $M$. I believe if the singular values of $M$ are all distinct (which is almost certainly true of any set of data obtained in nature), then the orthogonal matrix $V$ is uniquely determined by $M$. — Matthew Drury, Sep 17 '20 at 19:11
Any matrix $A$ (of any dimension, regardless of whether its square or not) can be decomposed as $A = U \Sigma V^T$ where $U$ and $V$ are orthogonal matrices and $\Sigma$ is diagonal. This is called the [singular value decomposition](https://en.wikipedia.org/wiki/Singular_value_decomposition). If $A$ is symmetric (as the covariance matrix $M$ is) then it's not hard to convince yourself that $U=V$. — proof_by_accident, Sep 17 '20 at 19:24

Why does the condition number of the covariance matrix explode as number of variables increases?

1 Answers1