30

I have a correlation matrix of security returns whose determinant is zero. (This is a bit surprising since the sample correlation matrix and the corresponding covariance matrix should theoretically be positive definite.)

My hypothesis is that at least one security is linearly dependent on other securities. Is there a function in R that sequentially tests each column a matrix for linear dependence?

For example, one approach would be to build up a correlation matrix one security at a time and calculate the determinant at each step. When the determinant = 0 then stop as you have identified the security who is a linear combination of other securities.

Any other techniques to identify linear dependence in such a matrix are appreciated.

amoeba
  • 93,463
  • 28
  • 275
  • 317
Ram Ahluwalia
  • 3,003
  • 6
  • 27
  • 38
  • Your matrix is positive semi-definite, it's not positive definite though, for it is singular. – ttnphns Oct 01 '11 at 17:53
  • What are the dimensions (no. variables; no. samples)? – Karl Oct 01 '11 at 17:59
  • Number of columns = 480. # of rows for each time series = 502. In general you find that the larger the time series the sample covariance matrix tends to be positive definite. However, there are many cases where you'd like to use a substantially smaller value of T (or exponentially weight) to reflect recent market conditions. – Ram Ahluwalia Oct 01 '11 at 19:24
  • 3
    The question is ill-posed. If your data matrix is 480 by 502 then saying that the matrix has rank $q < 480$ (the column space of the matrix has dimension $q < 480$) is mathematically equivalent to saying that some column is a linear combination of the others, but you can't pick out one column and say that this is the column that is linearly dependent. So there is no procedure for doing this, and the suggested procedure will pick a quite arbitrary security depending on the order they are included. – NRH Oct 01 '11 at 21:34
  • The covariance matrix is symmetric. It is generated by transpose(A) * A. The matrix A has dimensions 480x502. However the covariance matrix is 480x480 – Ram Ahluwalia Oct 01 '11 at 23:26
  • may be useful: http://stats.stackexchange.com/questions/50537/should-one-remove-highly-correlated-variables-before-doing-pca?rq=1 – Ladislav Naďo Nov 01 '13 at 14:01
  • ^ i have looked through that, and understand the effect of removing the correlated variables or not. due to the often heavy correlation in this dataset (often 0.95 to 1.00 is present) i decided that removing these prior to PCA would be beneficial – Samuel Nov 01 '13 at 14:04
  • Is what i am doing correct? ie even if the variables are not directly correlated but there is a link (like in A = B +C but no correlation between B and A) should i be leaving only one of these in for PCA? Would i expect high VIFs if A, B and C were left in? or more accurately if A and B were left in at the modelling stage? (where C has been removed by PCA identifying C and A as correlated) – Samuel Nov 01 '13 at 14:07
  • Try to give me more detail. Would it be possible to add a little bit of raw data and explain what your variables really are (instead of A,B,C...)? – Ladislav Naďo Nov 01 '13 at 14:11
  • Ill explain as much as i legally can. i have approx 100 - 200 variables, and 1200 subjects. There are a few output Y variables which i want to look at, within the X variables there are relationships such as 'internal water IW' 'external water EW' 'total water TW'. TW = IW + EW. in the case i am looking at IW is often 0.1 or somewhere between 0.5-1.5 and EW is from 0.5 to 2.5 and reasonably normally distributed. In this case PCA and the correlation matrix identify a link between EW and TW but not between IW and TW. However i know there is a link in reality – Samuel Nov 01 '13 at 14:14
  • If its some "topsecretarea51project", than you should use allegory :) – Ladislav Naďo Nov 01 '13 at 14:17
  • its not that secret. :) and i feel the info to come out of this wont be that exciting anyway..i should add that this is an obvious relationship, many of them are not with no clues in the name of the relationship as to what is linked – Samuel Nov 01 '13 at 14:26
  • Here is a very similar question https://stats.stackexchange.com/q/296248/3277 – ttnphns Aug 06 '17 at 10:26

7 Answers7

31

Here's a straightforward approach: compute the rank of the matrix that results from removing each of the columns. The columns which, when removed, result in the highest rank are the linearly dependent ones (since removing those does not decrease rank, while removing a linearly independent column does).

In R:

rankifremoved <- sapply(1:ncol(your.matrix), function (x) qr(your.matrix[,-x])$rank)
which(rankifremoved == max(rankifremoved))
James
  • 319
  • 3
  • 2
  • 1
    Hugely useful answer in determining the offending column in a regression matrix where i received the error `system is exactly singular: U[5,5] = 0 `, which I now know means column 5 was the issue (seems obvious with hindsight as it's a column of zeros!) – Matt Weller Nov 27 '14 at 23:28
  • In the comment of James, he posted the script: rankifremoved –  Sep 05 '16 at 15:12
  • @EltonAraújo: The output will be a vector giving the indices of the linearly dependent columns: so (2,4,5) for the example in ttnphns's answer. But I wonder how issues of numerical precision are going to affect this method. – Scortchi - Reinstate Monica Sep 05 '16 at 16:03
  • rankifremoved contains all columns that are linearly dependent among them or between them. In some application, we may want to retain a column or a few column and not dropping all – MasterJedi Aug 01 '17 at 02:56
  • Shouldn't this return an empty set for `your.matrix = matrix(1:4, 2)`? – Holger Brandl Dec 15 '17 at 15:16
20

The question asks about "identifying underlying [linear] relationships" among variables.

The quick and easy way to detect relationships is to regress any other variable (use a constant, even) against those variables using your favorite software: any good regression procedure will detect and diagnose collinearity. (You will not even bother to look at the regression results: we're just relying on a useful side-effect of setting up and analyzing the regression matrix.)

Assuming collinearity is detected, though, what next? Principal Components Analysis (PCA) is exactly what is needed: its smallest components correspond to near-linear relations. These relations can be read directly off the "loadings," which are linear combinations of the original variables. Small loadings (that is, those associated with small eigenvalues) correspond to near-collinearities. An eigenvalue of $0$ would correspond to a perfect linear relation. Slightly larger eigenvalues that are still much smaller than the largest would correspond to approximate linear relations.

(There is an art and quite a lot of literature associated with identifying what a "small" loading is. For modeling a dependent variable, I would suggest including it within the independent variables in the PCA in order to identify the components--regardless of their sizes--in which the dependent variable plays an important role. From this point of view, "small" means much smaller than any such component.)


Let's look at some examples. (These use R for the calculations and plotting.) Begin with a function to perform PCA, look for small components, plot them, and return the linear relations among them.

pca <- function(x, threshold, ...) {
  fit <- princomp(x)
  #
  # Compute the relations among "small" components.
  #
  if(missing(threshold)) threshold <- max(fit$sdev) / ncol(x)
  i <- which(fit$sdev < threshold)
  relations <- fit$loadings[, i, drop=FALSE]
  relations <- round(t(t(relations) / apply(relations, 2, max)), digits=2)
  #
  # Plot the loadings, highlighting those for the small components.
  #
  matplot(x, pch=1, cex=.8, col="Gray", xlab="Observation", ylab="Value", ...)
  suppressWarnings(matplot(x %*% relations, pch=19, col="#e0404080", add=TRUE))

  return(t(relations))
}

Let's apply this to some random data. These are built on four variables (the $B,C,D,$ and $E$ of the question). Here is a little function to compute $A$ as a given linear combination of the others. It then adds i.i.d. Normally-distributed values to all five variables (to see how well the procedure performs when multicollinearity is only approximate and not exact).

process <- function(z, beta, sd, ...) {
  x <- z %*% beta; colnames(x) <- "A"
  pca(cbind(x, z + rnorm(length(x), sd=sd)), ...)
}

We're all set to go: it remains only to generate $B, \ldots, E$ and apply these procedures. I use the two scenarios described in the question: $A=B+C+D+E$ (plus some error in each) and $A=B+(C+D)/2+E$ (plus some error in each). First, however, note that PCA is almost always applied to centered data, so these simulated data are centered (but not otherwise rescaled) using sweep.

n.obs <- 80 # Number of cases
n.vars <- 4 # Number of independent variables
set.seed(17)
z <- matrix(rnorm(n.obs*(n.vars)), ncol=n.vars)
z.mean <- apply(z, 2, mean)
z <- sweep(z, 2, z.mean)
colnames(z) <- c("B","C","D","E") # Optional; modify to match `n.vars` in length

Here we go with two scenarios and three levels of error applied to each. The original variables $B, \ldots, E$ are retained throughout without change: only $A$ and the error terms vary.

Results

The output associated with the upper left panel was

       A  B  C  D  E
Comp.5 1 -1 -1 -1 -1

This says that the row of red dots--which is constantly at $0$, demonstrating a perfect multicollinearity--consists of the combination $0 \approx A -B-C-D-E$: exactly what was specified.

The output for the upper middle panel was

       A     B     C     D     E
Comp.5 1 -0.95 -1.03 -0.98 -1.02

The coefficients are still close to what we expected, but they are not quite the same due to the error introduced. It thickened the four-dimensional hyperplane within the five-dimensional space implied by $(A,B,C,D,E)$ and that tilted the estimated direction just a little. With more error, the thickening becomes comparable to the original spread of the points, making the hyperplane almost impossible to estimate. Now (in the upper right panel) the coefficients are

       A     B     C     D     E
Comp.5 1 -1.33 -0.77 -0.74 -1.07

They have changed quite a bit but still reflect the basic underlying relationship $A' = B' + C' + D' + E'$ where the primes denote the values with the (unknown) error removed.

The bottom row is interpreted the same way and its output similarly reflects the coefficients $1, 1/2, 1/2, 1$.

In practice, it is often not the case that one variable is singled out as an obvious combination of the others: all coefficients may be of comparable sizes and of varying signs. Moreover, when there is more than one dimension of relations, there is no unique way to specify them: further analysis (such as row reduction) is needed to identify a useful basis for those relations. That's how the world works: all you can say is that these particular combinations that are output by PCA correspond to almost no variation in the data. To cope with this, some people use the largest ("principal") components directly as the independent variables in the regression or the subsequent analysis, whatever form it might take. If you do this, do not forget first to remove the dependent variable from the set of variables and redo the PCA!


Here is the code to reproduce this figure:

par(mfrow=c(2,3))
beta <- c(1,1,1,1) # Also can be a matrix with `n.obs` rows: try it!
process(z, beta, sd=0, main="A=B+C+D+E; No error")
process(z, beta, sd=1/10, main="A=B+C+D+E; Small error")
process(z, beta, sd=1/3, threshold=2/3, main="A=B+C+D+E; Large error")

beta <- c(1,1/2,1/2,1)
process(z, beta, sd=0, main="A=B+(C+D)/2+E; No error")
process(z, beta, sd=1/10, main="A=B+(C+D)/2+E; Small error")
process(z, beta, sd=1/3, threshold=2/3, main="A=B+(C+D)/2+E; Large error")

(I had to fiddle with the threshold in the large-error cases in order to display just a single component: that's the reason for supplying this value as a parameter to process.)


User ttnphns has kindly directed our attention to a closely related thread. One of its answers (by J.M.) suggests the approach described here.

whuber
  • 281,159
  • 54
  • 637
  • 1,101
  • Wow, heres what i understand from your response..regress my variables against any other variable. Use the VIF to then ID related variables..this works. Is it best to do this with chunks of the data at a time? Also do you remove anything if you detect colinearity using the regression prior?.. From what i understand about PCA generally is that you use the largest PCs (explaining most variance) based on the eigenvalues as these explain most variance, these are loaded to varying degrees using the original variables. Im unsure as to what small loadings and what they are colinear with – Samuel Nov 01 '13 at 17:22
  • This answer explains how to interpret the small components: they exhibit the collinearities. Yes, you can use subgroups of variables if you like. The regression method is just to *detect* the presence of collinearity, not to identify the collinear relations: that's what the PCA does. – whuber Nov 01 '13 at 18:37
  • `"loadings," which are linear combinations of the original variables` This statement looks not precise. Loadings $\bf A$ are the coefficients of linear combinations of _components_ in prediction _variables_. So, did you mean $\bf A^{-1}$? – ttnphns Nov 01 '13 at 19:18
  • Also, may I ask you leave your opinion about possible using of sweep operation (http://stats.stackexchange.com/a/16391/3277) in the task of tracking down of linearly dependent subsets of variables? – ttnphns Nov 01 '13 at 19:23
  • @ttnphns I verified the statement about loadings before posting this answer. The verification consisted of performing an SVD of a design matrix $X$, writing $X=UWV^{'}$. The matrix $V$ comprises the linear relations (the "loadings" as reported by `princomp`) and columns of $XV=UW$ are the (weighted) principal components. Limiting $W$ to the columns with the smallest eigenvalues makes the norm of $UW$ small, which is exactly what we want. Indeed, where eigenvalues are $0$, the corresponding columns of $XV$--which are explicitly linear combinations of columns of $X$, as stated--are all zero. – whuber Nov 01 '13 at 19:39
  • @ttnphns I haven't used SPSS since 1977 and cannot remember what its "sweep" operation is. It sounds like it might be a projection orthogonal to a given vector. As such your proposal appears to be manually carrying out some of the steps that SVD will take care of automatically. Note that a full implementation of your approach requires $O(p^2)$ sweeps for $p$ variables, which is unsatisfactory for large problems. My opinion, then, is it looks like it could provide some insight--especially if you would describe what "sweeping" really does--but might be less helpful in practice. – whuber Nov 01 '13 at 19:46
  • sweep operation is a gimmick useful to invert a matrix and get some regressional statistics (such as coefficients) in parallel. SPSS uses it as the main tool in linear regression algorithm. – ttnphns Nov 01 '13 at 20:28
  • @whuber, I can confirm that $XV=UW$ is matrix of loadings: it is _n_ X _p_ (the size of $X$) matrix of loadings of the PCA of _rows_ (not columns) of $X$ . Was that what you indended in your answer? But in any case "loadings" are coefficients modeling data from components (or factors), not vice versa. – ttnphns Nov 01 '13 at 20:33
  • P.S. Ah, I see: you say `The matrix V comprises ... the "loadings" as reported by princomp` Yes, the right eigenvectors V _when normalized_ by the eigenvalues indeed become loadings $A$ for variables. But eigenvectors themselves should not be called "loadings": it is misusage of the word. – ttnphns Nov 01 '13 at 20:53
  • @ttnphns Then please take that up with the `R` developers, whose language I have adopted here. – whuber Nov 01 '13 at 21:04
  • In this case, I use established terminology (see e.g. classic book _Harman. Modern factor analysis_). Eigenvectors are the direction cosines of the transformation. When they are supplied by the information about variation (normalized by eigenvalues) they become loadings. Loadings are the scalar products, e.g. correlations or covariances - and are thus directly comparable to those between the variables. Both eigenvectors and loadings may serve as regressional coefficients in relations b/w the components and the variables. – ttnphns Nov 01 '13 at 21:16
6

You seem to ask a really provoking question: how to detect, given a singular correlation (or covariance, or sum-of-squares-and-cross-product) matrix, which column is linearly dependent on which. I tentatively suppose that sweep operation could help. Here is my probe in SPSS (not R) to illustrate.

Let's generate some data:

        v1        v2        v3         v4          v5
    -1.64454    .35119   -.06384    -1.05188     .25192
    -1.78520   -.21598   1.20315      .40267    1.14790
     1.36357   -.96107   -.46651      .92889   -1.38072
     -.31455   -.74937   1.17505     1.27623   -1.04640
     -.31795    .85860    .10061      .00145     .39644
     -.97010    .19129   2.43890     -.83642    -.13250
     -.66439    .29267   1.20405      .90068   -1.78066
      .87025   -.89018   -.99386    -1.80001     .42768
    -1.96219   -.27535    .58754      .34556     .12587
    -1.03638   -.24645   -.11083      .07013    -.84446

Let's create some linear dependancy between V2, V4 and V5:

compute V4 = .4*V2+1.2*V5.
execute.

So, we modified our column V4.

matrix.
get X. /*take the data*/
compute M = sscp(X). /*SSCP matrix, X'X; it is singular*/
print rank(M). /*with rank 5-1=4, because there's 1 group of interdependent columns*/
loop i= 1 to 5. /*Start iterative sweep operation on M from column 1 to column 5*/
-compute M = sweep(M,i).
-print M. /*That's printout we want to trace*/
end loop.
end matrix.

The printouts of M in 5 iterations:

M
     .06660028    -.12645565    -.54275426    -.19692972    -.12195621
     .12645565    3.20350385    -.08946808    2.84946215    1.30671718
     .54275426    -.08946808    7.38023317   -3.51467361   -2.89907198
     .19692972    2.84946215   -3.51467361   13.88671851   10.62244471
     .12195621    1.30671718   -2.89907198   10.62244471    8.41646486

M
     .07159201     .03947417    -.54628594    -.08444957    -.07037464
     .03947417     .31215820    -.02792819     .88948298     .40790248
     .54628594     .02792819    7.37773449   -3.43509328   -2.86257773
     .08444957    -.88948298   -3.43509328   11.35217042    9.46014202
     .07037464    -.40790248   -2.86257773    9.46014202    7.88345168

M
    .112041875    .041542117    .074045215   -.338801789   -.282334825
    .041542117    .312263922    .003785470    .876479537    .397066281
    .074045215    .003785470    .135542964   -.465602725   -.388002270
    .338801789   -.876479537    .465602725   9.752781632   8.127318027
    .282334825   -.397066281    .388002270   8.127318027   6.772765022

M
   .1238115070   .0110941027   .0902197842   .0347389906   .0000000000
   .0110941027   .3910328733  -.0380581058  -.0898696977  -.3333333333
   .0902197842  -.0380581058   .1577710733   .0477405054   .0000000000
   .0347389906  -.0898696977   .0477405054   .1025348498   .8333333333
   .0000000000   .3333333333   .0000000000  -.8333333333   .0000000000

M
   .1238115070   .0110941027   .0902197842   .0347389906   .0000000000
   .0110941027   .3910328733  -.0380581058  -.0898696977   .0000000000
   .0902197842  -.0380581058   .1577710733   .0477405054   .0000000000
   .0347389906  -.0898696977   .0477405054   .1025348498   .0000000000
   .0000000000   .0000000000   .0000000000   .0000000000   .0000000000

Notice that eventually column 5 got full of zeros. This means (as I understand it) that V5 is linearly tied with some of preceeding columns. Which columns? Look at iteration where column 5 is last not full of zeroes - iteration 4. We see there that V5 is tied with V2 and V4 with coefficients -.3333 and .8333: V5 = -.3333*V2+.8333*V4, which corresponds to what we've done with the data: V4 = .4*V2+1.2*V5.

That's how we knew which column is linearly tied with which other. I didn't check how helpful is the above approach in more general case with many groups of interdependancies in the data. In the above example it appeared helpful, though.

ttnphns
  • 51,648
  • 40
  • 253
  • 462
5

What I'd try to do here for diagnostic purposes is to take the $502\times 480$ matrix (that is, the transpose) and determine the singular values of the matrix (for diagnostic purposes, you don't need the full singular value decomposition... yet). Once you have the 480 singular values, check how many of those are "small" (a usual criterion is that a singular value is "small" if it is less than the largest singular value times the machine precision). If there are any "small" singular values, then yes, you have linear dependence.

3

I ran into this issue roughly two weeks ago and decided that I needed to revisit it because when dealing with massive data sets, it is impossible to do these things manually.

I created a for() loop that calculates the rank of the matrix one column at a time. So for the first iteration, the rank will be 1. The second, 2. This occurs until the rank becomes LESS than the column number you are using.

Very straightforward:

for (i in 1:47) {

  print(qr(data.frame[1:i])$rank) 
  print(i) 
  print(colnames(data.frame)[i])
  print("###") 
}

for() loop breakdown

  1. calculates the rank for the ith column
  2. prints the iteration number
  3. prints the column name for reference
  4. divides the console with "###" so that you can easily scroll through

I am sure that you can add an if statement, I don't need it yet because I am only dealing with 50ish columns.

Hope this helps!

Nick P
  • 31
  • 2
  • 2
    Although there is nothing wrong with this theoretically, it is a numerically unstable and inefficient algorithm. Especially with large numbers of columns it can fail to detect near-collinearity and falsely detect collinearity where none exists. – whuber Aug 29 '16 at 14:09
2

Rank, r of a matrix = number of linearly independent columns (or rows) of a matrix. For a n by n matrix A, rank(A) = n => all columns (or rows) are linearly independent.

Arun
  • 744
  • 2
  • 7
  • 15
2

Not that the answer @Whuber gave really needs to be expanded on but I thought I'd provide a brief description of the math.

If the linear combination $\mathbf{X'Xv}=\mathbf{0}$ for $\mathbf{v}\neq\mathbf{0}$ then $\mathbf{v}$ is an eigenvector of $\mathbf{X'X}$ associated with eigenvalue $\lambda=0$. The eigenvectors and eigenvalues of $\mathbf{X'X}$ are also the eigenvectors and eigenvalues of $\mathbf{X}$, so the eigenvectors of $\mathbf{X'X}$ associated with eigenvalues near $\lambda=0$ represent the coefficients for approximate linear relationships among the regressors. Principal Component Analysis outputs the eigenvectors and eigenvalues of $\mathbf{X'X}$, so you can use the eigenvectors $\mathbf{v}$ associated with small $\lambda$ to determine if linear relationships exist among some of your regressors.

One method of determining if an eigenvalue is appropriately small to constitute collinearity is to use the Condition Indices: $$ \mathbf{\kappa_j}=\frac{\lambda_{max}}{\lambda_j} $$ which measures the size of the smallest eigenvalues relative to the largest. A general rule of thumb is that modest multicollinearity is associated with a condition index between 100 and 1,000 while severe multicollinearity is associated with a condition index above 1,000 (Montgomery, 2009).

It's important to use an appropriate method for determining if an eigenvalue is small because it's not the absolute size of the eigenvalues, it's the relative size of the condition index that's important, as can be seen in an example. Consider the matrix $$ \mathbf{X'X}=\left[\begin{array}{rr} 0.001 & 0 & 0 \\ 0 & 0.001 & 0 \\ 0 & 0 & 0.001 \\ \end{array} \right]. $$ The eigenvalues for this matrix are $\lambda_1=\lambda_2=\lambda_3=0.001$. Although these eigenvalues appear small the condition index is $$ \mathbf{\kappa}=\frac{\lambda_{max}}{\lambda_{min}}=1 $$ indicating absence of multicolinearity and , in fact, the columns of this matrix are linearly independent.

Citations

Montgomery, D. (2012). Introduction to Linear Regression Analysis, 5th Edition. John Wiley & Sons Inc.

tjnel
  • 852
  • 10
  • 14
  • 1
    Use of the condition indices is a good idea (+1). I would like only to point out two things. First, it would be more numerically stable and more directly relevant to compute their reciprocals: divide each eigenvalue by the largest of them all and see how close to zero it is. Second (referring to your initial discussion), unless $X$ is square, it cannot have eigenvalues or eigenvectors: the concept makes no sense for non-square matrices. – whuber Aug 29 '16 at 14:06
  • Why dont you perform a $QR$-Decomposition of $X$ (which may be $n$ by $k$, $n >> k$)? Any rank deficiencies of $X$ are also in $R$ on which you could perform the aforementioned eigenvalue-decomposition to detect linear dependent columns (which are easy identified even under pivoting) - this is the reference: page 179 of Wood Generalized Additive Models an Introduction with R. – Druss2k Sep 09 '19 at 20:59