How do I avoid computationally singular matrices in R?

Question

I'm fitting a logistic regression model (with R's caret package) to data here. I aim to predict whether Hillary or Trump will win a given county.

The relevant code:

logisticSettings <- trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary)
logisticModel <- train(electTrain[,2:length(electTrain)], make.names(electTrain[,1]), method = "plr", metric = "ROC", trControl = logisticSettings)

electTrain is my training dataset; the first column is the column of classes and the rest is features. When I run this, I get the following error:

Error in solve.default(ddf) : 
system is computationally singular: reciprocal condition number = 9.55304e-17

I think this stems at least in part from the data being highly correlated. For example, one column is 2010 population, and another is 2010 population estimate. To remedy this, I removed some columns from my training set so that no features were correlated at above .92 (arbitrary cutoff).

But the error persists. What's wrong? Some ideas:

The error cutoff is still too high.
One column is approximately a linear combination of two or more others.
I've made a mistake in the code.

RUser4512 · Answer 1 · 2016-12-08T13:04:12.557

Removing highly correlated (or identical) variables by hand can work, but :

It can become unfeasible as the number of variables becomes too large
Selecting the variables by hand is purely arbitrary
With factor variables, it becomes slightly harder to detect correlated variables (unless you look at the predictors with dummy variables)
Singularity can also arise because a variable is a linear combination of other variables, which needs some further preprocessing to detect

I would recommend a ridge regression / Tikhonov regularization :

It makes the matrix always invertible introducing a penalty
If some of the variables are identical, they will receive the same weight
It is easily usable (and fast) using the R package glmnet
The penalization parameter can be selected by cross-validation

score 2 · Answer 2 · answered Jul 26 '17 at 16:54

RUser4512 has a good answer (+1). I just want to add some comments on Matrix Condition Number, which we can use to check the numerical stability issues. In R the function is kappa.

Here is an example in R. In this example, we create a data with two highly correlated columns. x1 and x2. Note they are not identical but really close.

In experiment 1, although it has a large condition number. R still can solve it.

> set.seed(0)
> x1=runif(1e3)*1e3
> x2=x1+runif(1e3)*1e-3
> x=cbind(x1,x2)
> y=runif(1e3)

> kappa(t(x) %*% x)
[1] 8.855766e+12

> solve(t(x) %*% x, t(x) %*%y)
        [,1]
x1 -399.9371
x2  399.9375

In experiment 2, we further reduce the difference on two columns by 1000 times. R will produce an error as you described.

> x[,2]=x[,2]*1e-3
> kappa(t(x) %*% x)
[1] 2.220277e+18
> solve(t(x) %*% x, t(x) %*%y)
Error in solve.default(t(x) %*% x, t(x) %*% y) : 
  system is computationally singular: reciprocal condition number = 4.49945e-19

How do I avoid computationally singular matrices in R?

2 Answers2