5

I'm fitting a logistic regression model (with R's caret package) to data here. I aim to predict whether Hillary or Trump will win a given county.

The relevant code:

logisticSettings <- trainControl(method = "cv", number = 10, returnResamp = "all", classProbs = TRUE, summaryFunction = twoClassSummary)
logisticModel <- train(electTrain[,2:length(electTrain)], make.names(electTrain[,1]), method = "plr", metric = "ROC", trControl = logisticSettings)

electTrain is my training dataset; the first column is the column of classes and the rest is features. When I run this, I get the following error:

Error in solve.default(ddf) : 
system is computationally singular: reciprocal condition number = 9.55304e-17 

I think this stems at least in part from the data being highly correlated. For example, one column is 2010 population, and another is 2010 population estimate. To remedy this, I removed some columns from my training set so that no features were correlated at above .92 (arbitrary cutoff).

But the error persists. What's wrong? Some ideas:

  • The error cutoff is still too high.

  • One column is approximately a linear combination of two or more others.

  • I've made a mistake in the code.

Noah Walton
  • 51
  • 1
  • 1
  • 3

2 Answers2

6

Removing highly correlated (or identical) variables by hand can work, but :

  • It can become unfeasible as the number of variables becomes too large
  • Selecting the variables by hand is purely arbitrary
  • With factor variables, it becomes slightly harder to detect correlated variables (unless you look at the predictors with dummy variables)
  • Singularity can also arise because a variable is a linear combination of other variables, which needs some further preprocessing to detect

I would recommend a ridge regression / Tikhonov regularization :

  • It makes the matrix always invertible introducing a penalty
  • If some of the variables are identical, they will receive the same weight
  • It is easily usable (and fast) using the R package glmnet
  • The penalization parameter can be selected by cross-validation
RUser4512
  • 9,226
  • 5
  • 29
  • 59
2

RUser4512 has a good answer (+1). I just want to add some comments on Matrix Condition Number, which we can use to check the numerical stability issues. In R the function is kappa.

Here is an example in R. In this example, we create a data with two highly correlated columns. x1 and x2. Note they are not identical but really close.

In experiment 1, although it has a large condition number. R still can solve it.

> set.seed(0)
> x1=runif(1e3)*1e3
> x2=x1+runif(1e3)*1e-3
> x=cbind(x1,x2)
> y=runif(1e3)

> kappa(t(x) %*% x)
[1] 8.855766e+12

> solve(t(x) %*% x, t(x) %*%y)
        [,1]
x1 -399.9371
x2  399.9375

In experiment 2, we further reduce the difference on two columns by 1000 times. R will produce an error as you described.

> x[,2]=x[,2]*1e-3
> kappa(t(x) %*% x)
[1] 2.220277e+18
> solve(t(x) %*% x, t(x) %*%y)
Error in solve.default(t(x) %*% x, t(x) %*% y) : 
  system is computationally singular: reciprocal condition number = 4.49945e-19
Haitao Du
  • 32,885
  • 17
  • 118
  • 213