1

For learning purposes I have been using a subset of the MNIST data--just the training set and just the digits 3 and 8. I want to try linear regression for doing this classification. (Yes I realise that logistic regression is more appropriate). In R syntax (this is not an R specific question), I do:

b<-qr.solve(cbind(1,x),y)  

qr.solve() solves systems of equations via the QR decomposition
x is the data matrix, one row per handwritten digit, each col is one pixel (28x28)
the cbind() adds a column of 1s onto x so that I can fit an intercept
y is the true identity of each digit, 0=3, 1=8
b is the vector of fitted coefficients

The problem is that qr.solve() fails, giving the error that x is singular. I guess that this means the columns of x are linearly dependent. This can be caused multiple ways. I have removed duplicate columns, zero columns, and columns that contain a single value. After doing this, qr.solve() still chokes and says x is singular.

How can I determine what columns are causing trouble? How to modify x so that linear dependence is gone? (I realise that lasso and ridge regression would be good options here to regularise x [down-weight linearly dependent cols].) One thing that works is to add noise to x, but that seems fishy to me.

Bill Simpson
  • 91
  • 1
  • 1
  • 2
  • It is possible that *all* columns are "causing trouble," because linear independence is not a relationship among any two of the columns: it's a property of all of them at once. One way to get around this is to find a subset of linearly dependent columns with the property that removing any one of those will leave a subset of independent columns. You can repeat the procedure to produce a nonsingular matrix. Finding such a subset is easy in principle: keep adjoining columns to a linearly independent set until the set first becomes linearly dependent. – whuber Jun 21 '17 at 14:09
  • thanks @whuber. If I understand you, the suggestion is (in R notation) to do `qr(x[,-i])$rank` for i=1 to ncols. Find the rank of x when col 1 is missing, when col 2 is missing, etc. If the rank is not affected by deleting that column, then that column is not linearly independent. Unfortunately this will be incredibly slow for a machine learning set-up like mine where x has 784 cols and 8414 rows. (I tried it) – Bill Simpson Jun 21 '17 at 14:28
  • You don't drop columns one at a time. At step `i` you check whether `x[, 1:i]` is singular, proceeding from `i=1` onward. When `x[, 1:i]` first becomes singular, you may drop any single one of the columns indexed between `1` and `i`. Although `R` doesn't seem to have a built-in procedure to sequentially develop a QR decomposition, efficient methods to do so exist: see https://stats.stackexchange.com/questions/6920/ for a discussion of efficient addition of rows; there are comparable algorithms to add columns. I processed a dataset your size with brute force in 100 seconds. – whuber Jun 21 '17 at 16:10
  • I did as suggested and I found that the problematic columns were those with say 3 or so nonzero items. Of course it is not that the values are zero. If these values in a column were any other constant it would also cause problems. This is a big deal for pattern recognition, because the image background is constant (zero for these MNIST digits) except for the character being recognised. – Bill Simpson Jun 22 '17 at 14:34

0 Answers0