3

I have a large, sparse dgCMatrix matrix in R:

  • ~200,000 rows
  • ~150,000 columns
  • ~1,000,000,000 non-zero entries

R code to generate the matrix:

nrows <- 2e5
ncols <- 1.5e5
nnz <- 1e9
set.seed(42)
i <- sample(1:nrows, nnz, replace=TRUE)
j <- sample(1:ncols, nnz, replace=TRUE)
x <- sparseMatrix(i=i, j=j, x=1, dims=c(nrows, ncols))

I also have a vector y of length 200,000:

cf <- rnorm(1:ncols)
cf[sample(1:ncols, ncols/2)] <- 0
y <- (x %*% cf)[,1] + rnorm(nrows) * 100

I'd like to find a set of weights w such that they minimize MAE: mean(abs(y - x %*% w)) (or mean absolute error)

I can find weights that minimize RMSE (root mean squared error) using glmnet:

model <- cv.glmnet(x, y, family = 'gaussian', nfolds=5)

But so far I can't find a similar package for minimizing MAE/LAD/L1 error. The closest thing I've found is the flare package, but that doesn't support sparse input.

Does anyone know of any R package (or have good tricks) for trying to solve a large-scale, sparse MAE/LAD/L1 regression problem?

Richard Hardy
  • 54,375
  • 10
  • 95
  • 219
Zach
  • 22,308
  • 18
  • 114
  • 158
  • As soon as you start generating a fit and computing residuals, the matrix will no longer be sparse, so sparseness appears to be of little consequence or use. A quick and dirty way to approximate a fit like the one you seek is Tukey's *median polish*: see http://stats.stackexchange.com/questions/8251, for instance. – whuber Dec 06 '16 at 23:16
  • It looks like the flare package is using a sparse matrix internally, but I'm not 100% sure. Also, what makes MAE regression different than RMSE regression? It seems possible to fit a good RMSE model while maintaining sparsity, why is that harder than with MAE? Thanks a lot. – Zach Dec 07 '16 at 00:51
  • You created two new tags without Wikis. I think it would be a cleaner approach to avoid that. Also, what about quantile regression, doesn't it fit here? – Richard Hardy Dec 07 '16 at 06:27
  • @RichardHardy Oops. How do I delete the new tags/wikis? Also, I think quantile regression would work here, but I can't find a package for quantile regression that works with such a large, sparse matrix. – Zach Dec 07 '16 at 14:24
  • I think they get deleted automatically if there are no questions tagged. So once I removed the tags from your post, they should disappear from the tag list as well (after a while). About sparse quantile stuff -- unfortunately, I don't have any suggestions. – Richard Hardy Dec 07 '16 at 15:27
  • Have you by any chance come across an answer to your question? It is quite interesting (+1). – usεr11852 Jun 28 '17 at 21:33
  • @usεr11852 I answered my own question. Not perfect, but it works: https://stats.stackexchange.com/a/288223/2817 – Zach Jun 30 '17 at 18:10

2 Answers2

1

It's not perfect, but I ended up rolling my own solution with the optim function. There's probably a lot of room for improvement, but it works:

library(Matrix)
set.seed(42)
X <- rsparsematrix(1e6, 1e2, density = .01)
CF <- rnorm(ncol(X))
Y <- as.numeric(X %*% CF) + rnorm(nrow(X))
mae <- function(p, a) mean(abs(p-a))
optim_fun <- function(p){mae(X %*% p, Y)}
init <- rep(0, ncol(X))
model <- optim(init, optim_fun, method='BFGS')
print(model$value)
plot(model$par ~ CF)
Zach
  • 22,308
  • 18
  • 114
  • 158
1

Let $X$ be a $n\times p$ real matrix. The problem is

$$\min_{b} \Vert{y - Xb\Vert_1 + \lambda\Vert b\Vert_1}$$

This is a linear optimization (Rglpk package works with sparse matrices).

$$\min_{b, r, c} \vec 1\cdot r + \lambda\vec 1\cdot c$$ s.t.

$$Xb - r \le y \le Xb + r$$

$$b - c \le 0 \le b + c$$

$$r, c \ge 0$$

You can implement a cross-validation routine for finding $\lambda$.

user231919
  • 11
  • 3