LASSO or other regularized regression with censored (missing) data

Question

Here is my problem. I am looking at various time series curves. Let's call them total spend aggregated over all customers on various products versus time. At any given time, I want to predict the spend on any product for which I don't have data, based on the spends from the products for which I do have data. The number of products is much larger than the number of time data-points so this requires a highly regularized solution. I am starting with using LASSO to do this.

The trouble is that the products all have a finite lifetime. Outside of this lifetime they show zero spend or are just missing. Sometimes within their lifetime, they also have zero spend but in that case, these zeros really mean zero. Outside of their lifetime, the zeros just indicate that the product was not available and so the customer spend likely went to another product that was available with similar functionality.

So the question is: what kind of algorithm can fit such a regularized linear model while treating these periods outside of the lifetime as missing data. Assume that, I know where to draw the boundaries on the lifetimes.

I'd like implementations in either R or Python. Perhaps the flare package in R is the place to start. In there it is mentioned that the Dantzig selector and CLIME are both tolerant of missing values in the design matrix. Can anyone make a suggestion?

Note that I am trying to make a predictor for all variables so all the better if I can do this all at once by estimating the precision matrix of a Gaussian graphical model directly.

I'm not sure if this would be truly helpful, since this solution is a bit like using a Howitzer to kill a fly, but I know that you could program this in STAN using only examples in the manual. The hangup here is that programming it *efficiently* requires a bit of knowledge of how STAN does its work. — Sycorax, Aug 03 '14 at 14:08
STAN is probably not the solution here but thanks for mentioning it. It looks interesting. — Dave31415, Aug 03 '14 at 18:28

score 1 · Answer 1 · answered Aug 03 '14 at 19:29

Here is one thing that might be a good solution. I looked at the clime R package but couldn't get it to run. Then I tried the fastclime package for R. It works and apparently is faster than clime (hence the name). This should solve my problem because it will take either the data X(n,p) or the covariance matrix, Cov(p,p).

Because it will take the covariance matrix directly, it will work with missing values. That's because you can always calculate the covariance matrix by ignoring missing values. In R you do cov.mat=cov(X,use='pairwise') using the built in cov function. The use='pairwise' (reads the docs) means that it will compute the dot product as best it can by ignoring any pairs that have NA. Any method that will take the sample covariance as input likewise will work. I'd guess that it would also generalize to include any weight on the p-parameters (not just 0,1).

Still open to other suggestions.

LASSO or other regularized regression with censored (missing) data

1 Answers1

Linked