Combining many sparse binary variables

Question

Based on kjetil b halvorsen suggestion, I rephrased my problem:

My problem is analogous to the following: i am supposed to predict if a high school student will go to university (Yes/No).

I have some data (high school scores etc) and I can make a model with modest auc < 0.7. In addition to that, I have event data (student id, date, event id) and domain experts say it include very useful information. However, I can not find a way to improve my auc with the data.

I think the problem including the event data is that:

the data is very sparse. Most of students do not have any events and those who have, have only few event bits on
some of the have strong association so that e.g. events E1 and E221 are practically the same

So my question is: What is a typical and statistically sound way to combine sparse E1 - E500 binary events into e.g. five event clusters C1 - C5 to be used as features in classification problem?

My approaches so far:

glmnet/lasso

First I included all the events as binary features into glmnet / lasso and I got different events included or dropped from the model each time. I guess it is due to sparsity and cross correlation.

I did a hack in which I ran glmnet / lasso 1000 times, then summed up the coefficients of each event, and finally combined the events into five clusters C1 - C5 based on the coefficient sum of 1000 runs. Using C1 - C5 in my model gave better auc than using the original E1 - E500 so that makes me think that some kind of grouping/clustering would be useful.

MCA

Somebody suggested MCA. I tried that but primary dimension explained only 4.9% of variance so maybe it is not very useful. I guess low explainability is due to data sparsity

logic regression

Logic regression (not logistic regression) is a algorithm to combine binary variables into one binary predictor http://kooperberg.fhcrc.org/logic/documents/logic-regression.pdf

I tried it, but the result is not stable. It gave totally different variables to be included in predictor in every run

Possible duplicate of [Principled way of collapsing categorical variables with many levels?](https://stats.stackexchange.com/questions/146907/principled-way-of-collapsing-categorical-variables-with-many-levels) — kjetil b halvorsen, Jul 11 '19 at 08:48
@kjetil, I don't think that is a duplicate. The OP's data are many binary variables, not categorical variable(s) with many levels. — ttnphns, Jul 11 '19 at 14:41
I edited your question and tags because it appeared to me that your 500 correlated sparse features are binary (i.e. "events"). If that is true, please search for similar questions by tags "binary data" and "sparse". — ttnphns, Jul 11 '19 at 14:46
@Tikke: It would help if you could tell us what are those 500 different events, and how they are represented, 500 binary variables? What is the response variable? Maybe some clustering on the events, or correspondence analysis on the 500 indicator variables, or ... There is also something called [logic regression](http://kooperberg.fhcrc.org/logic/). — kjetil b halvorsen, Jul 11 '19 at 15:16
The problem is analogous to the following: i am supposed to predict if a high school student will go to university (Yes/No). I have some data (high school scores etc) and I can make a model with modest auc = 0.7. In addition to that, I have event data (student id, date, event id) which I know have useful information, but I can not find out a way to improve my auc with the data. I think the problem is that the data is sparse (most of student do not have any events) , some of the events correlate a lot, and there are so many event types that I can not manually e.g. cluster them — Tikke, Jul 11 '19 at 18:11
@Tikke: Can you please add this new information as an edit to the original question? Then more people will see it! (and we want all necessary information to be in the post itself.) — kjetil b halvorsen, Jul 11 '19 at 18:25
Are you interested in inference, or just prediction (i.e. do you want to draw conclusions about how your response variable is related to the events, or only to predict the response as well as possible)? — user20160, Jul 11 '19 at 23:38
@user20160: very good point. My paying customer is interested in prediction whereas other key stakeholders in a project are interested in inference — Tikke, Jul 15 '19 at 09:05
@kjetil b halvorsen: next trying logic regression. It sounds something I am looking for. — Tikke, Jul 15 '19 at 09:12
OK. With so much instability: What is your sample size? Do you have any missing data? About correspondence analysis: Maybe not only use the first dimension, maybe the useful information for the regression problem is elsewhere. Try a regression model with many of the factors. Also look at https://arxiv.org/pdf/1508.06885.pdf — kjetil b halvorsen, Jul 15 '19 at 12:02
If your features are correlated, L2 penalisation ( ridge) is better than L1 (lasso), IE setting alpha =0 in glmnet. I would have advised manual 'clustering' of the data into hierarchies of features, and adding them. IE if you have feature 'in after-school baseball club', add additional features, in after-school club, sports.... The automated 'unsupervised' way of doing this is by using embeddings, however I doubt you have enough data. — seanv507, Aug 31 '19 at 08:01

score 0 · Answer 1 · answered Aug 31 '19 at 04:08

I conclude this thread by explaining how I finally (partly) solved this problem. I tried all the proposed approaches without too much success. Finally I asked one domain expert to manually cluster the events E1...E221 into four groups C1..C4 and I used them in my model. For various reasons, manual clustering was quite significant amount of work but my model improved to the satisfactory level of AUC == 0.80 and I was able to proceed with the project.

I want to thank all of you who contributed into this thread. I have learnt a lot on new cool statistical tools and I will study them further as a new project is lurking around and there the amount of events is beyond of manual clustering.

So, I am still looking for an algorithm to automatically cluster hundreds of sparse binary variables into ~dozen clusters :-)

Combining many sparse binary variables

My approaches so far:

glmnet/lasso

MCA

logic regression

1 Answers1