Finding the best dataset for classification

Question

I have 100 datasets. All of them have varying number of features. There are around 20,000 samples in each of them. Every $i$-th sample in the 100 datasets has the same label ($0/1$). The data is highly imbalanced, so there are much fewer positive labels than 0 labels.

I want to give a score/weight to each dataset defining how successful it is in representing the labels. How can I do it?

I don't get the "every i-th sample" sentence. What is i? Are you just trying to say that every sample is labelled with either with 0 or 1 irrespectively of which data set it is form? — Erik, Jan 07 '14 at 15:12
Simple: make a classifier per data set and report its performance (accuracy, AUC, ...). — Marc Claesen, Jan 07 '14 at 15:48
If I understand correctly: the nature of the observations (e.g. patient details) and the binary outcomes in all datasets are the same (e.g. Dead/Alive) but they can have different number of features (e.g. one dataset may have age, the other not). This is unusual. Can you clarify what you mean by the "every i-th sample" sentence, as @Erik has asked. — Zhubarb, Jun 23 '14 at 10:38
Perhaps you could find [this nice answer](http://stats.stackexchange.com/a/63549/31372) and [my recent answer](http://datascience.stackexchange.com/a/4833/2452) relevant and helpful. — Aleksandr Blekh, Jan 09 '15 at 09:13

score 1 · Answer 1 · answered Jul 23 '16 at 12:48

Build a model on each dataset and report your resampled evaluation metric.

Build it properly, using nested resampling, with an inner tuning loop and an outer testing loop.

In the case of the imbalance, you have to make some choices regarding resampling strategy parameters and stratification (i.e. at each resample at least one $1$ must be tested), evaluation metric (e.g. $AUC$ is more informative than accuracy in this case), etc.

score 0 · Answer 2 · answered Jan 07 '14 at 15:27

0

In your case, I would very much like to try a method like The group lasso for logistic regression: collate your 100 data sets into one big data set X of size $20 000 \times p$, where $p$ is the total number of variables. Then apply the group lasso logistic regression on $(X,y)$, $y$ being the vector of classes (0s and 1s), and the groups being the 100 groups of variables.

A posteriori, you could sort the groups according to their importance in the model by considering the $\ell_2$-norm of their coefficients (normalized if the groups have very different sizes).

answered Jan 07 '14 at 15:27

Vincent Guillemot

659
3
8

In this approach you build a model based on the concatenation of all data sets. It does not answer the OP's question which was to assign weights to each data set as a measure of informativeness. – Marc Claesen Jan 07 '14 at 15:50
Guys remember my dataset is highly unbalanced. I have very few of them are 1s not even 10 and like 19990 0s – user34790 Jan 07 '14 at 16:31
The importance of each group of variables can be measured through the $\ell_2$-norm of its coefficients. To address the fact that the classes are very very unbalanced (99% of zeros!!!!), I have to admit that the logistic regression may certainly not be a very wise choice... – Vincent Guillemot Jan 07 '14 at 16:34

Finding the best dataset for classification

2 Answers2