0

I have a multi-label dataset, whose label distribution looks something like this, with label on x-axis and number of rows it occurs in the dataset in y-axis.

## imports
import numpy as np
import pandas as pd
%matplotlib inline
from sklearn.datasets import make_multilabel_classification

## creating dummy data
X, y = make_multilabel_classification(n_samples=100_000, n_features=2,
                                      n_classes=100, n_labels=10, random_state=42)
X.shape, y.shape
   ((100000, 2), (100000, 100))

## making it a dataframe
final_df = pd.merge(left=pd.DataFrame(X), right=pd.DataFrame(y), left_index=True, right_index=True).copy()
final_df.rename(columns={'0_x':'input_1', '1_x':'input_2', '0_y':0, '1_y':1}, inplace=True)
final_df.columns = final_df.columns.astype(str)

## plotting the counts of each label:
labels = [str(i) for i in range(100)]
value_counts = final_df.loc[:, labels].sum(axis=0)
value_counts.plot(kind='line')

enter image description here
So, there are labels that have occurred only in couple hundred rows, while there are also labels, occurred in 19K+ rows.

I would now like to undersample it, to make the number of rows each label appear in the dataset, look something like this:
enter image description here
So, a label has to occur, at max in only around 2000 rows(+100 is acceptable), while all the under occurred labels has to be left as is.

I have gone through various under-sampling methods that imbalanced-learn provides, but none of them seemed to support multi-label datasets.

How do I do this?

  • 1
    Unbalanced classes are almost certainly not a problem, and oversampling will not solve a non-problem: [Are unbalanced datasets problematic, and (how) does oversampling (purport to) help?](https://stats.stackexchange.com/q/357466/1352) – Stephan Kolassa Oct 26 '21 at 10:19
  • I don't have a class-imblance, i have imbalance in labels, as depicted in first plot in question. – Naveen Reddy Marthala Oct 26 '21 at 10:32
  • @StephanKolassa, is it completely fine to have label-imbalance too? – Naveen Reddy Marthala Oct 26 '21 at 10:44
  • 1
    Ah. So "labels" are not your target variable? No, imbalance in predictors ("labels", if I understand you correctly) is not a problem, either, except in some very specific circumstances, which probably do not apply to your situation (e.g., ANOVA with highly different variances between groups). – Stephan Kolassa Oct 26 '21 at 10:46
  • i presume you think "labels" is the name of my target variable. what i have is, one input text column and all my outputs are binary, indicating if the current row belongs to a variable or not. so, each row can belong to 1 or more labels. – Naveen Reddy Marthala Oct 26 '21 at 10:49
  • What do you consider the difference between a label imbalance and a class imbalance? – Dave Oct 26 '21 at 11:05
  • @Dave, mutli-class is something like iris, each sample can belong to ONLY one of arbitrary number of classes. while multi-label being each sample can belong to arbitrary number of labels. an example can be questions on this forum, where a question can belong to ONLY 1 label or ANY 5 labels from thousands of labels available. – Naveen Reddy Marthala Oct 26 '21 at 12:21
  • But what is the difference between class imbalance and label imbalance? – Dave Oct 26 '21 at 12:44
  • the notion that ratio differing between classes or labels is same in both the cases. however, the way multi-class and multi-label datasets are over- or under-sampled is what differs. and which i believe, is also emphasised by the fact that a package like, imbalanced-learn not supporting multi-label datasets in `y` argument of `fit_resample` methods of any under sampling technique. – Naveen Reddy Marthala Oct 26 '21 at 12:49
  • under-sampling multi-class can be quiet straight forward. you find the samples which need to be dropped by random sampling or tomek links and just drop them. and doing the same with multi-label will have a lot of side effects, like the possibility of under-occurred labels getting removed etc,. – Naveen Reddy Marthala Oct 26 '21 at 12:52
  • Maybe you should say how you plan to do the multilabel modeling: if as is often the case you will build essentially disjoint binary models for each label, then this reduces to (binary) class imbalance. – Ben Reiniger Oct 26 '21 at 17:46
  • Slightly related: https://datascience.stackexchange.com/q/54450/55122 – Ben Reiniger Oct 26 '21 at 17:46
  • cross-posted at https://datascience.stackexchange.com/q/103512/55122 – Ben Reiniger Oct 26 '21 at 19:47

0 Answers0