1

I would like to train / test split a dataset in such a way that all categories of categorical variables are in both train and test split.

I tried ( using sickit learn ) :

df_moto_train , df_moto_test = train_test_split( df_moto , test_size = 0.15 , stratify = df_moto[ cols_obj ] )

( where cols_obj is a list of categorical variables from the dataframe df_moto )

but I got the message :

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of groups for any class cannot be less than 2.

Thanks.

1 Answers1

1

The error massage says it all: in some subgroups you have insufficient data. You need to gather more data or split it differently.

Tim
  • 108,699
  • 20
  • 212
  • 390
  • Right : one combination of categories has only one record. The more categorical variables you stratify on, the more you could meet that case. I wonder if there is a way to collapse categories such as to ensure stratification is possible. – Fabrice BOUCHAREL Nov 15 '21 at 12:02