1

I have two datasets and I need to split them into dev, train and test. The first one is a regular dataset where in each line there is a sentence along with a label. I found the following code in a post in TowardsDataScience which is perfect for this dataset:

import pandas as pd

df = pd.read_csv('/kaggle/input/bluebook-for-bulldozers/TrainAndValid.csv', parse_dates=['saledate'], low_memory=False)


from fast_ml.model_development import train_valid_test_split

X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = 'SalePrice', 
                                                                            train_size=0.8, valid_size=0.1, test_size=0.1)

print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)

However, my second dataset follows the CONLL format and the sentences of the text are split into tokens and each line contains a token with its label (bio tagging); with new lines separating the different sentences.

E.g.

token   label
Also    O
,   O
outdoor B-claim
activities  I-claim
enable  I-claim
me  I-claim
to  I-claim
socialize   I-claim
with    I-claim
other   I-claim
people  I-claim
and I-claim
enjoy   I-claim
natural I-claim
beauty  I-claim
.   O
                    
There   O
are O
strong  O
advantages  O
to  O
spend   O
leisure O
time    O
outdoors    O
.   O

I want to also split this dataset into dev, train and test, ensuring that no sentence is split across the different datasets. Is there any library, like scikit-learn or fast_ml (used above), that support splitting datasets of this format?

sthemeli
  • 111
  • 3

0 Answers0