I have two datasets and I need to split them into dev, train and test. The first one is a regular dataset where in each line there is a sentence along with a label. I found the following code in a post in TowardsDataScience which is perfect for this dataset:
import pandas as pd
df = pd.read_csv('/kaggle/input/bluebook-for-bulldozers/TrainAndValid.csv', parse_dates=['saledate'], low_memory=False)
from fast_ml.model_development import train_valid_test_split
X_train, y_train, X_valid, y_valid, X_test, y_test = train_valid_test_split(df, target = 'SalePrice',
train_size=0.8, valid_size=0.1, test_size=0.1)
print(X_train.shape), print(y_train.shape)
print(X_valid.shape), print(y_valid.shape)
print(X_test.shape), print(y_test.shape)
However, my second dataset follows the CONLL format and the sentences of the text are split into tokens and each line contains a token with its label (bio tagging); with new lines separating the different sentences.
E.g.
token label
Also O
, O
outdoor B-claim
activities I-claim
enable I-claim
me I-claim
to I-claim
socialize I-claim
with I-claim
other I-claim
people I-claim
and I-claim
enjoy I-claim
natural I-claim
beauty I-claim
. O
There O
are O
strong O
advantages O
to O
spend O
leisure O
time O
outdoors O
. O
I want to also split this dataset into dev, train and test, ensuring that no sentence is split across the different datasets. Is there any library, like scikit-learn or fast_ml (used above), that support splitting datasets of this format?