BTOG.
I need to develop a machine learning algorithm that matches random text to predefined categories. The texts that I need to predict is web page text.
I know there are many machine learning libraries that can help me to train such algorithms, e.g. scikit-learn, TensorFlow etc. There are also many tutorials that explains how to train text categorization model that predicts movie reviews, toxicity etc.
Before I'm starting to train/finetune AI models I need to gather dataset for each category, But the problem with texts is that one text can be referred to more than one category and eventually will decrease the final model accuracy.
What is the recommended scientific way to linearly split the dataset in such way that each category will only contain texts that belongs to the certain category and not to different categories at the same time?
If it's not possible, are there any method that can be used to haze this issue?