Scientific way to construct dataset for text classification

Question

BTOG.

I need to develop a machine learning algorithm that matches random text to predefined categories. The texts that I need to predict is web page text.

I know there are many machine learning libraries that can help me to train such algorithms, e.g. scikit-learn, TensorFlow etc. There are also many tutorials that explains how to train text categorization model that predicts movie reviews, toxicity etc.

Before I'm starting to train/finetune AI models I need to gather dataset for each category, But the problem with texts is that one text can be referred to more than one category and eventually will decrease the final model accuracy.

What is the recommended scientific way to linearly split the dataset in such way that each category will only contain texts that belongs to the certain category and not to different categories at the same time?

If it's not possible, are there any method that can be used to haze this issue?

Can you give an example of a text that belongs to more than one category? Does this happen if a website has a wide range of topics (shopping on one page, news on another page)? Or something else? — Sycorax, Dec 21 '21 at 13:48
Sycorax, basically a text should have one main category with many sub categories (I think we don't need examples for this because we can observe it on wikipedia etc.). Sometimes texts also have more than one main category not only on shopping and news. For example you can see "Linux" wikipedia entry that have many computer science terms. Another example can be food article that can talk on different things then food like culture etc. — Ben Goz, Dec 21 '21 at 14:00
Ok, so why not split the dataset based on the main category? — Sycorax, Dec 21 '21 at 14:00
It sounds like some documents have more than one topic. You can just model this directly using a multi-label model. https://stats.stackexchange.com/questions/107768/what-is-the-difference-between-a-multi-label-and-a-multi-class-classification The [tag:multilabel] tag has a number of questions about this. — Sycorax, Dec 21 '21 at 14:23

Scientific way to construct dataset for text classification

0 Answers0