0

I'm looking to classify subsections of "full" documents based on their similarity to a set of subsections that have been manually curated and assigned labels (let's call these short documents). There are about 50 categories with 5-10 short documents in each. The short documents range from about 100 to 500 words. The full documents will be much larger, containing 10,000+ words. Each full documents can contain many (non-overlapping) subsections and perhaps some subsections that don't match any of our curated ones (and that is fine).

I've been able to discriminates between the short documents reasonably well using TF-IDF and doc2vec methods. For the TF-IDF method, I modified it slightly to work on categories instead of individual documents. Any other short document to be labelled can be scanned for words in and assigned TF-IDF scores for each word/category. The TF-IDF scores for each category can be aggregated over words (e.g. average) and assigned the label with the highest aggregated score. I've also tried cosine similarity with TF-IDF vectors and that works pretty well too. For doc2vec I, trained a model using gensim and then used KNN classification with cosine similarity to assign labels.

I am now wondering what the best method to proceed with for larger documents. I do not have to stick to the above methods. Given the relatively large number of categories and small number of samples in each category, I don't think more complex supervised methods would be appropriate, but I may be wrong here. I've thought about chunking the document into smaller subsections and then performing the above analysis, but finding the places to split are not obvious and it could become computationally intensive. I've also thought about creating sentence embeddings and looking for groups of nearby sentences that are all similar to those in a category. I guess the same approach could be taken at using word embeddings - look for groupings of nearby words that are highly similar to those in a category. I'm not a NLP expert, so any suggestions would be much appreciated.

COM
  • 101
  • 2

0 Answers0