Best practice for named entity recognition on large texts

Question

What are the best practices to apply NER to large texts (e.g 20 pages+)?

One common advice is to split the text before passing it as input to the model. However this can require a significant manual work of establishing splitting rules, especially if there are many different templates.

What are other alternatives or complementary practices when building a NER solution for large documents?

Are there common tricks to have a generalizable approach with regards to the data splitting logic?

score 1 · Accepted Answer · answered Mar 31 '21 at 08:57

Most (if not all) existing NER data work on the sentence level, so models trained on those datasets also expect its input to be a sentence.

Sentence-splitting is part of most NLP toolkits and works remarkably well. In most cases, rule-based procedures such as in NLTK. More advanced tools use machine-learned tokenization, such as UDPipe (best used via spacy-udpipe wrapper) or Stanza both working for many languages.

Best practice for named entity recognition on large texts

1 Answers1