2

What are the best practices to apply NER to large texts (e.g 20 pages+)?

One common advice is to split the text before passing it as input to the model. However this can require a significant manual work of establishing splitting rules, especially if there are many different templates.

What are other alternatives or complementary practices when building a NER solution for large documents?

Are there common tricks to have a generalizable approach with regards to the data splitting logic?

mobupu
  • 472
  • 1
  • 3
  • 13

1 Answers1

1

Most (if not all) existing NER data work on the sentence level, so models trained on those datasets also expect its input to be a sentence.

Sentence-splitting is part of most NLP toolkits and works remarkably well. In most cases, rule-based procedures such as in NLTK. More advanced tools use machine-learned tokenization, such as UDPipe (best used via spacy-udpipe wrapper) or Stanza both working for many languages.

Jindřich
  • 2,261
  • 4
  • 16