Identifying locations in a difficult OCR-read English text with python

Question

My goal is to identify city and (US) state of both inventor and assignee of US patents from the 1910's and 1920's. These patents are provided by google and look like so, like so or like so. The information on inventor's location is stored in the header and the for the very old ones in the first paragraph, and for the assignee's in the header (if there is an assignee, which if not the inventor is a company). Sometimes there are multiple inventors or assignees.

All of this needs to be written in python 3.x and should work reasonably well for about 1 Million patent texts.

The biggest challenge here is that the OCR is incomplete: Sometimes the header is missing, split seemingly random, and often word or sentence structure are broke. That is, there are superfluous full stops and letters are not correctly identified. The first paragraph is usually still better read then the header.

My current idea was this, and my question is whether there is a better way to go about:

Download the items
Iterate through text line by line
Identify the first paragraph
Exploit regularity in the first paragraph ("a resident of Indianapolis, county of Marion, and State of Indiana", as here) and find inventor's location by a series of string splits
Identify header and match word-by-word for something that looks like Assignee and exploit regularity in this sentence ("assignors to , a corporation of Delaware", as here) by a series of string splits and replacements for common OCR mistakes
Write file listing inventor's city, inventor's state, assignee's city and assignee's state
Later go through list of cities and difflib-match against list of cities in that state

Now the question is: Can I improve on the idea? I've seen there are some Name Entity Recognition packages, but either they are made for correctly written words (such as PyNER) or I can't install/use them (such as geograpy). For example, before I indulge in learning how to train a word recognition algorithm, I'd like to know whether it's worth it or whether I'm better off hard-coding the algorithm.

I think you would get better results by doing the OCR yourself, instead of trying to fix Google's output. — Kodiologist, Feb 01 '18 at 17:38

Identifying locations in a difficult OCR-read English text with python

0 Answers0