Avoiding OCR performance coupling to upstream Bounding Box model

Question

I have a model pipeline where I first use an object detection deep learning model to locate text regions in images of natural scenery (i.e. outdoor images), and then send the cropped region to a deep learning OCR model to read the text.

Both models are trained with human-annotated data.

While trying to improve the object detection model, I have found that changes to the object detection model (even good changes, i.e. tighter, more consistent boxes around the text) can degrade the OCR model's performance (impact is in the range of 4-5%).

I have tried to reduce the OCR model's coupling to the detection model by randomly expanding each side the bounding boxes of the training data (only expanding to ensure all text is still visible), but I am still being affected by this coupling. I have also tried other image augmentation methods (e.g. rotate, shift, shear, elastic deformation, contrast, brightness) with no real change to the coupling.

Two main questions:

Are there any other methods I can try to reduce this correlation of performance?
Would it just be better to couple further and to train the OCR model with images cropped by whatever detection model I intend to use, as opposed to using human-annotated crops?

Avoiding OCR performance coupling to upstream Bounding Box model

0 Answers0