I am trying to solve the following pattern recognition / information extraction problem.
Assume I have a text where each token has been annotated by a single class among $K$ classes available (with a conditional random field for example). Results are quite good, but it remains some noise.
Here an example (each token of the text has a class which is represented by an alphabet letter):
D E I A G K D A K G K I Z A
I U A I D O G O D O A
D A G O D I A O D G A I E F G A I
Q S D F A Z E R
Q A D F A Z E Z
Q S D F A E E R
Q S D A E R E R
Q S D F A Z E R
Q D D F A Z E Q
S S D F A Z E R
S F I A S D E G
A D I G H A Z D F G H
And I want in output of the model I'm looking for:
D E I A G K D A K G K I Z A
I U A I D O G O D O A
D A G O D I A O D G A I E F G A I
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
S F I A S D E G
A D I G H A Z D F G H
Explanation: for the first 3 lines, and the last 2 second lines, I cannot do much. But we can "recognize" a table in the remaining of the data, and we want that the class for the tokens are consistent according to the table structure.
I am looking for a principled way to do that. I thought about 4-connected grid Markov Random Fields, and dynamic programming algorithms to optimize a mixed cost of changing the label + quality of alignments. Not sure it is a good way to start.
N.B. it is a highly simplified problem exposed here, but i think the crux of the problem is summarized by the example.