Which model for this information extraction problem?

Question

I am trying to solve the following pattern recognition / information extraction problem.

Assume I have a text where each token has been annotated by a single class among $K$ classes available (with a conditional random field for example). Results are quite good, but it remains some noise.

Here an example (each token of the text has a class which is represented by an alphabet letter):

D E I A G K D A K G K I Z A
I U A I D O G O D O A
D A G O D I A O D G A I E F G A I
Q S D F A Z E R
Q A D F A Z E Z
Q S D F A E E R
Q S D A E R E R
Q S D F A Z E R
Q D D F A Z E Q
S S D F A Z E R
S F I A S D E G
A D I G H A Z D F G H

And I want in output of the model I'm looking for:

D E I A G K D A K G K I Z A
I U A I D O G O D O A
D A G O D I A O D G A I E F G A I
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
Q S D F A Z E R
S F I A S D E G
A D I G H A Z D F G H

Explanation: for the first 3 lines, and the last 2 second lines, I cannot do much. But we can "recognize" a table in the remaining of the data, and we want that the class for the tokens are consistent according to the table structure.

I am looking for a principled way to do that. I thought about 4-connected grid Markov Random Fields, and dynamic programming algorithms to optimize a mixed cost of changing the label + quality of alignments. Not sure it is a good way to start.

N.B. it is a highly simplified problem exposed here, but i think the crux of the problem is summarized by the example.

Is it a typo that the first line of the expected output is a truncated version of the input? — Nowhere man, Dec 01 '15 at 01:51
yes, fixed. The problem is to locate the begin and end of the table pattern (if any), and enforce a common pattern across the lines (with a (roughly?) minimal number of label changes). — mic, Dec 01 '15 at 09:25

Which model for this information extraction problem?

0 Answers0