I have an index composed by thousands of documents. Slightly modified copies of those documents are sent to my application in small chunks, and I need to check, from those chunks, which document has is being transmitted. All documents are sent to my application, however, the order is unknown.
Each chunk generates a candidate match, containing the id of the document and the search score. I need to figure out some way of learning a heuristic to tell me when a given document is really a match. The features I can think of to use are statistics related to the scores as well as to the number of candidates belonging to a given document.
I have access to the original versions of the documents so that I can generated a labeled dataset, however, when the documents are sent through the streaming channel, they get slightly modified, as I mentioned above.
For instance, if I have 10 segments (with their corresponding scores) pointing at document 1, 8 segments pointing at document 2, and 9 segments pointing at document 3, how can I devise a heuristic to tell if all documents have actually been streamed, or if one or more are just false positives? How can the same heuristics approach false negatives?