Heuristics streaming data matching

Question

I have an index composed by thousands of documents. Slightly modified copies of those documents are sent to my application in small chunks, and I need to check, from those chunks, which document has is being transmitted. All documents are sent to my application, however, the order is unknown.

Each chunk generates a candidate match, containing the id of the document and the search score. I need to figure out some way of learning a heuristic to tell me when a given document is really a match. The features I can think of to use are statistics related to the scores as well as to the number of candidates belonging to a given document.

I have access to the original versions of the documents so that I can generated a labeled dataset, however, when the documents are sent through the streaming channel, they get slightly modified, as I mentioned above.

For instance, if I have 10 segments (with their corresponding scores) pointing at document 1, 8 segments pointing at document 2, and 9 segments pointing at document 3, how can I devise a heuristic to tell if all documents have actually been streamed, or if one or more are just false positives? How can the same heuristics approach false negatives?

You'll probably want to learn a noise model based on your dataset, then use that to get confidence scores. What do the documents and the modifications look like? — Danica, Apr 22 '15 at 22:02
They look like a set of long values, since I hash the tokens. Which model would come to your mind? — Felipe Martins Melo, Apr 23 '15 at 14:26
I guess the question is what kind of modifications happen? Are some words randomly replaced with other words? Are sequences from other documents thrown in? If you have an idea (maybe based on the process that's causing the errors) of what kind of errors are introduced, you can tackle the different types of errors with different models. — Danica, Apr 24 '15 at 19:06
@Dougal, the documents are actually audio files (I didn't mention that before for the sake of clarity), and the changes performed on the segments are due to codecs applied by a streaming service (e.g. bitrate and sample rate changes). Any ideas? Thanks. — Felipe Martins Melo, Apr 24 '15 at 21:48
Are you familiar with perceputal hashing? The audio ones are pretty good, I think. — Danica, Apr 24 '15 at 21:52
Yes, I do. However the fingerprinting scheme I'm using is quite good, the problem is only to figure out a heuristics to tell if a given audio file has or has not been streamed. — Felipe Martins Melo, Apr 24 '15 at 21:56
So, given an audio file, the question is whether it's been saved from a stream or is an "original"? — Danica, Apr 24 '15 at 22:15
Rephrasing the question, every audio file will be fully streamed. I need to find the streaming order. — Felipe Martins Melo, Apr 24 '15 at 22:24

Heuristics streaming data matching

0 Answers0