Proofreading lots of documents based on small sample

Question

I have gazillion documents (your tax return) that I need to check for correctness, but I don't the manpower nor the will power to read through all of it. Even if I do, I can not guarantee the quality and consistency of the proof reading process.

The only thing I can do is to pick a sample collection of document to proof read and assign accept or reject to it. From that I want to determine the confidence level of certain confidence interval...I am clueless on what I should do next or I am using the right approach.

I have no experience with this problem domain, perhaps someone with more QA experience can point me in the right direction. Like what question to ask ...

Thank you for reading :)

This question might be more appropriate for stats.stackexchange.com. — , Mar 08 '11 at 21:10
Did you check http://en.wikipedia.org/wiki/Confidence_interval ? — , Mar 08 '11 at 21:19
Do you have a set of "rules" by which you define a return as correct or not ? — Ralph Winters, Mar 09 '11 at 00:24
It will be read by human, they will simply mark the document as accept (no error) or reject (there is an error) — nobody, Mar 09 '11 at 15:03
Are you trying to determine if your proofreading process is correctly identifying the erroneous returns? Do you have any baseline measures, for example how many returns caught by the IRS contain errors? — Ralph Winters, Mar 09 '11 at 16:20
One question is, what are these documents? Scanned images, PDFs, Excel spreadsheets? And a second question is, do you know which documents will be marked as accept and which ones as reject? Depending on this, the problem may be either solvable or extremely difficult. E.g. if 'reject' means blank form and accept means actual filled out form, this could be doable. If 'accept' means the return is actually valid, this could be extremely difficult. — SheldonCooper, Mar 30 '11 at 21:47
This [answer](http://stats.stackexchange.com/questions/7467/quality-assurance-and-quality-control-qa-qc-guidelines-for-a-database/7472) might be of interest. — mpiktas, Mar 31 '11 at 20:17

GaBorgulya · Answer 1 · 2011-03-31T12:47:51.767

As a first step you could take a sample from the documents. This can be random sampling, but if you know that certain characteristics of the documents are particularly relevant you could use stratified sampling.

The second step can be feature extraction. Define characteristics of the documents that may help predicting the accept or reject labels. Give clear definitions; you can use numerical scale, ordinal and nominal (including binary) variables. Determine features and labels in the sample.

Third step, developing a prediction model. To predict the binary response there are many kinds of methods available. If you have some experience how some features affect acceptance you may want to build it into a model and use regression techniques e.g. logistic regression, probit regression, and model selection (e.g. stepwise variable selection). If you can't or don't want to build in pre-existing knowledge into the automatic assessment you can use machine learning techniques like random forests, logistic regression tree, support vector machines.

Fourth step is validating your model on a second sample. The way to do this can be randomly partitioning your original sample into a modelling and a testing subset before modelling.

When you know how well your automatic classification system performs you will be able to judge which documents need human review.

Proofreading lots of documents based on small sample

1 Answers1