How many samples do I need for OCR problems?

Question

I am thinking about collecting samples of hand written digits (0 to 9) from people. I'll try to test different algorithms for optimal character recognition- some form of neural network and random forest may be! I have planned to collect 20 entries from each person (the same digit being asked to write twice) so that I can make a training set and a test set.

Is my idea correct? How should I statistically decide how many samples will suffice?

Haitao Du · Answer 1 · 2017-01-12T16:53:50.927

Overall suggestion:

There are many open data sources for digit OCR available, have you checked about MNIST? Also, there are works done for algorithm comparison as you described in the same line I provided. Please check it first.

If you want to do the same thing, why? Is there anything the MNIST doesn't offer? If there are some, I am sure you can find other open data online. Such as this one.

Collecting data by yourself is very costly and it is better to have a good reason to do it.

To your question:

It is hard to say how many data are needed. It depends on your model and the complexity of the "task". Finally the quality of the data.

For example, if you want to use a complex model (neural network), it is better to have hundreds of thousands data points. On the other hand, it is a simpler model (say, logistic regression), less data is required.
For example, if you want to build a classifier to classify 0 vs 1, then it is a relative simple task (comparing to classify 0 and 6), and less data will be needed.
In addition, if the quality of data is low, say all the digits are blur or have low resolution, then, more data are needed.

How many samples do I need for OCR problems?

1 Answers1