4

I am thinking about collecting samples of hand written digits (0 to 9) from people. I'll try to test different algorithms for optimal character recognition- some form of neural network and random forest may be! I have planned to collect 20 entries from each person (the same digit being asked to write twice) so that I can make a training set and a test set.

Is my idea correct? How should I statistically decide how many samples will suffice?

kjetil b halvorsen
  • 63,378
  • 26
  • 142
  • 467
Blain Waan
  • 3,345
  • 1
  • 30
  • 35

1 Answers1

1

Overall suggestion:

There are many open data sources for digit OCR available, have you checked about MNIST? Also, there are works done for algorithm comparison as you described in the same line I provided. Please check it first.

If you want to do the same thing, why? Is there anything the MNIST doesn't offer? If there are some, I am sure you can find other open data online. Such as this one.

Collecting data by yourself is very costly and it is better to have a good reason to do it.


To your question:

It is hard to say how many data are needed. It depends on your model and the complexity of the "task". Finally the quality of the data.

  • For example, if you want to use a complex model (neural network), it is better to have hundreds of thousands data points. On the other hand, it is a simpler model (say, logistic regression), less data is required.

  • For example, if you want to build a classifier to classify 0 vs 1, then it is a relative simple task (comparing to classify 0 and 6), and less data will be needed.

  • In addition, if the quality of data is low, say all the digits are blur or have low resolution, then, more data are needed.

Haitao Du
  • 32,885
  • 17
  • 118
  • 213