The important takeaway is:
The bigger your test set, the less noisy the accuracy measure will be.
The intuition is that flipping labels on 1 sample will look like a big improvement in a small test set, but could have just been chance.
Don't place much importance on their "rule of thumb" -- in fact they tell statisticians to cover their ears because they are not telling the whole story. If I have a humongous test set with 1 billion samples, and I see that I have 30 more correct samples after changing my model, that could very easily have happened due to chance, not because my new model is actually better. Note though that having 1 billion samples in your test would still be nice because even 0.1% improvement would be flipping labels on many samples.
Of course, the main downside of holding data for a large test set is that your training set will be smaller. Ideally you have large data sets for both!