0

There's a deep learning course on Udacity made by Google Brain. One of the course's videos talks about validation and test sets size. That video explains that a change in 30 samples is usually significant, but no demonstration is provided.

I want to know why they say that a change in 30 samples is significant and where this statement comes from.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
susomena
  • 1
  • 1

1 Answers1

2

The important takeaway is:

The bigger your test set, the less noisy the accuracy measure will be.

The intuition is that flipping labels on 1 sample will look like a big improvement in a small test set, but could have just been chance.

Don't place much importance on their "rule of thumb" -- in fact they tell statisticians to cover their ears because they are not telling the whole story. If I have a humongous test set with 1 billion samples, and I see that I have 30 more correct samples after changing my model, that could very easily have happened due to chance, not because my new model is actually better. Note though that having 1 billion samples in your test would still be nice because even 0.1% improvement would be flipping labels on many samples.

Of course, the main downside of holding data for a large test set is that your training set will be smaller. Ideally you have large data sets for both!

user20061
  • 453
  • 1
  • 3
  • 11
  • But what I want to know is that whole story they're not telling – susomena Jul 05 '17 at 15:09
  • I attempted to give you that in my response. Are there questions that you have about what I've said? – user20061 Jul 05 '17 at 15:12
  • My question is how can I know how many samples have to change to consider that change significative. In the case of the video the question is why 30 samples changed is significative? – susomena Jul 05 '17 at 16:00
  • 1
    A differnce of 30 samples may or may not be significant, depending on the size of your test set, the total proportion of the test set you are getting correct, w/ some assumptions. In order to test such a difference, you would use [Fisher's exact test](http://en.wikipedia.org/wiki/Fisher's_exact_test), which you can read about [here](https://stats.stackexchange.com/questions/123609/exact-two-sample-proportions-binomial-test-in-r-and-some-strange-p-values) and [here](https://stats.stackexchange.com/questions/113602/test-if-two-binomial-distributions-are-statistically-different-from-each-other). – user20061 Jul 05 '17 at 17:31
  • 1
    You may also be interested in model comparison or [model selection](https://en.wikipedia.org/wiki/Model_selection). – user20061 Jul 05 '17 at 17:34
  • And what about a McNemar's test to check if the change in the samples induced by the training step is significative? A few days ago searching for this I came across with this test, but I couldn't get the results of the video. – susomena Jul 06 '17 at 09:14
  • It depends how you've constructed the test set. If you're randomly choosing the samples in the test set each time, then Fisher's may be appropriate. But if you've fixed the test set to be the same for both models, then yes McNemar's test may be appropriate. The video was underspecified, so it would be very easy to get a result where a difference of >30 was not significant. (like in my original answer), or where a difference of <30 was significant. – user20061 Jul 06 '17 at 11:20