I have 2 sets of 40 000 independent datasets (composed of 10 000 elements each), that I am comparing using an Anderson-Darling test. More precisely, I compare corresponding subsets of my two sets, so that at the end, I have 40 000 p-values (or AD statistics, to be compared with critical values). Just to give a little more detail, I am comparing two models, that I ran over 40000 input test-cases. These models produce statistical outputs. I try to estimate if a change in the model significantly impacts the output distributions.
My question is the next one: searching for a significant result, should I correct my p-values to take into account the fact I run 40 000 tests (for instance, using a Bonferroni correction) ? As far as I understand, if I select a significant level $\alpha=0.05$, I consider that a significant result appears when the p-value is below 5%, because statistically, it has no more than 5% chances to arise by chance.
However, when I run 40 000 tests I have more chances to get p-values below 0.05, isn't it ?
It seems to me that I could consider my models to be different if more than 5% the computed p-values falls below 0.05, and not just one. Would this be correct ?