I would put it simply as null hypothesis testing is really only about the null hypothesis. And generally, the null hypothesis isn't usually what is of interest, and may not even be "the status quo" - especially in regression type of hypothesis testing. Often in social science there is no status quo, so the null hypothesis can be quite arbitrary. This makes a huge difference to the analysis, as the starting point is undefined, so different researches are starting with different null hypothesis, most likely based on whatever data they have available. Compare this to something like Newton's laws of motion - it makes sense to have this as the null hypothesis, and try to find better theories from this starting point.
Additionally, p-values don't calculate the correct probability - we don't want to know about tail probabilities, unless the alternative hypothesis is more likely as you move further into the tails. What you really want is how well the theory predicts what was actually seen. For example, suppose I predict that there is a 50% chance of a "light shower", and my competitor predicts that there is a 75% chance. This turns out to be correct, and we observe a light shower. Now when decide which weather-person is correct, you shouldn't give my prediction additional credit for also giving a 40% chance of a "thunderstorm", or take credit away from my competitor for giving "thunderstorm" a 0% chance.
A bit of thinking about this will show you that it is not so much how well a given theory fits the data, but more about how poorly any alternative explanation fits the data. If you work in terms of Bayes Factors, you have prior information $I$, data $D$, and some hypothesis $H$, the bayes factor is given by:
$$BF=\frac{P(D|HI)}{P(D|\overline{H}I)}$$
If the data is impossible given that $H$ is false, then $BF=\infty$ and we become certain of $H$. The p-value typically gives you the numerator (or some approximation/transformation thereof). But note also that a small p-value only constitutes evidence against the null if there is an alternative hypothesis that fitsthe data. You could invent situations where a p-value of $0.001$ actually provides support for the null hypothesis - it really depends on what the alternative is.
There is a well known and easily misunderstood empirical example of this where a coin is tossed $104,490,000$ times and the number of heads is $52,263,471$ - slightly off one half. The null model is $y\sim Bin(n,0.5)$ and the alternative is $y|\theta\sim Bin(n,\theta)$ and $\theta\sim U(0,1)$ for a marginal model of $y\sim BetaBin(n,1,1)\sim DU(0,\dots,n)$ (DU= discrete uniform). The p-value for the null hypothesis is very small $p=0.00015$, so reject the null and publish right? But look at the bayes factor, given by:
$$BF=\frac{{n\choose y}2^{-n}}{\frac{1}{n+1}}=\frac{(n+1)!}{2^ny!(n-y)!}=11.90$$
How can this be? The Bayes Factor supports the null hypothesis in spite of the small p-value? Well, look at the alternative - it gave a probability for the observed value of $\frac{1}{n+1}=0.0000000096$ - the alternative does not provide a good explanation for the facts - so the null is more likely, but only relative to the alternative. Note that the null only does marginally better than this - $0.00000011$. But this is still better than the alternative.
This is especially true for the example that Gelman criticises - there was only ever really one hypothesis tested, and not much thought gone into a) what the alternatives explanations are (particularly on confounding and effects not controlled for), b) how much are the alternatives supported by previous research, and most importantly, c) what predictions do they make (if any) which are substantively different from the null?
But note that $\overline{H}$ is undefined, and basically represents all other hypothesis consistent with the prior information. The only way you can really do hypothesis testing properly is by specifying a range of alternatives that you are going to compare. And even if you do that, say you have $H_1,\dots,H_K$, you can only report on the fact that the data supports $H_k$ relative to what you have specified. If you leave out important hypothesis from the set of alternatives, you can expect to get nonsensical results. Additionally, a given alternative may prove to be a much better fit that the others, but still not likely. If you have one test where a p-value is $0.01$ but the one hundred different tests where the p-value is $0.1$ it is much more likely that the "best hypothesis" (best has better connotations than true) actually comes from the group of "almost significant" results.
The major point to stress is that a hypothesis can never ever exist in isolation to the alterantives. For, after specifying $K$ theories/models, you can always add a new hypothesis
$$H_{K+1}=\text{Something else not yet thought of}$$
In effect this type of hypothesis is basically what progresses science - someone has a new idea/explanation for some kind of effect, and then tests this new theory against the current set of alternatives. Its $H_{K+1}$ vs $H_1,\dots,H_K$ and not simply $H_0$ vs $H_A$. The simplified version only applies when there is a very strongly supported hypothesis in $H_1,\dots,H_K$ - i.e, of all the ideas and explanations we currently have, there is one dominant theory that stands out. This is definitely not true for most areas of social/political science, economics, and psychology.