The reproducibility crisis has given many pause over the value (?) of $p$-values to measure the relevance of statistical findings. Given the interpretation of a $p$-value and some knowledge of probability, it's not surprising to see how many confirmatory studies fail to show $p<0.05$ when the originating study had $p<0.05$ (guaranteed at a rate much higher than $0.05$). The bit I struggle with is whether that in fact confirms or disproves the originating study.
One thought is: why aren't these studies being compared in terms of their confidence intervals? If the originating study is declared statistically significant on the basis of a 95% CI not including the null hypothesized value (equivalent to $p$-value based inference), it seems much more plausible that a confirmatory study would produce an effect which lies within the 95% CI despite lacking statistical significance itself?
Does this imply that the basis for evaluating reproducibility of studies (rather than evaluating statistical significance) is wrong?