I recently joined a company running a lot of A/B tests with many bad practices. Although I can explain why most of them are flaws, there is one that I can't really wrap my head around.
Let's use an example. I want to measure the difference in Average Revenue Per User (ARPU) between 2 versions. Instead of calculating a sample size beforehand to know when to stop my test, I am monitoring the difference of ARPU between the two groups every morning. During the first few days this value will oscillate quite a lot before converging steadily towards 1 value after N weeks. If this value is positive, then I argue that my test group has an higher ARPU than the control group by X$ (if test is statistically significant).
This does not seem rigorous at all but I have a hard time explaining why in layman terms because it sounds quite intuitive to do so.