I have been reading about the work at Bletchley Park to crack Enigma during World War II. Part of what Turing and his colleagues did was compare pairs of encrypted messages to decide if the Germans had created them from the same Enigma start setting, or from different settings. In other words, we are dealing with the following situation:
A) One German unit has sent a message using one Enigma in some start setting. B) Another German unit has sent another message using another Enigma in some start setting. C) The British intercept both messages. D) Turing and his colleagues try to guess if the start settings on the two Enigmas are identical.
Obviously, these were highly educated guesses. They were based on sequential analysis and the weight of evidence in favor of a hypothesis, $h$, given the evidence, $e$:
$woe(h:e) = \log \left(\frac{p (e|h)}{p \left(e\left|\bar{h}\right.\right)}\right)$
What Turing and his colleagues did, as far as I understand, was utilize that pairs of messages created from the same start setting kept an essential property of two compared natural language messages, namely that letters in these messages match at the rate 0.076 when messages are placed one above the other, so to speak. In contrast, when messages are random sequences of letters, the match-rate is only 0.037; crucially, intercepted messages created from different start settings fell into the latter category, although hiding natural language.
Now, Turing and his colleagues moved along letters in messages sequentially, and for each match, incremented the weight of evidence for “h: same setting” upward by 3.1, while they changed the total by -0:18 for each none-match. Then, after reaching one of two predetermined thresholds for accumulated evidence, they either guessed that the start settings were the same (i.e., if the weight of evidence broke the upper limit) or supposed that the start settings were different (i.e., if the weight of evidence broke the lower threshold).
As far as I understand, the thresholds used corresponded to an exact probability of $h$ (i.e., a personal probability of “same setting”), but theoretically, if the intercepted messages were long enough, and sent from the same initial setting, the accumulated weight of evidence would, in principle, converge on positive infinity, corresponding to a personal probability of 1. This makes a great deal of sense since there was nothing actually probabilistic in the start settings – they either were or were not, the same. After all, the German units had already sent the messages!
But this is where things get interesting for me. Suppose we switch scenery now and face the problem of guessing what state a gold coin WILL be in AFTER it is flipped. And suppose, furthermore, that we have the benefit of collecting evidence about the gold coin by flipping as many equivalent silver coins as we want, before flipping the gold coin. As we do this, the total weight of evidence for “h: the gold coin will land on heads” should increment upward, if the gold coin tends to land on heads more than chance.
What I don’t get is this:
A: Does the weight of evidence converge on positive infinity after many observations of silver-coin-flipping, and thereby overshoot completely the probability (in the frequentist sense) of the gold coin landing on heads?
OR,
B: Does the weight of evidence collected from flipping silver coins converge to a point, which when translated into a personal probability, equals the probability (in the frequentist sense) of the golden coin landing on heads?
I suspect the answer is A, which would suggest one can go terribly awry using the weight of evidence for predicting! Have I gotten this all wrong?