I also bumped into this passage recently! I'm not very familiar with probability/information theory, so I hope this makes sense and my notation is understandable; I tried for precision at the expense of brevity, but there's some notation in the book that I just don't know how to use precisely.
As far as I can tell, the "data-processing inequality" for KL divergence (aka relative entropy) is proved "in the same way as the data-processing inequality for mutual information" in the sense that they both involve expanding a certain quantity in two ways with a chain rule and then bounding parts of the expansion, even though the chain rule for mutual information (Theorem 2.5.2) and the chain rule for relative entropy (Theorem 2.5.3) don't seem analogous to me, except in some intuitive sense.
In its most general form, the data-processing inequality for relative entropy is probably something like this:
Theorem. Let $X_1$ and $X_2$ be two random variables with the same set of possible values $\mathcal{X}$ and probability distributions $P_1$ and $P_2$. Let $Y_1$ and $Y_2$ be two random variables with the same set of possible values $\mathcal{Y}$, that depend on $X_1$ and $X_2$, respectively, in the same way; that is, we draw $Y_1$ from a distribution depending on only $X_1$ and we draw $Y_2$ from a distribution depending on only $X_2$, and when $X_1 = X_2$, the distributions of $Y_1$ and $Y_2$ are identical. If the probability distributions of $Y_1$ and $Y_2$ are $\hat{P_1}$ and $\hat{P_2}$, then
$$
D(P_1\| P_2) \geq D(\hat{P_1}\| \hat{P_2}).
$$
Note: We can in particular pick a function $f : \mathcal{X} \to \mathcal{Y}$ and just take $Y_1$ and $Y_2$ to be $f(X_1)$ and $f(X_2)$. This is how this inequality is applied in the proof of Theorem 11.6; $f$ is taken to be the indicator function of the set
$$
A := \{x \in \mathcal{X} : P_1(x) > P_2(x)\},
$$
which reduces the theorem for general random variables to the case of binary random variables.
Proof: By applying the chain rule for relative entropy (Theorem 2.5.3), we can expand the KL divergence between the joint distributions $D(P_1, \hat{P_1}\| P_2, \hat{P_2})$ (over the values $\mathcal{X} \times \mathcal{Y}$) in two ways:
$$
\begin{align*}
D(P_1, \hat{P_1}\| P_2, \hat{P_2})
&= D(\hat{P_1}\|\hat{P_2}) + D(P_1(X_1|Y_1)\| P_2(X_2|Y_2))\\
&= D(P_1\| P_2) + D(P_1(Y_1|X_1)\| P_2(Y_2|X_2)).
\end{align*}
$$
The last term in each line is a conditional relative entropy, as defined in (2.66) as
$$
D(p(y|x)\| q(y|x)) := \mathbb{E}_{p(x,y)} \log \frac{p(Y|X)}{q(Y|X)}.
$$
Since the distributions of $Y_1$ given $X_1$ and the distribution of $Y_2$ given $X_2$ are identical, the conditional relative entropy between them is 0 (equation 2.91), i.e.
$$
D(P_1(Y_1|X_1)\|P_2(Y_2|X_2)) = 0
$$
Since conditional relative entropies are nonnegative (equation 2.91),
$$
D(P_1(X_1|Y_1)\| P_2(X_2|Y_2)) \geq 0
$$
Therefore,
$$
\begin{align*}
D(P_1\| P_2)
&= D(\hat{P_1}\|\hat{P_2}) + D(P_1(X_1|Y_1)\| P_2(X_2|Y_2))
\\ &\geq D(\hat{P_1}\|\hat{P_2}),
\end{align*}
$$
as desired.
Note 2: These lecture notes seem to have a proof of a more general version of this theorem, which also applies to more general divergences, although I can't say I understand it.