I feel like I'm really close but just need to bridge a final gap.
The challenge with having a 3:1 split is that the entropy ends up being something like 0.811, and we can't ask a fractional number of questions, which is why the yes/no framework falls apart.
Consider a sequence of As and Bs.
...AABAAABABBAAABAAAA
If the ratio were 1:1 I would just have to tell you what each symbol is: A or B. and the entropy would be 1 (1 bit of information per symbol). But now I can take advantage of the fact that there are more A's than B's. I can provide a different piece of information. That is "How many A's are there before the next B appears?". For the sequence above I would say: "4,3,0,1,3,..." (parsing it in reverse order).
Of course, what I'm saying is not in bits, it's in base 10. And I say a different number each time. So I need to find the expectation value of the number of bits required. Let's do that in steps. First we write down the probability of $n$ A's followed by 1 B
$$
P(n) = (0.75)^n\cdot 0.25
$$
Then the expectation value is:
$$
E_{n \sim P}[n] = 0.25\sum_{i}^\infty {i(0.75)^i}
$$
where we've taken the 0.25 out of the sum. Now we apply the sum of geometric series identity (with a slight modification)
$$
\sum_{i}^\infty ir^i = \frac{1}{(1-r)^2}
$$
(I figured that out by modifying the proof for standard sum of geometric series so you might want to check it for yourself). So in applying that we get
$$
\begin{aligned}
E_{n \sim P}[n] &= \frac{0.25}{(1-0.75)^2} \\
& = \frac{1}{0.25}
\end{aligned}
$$
And so the expected number of bits required is: $-\log(0.25)$. Now the other important aspect, is how often I need to provide this information. Well because B occurs 25% of the time, then I need to provide this information for 1 in every 4 symbols on average. So the number of bits required per symbol is $-0.25\log(0.25)$.
But we know the full entropy is $-0.25\log(0.25) - 0.75\log(0.75)$. What about the second part? Well, we could have gone back and applied similar logic How many B's are there before the next A appears?. This is also an interesting question, because although I have to provide you with information more often on average (75% of the time, instead of 25% of the time), the number of bits required to encode that information is expected to be just $-\log(0.75)$ (which is smaller than $-\log(0.25)$).
Cool so to summarise... If I use method 1:
"How many A's are there before the next B appears?"
the average number of bits required per symbol is $-0.25\log(0.25) = 0.5$
And if I use method 2:
"How many B's are there before the next A appears?"
the average number of bits required per symbol is $-0.75\log(0.75) = 0.311$
Now I don't know how to bridge the gap to explain why summing these two quantities is the right answer. Shouldn't I just choose the cheapest method, that is method 2?