Here's an idea that's less statistically powerful, but more intuitive, than a chi-squared test.
If there's bias then then the most common answer should have high frequency. So, you use the number of times the most common answer was observed as a test statistic. Call that $t$ for "top". The $p$ value is the probability that $t$ is what you observed or higher. This probability is:
$$1 - P(A \le t-1, B \le t-1, C \le t-1, D \le t-1)$$
This takes a little unpacking. We see a probability that they're all below $t$, but the $1-$ means we're calculating the probability that this condition is violated. What would it take to violate the condition? A category having $t$ or more counts. So, it's precisely that the maximum is $t$ or more.
This multinomial probability can be calculated using my R package pmultinom as follows:
1 - pmultinom(upper=rep.int(t-1, 4), size=number.of.questions,
probs=rep.int(1/4, 4), method="exact")
If there's too much uniformity then the least common answer should have high frequency (relative to what you'd expect from random chance). Let's call the number of observations of the least common answer $b$, for "bottom". Treating $b$ as our test statistic, the p-value is
$$P(A > b-1, B > b-1, C > b-1, D > b-1)$$
Unpacking this again. We see a condition that all the counts are above $b-1$. This means that the lowest count has to be $b$, or more. So it's the probability that the minimum is $b$ or more, which is what we want.
Using pmultinom
again,
pmultinom(lower=rep.int(b-1, 4), size=number.of.questions,
probs=rep.int(1/4, 4), method="exact")
(it's a little confusing why I'm using $\le$ with the "upper" argument and $>$ with the "lower argument"--I was just trying to mimic the behavior of pbinom
with lower.tail=TRUE
and FALSE
respectively)
I think that a runs test is better for looking for uniformity, though. I tried out this minimum-based method with a test of 400 questions. To reject randomness at the .05 level, you need the minimum to be 98, 99, or 100, which would be really extreme uniformity.