This unusual situation reverses the usual role of model and sample. Typically, we think of this situation as if the reply to any given question $q$ is a Bernoulli variable $X_q$ (like flipping an unbalanced coin) with parameter $p$, the model estimates $p$, and we have multiple independent realizations of $X_q$ to compare to $p$. Here, instead, the model does not estimate $p$, but produces exactly one of the two possible outcomes of $X_q$ (for each $q$).
To evaluate such a model, we need to quantify how good each prediction is compared to any other model. Because there are only two possible predictions per question, the problem becomes exceptionally simple: for each question there is a "good" prediction--presumably, the one that agrees with the majority of the responses--or a "bad" prediction. Making the bad prediction incurs a cost. In full generality, the cost can vary from question to question (because some questions might be considered more important or useful than others). However, let's suppose the goodness of a model depends separately on how good it is for each question. Whence, independently of the models, the investigator must specify the costs $(c_q)$, one for each question. To compare the models, sum the costs for the "bad" predictions in each model. The better model has the lower sum of costs.
If all questions are of equal probative value for model selection, then all costs will be the same (and can taken to be $1$ without any loss of generality). In this case, the cost of a model is the number of "bad" predictions it makes. The better model makes fewer bad predictions.
This leads to another question: if these participants form a representative sample of a population, as they usually do, and the purpose is to make inferences about this population, then the cost of each model is random. How certain can we be of our comparison?
To answer this, note that the aggregate response $k_q$ to question $q$, which is answered (say) by $n_q$ of the participants, has a Binomial$(n_q, p_q)$ distribution for an unknown parameter $p_q$. If the prediction made by model $m$ is $0$ (using some 0/1 indicator coding for red/blue), then the cost is
$$\text{Cost}_q(m,0) = c_q \text{ if } 2 k_q \gt n_q, \quad 0 \text{ otherwise.}$$
If the prediction made by $m$ is $1$, the cost is
$$\text{Cost}_q(m,1) = c_q \text{ if } 2 k_q \lt n_q, \quad 0 \text{ otherwise.}$$
(For simplicity I'm overlooking the possibility of ties, which can occur whenever $n_q$ is even: costs have to be stipulated in these cases, too. But that doesn't change the nature of the analysis; it's just a complication in the details.)
Consider, now, the comparison of two models. Because their costs can differ only for questions where the model predictions differ, we can forget about all the other questions. Without any loss of generality, then, we may assume the two models differ on all the questions. By recoding the Bernoulli variables if necessary (and changing $p_q$ to $1-p_q$ in such cases), we can arrange it so that one model, call it $m_0$, always predicts $0$ and the other, $m_1$, always predicts $1$. The observed difference between the cost of $m_1$ and the cost of $m_0$ is
$$Y = \sum_{q} I(2 k_q \lt n_q) c_q$$
where $I(\text{true})=1$ and $I(\text{false})=-1$. The true difference equals
$$\eta = \sum_{q} I(p_q \lt 1/2) c_q.$$
We seek either a confidence interval for $\eta$ based on the observations (the questionnaire results) or a test of the hypothesis $H_0: \eta \ge 0$.
At this point many solutions are possible, including a Bayesian one (upon adopting priors for the parameters $p_q$), an exact Frequentist one (using the Binomial distributions of the $k_q$), and an approximate Frequentist one. I'll sketch the latter as an illustration.
Estimate $p_q$ as $\hat{p}_q = k_q/n_q$. Use this to compute (via the theory of the Binomial distribution) $\Pr(2 k_q \lt n_q) = \hat{\pi}_q$. We have thereby estimated that the random variable $I(2 k_q \lt n_q) c_q$ has mean $\hat{\pi}_q I(\hat{\pi}_q \lt 1/2) c_q$ and variance $\hat{\pi}_q(1 - \hat{\pi}_q) c_q^2$. Consequently, we are supposing $Y$ has mean
$$\hat{\eta} = \sum_q \hat{\pi}_q I(\hat{\pi}_q \lt 1/2) c_q$$
and variance
$$\hat{\sigma}^2 = \sum_q \hat{\pi}_q(1 - \hat{\pi}_q) c_q^2.$$
Construct the approximate $1-\alpha$ confidence interval for the cost difference
$$[\hat{\eta} - Z_{\alpha/2}\hat{\sigma}, \hat{\eta} + Z_{\alpha/2}\hat{\sigma}]$$
by taking $Z_{\alpha/2}$ to be the $1-\alpha/2$ percentile of the standard normal distribution. When that interval does not cover $0$ we will conclude (with $1-\alpha$ confidence) that one of the models $m_0$ or $m_1$ is superior (the one with the lowest estimated cost, obviously).
These approximations have some obvious problems, but they can serve us well when (i) the $c_q$ don't vary much; (ii) none of the $p_q$ is very close to $0$ or $1$; and (iii) the model predictions differ on a "substantial" number of questions (perhaps 5 or more, but it depends on the sizes of the $p_q$ and $c_q$).
The three-model case involves three dependent comparisons. This will require some protection against the multiple comparison problem, perhaps with a Bonferroni adjustment of $\alpha$.