Check whether data set is ether randomly generated, biased or "too uniform"

Question

I have a data set with data from 26 multiple-choice exam papers. Each question has four alternatives, one of which is correct. I know, for each paper, how many answers had A as the correct option (and B, C and D). Is it possible to determine whether:

There is tendency for one of the options to be correct, and if so, given the sample size, with what certainty we can know that this tendency actually exists.
The data is "too uniform", i.e. that the options are not randomly sorted. If the order of the options (and thus which letter is correct) had been chosen by a human instead of randomly generated, they would probably try too hard to make them look "random" (for example, when asked what a series of 100 random coin flips looks like, people tend to respond with something like 1101010101100..., while a random coil would likely result in more than two heads/tails in a row)

It sounds like you have no information about what the correct options are, so how could you hope to succeed in answering either of these questions? — whuber, Jan 13 '17 at 20:05
I know, for each paper, how many times A was correct, how many times B was correct, and so on. — Marcel, Jan 13 '17 at 20:08
Correctness reflects the examinees. What could it possibly have to do with randomness of correct options or, indeed, any kind of randomness, given that the examinees likely were trying their best *not* to give random answers? — whuber, Jan 13 '17 at 20:09
@whuber Sorry, I may have been a bit unclear. I am in fact looking at whether the *examiners* sorted the answers randomly. For an extreme counterexample of the opposite, imagine that the correct answer to every question had been option A. — Marcel, Jan 13 '17 at 20:12

score 2 · Accepted Answer · edited Apr 13 '17 at 12:44

Under the assumption that the answer letters should be equiprobable, you could perform a chi-squared test for goodness of fit. I assume each paper has the same letter as the correct answer for each question (e.g., A is the right answer to question 1 for all students' exams). As a result, your sample size is really only the number of questions on the test, $n_q$, not $26$ or $26\times n_q$. That will limit your statistical power (your ability to detect that, e.g., A occurs to frequently), but the test can still be conducted. Imagine your data look like this:
```
A   B   C   D
9   6   6   3

chisq.test(matrix(c(9,6,6,3), nrow=1))
    Chi-squared test for given probabilities

data:  matrix(c(9, 6, 6, 3), nrow = 1)
X-squared = 3, df = 3, p-value = 0.3916
```
If you have differing numbers of times A, etc., was correct over a set of 26 tests, the situation becomes more complicated. For instance, you might want to know if A is a correct option too often over the set of tests, in each individual test, in a particular test, or if the selection of A for the right answer differs by test. These are different statistical questions—you just need to decide which question you want to ask. For each of those cases, a chi-squared test can be used, but the way it would be set up will differ depending on the hypothesis you are trying to test.
You are referring to a kind of runs test. A run in this context is a series of the same value (heads, heads, heads, tails, heads, is three runs). What people don't realize is that truly random data ought to have runs, so when people try to generate random data out of their heads, they typically over-alternate. A runs test tests for that.

The typical runs test is for binary data, however. You want a multinomial version, because you have four options (A - D), not just two. I don't know of one, but this is easy to simulate. Here I demonstrate a simple simulation, coded in R:
```
set.seed(4964)  # this makes the example exactly reproducible
A = sample.int(4, size=24, replace=TRUE, prob=c(.375, .25, .25, .125))
LETTERS[A]
#  [1] "B" "D" "A" "A" "B" "D" "C" "A" "B" "A" "A" "B" "A" "A" "C" "B" "B" "A" "B"
# [20] "C" "C" "D" "A" "A"
sum(A[1:23]==A[2:24])
# [1] 6

MC.dist = vector(length=10000)
for(i in 1:10000){
  si = sample.int(4, size=24, replace=TRUE, prob=c(.375, .25, .25, .125))
  MC.dist[i] = sum(si[1:23]==si[2:24])
}
mean(MC.dist<=6)
# [1] 0.5239
```
Here we calculated the number of times the same answer letter occurred twice in a row (which should occur with random sequences). We then simulated the sampling distribution of this with a simple Monte Carlo simulation. A one-tailed test of whether there were too few runs is simply the proportion of simulations in which there were the same number or fewer runs as in your test.

The runs test here only checks if you get the same answer letter twice in a row too infrequently. This strikes me as the most relevant thing to assess. However, in theory a very large number of non-random patterns could exist (e.g., see if you can spot the pattern in this sequence: A, B, C, B, C, D, C, D, A, D, A, B, etc.). One way to search over a much larger set of possible patterns would be to fit a hidden Markov model. I have less expertise here, but this won't really be doable with a short sequence like a typical number of multiple choice questions in a test.

But could a runs test be applied when one only has the total numbers for each test, not the correct answers for each question? — Marcel, Jan 13 '17 at 20:04
It could help to explain how a runs test could be applied when there are four options rather than just two. You might also want to indicate there are many other tests of randomness of a sequence besides a runs test (which detects only one form of non-randomness). — whuber, Jan 13 '17 at 20:04
Also, I seem to have been a bit unclear - the 26 papers are not 26 copies of the same test, they are tests from 26 different sessions. — Marcel, Jan 13 '17 at 20:05
@Marcel, if you don't know what the correct answer letter is for each question, there is no way to determine if they have been ordered randomly, or in an over- or under-alternating fashion. — gung - Reinstate Monica, Jan 13 '17 at 20:08
@gung See comment above - that is in fact all I know (the number of questions where a given answer letter is correct) — Marcel, Jan 13 '17 at 20:10
@Marcel, #1 will still be doable, but #2 is not possible given the information you have. — gung - Reinstate Monica, Jan 13 '17 at 20:17
Also see [Runs test for randomness for k elements](http://stats.stackexchange.com/questions/66228/runs-test-for-randomness-for-k-elements) — Glen_b, Jan 14 '17 at 00:43

user54038 · Answer 2 · 2018-04-26T16:10:07.777

Here's an idea that's less statistically powerful, but more intuitive, than a chi-squared test.

If there's bias then then the most common answer should have high frequency. So, you use the number of times the most common answer was observed as a test statistic. Call that $t$ for "top". The $p$ value is the probability that $t$ is what you observed or higher. This probability is:

$$1 - P(A \le t-1, B \le t-1, C \le t-1, D \le t-1)$$

This takes a little unpacking. We see a probability that they're all below $t$, but the $1-$ means we're calculating the probability that this condition is violated. What would it take to violate the condition? A category having $t$ or more counts. So, it's precisely that the maximum is $t$ or more.

This multinomial probability can be calculated using my R package pmultinom as follows:

1 - pmultinom(upper=rep.int(t-1, 4), size=number.of.questions,
              probs=rep.int(1/4, 4), method="exact")

If there's too much uniformity then the least common answer should have high frequency (relative to what you'd expect from random chance). Let's call the number of observations of the least common answer $b$, for "bottom". Treating $b$ as our test statistic, the p-value is

$$P(A > b-1, B > b-1, C > b-1, D > b-1)$$

Unpacking this again. We see a condition that all the counts are above $b-1$. This means that the lowest count has to be $b$, or more. So it's the probability that the minimum is $b$ or more, which is what we want.

Using pmultinom again,

pmultinom(lower=rep.int(b-1, 4), size=number.of.questions,
          probs=rep.int(1/4, 4), method="exact")

(it's a little confusing why I'm using $\le$ with the "upper" argument and $>$ with the "lower argument"--I was just trying to mimic the behavior of pbinom with lower.tail=TRUE and FALSE respectively)

I think that a runs test is better for looking for uniformity, though. I tried out this minimum-based method with a test of 400 questions. To reject randomness at the .05 level, you need the minimum to be 98, 99, or 100, which would be really extreme uniformity.

Check whether data set is ether randomly generated, biased or "too uniform"

2 Answers2