What is a good method to identify outliers in exam data?

Question

I give my students an exam that has 8 questions on it. Each question is about a particular topic. The exam is made up on the fly by randomly selecting 1 question for each topic from a pool of questions for that particular topic. Each topic pool has 20 questions in it. I am worried that there might be a few outlier questions (i.e., they are much easier or harder than the other questions) in each pool.

I want to find out if the questions in each pool are essentially equivalent or if there is a particular question in the pool which is significantly harder or easier than the others in the pool. I have the scores for about 300 students.

Can anyone suggest a method that will allow me to for each pool rank each question by how hard it is using how the students did on the other questions in their instance of the exam?

As requested by a comment here is my current naive approach:

Lets say an exam is made of up $n$ questions. Each question is drawn from a specific pool. So, an exam is a set of elements of the form $q_{pi}$, where $p$ is the pool the question was drawn from and $i$ is the instance of the question from that pool. For notation ease, lets assume each pool has $m$ instances. So each exam, $e$, is $\{q_{pi} | 0 \le p < n, 0 \le i < m\}$ and there are $s$ students, so we have a universe of $s$ exams, $\{e_1, ..., e_s\}$. I want to make sure that the hardness of all $q_{pi}$ for a fixed $p$ and $0 \le i < m$ are roughly similar.

To determine the relative hardness of $q_{pj}$ I would look at all exams that include $q_{pj}$ and compare each students score on $q_{pj}$ with their score on the rest of the exam they took, e.g., $r = \sum_{x=0}^{x<n} q_{x*}$, where $x!=p$ and $*$ represents the instance of pool $x$ that that particular student took. Then, sum up each the difference for all students of $d_{pj} = q_{pj}-r$. Then, I will compare all the $d_{pj}$ for a particular pool. If a particular $|d_{pj}|$ is significantly larger (more than 1 std dev away?) than the others, I will modify its weight.

Suggestions? Comments?

Your title asks for "the best method". In order to clearly identify something as best we'd need some *criterion* on which it's possible to measure "better". Or are you really just seeking a suggestion of how to deal with the problem you describe in the body? (a substantially simpler question) — Glen_b, Mar 05 '15 at 22:23
Sorry, I am seeking a suggestion. It doesn't have to be "best" — seth, Mar 05 '15 at 22:28
Help us help you: what have you tried, or thought about trying? Where are you stuck? — rolando2, Mar 05 '15 at 23:48
You might also want to compute Cronbach's Alpha in each grouping to ensure high internal consistency: http://en.wikipedia.org/wiki/Cronbach%27s_alpha — StatsStudent, Mar 06 '15 at 07:58
I'd make the title much more specific. Already the discussion has centred on the specifics of your question, which is fine, but a thread with this title will attract and disappoint many people with a much broader interest. In fact I was drawn in that way.... — Nick Cox, Mar 06 '15 at 09:25
@NickCox: I would be happy to make it more specific, but I am not sure how to that. A suggestion would be most welcome. — seth, Mar 06 '15 at 17:22

Tim · Answer 1 · 2021-07-20T05:33:42.807

Have you considered Item Response Theory-based methods? IRT is designed especially for this kind of purposes.

Simple example is Rasch model that lets you compute both the student abilities and question difficulty in a single generalized linear model (or generalized linear mixed model). With binary answer format the Rasch model could be written as

$$P(X_{ij} = 1) = \frac{\exp(\theta_i - \beta_j)}{1+\exp(\theta_i - \beta_j)} $$

where response $1$ by $i$-th student for $j$-th test item is modeled as a function of student's abilities $\theta_i$ and item's difficulty $\beta_j$. There are also models considering more than one item parameter (e.g. item discrimination, guessing), models for items with polytomous answering format, or such models that include other additional explanatory or grouping variables. In measuring student's abilities the model "weights" test items by their difficulty so you don't have to worry about the fact that some items are easier and some are harder. If you are interested in item difficulty you can check their $\beta_j$ values. There are also additional tools for measuring item- and person-fit that may be helpful for identifying item- and person-outliers.

This model is suited for "static" tests with finite number of items, but missing-data design is also possible, where you threat the non-answered questions as missing data and impute their answers using your model. Usually EM algorithm is used for estimation but for more complicated designs Bayesian approach is more suitable.

Those methods are really design-specific so it is hard to give a single answer. There are multiple books available, e.g. a nice introduction Item Response Theory for Psychologists by Susan E. Embretson and Steven P. Reise.

Yes IRT is perfect for this. Alternatively, you could calculate difficulty by just examining the proportion of people who get items wrong, but IRT would be much more powerful by actually taking into account how smart test takers are. — Behacad, Mar 06 '15 at 19:02

score 0 · Answer 2 · answered Mar 06 '15 at 07:32

0

You might try fitting a logistic model. Response is a whether the answer was correct or not. You could add a (random?) effect for the student and a fixed effect for each question. You could develop a ranking based on the coefficients for each question.

answered Mar 06 '15 at 07:32

et_al

21
4

Good point but I would still recommend IRT-based GLMM logistic model (e.g. https://ppw.kuleuven.be/okp/_pdf/DeBoeck2011TEOIR.pdf) rather then an ad-hoc one. – Tim Mar 06 '15 at 19:08

What is a good method to identify outliers in exam data?

2 Answers2

Linked