How can I best deal with the effects of markers with differing levels of generosity in grading student papers?

Question

Around 600 students have a score on an extensive piece of assessment, which can be assumed to have good reliability/validity. The assessment is scored out of 100, and it's a multiple-choice test marked by computer.

Those 600 students also have scores on a second, minor, piece of assessment. In this second piece of assessment they are separated into 11 cohorts with 11 different graders, and there is an undesirably large degree of variation between graders in terms of their 'generosity' in marking, or lack thereof. This second assessment is also scored out of 100.

Students were not assigned to cohorts randomly, and there are good reasons to expect differences in skill levels between cohorts.

I'm presented with the task of ensuring that differences between cohort markers on the second assignment don't materially advantage/disadvantage individual students.

My idea is to get the cohort scores on the second assessment to cohere with cohort scores on the first, while maintaining individual differences within the cohorts. We should assume that I have good reasons to believe that performance on the two tasks will be highly correlated, but that the markers differ considerably in their generosity.

Is this the best approach? If not, what is?

It'd be greatly appreciated if the answerer could give some practical tips about how to implement a good solution, say in R or SPSS or Excel.

I want to suggest hierarchical Bayesian modelling. Are each cohort the same size? Can you give a little more details on the process of cohort formation? — Arthur B., Nov 13 '14 at 02:33
Each cohort is not the same size. The cohorts range in size from 40 students to 85 students. To an extent older students were allocated to particular cohorts, and on average they are expected to score higher than younger students. Also, to a limited extent students could self-select to be in particular cohorts, and it's anticipated that some cohorts would have attracted students on average better than other cohorts. — user1205901 - Reinstate Monica, Nov 13 '14 at 02:56
Not easy. A simple approach would be to normalize each cohort based on the average score of its students on the second test. Is that what you had in mind? — Arthur B., Nov 13 '14 at 03:03
Would you be able to explain a little further what you mean by "normalize each cohort based on the average score of its students on the second test"? — user1205901 - Reinstate Monica, Nov 13 '14 at 03:29
Great question! Are the final scores for the multiple choice and the essay portions supposed to be comparable (ie the same numerical ranges)? — gung - Reinstate Monica, Nov 13 '14 at 03:33
As I was writing this question I thought it might be up your alley! The final scores are broadly comparable, but a bit different. The mean on the multiple choice section is ~70 with a SD around 15. The mean on the other section is ~85 with a SD around 6. — user1205901 - Reinstate Monica, Nov 13 '14 at 03:37
By that I mean: for each student, adjust the score of the 1st test by subtracting his cohort's mean score on the 1st test and adding the mean of that cohort's score on the 2nd test. Then, do a weighted average of the adjusted first score, and the 2nd score. So say a student scored 66 on the 1st test, and his cohort averaged 50 on the 1st test and 57 on the 2nd. Adjust his score on the first test to 66+(57-50) = 73 — Arthur B., Nov 13 '14 at 03:49
I would be suspicious of any effort to solve this problem based only on the data you have described, because it would have to rest on the strong (and untestable) assumption that there is no interaction between cohort and performance on the two separate test instruments. If you possibly can, consider the option of conducting a separate small experiment to calibrate the graders. — whuber, Nov 13 '14 at 04:14
@gung Yes, on both the scores could have theoretically ranged from 0 to 100. However, in practice nobody scored below ~30/100 on the MCQ test, and nobody scored below 50/100 on the second assessment. — user1205901 - Reinstate Monica, Nov 13 '14 at 04:33
@whuber Unfortunately it is not possible to conduct the separate small experiment to calibrate the graders. The divergence in grading standards between cohorts seems quite high, and thus I'm reluctant to do nothing (in fact I have been instructed to do something!). I'm wondering if there is a way of proceeding that is better than doing nothing. — user1205901 - Reinstate Monica, Nov 13 '14 at 04:39
To see better where the problem lies, suppose (hypothetically) that (1) the two forms of assessment are multiple choice and essay and (2) your older students tend to do relatively better on essay questions. When you use your data to make the scores "cohere" you will be confounding the grader effects with the age effects and, by making adjustments, thereby *systematically* disadvantage the older students compared to the younger. No matter how sophisticated an algorithm you choose, it can only paper over this basic problem. You need *some* additional data to resolve this confounding. — whuber, Nov 13 '14 at 04:44
I understand. Is there some additional data that might resolve the problem that doesn't require obtaining extra data from the graders? Is there some principled way I can decide whether it's better to accept the possibility of systematic disadvantage from grader effects or systematic disadvantage from interaction between cohort and performance? Intuition is telling me the former is a bigger problem in my particular case. — user1205901 - Reinstate Monica, Nov 13 '14 at 09:41
You write "In this second piece of assessment they are separated into 11 cohorts with 11 different graders". Does it mean that (a) each cohort consists of students which received a specific grade, or (b) students were divided by 11 cohorts, and a student belonging to each cohort might receieve any of the 11 grades? — user31264, Nov 15 '14 at 14:06
Sorry for lack of clarity. Each cohort just refers to a different group of students with a different class time. The students in a cohort don't all receive the same grade, but they all have the same grader. — user1205901 - Reinstate Monica, Nov 16 '14 at 00:20
@ArthurB. would you be willing to write out your idea as an answer to the question? — user1205901 - Reinstate Monica, Nov 17 '14 at 00:41
Whichever idea you think would be best and/or are willing to write out. Or both! — user1205901 - Reinstate Monica, Nov 17 '14 at 00:47
One thing to consider is how comfortable you'd be explaining the adjustment procedure to students or other stakeholders: many might feel that given a potential issue with the marking, putting *some* effort into a proper calibration of markers would not be too much to expect if the exam's an important one. — Scortchi - Reinstate Monica, Nov 17 '14 at 13:40
My plan to present a sensible adjustment, with some comments on the appropriateness of that adjustment; it won't necessarily be accepted. Also, the problematic assessment wasn't the exam - the exam was the first piece of assessment. Without going into what it was, the nature of the second assessment was such that it's inherently difficult to ensure good calibration of markers, and (pretty much) impossible to address this issue post-hoc. The second assessment was a novel assessment only worth a trivial proportion of the final total grade. — user1205901 - Reinstate Monica, Nov 17 '14 at 22:12
There are so many comments that I might have missed this, but the first thing I would do is randomly give the second round results to other TAs to be regraded. If you see significant intra-TA differences, then worry. — JenSCDC, Nov 17 '14 at 22:49
This would be a good approach under normal circumstances, but the second assessment wasn't of a format that can be handed around to TAs (we can imagine it was an oral recitation for which there were no recordings). — user1205901 - Reinstate Monica, Nov 17 '14 at 22:55
You keep insisting that we can't check some examples or regrade because the work is transient. Maybe you just need to be creative with this. How about this: Talk to each of the graders and get them to tell you explicit examples of what they saw and how they graded. Any additional evidence you can get will help. — awcc, Nov 19 '14 at 16:48
(continuing) If this is completely impossible, but you must take action, it's a tough question and will rely on your notions of fairness as well as your priors. If we really must do this, I'd suggest coming up with some "fairness" cost function (this should include complexity of the strategy and what was on the syllabus as well some function of the deviation of grades from their "true" values), coming up with some models for your uncertainty in how well the transient part tracks the multiple choice part, and then test different strategies with a view toward minimizing the fairness cost. — awcc, Nov 19 '14 at 16:53
"Fairness" is interesting - awarding marks isn't solely about estimating or predicting something. Is it more fair to do nothing & risk allowing students in some cohorts being penalized by their bad luck in getting a stricter examiner, or to take a positive action to penalize students in others because *in your opinion* the divergence between results from the multiple-choice exam & the second assessment was too large in those cohorts to be real? Does it matter that you might be penalizing mainly older/younger students? Or students of a particular race or sex? - what would it take to ... — Scortchi - Reinstate Monica, Nov 19 '14 at 18:11
... provoke re-evaluation of the idea that any calibration of markers is impractical? Is it fair to give this second assessment any weight at all? - there don't seem to be many signs that the marking procedure has been thought through. — Scortchi - Reinstate Monica, Nov 19 '14 at 18:12
Giving the second assessment less weight makes a lot of sense for statistical reasons due to its decreased predictive power. On the other hand, giving it too little weight may disappoint those who worked hard on it, and may violate trust in the syllabus weightings. This is all quite subjective. — awcc, Nov 19 '14 at 18:19
@awcc: Very true. (And those weren't rhetorical questions; I don't think they have pat answers.) The appearance of fairness is something else again: post-hoc adjustments on subjective grounds can excite suspicion. — Scortchi - Reinstate Monica, Nov 19 '14 at 18:35
This reminds me of a quote attributed to Tukey: "The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted...." — Emil Friedman, Nov 19 '14 at 18:30
Leaving aside all the important issues that have been raised (which mostly won't be addressed by this suggestion), I will say I'm a bit surprised nobody seems to have mentioned Bradley-Terry (and related) models as a possible way to tease out marker effects. — Glen_b, Nov 21 '14 at 15:22

Silverfish · Accepted Answer · 2014-11-18T14:43:08.977

Knowing how graders differ is good, but still doesn't tell you what to compensate the grades to. For simplicity imagine just two graders. Even if we conclude grader 1 is consistently 5 marks more generous than grader 2, that doesn't tell you what to do with two students who were each graded 70, one by grader 1 and one by grader 2. Do we say that grader 2 was a harsh marker, and uprate that 70 to 75, while keeping the 70 marked by grader 1 unchanged? Or do we assume grader 1 was unduly lenient, knock his student down to 65 marks, and keep grader 2's 70 unchanged? Do we compromise half-way between - extending to your case, based on an average of the 11 graders? It's the absolute grades that matter, so knowing relative generosity is not enough.

Your conclusion may depend on how "objective" you feel the final absolute mark should be be. One mental model would be to propose each student has a "correct" grade - the one that would be awarded by the Lead Assessor if they had time to mark each paper individually - to which the observed grades are approximations. In this model, observed grades need to be compensated for their grader, in order to bring them as close as possible towards their unobserved "true" grade. Another model might be that all grading is subjective, and we seek to transform each observed grade towards the mark we predict it would have been awarded if all graders had considered the same paper and reached some sort of compromise or average grade for it. I find the second model less convincing as a solution even if the admission of subjectivity is more realistic. In an educational setting there is usually someone who bears ultimate responsibility for assessment, to ensure that students receive "the grade they deserve", but this lead role has essentially absolved responsibility to the very graders who we already know disagree markedly. From hereon I assume there is one "correct" grade that we aim to estimate, but this is a contestable proposition and may not fit your circumstances.

Suppose students A, B, C and D, all in the same cohort, "should" be graded as 75, 80, 85 and 90 respectively but their generous grader consistently marks 5 marks too high. We observe 80, 85, 90 and 95 and should subtract 5, but finding the figure to subtract is problematic. It can't be done by comparing results between cohorts since we expect cohorts to vary in average ability. One possibility is using the multiple choice test results to predict the correct scores on the second assignment, then use this to assess variation between each grader and the correct grades. But making this prediction is non-trivial - if you expect different mean and standard deviation between the two assessments, you can't just assume that the second assessment grades should match the first.

Also, students differ in relative aptitude at multiple-choice and written assessments. You could treat that as some kind of random effect, forming a component of the student's "observed" and "true" grades, but not captured by their "predicted" grade. If cohorts differ systematically and students in a cohort tend be similar, then we shouldn't expect this effect to average out to zero within each cohort. If a cohort's observed grades average +5 versus their predicted ones, it is impossible to determine whether this is due to a generous grader, a cohort particularly better-suited to written assessment than multiple-choice, or some combination of the two. In an extreme case, the cohort may even have lower aptitude at the second assessment but had this more than compensated for by a very generous grader - or vice versa. You can't break this apart. It's confounded.

I also doubt the adequacy of such a simple additive model for your data. Graders may differ from the Lead Assessor not just by shift in location, but also spread - though since cohorts likely vary in homogeneity, you can't just check the spread of observed grades in each cohort to detect this. Moreover, the bulk of the distribution has high scores, fairly near the theoretical maximum of 100. I'd anticipate this introducing non-linearity due to compression near the maximum - a very generous grader may give A, B, C and D marks like 85, 90, 94, 97. This is harder to reverse than just subtracting a constant. Worse, you might see "clipping" - an extremely generous grader may grade them as 90, 95, 100, 100. This is impossible to reverse, and information about the relative performance of C and D is irrecoverably lost.

Your graders behave very differently. Are you sure they differ only in their overall generosity, rather than in their generosity in various components of the assessment? This might be worth checking, as it could introduce various complications - e.g. the observed grade for B may be worse than that of A, despite B being 5 point "better", even if the grader's allocated marks for each component are a monotonically increasing function of the Lead Assessor's! Suppose the assessment is split between Q1 (A should score 30/50, B 45/50) and Q2 (A should score 45/50, B 35/50). Imagine the grader is very lenient on Q1 (observed grades: A 40/50, B 50/50) but harsh on Q2 (observed: A 42/50, 30/50), then we observe totals of 82 for A and 80 for B. If you do have to consider component scores, note that clipping may be an issue - I suspect few papers get graded a perfect 100, but rather more papers will be awarded full marks in at least one component.

Arguably this is an extended comment rather than an answer, in the sense it doesn't propose a particular solution within the original bounds of your problem. But if your graders are already already handling about 55 papers each, then is it so bad for them to have to look at five or ten more for calibration purposes? You already have a good idea of students' abilities, so could pick a sample of papers from right across the range of grades. You could then assess whether you need to compensate for grader generosity across the whole test or in each component, and whether to do so just by adding/subtracting a constant or by something more sophisticated like interpolation (e.g. if you're worried about non-linearity near 100). But a word of warning on interpolation: suppose the Lead Assessor marks five sample papers as 70, 75, 80, 85 and 90, while a grader marks them as 80, 88, 84, 93 and 96 so there is some disagreement about order. You probably want to map observed grades from 96 to 100 onto the interval 90 to 100, and observed grades from 93 to 96 onto the interval 85 to 90. But some thought is required for marks below that. Perhaps observed grades from 84 to 93 should be mapped to the interval 75 to 85? An alternative would be a (possibly polynomial) regression to obtain a formula for "predicted true grade" from "observed grade".

Unfortunately the nature of assessment 2 makes it impossible for the graders to look at more for calibration purposes. You can think of it as being like an oral poetry recitation that was done once with no recording, and which was assessed immediately afterwards. It would be impractical to schedule new recitations purely for calibration purposes. To answer your other question, Assessment 2 didn't really have clear subcomponents, and we don't need to consider component scores. — user1205901 - Reinstate Monica, Nov 18 '14 at 10:32
This is "not an answer" but in an ideal world I'd have suggested to turn things around and use an example sample (possibly of artificial assignments deliberately designed to be on grade borderlines, rather than by real students) as a way of training the graders to have the same generosity, rather than to deduce and compensate for their generosities. If the assessments are done this is clearly no solution for you, though. — Silverfish, Nov 18 '14 at 11:23
(+1) Very thorough "not an answer". Consistency in rather subjective tests can often be greatly improved by splitting the grading task into components - otherwise one grader might be giving more weight to rhythm, another to projection, &c. — Scortchi - Reinstate Monica, Nov 18 '14 at 11:36
It is clear that in addition to submitting a possible adjustment to the person who will ultimately decide the issue, I will also need to submit some explanation of the pros and cons of adjustment. Your response provides a lot of helpful material regarding this. However, I wonder what criteria I can use to make a judgement on whether it's more beneficial to leave everything alone, or to make a change. I look at the cohort grades and my intuition says that the differences between markers are a having a big impact. Intuition is unreliable, but I'm not sure what else I can go on in this case. — user1205901 - Reinstate Monica, Nov 18 '14 at 11:47
One question is whether you have reasonable grounds to believe the "differential task aptitude" effect to be small, particularly when averaged over a cohort, compared to the "grader generosity" effect. If so, you might attempt to to estimate the generosity effect for each cohort - but you risk being confounded. Moreover, there is a Catch 22. I would be *most* cautious of applying large "corrections" to the observed grades. But if suggested corrections are small, it is plausible they are due to systematic differences in differential task ability between cohorts, not grader generosity at all. — Silverfish, Nov 18 '14 at 13:19
Confounding means there's nothing in the data by itself to help you, & you do have to use "intuition", which might be called prior knowledge, & shouldn't be limited to yours alone. Other information that could be brought to bear includes students' age & results from previous assessments. Obviously these aren't things you'd want to directly take into account in any adjustment of marks: but if there are large differences in the averages of these between cohorts, & especially if the magnitude of apparent grader generosity seems to be related to them, you'd be inclined to give more weight to ... — Scortchi - Reinstate Monica, Nov 18 '14 at 16:22
... differential task aptitude as an explanation; on the other hand, if the cohorts are well mixed (at least on things you know about) you'd be inclined to give it less weight. — Scortchi - Reinstate Monica, Nov 18 '14 at 16:27
The intuitions of those involved indicate that the Assessment 2 results are aberrant. That is, some cohorts were expected to do disproportionately well on Assessment 1 and if anything to have a marginally bigger edge on Assessment 2, but despite doing as well as expected on Assessment 1 they did rather poorly on Assessment 2. These intuitions drove my specification in the OP that "We should assume that I have good reasons to believe that performance on the two tasks will be highly correlated, but that the markers differ considerably in their generosity." — user1205901 - Reinstate Monica, Nov 19 '14 at 00:10
@Scortchi Point taken that large adjustments must be out of the question, especially as we can't be sure that our intuitions are correct about grader generosity driving the effect instead of an unexpected interaction between cohort and assessment type. — user1205901 - Reinstate Monica, Nov 19 '14 at 01:04
Would working with relative generosity be more likely to lead to a reasonable outcome if we can assume that the process is meant to end with a certain overall mean? — user1205901 - Reinstate Monica, Nov 19 '14 at 02:11
I think knowing what the final class average should be is helpful for the reasons suggested at the start of my answer, but doesn't by itself deal with the confounding issue. Having a good idea what the mean should be for each cohort would be ideal but I suspect unrealistic! — Silverfish, Nov 21 '14 at 10:14
Ah yes, I had meant overall mean in the sense of overall mean across every cohort. We do have approximate guidelines on what that's meant to be. — user1205901 - Reinstate Monica, Nov 21 '14 at 22:48

score 2 · Answer 2 · answered Nov 17 '14 at 15:07

2

A very simple model:

Let $s_{1,i}$ be the score of student $i$ on test 1, and $s_{2,i}$ his score on test 2. Let $A_1, \ldots, A_p$ be the partition of the students in the original cohorts.

Each cohort is biased by the strength of its students and the easiness of the grader. Assuming this is an additive effect, we back out of it the following way: we'll subtract the average score of the cohort on the first test, and add the average score of the cohort on the second test.

We compute an adjusted score $s'_1$ as follow

$$\forall j \leq p, \forall i \in A_j, s'_{1,i} = s_{1,i} - \frac{1}{|A_j|} \sum_{i \in A_j} ( s_{1,i} - s_{2,i} )$$

Finally, form a final score $s$ with whichever weighting you find appropriate

$$\forall i, s_i = \alpha s'_{1,i} + (1-\alpha) s_{2,i}$$

The downside is that an individual student might be penalized if the people in his cohort happened to get unlucky on the second test. But any statistical technique is going to carry this potentially unfair downside.

answered Nov 17 '14 at 15:07

Arthur B.

2,480
13
19

3

As with any other proposal, this one will suffer from the inherent unfairness of being unable to distinguish the grader effect from the group effect. There simply is no way around that. At least your procedure is a little more transparent than some others that have been proposed, by making its arbitrary nature obvious (in the choice of $\alpha$). – whuber Nov 17 '14 at 16:47
Given sufficiently large cohorts, this does remove the group effect and grader effect. – Arthur B. Nov 17 '14 at 17:10
1

No - the cohorts aren't selected at random. – Scortchi - Reinstate Monica Nov 17 '14 at 17:11
Doesn't matter, you recover the group's true latent average "ability " – Arthur B. Nov 17 '14 at 17:15
Say $s_{1,i} = \mu_i + g_j + \epsilon_{1,i}$ where $i \in A_j$, and $s_{2,i} = \mu_i + \epsilon_{2,j}$ where $i \in A_j$. $\mu_i$ designates the student's true latent ability. My technique estimates $g_j$, the grader effect by comparing the cohort's average score on test 2 vs test 1. – Arthur B. Nov 17 '14 at 17:20
1

... which, as @whuber keeps saying, is confounded with any inherent tendency of the cohort (owing to age or whatever) to do relatively better on one type of test than another. – Scortchi - Reinstate Monica Nov 17 '14 at 17:28
2

You cannot eliminate confounding by taking larger cohorts! At best you can come up with ever more precise estimates of uninterpretable values. – whuber Nov 17 '14 at 17:31
Indeed I assume that differential success on each test is cohort independent, which strikes me as very reasonable. – Arthur B. Nov 17 '14 at 17:46
3

Reasonable, perhaps: but it's untestable given the information available to the OP. The validity of your answer relies on the truth of this implicit assumption. Even worse, its negation (which of course is also untestable) is eminently reasonable, too: because cohorts are self-selected, they may consist of people who perform in common ways on different assessment instruments, suggesting it may actually be *likely* that differential success will be due in part to the cohort and only partially due to variability among graders. – whuber Nov 17 '14 at 17:57
There is no indication in the question that the tests test for different aptitude. Nothing is testable in life, there are only models. – Arthur B. Nov 17 '14 at 18:07
In the OP I had mentioned that Assessment 2 was the one which had problems with the graders, and for which we were contemplating an adjustment. In your answer, did you flip that around that Assessment 1 was the one with these problems? – user1205901 - Reinstate Monica Nov 17 '14 at 22:51
1

Ah yes, sorry, i did. – Arthur B. Nov 17 '14 at 22:51
1

The $\alpha$ factor is a nice idea, but adjusting by the mean alone seems too crude --- you'll be adjusting up and down the maximum possible score in each cohort. If you're going to do this sort of thing (bearing in mind all the caveats), I'd be more comfortable with some transformation $s_{1,i} \rightarrow s_{1,i}'$ that keeps 100 (or some more robust "very high score" estimator if 100 is very rare) as a fixed point. – awcc Nov 19 '14 at 18:30
Replace the mean with the median as needed, the basic idea remains the same. – Arthur B. Nov 19 '14 at 18:31

score 1 · Answer 3 · answered Nov 18 '14 at 05:14

1

You can't. At least, not without collecting additional data. To see why, read @whuber's numerous upvoted comments throughout this thread.

answered Nov 18 '14 at 05:14

Jake Westfall

11,539
2
48
96

score 0 · Answer 4 · answered Nov 18 '14 at 23:45

Rephrasing the problem: How best to approach setting a mark of a two part an exam with the conditions requiring that the second part is exposed to greater uncertainty due to the range of Delegated Markers' qualitative assessments.

Where: Master Tester = accountable person for exam Delegated Tester = person (1 of 11) assigned to mark par #2 of the exam Student = the guy that gets the fun of sitting an exam

Goals include: A) Students receive a mark that is reflecting their work B) Manage the uncertainty of the second part to align with the intent of the Master Tester

Suggested approach (answer): 1. Master Tester randomly selects a representative sample set of exams, marks the part #2 and develops correlation with the part #1 2. Utilise the correlation to assess all of the Delegated Markers' data (Part #1 vs #2 score) 3. Where the correlation is significantly different from the Master Tester - significance to be acceptable to the Master Tester - examine the exam as the Master Tester to re-assign the result.

This approach ensures that the Master Tester is accountable for the correlation and the acceptable significance. The correlation could be as simple as the score for part #1 vs #2 or relative scores for questions of test #1 vs #2.

The Master Tester will also be able to set a quality of result for Part #2 based on the "rubbery-ness" of the correlation.

How can I best deal with the effects of markers with differing levels of generosity in grading student papers?

4 Answers4

Linked