Hopefully this isn't overkill, but I think this is a case where creating a linear model, and looking at the estimated marginal means, would make sense. Estimated marginal means are adjusted for the other terms in the model. That is, the E.M. mean score for student A can be adjusted for the effect of Evaluator 1. That is, how much higher or lower Evaluator 1 tends to rate students relative to the other evaluators.
Minimally, you can conduct a classic anova to see if there is a significant effect of Evaluator.
I have an example of this below in R.
Edit: I suppose for this approach to make sense --- that is, be fair --- each evaluator should have evaluated relatively many students, and a random or representative cross-section of students. There is always a chance that an evaluator just happens to evaluate e.g. a group of poor students. In this case we would inaccurately suspect the evaluator of being a tough evaluator, when in fact it was just that their students were poor. But having multiple evaluators per student also makes this less likely.
Create some toy data
Data = read.table(header=T, text="
Student Evaluator Score
a 1 95
a 2 80
b 2 60
b 3 50
c 3 82
c 1 92
d 1 93
d 2 84
e 2 62
e 3 55
f 1 94
f 3 75
")
Data$Evaluator = factor(Data$Evaluator)
Create linear model
model = lm(Score ~ Student + Evaluator, data=Data)
Classic anova to see if there is an effect of Evaluator
require(car)
Anova(model)
### Anova Table (Type II tests)
###
### Sum Sq Df F value Pr(>F)
### Student 1125.5 5 20.699 0.005834 **
### Evaluator 414.5 2 19.058 0.009021 **
### Residuals 43.5 4
So, it looks like there is a significant effect of Evaluator.
Look at simple arithmetic means
require(FSA)
Summarize(Score ~ Student, data=Data)
### Student n mean sd min Q1 median Q3 max
### 1 a 2 87.5 10.606602 80 83.75 87.5 91.25 95
### 2 b 2 55.0 7.071068 50 52.50 55.0 57.50 60
### 3 c 2 87.0 7.071068 82 84.50 87.0 89.50 92
### 4 d 2 88.5 6.363961 84 86.25 88.5 90.75 93
### 5 e 2 58.5 4.949747 55 56.75 58.5 60.25 62
### 6 f 2 84.5 13.435029 75 79.75 84.5 89.25 94
Look at the adjusted means for Student
require(emmeans)
emmeans(model, ~ Student)
### Student emmean SE df lower.CL upper.CL
### a 83.7 2.46 4 76.8 90.5
### b 59.4 2.46 4 52.6 66.2
### c 86.4 2.46 4 79.6 93.2
### d 84.7 2.46 4 77.8 91.5
### e 62.9 2.46 4 56.1 69.7
### f 83.9 2.46 4 77.1 90.7
###
### Results are averaged over the levels of: Evaluator
### Confidence level used: 0.95
Note that some students' scores went up and others went down relative to arithmetic means.