Performance metric based on five-star ratings

Question

Although I think my question might be a specific case of a generic problem, I'll include some background info that might be relevant. There is a sort of concierge/delivery service I know that uses a model fairly similar to Uber. As with Uber it uses an anonymous, five-star rating system (no half stars) for feedback. While Uber lets its drivers rate passengers this service does not currently allow any kind of feedback on the customers by its workers.

As you can imagine there is a lot of angst regarding bad ratings. I don't know how common it is but a rating of three or lower results in that worker no longer being allowed to pick up jobs for that customer. I don't believe that the customer is necessarily ever aware that selecting the midpoint response has that severe consequence. Another point of some frustration is that the customers are forced to leave feedback, if they hadn't already done so, at the point they want to use the service again. And finally another possible bias until recently was that customers couldn't leave any further feedback in tandem with a five-star rating; all others were prompted for more details.

Here's another point that's possibly significant: A customer can actually change a worker's rating at any time and so it appears to me that ratings are not intended as a measure of performance per job but rather as the worker overall. Though I'm very curious to see for myself the interface of the mobile app that collects this information I just haven't had the opportunity yet.

These figures are used to calculate a rolling average and in this case that average drives a ranking used to determine the order that workers on shift get a notification, and thus opportunities, to snag new jobs. Workers can have multiple territories and as another twist I believe the ratings generally apply per territory, at least in the first stage of ranking. In many cases the sample could be small or non-existent. The ranking is never explicitly revealed and apparently the formula also involves a few other stats and all-time fallbacks/tie-breakers.

Let's mostly ignore that and focus on the big rating average. There is a lot of variance in the number of completed jobs across the workers. At a given time there are often only a handful of workers active in each territory. New workers initially get an artificial boost to their scores so they can get in the game.

Hopefully this is enough information to make some broad generalizations about it. I know that there are problems trying to use a Likert scale to generate and interpret averages and I've read this question in particular. (Is this in fact a Likert scale or just another kind of ordinal scale?) I already liked the idea of adjusting the customers' scores as suggested there. But ultimately I am very much a novice in this field and I want to know where this performance metric sits on a spectrum between perfectly reasonable and entirely unfair. Are there any factors here that make this whole system a bad idea?

Is Amazon's "average rating" misleading?

It seems that you've already spotted a few of the flaws in the system. For example, forcing people to give ratings before they can make a new booking. I suspect that if you force someone to give a Likert rating "against their well" they'll either pick an extreme or the midpoint. Less likely, but someone very keen to give feedback may give a 4 rather than a 5 — Ian_Fin, Jul 19 '16 at 14:15
The fact that a person's rating can be changed "on the fly" isn't ideal either. If I have four good experiences with a worker and then one terrible one you're expecting me to give that person an average score and not just give them 1. You're only as good as your last job, which isn't an entirely unusual notion but maybe not the best metric to hold someone's work against — Ian_Fin, Jul 19 '16 at 14:18
@Ian_Fin Thanks. Workers get a little hung up on not getting a five and seeing the average drop. Their feeling demoralized is one side effect and I wonder if just seeing the ranking would be better. But can it be argued that despite some flaws this method will produce a good enough approximation of an "ideal" ranking or is it hopelessly doomed to fail at that? — shawnt00, Jul 19 '16 at 14:46
Seeing a poor rating may be demoralising, or it may encourage a worker to work harder next time. I suspect that this question is really too broad and opinion based for StackExchange. The issues aren't really statistical either... That said, I can't see an argument for why only the most recent ranking would be a better approximation to a worker's true ability than the average of all rankings. — Ian_Fin, Jul 19 '16 at 14:54

Eduard Gelman · Answer 1 · 2016-07-22T17:18:46.413

Ultimately, you're trying to solve an economics problem that goes something like: "how do I make a single-scoring rating system that is fair while rewarding productive activity and entry into the ecosystem, and how do I sleep at night feeling like I've done a good job?".

There's no right answer. You've done a lot of great thinking about incentives already. I might add a suggestion to the system: any customer who consistently ranks at extremes should probably be disregarded, that is, screen out scores from customers whose ranking variance is low.

Maybe you can use a Hidden Markov Model so try to find the "true" ranking of a driver given only indirect measurements, but at this high-level of abstraction, what is "truth"?

The gold standard is to use the randomized trial. If you have the resources, why not devise a couple of scoring methods, run some random trials (each zip code randomly gets a different system or something alike), and compare performance or satisfaction metrics across treatments?

Edit: The commenter below is right in building on my logic! An individual's score should also depend on their other scores, not just for extremist raters. If an individual doesn't ever rate 5 stars, but does use the rest of the scale, the system should adjust the 4 stars to have roughly equal weight to 5 stars. Likewise, an overly generous rater who gives only 4 or 5 stars should have their inputs scaled down.

TLDR but if lots of customers give 3-star ratings then your model needs to adjust for that - ie don't penalise someone for getting 3 stars — Christian, Jul 22 '16 at 05:29
If it is "pass/fail" then why use 5 stars? Why not put a "I am satisfied with my service: yes/no". If no, give a textbox for them to detail their issue. — EngrStudent, Jul 25 '16 at 16:14
It was a tough decision to choose between the answers. I GREATLY appreciate that you spent time writing up your valuable insights for my question. — shawnt00, Jul 26 '16 at 04:43

Kodiologist · Accepted Answer · 2016-07-25T16:00:25.253

This is the sort of question considered in the field of psychometrics, the study of mental measurement. Measures of worker performance in particular are often considered in industrial-organizational psychology.

Many of the criticisms mentioned here, by you and others, seem on the money. A broader issue is that no studies seem to have been conducted to estimate the ratings' (or rankings') reliability and validity, which are two qualities that any good mental test should have. Reliability is the degree to which the test gives consistent results when the thing it is measuring is consistent, and various subtypes of reliability such as retest reliability and parallel-forms reliability can be assessed directly. An unreliable test is one that is too noisy to provide useful measurements. Validity is a quality of a test specific to an intended use of the test, which concerns how accurate it is for the intended use; for example, how well it predicts whether a customer will use the service again. The methods of assessing validity are as wide and varied as the uses of mental tests. But in any case, a test that lacks validity for a specific use is too inaccurate to be useful.

Could these ratings and rankings be reliable and valid? They might be, but unless we check, we don't know, and any test put to a high-stakes purpose, such as controlling what jobs workers can take, doesn't deserve the benefit of the doubt.

Performance metric based on five-star ratings

2 Answers2