1

Let's try and look at this with an example before I get into my questions.


Let's say you're trying to predict the number of Oscars a movie will win.

With qualitative data we can clean and plug in a bunch of data (e.g. box office opening, rotten tomato score, days released before oscars etc), throw that in an algorithm (e.g. Python's sciki-learn's (my preference of tookit) linear regression etc) and start making predictions. (Yes I know ML is more complicated than that but I am trying to keep this simple).

Now let's say you have a data set of qualitative data (all numerical). That is to say, for each movie we have the reviews on each movie from a number of judges. Each judge rated the movie on a score from 1 to 5 in 5 different categories (we know which judge rated which movie what score). To make things more interesting judges can also review a movie more than once. We do have dates of the reviews so we know the order they came in.

Example of a dataset

enter image description here

What are effective approaches to use this qualitative data for predictions (still wanting to predict number of Oscars a movie will win)? I would probably merge this data with quantitative data so I would imagine I want one data entry for each movie. Hence why I am looking at how best collapsing all the qualitative data to a single input row.


Now I am newer to ML and prediction so I am sure there are many techniques I don't know of that I can look into more to address this problems. Solutions I have thought of are:

  • Averaging (or taking the mean) the score in each category by all judges reviews for each movie. - Would make old and new reviews the same, judges are all equal, outliers cold effect the input

  • Using a time-weighted average so the newest reviews are worth more - Seems better, but judges are still all considered equal (even if they aren't all the same talent of judging). Could consider taking only the latest review from each judge.

  • Dropping high/low outliers. Either the judges with the highest/lowest scores across the board or just the highest/lowest in each category/column.

Are there other or better approaches I am not thinking of?

  • What if we know some judges are better / worse at reviewing movies (but we don't know who)? How can we account for this judge-quality? Dropping highest and lowest reviews? Is there a way to weight the individual reviewers inputs?

Looking for ideas and approaches I can learn more about.

Reily Bourne
  • 123
  • 5

0 Answers0