4

Let's say we have a supervised learning binary classification dataset. We are a lot more confident in some training examples than in others being accurate (e.g. some were labelled by several highly skilled humans who analyzed each example they labelled in depth, others were labelled by a low-skilled person who had to label a large number of examples in a short amount of time, likely leading to a lot of mistakes). Only about 15% of examples were labeled by "experts". To make matters worse, the samples were not randomly given to experts or non-experts - the distributions of samples given to experts and non-experts are different (e.g. of the examples judged by experts, 70% of them are labeled with a positive class, vs 10% for examples judged by non-experts).

I want to use random forests or gradient boosted trees to learn a model on this data.

How should I incorporate this prior knowledge into the model or into the dataset? One way I could think of is to fit a model to the original data, and then take a closer look at the misclassified examples that I am less confident about, and potentially re-label them in the dataset if they are found to have been originally labeled incorrectly. Is that the best way of doing it? Is there some way to, e.g., give more weight to training examples I am more confident about? Will that not cause a problem given that those examples were not randomly sampled from the universe?

mkt
  • 11,770
  • 9
  • 51
  • 125
rinspy
  • 3,188
  • 10
  • 40
  • My question differs slightly in that the subset of examples labelled by "experts" is not a random sample of all examples. I was wondering if introducing such weights would create a selection bias in the dataset. In an extreme example, if I set the weights of the "expert" examples to 1, and all other weights to 0, I end up with a very biased dataset and end up learning a model that doesn't generalize well to the universe. But I suppose choosing the weights carefully would mitigate this problem. – rinspy Jul 26 '17 at 14:36
  • @rinspy Yes, the choice of weights is not obvious here, but it should be possible to give some quantitative interpretation to 'expert' & 'non-expert' (or even some continuous scale of expertise). – mkt Jul 26 '17 at 15:48
  • 1
    The weighting can be treated as another hyperparameter to be tuned as well. – Firebug Jul 26 '17 at 18:18
  • @Firebug, that's a good point! The problem is that if the weighting is used to evaluate the model as well as train it(e.g. give more weight to correctly predicting examples we are more confident in), that can lead to problems as well (e.g. if classifying "expert" examples is easier, we will end up with giving all other examples the weight of 0). – rinspy Jul 27 '17 at 10:00
  • I'm not so sure they would be ascribed 0 weight. Having poorly labeled (or even unlabelled) data can still help to better describe the labeled data (see the case of semi-supervised learning and even NN pre-training). As you said as well, the non-expert labels outnumber the expert ones by a large margin. – Firebug Jul 27 '17 at 11:42
  • @Firebug, yes, but the expert-labeled data is a biased subsample in my case. It may be possible (and likely easier) to learn a model that works well on that subsample / subuniverse than learn a model that generalizes to the whole universe. – rinspy Jul 27 '17 at 11:57

1 Answers1

2

It is possible to incorporate this information by weighting points differently in random forests. How this works is that in the bootstrapping process, observations will be sampled with a probability proportional to their weights. In R, this can be done in the ranger package.

However, see also this useful answer and comment about how to treat the OOB error estimates if you plan on weighting your data.

mkt
  • 11,770
  • 9
  • 51
  • 125