1

I am building a classification model with mislabeled training data on the order of ~70% of the training data is labeled correctly and ~30% is labeled incorrectly. Knowing this, how can I quantify the error rate for my model? For example, if I have 85% accuracy on the test set, of those 85% how many come from the 70% that are actually labeled correctly?

I also have to say that the labels aren't mislabeled completely randomly either. There is certainly a relationship between my predictors and whether or not the label is correct. I have a few hundred possible labels and around 1 million records. The data are survey responses describing occupations. So common mislabellings will have write ins that contain words such as “Office manager” where this could land in any number of codes.

Is there any literature on this? Maybe some sort of confidence interval I can build for the error rate?

astel
  • 1,388
  • 5
  • 17
  • A similar post: https://stats.stackexchange.com/questions/218656/classification-with-noisy-labels/447856#447856 – kjetil b halvorsen Jul 21 '20 at 15:36
  • 1
    Thanks, yeah its close, but the answer given assumes a completely random uniform distribution of the incorrect labels (or noisy labels in the context of that question) – astel Jul 21 '20 at 17:26
  • Do you have any information (or can obtain it) on the more probable mislabelings? If so, tell us. How many different labels? Are some of them close so that equivocation is more probable? Also, there are some similar posts so search this site with keywords like "error class* labe*" (without the quotes) – kjetil b halvorsen Jul 21 '20 at 19:52
  • Here is a survey [Classification in the Presence of Label Noise](https://www.researchgate.net/publication/261601383_Classification_in_the_Presence_of_Label_Noise_A_Survey) – kjetil b halvorsen Jul 21 '20 at 20:34

0 Answers0