5

I have a dataset with 2 classes: A and B. The problem is that 20% to 30% of the samples of class B are mislabeled (labeled as B but the right label is A) and I am not able to identify those mistakes.

Is there a way/approach/method to enhance the classification performance in this scenario?

Karolis Koncevičius
  • 4,282
  • 7
  • 30
  • 47
naddoth
  • 61
  • 1
  • 4
  • 1
    Interesting question. I do not know the answer but it feels like there should be something that could be done. In particular you should exploit the fact that only B class samples can be mislabelled. So maybe somehow penalize "B as A" misclassification results less. In addition if you know that the mistake rate is somewhere around 25% you could try to add this to optimisation step as well, so that models with perfect A classification rate and 25% misclassification for B would get the highest score during optimization. But I expect this is quite hard to do. – Karolis Koncevičius Mar 24 '20 at 09:25
  • The only reasonable thing I can think of is, assuming your data is distinct across the two classes, create a classification model cutting out parts of the data and complete this cutting out different parts of the data, to see which performs better. Because if the classes are distinct the bad data should effect your results. – ajax2112 Mar 24 '20 at 09:27
  • How do you know some labels are mislabeled? – Aksakal Dec 21 '21 at 18:13

5 Answers5

4

If you have wrong data and no way to get the true labels then there is nothing "correct" that you can do to obtain this information.

You could treat this as an unsupervised problem first (or semi-supervised), by using say clustering with 2 clusters (since you know there are only 2 labels) to get a model to predict labels, and then following with classification. Note that such results may be overly optimistic.

user2974951
  • 5,700
  • 2
  • 14
  • 27
4

Under mild assumptions on the noise mechanism and data distribution (e.g. less than $\frac{1}{2}$ of the data is incorrectly labelled), some classifiers can be shown to be consistent in the binary classification setting. A classifier $C_n$, depending on the training data, is said to be consistent if $$R(C_n) → R(C_{Bayes}) \;\; as \;\; n → ∞$$ where a classifiers risk, $R(C)$ := is minimised by the Bayes classifier $$ C^{Bayes}(x) := \begin{cases} 1,& \text{if } η(x) ≥ 1/2\\ 0,& \text{otherwise} \end{cases}$$

K-nearest-neighbours and Support Vector Machines can be shown to satisfy this condition while Linear Discriminant Analysis does not. Since this limit is guaranteed as $n → ∞$, this doesn't answer how much data you will need in your case, however simulation studies are done in the paper I reference below which may help give you an intuition.





Reference

Cannings, T. I., Fan, Y. and Samworth, R. J. (2018) Classification with imperfect training labels. https://arxiv.org/abs/1805.11505.

Seraf Fej
  • 436
  • 2
  • 15
0

In case of wrong data the best practice, in my experience, is to get rid of it. See, unlike conventional programming where you build the algorithm and apply it to the data, in machine learning, the algorithm comes from the data itself, so if you put a wrong data it will disrupt your algorithm and you will get poor performance. The data you use in any machine learning algorithm should be as clean and as concise as possible to yield good results.

Michael
  • 212
  • 3
  • 16
  • 1
    This is not always a good idea. It could be that those units are wrong precisely because they are different. By removing them you remove the underlying process which generated the errors. You are in effect making the problem easier. – user2974951 Mar 24 '20 at 09:32
  • 1
    You are right if the error process has a structure. But if its a random noise - you might fit your model to it which may not be good. Also consider a classification of cats/dogs which has wrong labeling due to some human error, that most prevalent in difficult images, where its difficult for human to tell the difference, so the model won't be able to tell the difference either it it'll get a reinforcement on those samples. In contrast, if you remove those - the model might learn patterns which may be able to classify those hard samples right. – Michael Mar 24 '20 at 09:39
  • you propose "to get rid of it", but the point is that I don't know which samples of B are uncorrectly labeled... – naddoth Mar 24 '20 at 12:58
  • 1
    Well, in that case you may do what @user2974951 recommended, and use some clustering to get assessment of the labels. again, i think that the training data you use should be as clean and concise as possible to yield good results. – Michael Mar 24 '20 at 16:27
0

You have a bunch of known As (if I read correctly) and some other cases that may or may not be As. So you want to find the most similar cases from the unknown set. Sounds relatively straightforward. If As are really different, then you'll get a nice break in the similarity function.

Ed Rigdon
  • 1,341
  • 4
  • 11
  • I don't understand your answer. what dou you mean by your last sentence. Please give an example to illustratre your point. – naddoth Mar 24 '20 at 12:56
0

I'm a little late to this question, but for future readers: Try giving higher sample weights to data with class A. That way your algorithm will have a higher penalty for misclassifying A than for misclassifying B.

If your algorithm doesn't support sample weights you could try oversampling your data from class A.

There is a danger of overfitting with this method, so make sure to regularize and cross validate.