How to detect noisy entries in the data set

Question

I have a data set (entries described by the list of features X1-X7). This data set contains a small percentage of noise. How can I detect those entries that are subject to noise and exclude them from the data set? Do I need to perform clustering? In this case, all the entries that are subject to noise might be grouped into particular class.

This is just a small subset of my data set.

    X1          X2          X3          X4          X5          X6          Target
1   0.0000000   25.0063992  32.7777974  38.340927   42.7852062  59.369447   1
2   11.6182573  5.6806581   31.5693486  38.362217   43.4083127  60.619126   1
3   0.6963788   35.8701659  45.7370047  61.791145   65.9419381  70.754426   1
4   6.9001030   34.2983548  44.8918688  64.729515   80.6112268  94.598520   0
5   2.2026432   22.1041173  28.4673286  33.847659   39.7518905  43.536192   0

Without knowing how you want to use the data, you can't answer this question. If we knew that you wanted to develop a model to estimate X1 by using X2-X7 as independent variables; once you had developed such a model we could look for outliers among the residuals using various metrics (D's Cook, leverage-Halt, studentized residuals). And, using such metrics we could identify datapoints with too much influence on the coefficients. And, rerun the model without such outliers. But, I don't know what you want to achieve with this data. You can't identify the noise before specifying the model. — Sympa, Jan 13 '15 at 23:50
@Gaetan Lion: Thanks for you comment. I am developing the classification model (logistic regression). I added a target (class) column to the data in my post. The problem is that entries that belong to the class "1" have some noise, which means that some of them do not really belong to "1". My question is how to recognize such entries? Do I need to apply unsupervised outlier detection algorithm before starting classification? — Klausos Klausos, Jan 14 '15 at 15:52
Well, that's what Anony-Mousse suggests. I am not familiar with that technique. The first question I would raise is will it work with your logistic regression framework? — Sympa, Jan 15 '15 at 04:41
I have found this. It is very resourceful: https://sci2s.ugr.es/noisydata#Introduction%20to%20Noise%20in%20Data%20Mining — Abdulkarim Kanaan, Jun 10 '21 at 01:13

score 1 · Accepted Answer · answered Jan 14 '15 at 09:17

1

How about using unsupervised outlier detection algorithms such as LOF, LoOP etc?

AFAICT they are meant to detect noise in your data.

answered Jan 14 '15 at 09:17

Has QUIT--Anony-Mousse

39,639
7
61
96

What is 'LoOP '? Google returns search results for a normal loop. Please help! – Anonymous Person Apr 08 '19 at 08:17
1

Try "loop outlier" then to narrow down search... – Has QUIT--Anony-Mousse Apr 08 '19 at 20:58

How to detect noisy entries in the data set

1 Answers1

Linked