5

Consider: dataset defined as n datapoints x_i in m-dimensional space. And there is a label y_i defining one of the classes belonging to x_i. There are let's say 5 classes 1,2,3,4,5 (and there is total order among the classes, i.e. 1<2<3<4<5).

What I want to do is to analyse the sensitivity of the algorithm to noise in the dataset. It means that I will sequentially add more noise to the dataset and check how good the classifier will be when learned on the noisy data.

The question: What is the proper way of adding (generating) the noise?

My personal guess is that I will need to normalize the values and somehow add noise based on gaussian distribution. But I am not sure about the particular proper way. I don't want to make any statistical or mathematical mistake.

Marek
  • 395
  • 5
  • 11

1 Answers1

4

A straightforward way would be to flip some of the class labels. You can specify the proportion of labels to be changed, and the probabilities of different flips (since you have ordering on the labels, you might want to say a flip 1->2 is more likely than 1->5).

If you want to reduce the predictiveness of the features, you can multiply a feature vector with a vector of random variables drawn e.g. from the normal distribution $\mathcal{N}(0, \sigma^{2})$, setting $\sigma^{2}$ according to how large you want the noise to be.

Alexis
  • 26,219
  • 5
  • 78
  • 131
Ando Saabas
  • 156
  • 1
  • 1
    I'd only add, for future readers, that the "Gaussian" and "normal" distributions reference above are both shorthand for the "standard normal" distribution. – Eduard Gelman Apr 17 '18 at 22:20