0

Assume we have a dataset of 10 features, (combination of continuous and categorical features). I wish to add noise to each features separately, can i use the mean and SD of that particular feature to model my noise (Gaussian Noise), or is it okay to choose generic (mean = 0 and SD1).

Does adding noise to categorical features works, Or should i go with mode imputation? Is adding noise to each feature separately is valid?

When we add noise to the entire dataset in one go, does gaussian noise compute the mean and SD with respect to entire dataset as a whole or it inherently calculates separately for each feature.

Thanks in advance.

  • 2
    How much noise do you want to add to your continuous feature, and for what reason? You might even want your noise to vary, depending on the observation (e.g., larger variance for larger values). // You certainly can add noise to categorical features by randomly switching the category. Again, for what reason do you want to add noise? // You've tagged this with [tag:smote], which is a technique that [statisticians tend to discourage.](https://stats.stackexchange.com/questions/357466/are-unbalanced-datasets-problematic-and-how-does-oversampling-purport-to-he) Why? – Dave Sep 23 '21 at 12:45
  • Thanks for the response. Reason : The primary reason i want to introduce noise is to add more data. How much noise and what type of noise (gaussian or any other distribution) would be a best fit, is what I would like to know. – Suriya Kumar J S Sep 23 '21 at 14:11
  • 3
    Adding noise does not create more data. – Tim Sep 23 '21 at 15:38

2 Answers2

0

I understand that you want to add noise to the data as a kind of data augmentation technique.

For the continuous features I would go with gaussian noise with mean=0. The SD depends on how much noise you would like to add. It's best to experiment with several values of SD.

For the categorical features, it depends whether they are ordinal or nominal.

If the features are ordinal, you can treat them similarly to the continuous features by first establishing a correspondence between the ordinal features and real numbers (e.g. Excellent = 2, Good = 1, OK = 0, Bad = -1, Awful = -2), then draw a noise level from a gaussian distribution for each instance of the ordinal feature, add the noise to the feature and round the result (e.g. if the feature value was "Good", which corresponds to 1, and you add a noise level of 0.76, then you get 1.76 which, after rounding, corresponds to 2 = Excellent).

If the features are nominal, you can do the following: draw a noise level from a gaussian distribution. If the noise is in some prefixed interval about 0, you keep the feature as it is. Otherwise, you draw a number from a uniform distribution to determine a new value for this feature instance.

Again, as I said in the beginning, adding noise involves several parameters. It's best to experiment with various values of the parameters to decide which values give the best results. And it might be the case that this data augmentation technique will not prove itself useful at all for this case.

Nir H.
  • 101
  • 1
  • 1
    The fundamental issue you raise concerns the sense in which "adding noise" could be considered a "data augmentation technique." Could you explain what you mean by "augmentation"? It sounds like you are proposing somehow to create *more* or *better* data by adding noise, but--as pointed out in a comment to the question--that's impossible. – whuber Sep 23 '21 at 15:49
  • Data augmentation is a standard procedure in machine learning (cf. the Wikipedia article). By creating certain perturbations in the data the set of samples in enriched. Of course, it is not at all equivalent to collecting more data, but often that is impossible or difficult to achieve. Data augmentation is not always effective. For example. it can be used successfully to mitigate overfitting, but if the samples are not representative of the actual data source, data augmentation will be of no use. – Nir H. Sep 23 '21 at 15:59
  • 1
    Thank you for that explanation. In the ML sense data augmentation is not a single procedure. It is a set of techniques to apply transformations, appropriate for some particular application, to convert one dataset into another one that is equally likely to arise. It is difficult to see how any *generic* form of "adding noise" could be considered such a procedure. In particular cases, involving specific probability models, some form of careful "noise injection" could be made to work--but that would require a case-by-case demonstration. – whuber Sep 23 '21 at 19:51
0

Adding noise to categorical data makes sense, depending on what you are trying to achieve. For example in this paper from google, the authors add noise to the input in order to prevent overfitting, and develop a method for detecting "out of distribution" samples. In the paper they add gaussian noise to the continues variables where the noise amplitude, $\sigma$, is a hyper parameter. Regarding binary and categorical features they just flipping to different classes with certain probability.

ofer-a
  • 1,008
  • 5
  • 9