1

There's a tag here called "data-transformation" described as

Mathematical re-expression, often nonlinear, of data values. Data are often transformed either to meet the assumptions of a statistical model or to make the results of an analysis more interpretable.

Could someone run off a list of data transformation techniques, denoting which are linear or non-linear, for me to confirm what I think data-transformation means? Is it only about the transformation of raw data into an alternative form? (but wouldn't that just be manipulation of what exists in reality, thereby creating synthetic (false) data? a philosophical explanation to complement the list could help)

I'm especially interested in those methods that relate to machine learning whose purpose is the enhancement of a model's out-of-sample performance.

I leave the question openly general to get as many leads as possible, but a brief description of where each technique is commonly applied would be helpful too.

develarist
  • 3,009
  • 8
  • 31
  • 4
    The question is too broad. Any question that asks for a list of techniques usually is. A log transform is a "data transformation" as is a square root or a cubic root or... – AdamO Jul 19 '20 at 13:02
  • now that i get a sense of what this tag is referring to based on the answers so far, i will probably follow this up with a more focused question. how should i ask that basic transformations be excluded? given that it looks like the tag itself brings to mind the most basic of the basic numerical transformations – develarist Jul 19 '20 at 13:21
  • 2
    We think of data transformations as stated in the quotation: *re-expressions.* Thus, the underlying observations do not change, but only *how we represent them numerically* is modified. The concept is similar to you reporting your temperature as 37 degrees C and another reporting it as 98.6 degrees F: they are the same temperature, differently expressed. Or, for a nonlinear example, a chemist might report a concentration as -1 (log 10) and a non-chemist will report the same thing as 0.1. For one example of how this works (out of many), see https://stats.stackexchange.com/a/62147/919. – whuber Jul 19 '20 at 20:27

1 Answers1

2

The most common types of data transformation are:

  1. log (natural) transformation
  2. log(10) transformation
  3. squared transformation (second order polynomial)
  4. cubic transformation (third order polynomial)
  5. square root transformation
  6. cubic root transformation

Among the most important reasons, we may cite:

a) strategy to produce a normal distribution (not needed for regression, since what matters most is the linearity of the residuals)

b) strategy to re-scale values

c) strategy to include polynomial terms so as to improve a given model.

Caveat: some transformations must take in consideration the existence of zero (e.g. logarithm) or negative values (e.g. square root) beforehand.

With regard to Machine Learning, the most common transformations are:

a) normalization (z-score)

b) median normalization (using the difference from the median - instead of the mean - in the z-score)

c) min-max (select a pattern of distribution within a range, whose limits are the selected minimum and maximum value)

The main reason is to provide a similar scale to different features.

Hopefully that helps.

Marcos
  • 54
  • 3
  • It would help if you said what “median normalization” and “min-max“ are. Perhaps edit those equations into your answer. – Dave Jul 19 '20 at 14:10
  • 2
    "This most common types of data transformation are:..." [ [citation needed](https://en.wikipedia.org/wiki/Citation_needed) ] – Alexis Jul 19 '20 at 16:05
  • With median normalization, we use the difference from the median (insted of the mean) in the z-score. – Marcos Jul 22 '20 at 12:21
  • With min-max, we set the variable to have a distribution within the selected minimum and maximum values. – Marcos Jul 22 '20 at 12:22