6

When I am given a variable, I usually decide whether to take its logarithm based on gut feeling. Usually I base it on its distribution - if it has long tail (like: salaries, GDP, ...) I use logarithms.

However, when I need to preprocess a large number of variables, I use ad hoc techniques. With some tweaking I can arrive at "desired" results, but without a good argumentation.

Is there a common or widely accepted way to decide whether to scale a (single) variable with log (or, say, square root)?

Of course, for more refined techniques I need scaling related used method, the meaning of particular parameters or their relations. But e.g. for deciding whether to use log scale in a plot - distribution of a single variable should suffice.

Requirements:

  • It should be relatively method-agnostic (I can do further rescaling, if needed).
  • It should based only on the distribution of values (not e.g. semantics of data).
  • It should be a sensible rule for choosing scales in plots.

I know:

Piotr Migdal
  • 5,586
  • 2
  • 26
  • 70

2 Answers2

3

As a rule of thumb, try to make the data fit a (standard) normal distribution, a uniform distribution or any other distribution where the values are more or less “evenly” distributed.

As a measurement, one thing that you could aim for is to maximize the distribution’s entropy for a fixed variance.

So, if your data is approximately log-normal distributed, taking its logarithm would probably be a good idea since afterwards it would be approximately normal distributed.

Another way to determine how to preprocess the data would be to transform it to a distribution in which an additive perturbation of a certain size would be equally significant no matter what the value that was being perturbed was. For example, if a 5 % raise in salary can be said to be equally significant no matter how much money you earn, you should probably logarithmize the data since that would make an additive perturbation equally significant for all values.

HelloGoodbye
  • 534
  • 4
  • 10
  • `As a rule of thumb, try to make the data fit a (standard) normal distribution, a uniform distribution or any other distribution where the values are more or less “evenly” distributed`. Is this rule of thumb applicable for algorithms? – spectre Dec 09 '21 at 16:48
  • @spectre I don’t understand your question. What do you mean by “applicable for algorithms”? – HelloGoodbye Dec 10 '21 at 17:15
  • You mention in your answer to try to make the data fit a normal distribution. Is this applicable to all algorithms? If my data is right skewed and I am using a RandomForest algorithm, then should I transform all my variables into normal distribution? (RandomForest does not require the data to be normally distributed) – spectre Dec 11 '21 at 05:08
  • @spectre While it was a while ago, I may have went with the requirement "it should be a sensible rule for choosing scales in plots" given by OP. No algorithm I know of requires the data to be normally distributed, but some algorithms will work better if it is. I know too little about random forests to be able to say whether that is also true for that algorithm. But **if** the only thing that matters is which order the numbers come in when sorted (along different axis) and not their actual values, making the data normal distributed will not make a difference and therefore be unnecessary. – HelloGoodbye Dec 11 '21 at 05:37
  • @spectre On a related note, it is usually a good thing if the different predictors (input variables) are as little correlated as possible, i.e., that their correlation is very close to 0. I'm quite sure this is desirable even when using random forests, so if your predictors are heavily correlated, it might be a good idea to transform them to new variables that are not correlated. The process of transforming the predictors in this way (along with normalizing them such that the get the variance 1) is known as [whitening](https://en.wikipedia.org/wiki/Whitening_transformation). – HelloGoodbye Jan 04 '22 at 12:51
  • AFAIK 0 correlation between features is applicable only for linear algorithms (think linear regression, logistic regression)! But I may be wrong. Is there any pythonic implementation of whitening? – spectre Jan 05 '22 at 05:51
  • @spectre Why would that be the case? You can definitely still apply whitening and perform training afterwards even with a non-linear algorithm, but I can’t guarantee that it will never impede the training. In most cases I would think that it speeds up the training slightly, though. – HelloGoodbye Jan 06 '22 at 23:52
1

(So to be kosher and not mix the question with an answer.)

Right now I am using scale which minimized the following ratio: $$\frac{\sqrt[4]{\langle (x - \bar{x})^4 \rangle}}{\sqrt{\langle (x - \bar{x})^2 \rangle}}$$ That is, after normalizing a variable (i.e. mean 0 and variance 1) I am looking to have the 4th moment as low as possible (so to penalize too long-tailed, or otherwise disperse, distributions).

For me it works (but I am not sure if it's only my using it; and if there are any easy pitfalls).

Piotr Migdal
  • 5,586
  • 2
  • 26
  • 70
  • 4
    This is clever but highly non-robust: the fourth moment is extremely sensitive to outliers. – whuber Jul 22 '15 at 15:54
  • @whuber In some sense I want to be sensitive to outliers (even a single outlier can wreck the plotting scale). But you are right that it may result e.g. in choosing logarithmic scale based on a single erroneous datapoint. – Piotr Migdal Jul 22 '15 at 16:52
  • 2
    Seems to me a lot quicker and simpler just to look at minimum and maximum, which will deal with outliers too. Rule 1. (sorry if it seems too obvious) If the minimum is not positive, plain logarithms are inapplicable and probably don't make sense any way. (Side comment: some people are happy with adding a constant first to make logarithms defined.) Rule 2. The benefit of using logarithms increases with max/min. I don't seek cutoffs for Rule 2, as .e.g. in a scatter plot, the benefits of using a transformation will depend on the other variable too. – Nick Cox Apr 15 '16 at 10:05
  • Why the 4:th moment? Intuitively, it seems smart, but how did you come up with the method? Can you show that it is optimal in some way? – HelloGoodbye Sep 17 '18 at 04:10
  • Minimizing the standardized 4:th momentum, a.k.a. the kurtosis, is in fact not optimal, since it results in a PDF with [two very sharp peaks](https://en.wikipedia.org/wiki/Kurtosis#Moors'_interpretation). See my answer for a (probably) better measurement to optimize. – HelloGoodbye Sep 17 '18 at 05:26
  • @HelloGoodbye It is a rule of thumb, nothing sophisticated. At the same time - for many distributions no simple rescaling, linear or not, makes sense. In this case one need to dig deeper - e.g. using some latent variables (e.g. Item Response Theory) or Bayesian models. – Piotr Migdal Sep 17 '18 at 20:05