Why log-transform to normal distribution for decision trees?

Question

On page 304 of chapter 8 of An Introduction to Statistical Learning with Applications in R (James et al.), the authors say:

We use the Hitters data set to predict a baseball player’s Salary based on Years (the number of years that he has played in the major leagues) and Hits (the number of hits that he made in the previous year). We first remove observations that are missing Salary values, and log-transform Salary so that its distribution has more of a typical bell-shape. (Recall that Salary is measured in thousands of dollars.)

No additional motivation for the log-transform is given. Being that the data are being fed into decision tree algorithms, why was it important to force the data into a normal distribution? I thought most/all decision tree algorithms were invariant to scale changes.

Sycorax · Accepted Answer · 2019-01-02T16:59:21.693

In this case, the salary is the target (dependent variable/outcome) of the decision tree, not one of the features (independent variables/predictors). You are correct that decision trees are insensitive to the scale of the predictors, but since I suspect there are a small number of extremely large salaries, transforming the salaries might improve predictions because loss functions which minimize square error will not be so strongly influenced by these large values.

Dave Harris · Answer 2 · 2019-01-02T20:04:21.963

I downloaded last year's salaries. It is very likely they follow a Pareto distribution. The histogram is shown below.

The pdf of the Pareto distribution is $$\frac{\alpha{x_m}^\alpha}{x^{\alpha+1}}.$$

The scale parameter $x_m$ is \$545,000, the lowest salary last year. I estimated the shape parameter, $\alpha$, as 0.7848238 using MLE. This matters because when $\alpha<2,$ then the distribution has no variance. More properly, its variance is undefined. If any of your variables lacks a mean or a variance, then you cannot use anything that minimizes squared loss.

The distribution of the log of the variables does have a variance and so you can use least squares style methodologies on them. This is actually a serious omission from your textbook. Some things, like the stock market returns which have neither a mean nor a variance, or baseball salaries, which lack a variance will make OLS models meaningless. The log is not, inherently, the best treatment, but it does work.

Taking the log does not give you a bell shape.

This is entirely about being certain that all of your data has a variance. If all of the assumptions for OLS are met, then the underlying distributions do not matter. They can be insane looking, but variance has to be defined everywhere.

EDIT As Therkel pointed out in the comments when $\alpha<1$ then no mean exists either. There is a comment by Cliff AB that I should take up as well. He argues that the distribution is doubly bounded and so a finite variance and mean exist. I would disagree with that as an economist. It is true that there is only so much wealth in the world, but it is also true that we have no idea what it is. Furthermore, that wealth is changing every second of every day as people make individual choices.

The worker who does not pick that one apple reduces wealth if that apple is never picked and reduces available wealth regardless. An apple on a tree has no income value until it picked and processed. This makes the right-hand side constraint stochastic. For the purposes of baseball, the stochastic effect should be considered to be zero.

Baseball, as a percentage of world output, is so miniscule that you could ignore it. The same is true for American football, North American hockey, or for that matter, live stage theater for the whole United States.

The fact that you can model this data with a Pareto distribution means you have no mean or variance if the estimates are valid. If you take the log, you end up with finite variance. If you divide the data by its minimum value and take the logs, you end up with the exponential distribution, which is well enough behaved, but then you get interpretation problems.

Note that for a shape parameter $\alpha < 1$ then the distribution does not even have a mean. — Therkel, Jan 02 '19 at 07:22
"This matters because when α<2, then the distribution has no variance." - No, because we know the variance *is* bounded for salaries, as there is both an upper and lower bound on possible salaries (0 and all the money in the world, for example), regardless of whether the data appears to fit well with a Pareto distribution with unbounded variance. — Cliff AB, Jan 02 '19 at 17:34
@Cliff AB: I'm not sure the boundedness of salaries is really relevant. The mean/variance will be so badly defined that is is for practical/inferential purposes better to assume they do not exist. I wrote about that: https://stats.stackexchange.com/questions/94402/what-is-the-difference-between-finite-and-infinite-variance/100161#100161 — kjetil b halvorsen, Jan 02 '19 at 17:51
@CliffAB it isn't bounded on the right. The right side boundary, although it does exist, is stochastic. Not only does no one know the planetary budget constraint, but it is also in constant motion. — Dave Harris, Jan 02 '19 at 19:54
@Therkel thanks for pointing that out. I was working from memory. — Dave Harris, Jan 02 '19 at 19:54

Why log-transform to normal distribution for decision trees?

2 Answers2

Related