Regression: Transforming Variables

Question

When transforming variables, do you have to use all of the same transformation? For example, can I pick and choose differently transformed variables, as in:

Let, $x_1,x_2,x_3$ be age, length of employment, length of residence, and income.

Y = B1*sqrt(x1) + B2*-1/(x2) + B3*log(x3)

Or, must you be consistent with your transforms and use all of the same? As in:

Y = B1*log(x1) + B2*log(x2) + B3*log(x3)

My understanding is that the goal of transformation is to address the problem of normality. Looking at histograms of each variable we can see that they present very different distributions, which would lead me to believe that the transformations required are different on a variable by variable basis.

## R Code
df <- read.spss(file="http://www.bertelsen.ca/R/logistic-regression.sav", 
                use.value.labels=T, to.data.frame=T)
hist(df[1:7])

alt text

Lastly, how valid is it to transform variables using $\log(x_n + 1)$ where $x_n$ has $0$ values? Does this transform need to be consistent across all variables or is it used adhoc even for those variables which do not include $0$'s?

## R Code 
plot(df[1:7])

alt text

whuber · Accepted Answer · 2010-11-23T18:03:33.650

69

One transforms the dependent variable to achieve approximate symmetry and homoscedasticity of the residuals. Transformations of the independent variables have a different purpose: after all, in this regression all the independent values are taken as fixed, not random, so "normality" is inapplicable. The main objective in these transformations is to achieve linear relationships with the dependent variable (or, really, with its logit). (This objective over-rides auxiliary ones such as reducing excess leverage or achieving a simple interpretation of the coefficients.) These relationships are a property of the data and the phenomena that produced them, so you need the flexibility to choose appropriate re-expressions of each of the variables separately from the others. Specifically, not only is it not a problem to use a log, a root, and a reciprocal, it's rather common. The principle is that there is (usually) nothing special about how the data are originally expressed, so you should let the data suggest re-expressions that lead to effective, accurate, useful, and (if possible) theoretically justified models.

The histograms--which reflect the univariate distributions--often hint at an initial transformation, but are not dispositive. Accompany them with scatterplot matrices so you can examine the relationships among all the variables.

Transformations like $\log(x + c)$ where $c$ is a positive constant "start value" can work--and can be indicated even when no value of $x$ is zero--but sometimes they destroy linear relationships. When this occurs, a good solution is to create two variables. One of them equals $\log(x)$ when $x$ is nonzero and otherwise is anything; it's convenient to let it default to zero. The other, let's call it $z_x$, is an indicator of whether $x$ is zero: it equals 1 when $x = 0$ and is 0 otherwise. These terms contribute a sum

$$\beta \log(x) + \beta_0 z_x$$

to the estimate. When $x \gt 0$, $z_x = 0$ so the second term drops out leaving just $\beta \log(x)$. When $x = 0$, "$\log(x)$" has been set to zero while $z_x = 1$, leaving just the value $\beta_0$. Thus, $\beta_0$ estimates the effect when $x = 0$ and otherwise $\beta$ is the coefficient of $\log(x)$.

edited Nov 23 '10 at 18:03

answered Nov 23 '10 at 17:55

whuber

281,159
54
637
1,101

1

Very helpful description, thanks for the direction and the detail on my subquestion as well. – Brandon Bertelsen Nov 23 '10 at 20:57
http://pareonline.net/getvn.asp?v=15&n=12 Osborne (2002) recommends anchoring the minimum value in a distribution at exactly 1.0. http://pareonline.net/getvn.asp?v=8&n=6 – Chris Jun 17 '15 at 22:22
@Chris Thank you for the interesting references. I think the reasoning is invalid, though. There is no mathematical basis to the assertion that "similar to square root transformations, numbers between 0 and 1 are treated differently than those above 1.0. Thus a distribution to be transformed via this method should be anchored at 1.00." The logarithm doesn't suddenly change its behavior at $1$! For an example of a more nuanced and statistically appropriate approach to this question, see my answer at http://stats.stackexchange.com/a/30749. – whuber Jun 17 '15 at 22:36
The authors are pointing out that log(0.99) is negative while log(1.01) is positive... I will read your post. :) – Chris Jun 17 '15 at 22:41
1

@Chris All Box-Cox transformations transition from negative to positive at $1$, too. That's irrelevant for a nonlinear transformation, though, because it can be followed up by any linear transformation without changing its effects on variance or linearity of a relationship with another variable. Thus, if your client is allergic to negative numbers, just add a suitable constant *after* the transformation. Adding the constant *before* the transformation, though, can have a profound effect--and that's why no recommendation always to use $1$ could possibly be right. – whuber Jun 17 '15 at 22:44
1

In one of my datasets that I am working on, I noticed if I shifted the dependent response variable to anchor at 1 and used a box cox transformation to eliminate the skew, the resulting transformation was weakened leading credence to your critique. ;) – Chris Jun 17 '15 at 23:07
1

@whuber My previous question was very silly (will probably delete comment). Of course $\beta_0$ pertains to the $z_x$ dummy indicator, and NOT to the constant in the model. Thank you again for the extensive and clear explanations of this setup; very helpful for my work. Overall I prefer this parametrization as opposed to [this other, equivalent approach](http://stats.stackexchange.com/a/6565/36515). – landroni Jul 07 '15 at 18:37
@whuber *The main objective in these transformations is to achieve linear relationships with the dependent variable [...]. (This objective over-rides auxiliary ones such as [...] achieving a simple interpretation of the coefficients.)* This is also one of the central messages coming from CAR... While I am starting to understand the need for transforming data, I'm still a bit unclear on how to approach the interpretation of resulting coefficients. With `log` it's simple, as it allows transformation while maintaining relatively straightforward interpretations. – landroni Jul 08 '15 at 08:16
@whuber But, to take an example from my real data, what if your variable requires e.g. a Yeo-Johnson transformation with `lambda=-2` (cf `symbox()` and `yjPower()` in `car`) to achieve symmetry and near-normality? How do you approach interpretation of the associated coefficients then? Such transformations seem to limit comments srictly to the sign and significance of the estimated effect... (In CAR Fox and Weisberg advocate for the use of `effects` displays, especially when dealing with models containing transformations and interactions, but I haven't quite gotten the hang of them yet.) – landroni Jul 08 '15 at 08:17
1

@landroni If the data show that (say) a log transform produces a linear relationship between two variables, then that is useful and insightful information. If you were not familiar with the logarithm, you would find that transformation is difficult to interpret, too. But that's a subjective issue--it says nothing about the data. Interpretability is a matter of *familiarity*. That means understanding the mathematical nature of the transformation. It eventually comes with study and experience. – whuber Jul 08 '15 at 12:56
Thanks so much for this clear answer! I am implementing this approach in an analysis I'm conducting, but I find I'm seeing very high multicollinearity as a result; the correlations between the zero-indicator variables and the variables indicating the (log transformed) value of that same variable above 0 are ranging .91 to .99 in my data. Should I just not attempt to interpret the resulting coefficients individually? – Rose Hartman Apr 02 '17 at 02:20
@Rose This tends to be a problem with binary variables that are highly skewed (so that one value predominates). The multicollinearity can create large standard errors of the coefficient estimates, making it unreliable to interpret the coefficients. How to deal with that depends on the amount of data you have, the purpose of the regression, theories about the relationships, and much more. – whuber Apr 02 '17 at 17:52
@whuber -- how does your "two variables" method with an indicator relate to zero-inflated models? Seems very similar. – abalter Sep 03 '19 at 02:31
@abalter They share some features. One difference is that zero-inflated models concern the *conditional response* whereas the models discussed here concern the *regressor variables.* This is an important distinction, because the former is random while the latter is determinate. – whuber Sep 03 '19 at 11:32

score 1 · Answer 2 · answered May 20 '21 at 11:40

Very old post and my first entry here. This does not really answer the question at hand but I did the below some years ago when I studied econometrics. The layout is arguably not great but I think it may still be useful for anyone starting to delve into statistics.

There are some rules of thumb for taking logs (do not take them for granted). See for example Wooldrigde: Introductory Econometrics P. 46.

When a variable is a positive $ amount, the log is often taken (wages, firm sales, market value...)
Same for variables such as population, number of employees, school enrollments etc. (Why? - see below).
Variables measured in years (education, experience, tenure, age and so on) are usually not transformed (in original form).
Percentages (or proportions) like unemployment rates, participation rates, percentage of students passing exams etc. are seen in either way, with a tendency to be used in level form. If you take a regression coefficient involving the original variable (does not matter if independent or dependent variable), you will have a percentage point change interpretation. The table below summarizes what happens in regressions due to various transformations:

Now apart from the interpretation of the coefficients in regressions (which is in itself useful), the log has various interesting properties. I did this a few years ago, simply copy pasting here (please excuse that I do no change the formatting and make charts prettier etc.).

Why the natural logarithm is such a natural choice?

Gilbert Strang: Growth Rates and Log Graphs supplements what follows below, so worth watching.
List of Logarithmic Identities and Why Log Returns is also good.

There are 6 main reasons why we use the natural logarithm:

The log difference is approximating percent change
The log difference is independent of the direction of change
Logarithmic Scales
Symmetry
Data is more likely normally distributed
Data is more likely homoscedastic

Reason 1: The log difference is approximating percent change

Why is that? Well there are several ways to show this:

One is presented below
Other explanations using Taylor's Series and Maclaurin Series

If you have two values:

x = value Old (say 1.0) y = value New (say 1.01)

Property 1: Simple percent calculation shows it is 1%

$$\frac{New - Old}{Old} = \frac{New}{Old} - 1 = \frac{1.01}{1.0} -1 = 0.01$$

Hint: This is not a computational error in the exact percent calculation:
Python Docs
Rounding in Python
Is floating point math broken

But how does the log approximation work?

Property 2 Khan Academy Logarithmic properties $$ln(uv)=ln(u)−ln(v)$$

This allows you to greatly simplify certain expressions.

Property 3: $$ ln (1 + x) \approx x $$

Now combining the established properties we can rewrite

$$ x = \frac{New - Old}{Old} = \frac{New}{Old} - 1 $$

using:

$$ ln (1 + x) \approx x $$

gives:

$$ ln \Bigg(1 + \frac{New}{Old} - 1\Bigg) = ln \Bigg(\frac{New}{Old}\Bigg) \approx \frac{New - Old}{Old}$$

which using the properties of logs $$ln \Bigg(\frac{u }{ v}\Bigg) = ln (u) - ln (v) $$

can be rewritten as

$$ ln (New) - ln (Old) \approx \frac{New - Old}{Old}$$

Reason 2: The log difference is independent of the direction of change

Another point worth noting is that 1.1 to 1 is an almost 9.1% decrease, 1 to 1.1 is a 10% increase, the log difference 0.953 is independent of the direction of change, and always in between of 9.1 and 10. Moreover, if you flip the values in the log differences, all that changes is the sign, but not the value itself.

Reason 3: Logarithmic Scales

A variable that grows at a constant growth rate increases by larger and larger increments over time. Take a variable x that grows over time at a constant growth rate, say at 3% per year:

Now, if we plot against time using a standard (linear) vertical scale, the plot looks exponential. The increase in becomes larger and larger over time. Another way of representing the evolution of is to use a logarithmic scale to measure on the vertical axis. The property of the logarithmic scale is that the same proportional increase in this variable is represented by the same vertical distance on the scale. Since the growth rate is constant in this example, it becomes a perfect linear line.

This shows the effect of logarithmic scales nicely on the vertical axes.
The reason is that the distances between 0.1 and 1, 1 and 10, 10 and 100, and so forth are the same in the logarithmic scale.
Reason 4: Symmetry explains this in more detail.

In contrast to these examples, economic variables such as GDP do not grow at a constant growth rate every year.

Their growth rate may be higher in some decades, and lower in others.
Yet, when looking at their evolution over time, it is often more informative to use a logarithmic scale than a linear scale.
For instance, GDP is several times bigger now than 100 years ago. The curve becomes steeper and steeper and it is very difficult to see whether the economy is growing faster or slower than it was 50 or a 100 years ago.

Reason 4: Symmetry

A logarithmic transformation reduces positive skewness because it compresses the upper end (tail) of the distribution while stretching out the lower end. The reason is that the distances between 0.1 and 1, 1 and 10, 10 and 100, and 100 and 1000 are the same in the logarithmic scale. You can also see this in the pyplot chart above.

This has another important implication:

If you apply any logarithmic transformation to a set of data, the mean (average) of the logs is approximately equal to the log of the original mean, whatever type of logarithms you use.
However, only for natural logs is the measure of spread called the standard deviation (SD) approximately equal to the coefficient of variation (the ratio of the SD to the mean) in the original scale.

Reason 5: Data is more likely normally distributed Let's start with a log-normal distribution

A variable x has a log-normal distribution if $log(x)$ is normally distributed. A log-normal distribution results if a random variable is the product of a large number of independent, identically-distributed variables. This will be demonstrated below.
This is similar to the normal distribution which results if the variable is the sum of a large number of independent, identically-distributed variables.

Log-Normal Distribution:

$\mu$ is the mean and $\sigma$ is the standard deviation of the normally distributed logarithm of the variable.

Shapiro-Wilk Test for Normality

If the p-value $\leq 0.05$, then you would reject the NULL hypothesis that the samples came from a Normal distribution. To put it loosely, there is a rare chance that the samples came from a normal distribution.

Using SciPy's stats module
The following section demonstrates that taking the products of random samples from a uniform distribution results in a log-normal probability density function.

Defining

$${\displaystyle \mu =\ln \left({\frac {m}{\sqrt {1+{\frac {v}{m^{2}}}}}}\right),\qquad \sigma ^{2}=\ln \left(1+{\frac {v}{m^{2}}}\right).} $$

The probability density function for the log-normal distribution is:

$$ p(x) = \frac{1}{\sigma x \sqrt{2\pi}}\ \cdotp \ e^{\bigl(-\frac{(ln(x) \ - \ \mu)^2}{2\sigma^2}\bigr)} $$

where $\mu$ is the mean and $\sigma$ is the standard deviation of the normally distributed logarithm of the variable, which we just computed above. Given the formula, we can easily calculate and plot the PDF.

Reason 6: Data is more likely homoscedastic. Often, measurements are seen to vary on a percentage basis, for example, by 10% say. In such a case:

something with a typical value of 80 might jump around within a range of $\pm 8$ while
something with a typical value of 150 might jump around within a range of $\pm 15$.

Even if it's not on an exact percentage basis, often groups that tend to have larger values also tend to have greater within-group variability. A logarithmic transformation frequently makes the within-group variability more similar across groups. If the measurement does vary on a percentage basis, the variability will be constant in the logarithmic scale. Please check this reference for more info.

Let's start by generating a conditional distribution of $y$ given $x$ with a variance $f(x)$.

In plain English, we need something where the variability in the date increased when $x$ increases.

Regression: Transforming Variables

2 Answers2

Linked

Related