10

Why are Ratios "Dangerous" in Statistical Modelling?

A friend was telling me today that it is unwise to use the ratio of two variables as a variable in a regression model, and that is better to use these the same two variables as variables in the same regression model. However, when I asked why this was, I did not get an answer.

I spent some time trying to read about this and found the following points:

  • In the context of probability, ratios are not always defined. For example, the Cauchy Probability Distribution is the ratio of two Normal Distributions. The mean of the Cauchy Distribution is undefined.

  • Ratios have a problem of "spurious correlation". For example, if you generate random points from two independent Normal Distributions - and then take the ratio of each of pair : you will find that the ratio can show statistical correlation, even though the data has come from independent and random Normal Distributions.

  • Ratios have the potential of becoming very large numbers and run the risk of division by zero.

  • Supposedly, if a plot of the ratio between two variables is not a diagonal line (at 45 degrees) and does not pass through the origin - the ratio is meaningless (I don't understand why).

These are some of the points I identified as to why ratios might be "dangerous" in statistical modelling, and that it might be better to use the numerator and denominator variables from the ratio as variables for the regression model (or any statistical model) - but are there any other main reasons as to why it might be considered dangerous to use ratios in statistical models?

Thanks!

References:

Stephan Kolassa
  • 95,027
  • 13
  • 197
  • 357
stats_noob
  • 5,882
  • 1
  • 21
  • 42
  • The Caushy argument is only meaningful if the covariates are considered random. Can you elaborate on the second bullet point? What is the ratio correlated with? Certainly it is correlated with the variables that it is comprised of. An important argument against ratio covariates would be interpretation. There are many values of the individual components that can lead to the same ratio value. Such a regression model treats these disparate components as the same group. A good place for ratios is hypothesis testing when investigating the ratio of means or variances. – Geoffrey Johnson Dec 18 '21 at 12:43
  • Relevant: https://stats.stackexchange.com/questions/58664/ratios-in-regression-aka-questions-on-kronmal/410465#410465 and https://stats.stackexchange.com/questions/299722/ive-heard-that-ratios-or-inverses-of-random-variables-often-are-problematic-in – kjetil b halvorsen Dec 18 '21 at 12:55
  • 1
    Another reference: https://www.jstor.org/stable/2983064 – Frank Harrell Dec 18 '21 at 13:36
  • 2
    The answers will differ depending on whether this variable is an explanatory variable or the response variable: which one do you have in mind? – whuber Dec 18 '21 at 22:22

2 Answers2

8

The "dangerous" part of the ratio is the inverted denominator

If you have a ratio term involving two explanatory variables in a regression model, this can be written as the interaction term:

$$\frac{x_{1,i}}{x_{2,i}} = x_{1,i} \times \frac{1}{x_{2,i}}.$$

Now, there is nothing inherently problematic or dangerous about having an interaction term involving the explanatory variable $x_{1,i}$, and indeed, we have interaction terms like this in many regression models. However, it is arguably quite "dangerous" to have a model term that inverts the explanatory variable $x_{2,i}$ --- if this value is small for some data points then this explanatory term will "explode" at those data points, which will generally cause them to have large positive or negative values yielding high leverage points in the regression (i.e., they will affect the OLS fit a lot).

Be careful painting this situation with too broad a brush, because terms of this kind are not always dangerous. Indeed, if the explanatory variable $x_{2,i}$ was already "explosive" (say, because it was already the inverse of a stable random variable with a mean near zero) then inversion may actually make it more stable instead of more explosive. As a general rule, if we invert a random variable with relatively low kurtosis, and a mean near zero, we will tend to get a random variable with high kurtosis (i.e., high probability of extreme values), and vice versa.

Here we have concentrated on the term involving an inverted explanatory variable. Of course, it is possible that the interaction with $x_{1,i}$ could aggravate the explosive nature of this term, particularly if large values of $x_{1,i}$ tend to go with small values of $x_{2,i}$. But as you can see, it is really the inversion that is the "dangerous" part. Whether or not the ratio term is "dangerous" largely comes down to whether or not the inverted term $1/x_{2,i}$ is "dangerous" in its own right. If $x_{2,i}$ has some small values then this term will be quite explosive and yield high-leverage data points.

Ben
  • 91,027
  • 3
  • 150
  • 376
3

Actually, it's kind of simple why. Suppose you calculate CV multiple times from bootstrap. CV is $\frac{SD}{Mean}$. Now suppose that the mean value is not close to zero, but could be, let's say one in a million times. What happens then is we might get a CV that could be -1000 times the median of the other CV values. So the problem with ratios of random variables is that the more data we have, the wilder the mean value may be because of the divide by almost zero problem in the denominator.

EDIT: For a more exact example that I am just crudely summarizing here see: Brody JP, Williams BA, Wold BJ, Quake SR (2002) Significance and statistical errors in the analysis of DNA microarray data. Proc Natl Acad Sci 99(20):12975–12978.

Carl
  • 11,532
  • 7
  • 45
  • 102
  • 1
    This argument is invalid, because the size of the mean is completely arbitrary--it can be changed by a simple change in measurement scale, for instance--and thereby has no inherent relationship whatsoever to the sampling variability of the CV. The *real* source of the problem is deeper, as explained at https://stats.stackexchange.com/questions/299722. – whuber Dec 18 '21 at 15:17
  • 1
    @whuber I am not concerned about the scale of the denominator, but rather the probability of its resulting in a close approximation of a divide by zero error, and my thinking on this is quite similar to the post that you wrote. Perhaps you didn't like the way I said it? I'll give you a +1 on your post anyway. – Carl Dec 18 '21 at 17:56
  • Your concern is the correct one but your post doesn't seem to express that very clearly, because it uses only qualitative phrases like "close to zero." Moreover, this seems to cover a point already made in the question at "Ratios have the potential of becoming very large numbers and run the risk of division by zero." – whuber Dec 18 '21 at 18:01
  • 1
    @whuber Very large *magnitude* numbers to be more precise, and yes it might seem that way, but actually I am just crudely summarizing the results of a paper on that subject: Brody JP, Williams BA, Wold BJ, Quake SR (2002) Significance and statistical errors in the analysis of DNA microarray data. [Proc Natl Acad Sci 99(20):12975–12978](https://www.pnas.org/content/pnas/99/20/12975.full.pdf) – Carl Dec 18 '21 at 18:14
  • 1
    That paper focuses on ratios of (independent) zero-mean Gaussians. That's a nice example of the problem discussed in the thread I linked to, but doesn't seem to be covered adequately by your description here. – whuber Dec 18 '21 at 19:02
  • 1
    Agreed, which is why I gave the link, for further information. – Carl Dec 18 '21 at 19:49