Is computing the average of a ratio the correct approach, and how to do it with nested data?

Question

In general, is computing the average of a ratio appropriate?

And secondly, is the nested model below appropriate for doing this?

Here is a data set created from the Iris data that resembles my situation closely.

Imagine we wanted to compute the average and std error of the Length-to-Width Ratio of Iris Petals, irrespective of the Species. To do this, we measure Petal Length and Petal Width for 20 irises, and note the Species. However, we just measured the first 20 irises we found so Species is not balanced in the data. We finished with a sample of: 4 Setosa; 10 Versicolor; and 6 Virginica. From this data, the Petal Length and Width ratio is computed. Because it is believed there will be correlation within Species, we want to account for this when computing the average.

Here is an example data set:

set.seed(1234)  

d_setosa <- iris[iris$Species == 'setosa',]  
d_setosa <- d_setosa[sample (seq(1,50), size=4, replace =F),]  

d_versicolor <- iris[iris$Species == 'versicolor',]  
d_versicolor <- d_versicolor[sample (seq(1,50), size=10, 
                replace =F),]  

d_virginica <- iris[iris$Species == 'virginica',]  
d_virginica <- d_virginica[sample (seq(1,50), size=6, 
                              replace =F),]  

iris_sampled <- rbind(d_setosa,d_versicolor,d_virginica)

The length-to-width ratio is as follows:

iris_sampled$Petal.L_to_W_Ratio <- iris_sampled$Petal.Length / 
    iris_sampled$Petal.Width

The obvious incorrect thing to do would be to simply average the ratio:

mean(iris_sampled$Petal.L_to_W_Ratio)   
sd(iris_sampled$Petal.L_to_W_Ratio)/sqrt(20)

This produces mean: 3.846 and std error: 0.481

Using a linear mixed-model we would get the following:
(One of my questions is whether this model is specified correctly?)

library(lme4)  
ratio.m <- lmer(Petal.L_to_W_Ratio ~ (1|Species), data = iris_sampled)  
summary(ratio.m)

Random effects:  
 Groups   Name        Variance Std.Dev.  
 Species  (Intercept) 6.511    2.552     
 Residual             1.368    1.170   
Number of obs: 20, groups:  Species, 3  

Fixed effects:  
            Estimate Std. Error t value  
(Intercept)    4.381      1.500   2.921

This produced a result of mean: 4.381 and std error: 1.500

Here is a plot of the data.

I can see that the nested mean is sightly higher, which makes sense b/c there were 10 versicolor irises which would heavily influence the simple mean away from the 4 setosa irises.

score 1 · Answer 1 · answered Aug 09 '21 at 22:02

In general, is computing the average of a ratio appropriate?

As with so many things, it depends.

First, put aside temporarily the issue of ratios. If you expect that any outcome value will depend systematically on something like species, then it makes sense to take that into account in your calculations. That doesn't mean it's wrong to take the average over all the sampled leaves--it's just that the value won't be very useful, as the average you get will depend on the numbers of each species in your sample.

Back now to ratios. Potential problems with them are discussed extensively on this page. In particular: if the denominator can be close to zero, then ratio values can blow up and averages would be highly variable and useless.

Some things, on the other hand, are inherently ratios. Intensive variables come to mind: speed in miles per hour, concentrations in moles per liter, disease incidence in cases per 100,000 population. In those cases the denominators are necessarily positive, the numerators tend to increase proportionately to the denominators, and averages of those ratios make lots of sense.

Your case, the Petal.L_to_W_Ratio, is someplace in between. You expect some relationship between length and width, and the individual values are positive, but you don't necessarily have the close proportionality you expect from intensive variables.

You need to consider the variance of the ratio when you build a model, which can be a problem. (Even your simple overall averaging is building a model.) Different types of models are based on different assumptions about the variance of the outcome. For example, the simple linear random-effect model you show assumes that variance about the predicted values is independent of the predicted value, and (ideally) that such variance is normally distributed. That doesn't seem to be the case for Petal.L_to_W_Ratio, with a lot more variability among the ratios within the higher-average-ratio setosa species.

How to handle the variance then becomes important. If you're going to do a ratio, sometimes it helps to put the numerically larger value in the denominator, in this case to calculate a Petal.W_to_L_Ratio instead. Something as simple as working with the logs of the ratios, instead of the ratios themselves, might work adequately. A properly chosen generalized linear model, based on your understanding of your subject matter and the nature of your measurements, might do better.

Or, depending on why you want to calculate the ratio, there might be a better way to accomplish the task. Say that you did a detailed study of both width and length values, but in an upcoming field study you could only easily measure length. If you still wanted to estimate the corresponding width values based on data from your detailed study, you might just model width directly as a function of length and species and avoid the explicit ratio calculations completely.

Finally, be careful in how you think about the intercept of a random-effect model like yours as an "average," again putting aside the issue of the ratio. The intercept in your model is the mean of a distribution of species-mean values that (a) has been forced to be a normal distribution and (b) is weighted toward the cases with the largest number of observations. Is that the "average" you really want to determine? Or, maybe (with only 3 species), is all you care about the actual mean values within each species and the differences among them, as determined by a fixed-effects model? Again, it depends on what you're trying to accomplish.

Thank you @EdM! I appreciate the information on the question of working with ratios. In the last paragraph you mentioned the interpretation of the intercept in the random-effect model. My goal really is to reduce the influence of repeated correlated measurements using a random-effect. In my particular data I have multiple samples from human subjects that are not equal in number for all subjects (subject is the random-effect). In my case, would the intercept of the random-effects model correctly account for the correlations in the data? — jeffalltogether, Aug 09 '21 at 22:38
@jeffalltogether that’s a way to account for correlations within subjects arising from different intercepts. Some models need further random effects for individual-specific slopes with respect to predictor variables or time. If so, you need to decide whether intercepts and slopes should be allowed to be correlated, too. These choices depend on the specific study, its measurements, and its goals. Many threads on mixed models and random effects on this site can help point the way. If those don’t help, ask a new question more directly related to your own study. — EdM, Aug 10 '21 at 02:37

Is computing the average of a ratio the correct approach, and how to do it with nested data?

1 Answers1