0

This question is both related to math and coding, and for transparency, I also posted it on Stack Overflow for the coding part. I need to modify a log regression when I change my y-axis data from y to 100-y (y is a percentage). I understand that log(100-y)~x is a different model to log(y)~x, but considering the shape of the curve f(x)=-exp(x), I would expect an exponential lm() model to work. It seems I am wrong, why? I went through the site, it seems to be a known (and more complicated than I thought) issue/question, reading here, and here for example. Even if I don't get the exact "reversed" model, why does lm(exp(100-y)~x not provide a decent estimate? I also understand that a log-linked gaussian GLM could make it, but I have issues to properly code it and plot. And it is less convenient as I cannot really have a r² if I understand properly.

I use the following reprex, the initial data are:

m=c(3,1,3,2,4,7,10,2,1,44,11,24,1,3,5,11,7,17,8,10,17,34,31,10,7,3,1,6,4,3,4,7,11,14,6,8,34,80,23,54,2,3,2,1,13,8,10,15,2,15,8,11,8,12,9,18,85)
q=c(91,78,72,84,74,76,81,80,70,130,131,136,53,56,57,111,60,80,78,76,84,107,85,76,57,57,45,50,68,61,52,59,60,60,51,63,95,93,85,97,52,39,34,40,70,124,90,68,39,81,50,60,74,70,59,53,98)
Age=c(rep("Young",24),rep("Old",33))
df=data.frame(Age,m,q)

After looking at the plot, I decided to test a linear regression for the "Young" category, and a log one for the "Old" category:

model_lin_df <- df %>% group_by(Age) %>% do(model = glance(lm(m ~ q,data = .))) %>% unnest(model)
> model_lin_df
# A tibble: 2 x 13
  Age   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <chr>     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 Old       0.370         0.350 16.4       18.2 0.000172     1 -138.   282.  287.    8339.          31    33
2 Young     0.418         0.391  8.97      15.8 0.000646     1  -85.7  177.  181.    1771.          22    24

model_exp_df <- df %>% group_by(Age) %>% do(model = glance(lm(log(m) ~ q,data = .))) %>% unnest(model)
> model_exp_df
# A tibble: 2 x 13
    Age   r.squared adj.r.squared sigma statistic    p.value    df logLik   AIC   BIC deviance df.residual  nobs
      <chr>     <dbl>         <dbl> <dbl>     <dbl>      <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
    1 Old       0.517         0.502 0.776      33.2 0.00000242     1  -37.4  80.8  85.3     18.7          31    33
    2 Young     0.362         0.333 0.916      12.5 0.00187        1  -30.9  67.8  71.4     18.5          22    24

Here is the plot with the fitted lines:

model_lin <- lm(m ~ Age/q + 0, df,na.action=na.exclude)
model_exp <- lm(log(m) ~ Age/q + 0, df,na.action=na.exclude)

df_plot <- ggplot(df, aes(x = q, y = m, shape=Age, color=Age)) + geom_point() +
                geom_line(aes(y = ifelse(Age == "Young", fitted(model_lin), NA))) +
                geom_line(aes(y = ifelse(Age == "Old", exp(fitted(model_exp)), NA))) +
                theme_bw() + theme(panel.grid = element_blank())

enter image description here

The nice part is that it is a lm model, I can add the r² and p-value. But I realized that in fact, I should plot the y-axis as 100-y, the data would biologicaly make more sense:

s=100-m
df=cbind(df,s)

The linear model provides the same r² and p-value, which was intuitive:

model_lin_df2 <- df %>% group_by(Age) %>% do(model = glance(lm(s ~ q,data = .))) %>% unnest(model)
> model_lin_df2
# A tibble: 2 x 13
  Age   r.squared adj.r.squared sigma statistic  p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <chr>     <dbl>         <dbl> <dbl>     <dbl>    <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 Old       0.370         0.350 16.4       18.2 0.000172     1 -138.   282.  287.    8339.          31    33
2 Young     0.418         0.391  8.97      15.8 0.000646     1  -85.7  177.  181.    1771.          22    24

But I don't find how to "reverse" / back transform the exponential model. I would intuitively think that a -exp function would do it, but no:

model_exp_df2 <- df %>% group_by(Age) %>% do(model = glance(lm(100-exp(s) ~ q,data = .))) %>% unnest(model)

> model_exp_df2
# A tibble: 2 x 13
  Age   r.squared adj.r.squared   sigma statistic p.value    df logLik   AIC   BIC deviance df.residual  nobs
  <chr>     <dbl>         <dbl>   <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>    <dbl>       <int> <int>
1 Old       0.216        0.191  2.26e42      8.55 0.00640     1 -3264. 6534. 6539.  1.58e86          31    33
2 Young     0.114        0.0736 3.19e42      2.83 0.107       1 -2382. 4770. 4773.  2.24e86          22    24

model_lin2 <- lm(s ~ Age/q + 0, df,na.action=na.exclude)
model_exp2 <- lm((100-exp(s)) ~ Age/q + 0, df,na.action=na.exclude)

df_plot2 <- ggplot(df, aes(x = q, y = s, shape=Age, color=Age)) + geom_point() +
              geom_line(aes(y = ifelse(Age == "Young", fitted(model_lin2), NA))) +
              geom_line(aes(y = ifelse(Age == "Old", exp(fitted(model_exp2)), NA))) +
              theme_bw() + theme(panel.grid = element_blank())

enter image description here

Is there a lm() model that could fit the data for the "Old" category? It would be great because it easily provides r² and p-value. If not, why, and is a glm with log link the only option? I am less used to that one, I cannot figure out how to plot it for one category so far. Any insight will be appreciated!

Mata
  • 101
  • 2

0 Answers0