0

I recently used multiple linear regression to model monthly species abundance (y) and environmental variables (x) 2005-2016.

To ensure assumptions were satisfied for multiple linear regression I had to apply a transformation (abs(y-mean(y))) to the response variable. Having completed model selection, I wanted to see how well it was able to predict y using x from 2017 so I used the predict() function. The result was returned in its transformed state which is no use to me so, I have removed the transformation and used the following script.

mod1<-lm(y~x1+x2, data=mydata)
new.df <- data.frame(x1=c(),
                     (x2=c()))
predict(mod1, new.df)

I compared the result to the actual monthly species abundance data for 2017, and the predictions were very accurate.

I have two questions,

1) Can I report the prediction when this MLR model does not satisfy assumptions?

2) As the initial model selection was based on models with a transformed data is it suitable for me to report predictions from the model without transformation?

I have seen many answers to questions that may appear similar to this that seem suggest assumptions do not need to be satisfied for making predictions, however, I have been unable to find any reference for this in published literature.

Stefan
  • 4,977
  • 1
  • 18
  • 38
Jo Harris
  • 93
  • 1
  • 11
  • 2
    *I had to apply a transformation (abs(**x**-mean(**x**))) to the **response variable*** Can you clarify this point? Response variable is usually used to describe y, the dependent variable. – Penguin_Knight Apr 24 '18 at 15:13
  • 3
    By applying that transformation you guaranteed that one important assumption (at least) is *not* satisfied: all your responses are now associated with one another, perhaps strongly so, through the incorporation of their mean within every one of the transformed values. That is the very opposite of the *independence* that is assumed. Moreover, you have no way of inverting that "transformation," so of what use would it be? In what sense were you able to "remove" it? – whuber Apr 24 '18 at 15:14
  • @Penguin_Knight thanks, it was a typo, I have edited the question. – Jo Harris Apr 24 '18 at 15:15
  • I'd suggest to use a different distribution that is able to model your data. Then you don't have the trouble of transforming and back transforming etc. Is species abundance a proportion with values between 0 and 1? Or is it the a number of species, i.e. a count value? – Stefan Apr 24 '18 at 15:17
  • What were you trying to achieve with that transformation? It's very strange, in that you've folded your data around the mean. If your data were symmetric, what was formerly the lowest value is now tied for the highest value after transformation. – mkt Apr 24 '18 at 15:18
  • @whuber I am a student and followed my supervisor's advice, in hindsight it was unhelpful. It is too late to change as my thesis is due in a couple of weeks and I do not have time to revisit over 100 models so I am hoping there is a reference that says I do not need to satisfy the assumptions to use predict() so I can use the data. – Jo Harris Apr 24 '18 at 15:20
  • 2
    I'm sorry that you got bad advice, but you cannot justify an improper analysis with a citation. Even if such a citation did exist, it would be wrong - plenty of incorrect things get published, after all. But the fact of their having escaped close scrutiny during peer review does not make them correct, or a valid basis for action. – mkt Apr 24 '18 at 15:22
  • 2
    Of course you can use `predict`: like the sorcerer's apprentice, it will perform exactly the computation you require, whether or not it makes any sense or has any justification. A thesis is supposed to be published, so you need to apply publication standards to its preparation. That would include, at the very minimum, eliminating--or at least clearly and explicitly acknowledging--any elements you believe could be incorrect or misleading. – whuber Apr 24 '18 at 15:23
  • @Stefan my response variable is a proportion (%) – Jo Harris Apr 24 '18 at 15:24
  • In that case, I would have a look at beta regression. R has a package called `betareg` that could do the job: https://cran.r-project.org/web/packages/betareg/vignettes/betareg.pdf I am working with similar data and asked related questions [here](https://stats.stackexchange.com/questions/305272/inferential-statistics-on-mass-fractions-continuous-proportions) and [here](https://stats.stackexchange.com/questions/309047/zero-inflated-beta-regression-using-gamlss-for-vegetation-cover-data). – Stefan Apr 24 '18 at 15:30
  • @whuber Thank you for the advice. I did try many other transformations but this was the only one that satisfied all assumptions and did not give me autocorrelation issues. – Jo Harris Apr 24 '18 at 15:30
  • @Stefan thanks but it was my understanding that with beta regression no observation can equal exactly zero or exactly one and many of my observations do? I will look into it further. – Jo Harris Apr 24 '18 at 15:33
  • Yes that is true, however, depending the amount of zeros and ones, you could apply a transformation that brings those numbers between 0 and 1 (see page 3 under `2. Beta regression`). However that all depends on your specific data which you did not provide an example of. Alternatively, have a look at the link to my second question above. – Stefan Apr 24 '18 at 15:40
  • 2
    This is not a "transformation" in the sense you probably intended because (1) it is not invertible and (2) it depends on all the data at once rather than on just a single number at a time. Although there's nothing mathematically wrong with it, in most statistical applications it likely would not give rise to a useful model. – whuber Apr 24 '18 at 16:06

0 Answers0