5

So the background is that the I collected yield data for past 5-6 decades and location from where I collected yield data had high yielding varieties introduced over time. I am looking at the relationship between yield and rainfall but this introduction of HYV might affect the true impact of monsoon on yield and therefore I am detrending the data to remove the effect of HYV.

I did a linear regression of yield against time in R:

mdl1 <- lm(yield ~ time, data=data)

and then removed the linear trend by taking the residuals of the above regression:

yield.res <- resid(mdl1)

Now I am using these residuals for my subsequent analysis. For example, the relationship between yield and rainfall is:

 mdl2 <- lm(yield.res ~ rain, data=data)

In this case, do my yield.res have to be normally distributed before I do this regression? If yes, what sort of transformation do I need to use? Since yield.res consists of both negative and positive numbers, I am slightly confused how to go about it.

user53020
  • 635
  • 1
  • 5
  • 15

1 Answers1

3

This is a very odd way to try to analyze your data. Do you really believe that the relationship between yield and time is strictly linear? Do the results support that? Why not include both time and rain in a single multiple regression model? Moreover, since it isn't save to assume the relationship with time is strictly linear, you can use a spline function of time like so:

library(splines)
mdl <- lm(yield ~ ns(time) + rain, data=data)

To answer your explicit question, regression methods do not assume the marginal distribution of $Y$ (in this case, yield.res) is normally distributed (see here: What if residuals are normally distributed, but y is not). Instead, they assume that the residuals (more accurately the errors of the data generating process) are normally distributed.

gung - Reinstate Monica
  • 132,789
  • 81
  • 357
  • 650
  • 1
    In addition, the idea of using residuals only makes even halfway sense in the very special case of the ordinary linear model. – Frank Harrell Aug 03 '14 at 16:09
  • Hmm, I wonder what the downvote is for? I could possibly correct this, if I knew what was wrong. – gung - Reinstate Monica Aug 04 '14 at 02:18
  • One problem is that even if the original observations are independent the residuals will not be. But the biggest problem is that you don't formulate the correct model up front and fashion optimal estimation and hypothesis testing off of that model. – Frank Harrell Aug 04 '14 at 12:34