1

I have data about investment preferences 1 year before the Covid and during the Covid lockdown.

Some changes appear using simple T-Test. I want to be able to assess if these changes are particularly strong for some specific demographics (e.g., older individuals ($X_1$), individuals with lower income ($X_2$), etc...).

Should I use the initial level of my dependant variable in the regressions? Basically, if I want to use OLS regressions to investigate which independant variable correlate with the change in my dependant variable, which model is preferrable?

Model 1 (apparently called Change Score Method): $(Y_2-Y_1)= \beta_1 . X_1+ \beta_2 . X_2 $

Model 2 (apparently called Regressor Variable Method) Score Method): $Y_2= \beta_1 . X_1+ \beta_2 . X_2 + \beta_3 . Y_1 $

Thank you so much for your help - Any reference would also be much appreciated!

L. M.
  • 55
  • 4

1 Answers1

2

Both methods have been used. See here for example. It depends what question you want to answer. If you want to talk mostly about "change" you can use

(Y2-Y1) ~ X1 + X2            # (1)

Basal (Y1) should not be added to above equation as it will always be correlated with difference (Y2-Y1) - see comments below by @EdM and here.

On the other hand, if you want to discuss factors affecting "final value", you can use

Y2 ~ X1 + X2 + Y1            # (2)

However, since repeated measurements (Y1,Y2 at 2 times) have been done on same subject, hence mixed model is also often used. (including interactions as commented by @dbwilson below):

Y ~ X1 + X2 + time + X1*time + X2*time + (1|subject)

Following simplified version of formula is effectively same as above:

Y ~ X1*time + X2*time + (1|subject)            # (3)

There is another method commonly used, especially in biomedical literature: "Percent change", i.e.

(100*(Y2-Y1)/Y1) ~ X1 + X2            # (4)

It is not correct to keep Y1 as a predictor variable in this last method as there will be strong correlation between baseline and percent change.

I think this last method (percent change) is most understandable.

See here for more information on this topic.

rnso
  • 8,893
  • 14
  • 50
  • 94
  • Thank you so much for this detailed answer. In the end, given that I was mostly interested in change, I used (Y2-Y1) ~ X1 + X2 It is however interesting to see the last two methods you propose. Thank you again! – L. M. Jul 21 '20 at 10:30
  • Regressing the difference against the initial value is not a good idea. See [this answer](https://stats.stackexchange.com/a/476445/28500) and its links and [this answer](https://stats.stackexchange.com/a/476453/28500) to the question ["What are the worst (commonly adopted) ideas/principles in statistics?"](https://stats.stackexchange.com/q/476424/28500) – EdM Jul 21 '20 at 11:21
  • I have added a note regarding this in answer above. – rnso Jul 21 '20 at 11:30
  • In the mixed-model, the interaction between X1*time and X2*time are estimating the same effect as the X1 and X2 effects in the change score model. The code, however, should be Y~X1+X2+time+X1*time+X2*time + (1|subject). – dbwilson Jul 21 '20 at 11:55
  • I have added this in answer above with your reference. – rnso Jul 21 '20 at 12:35