I am using Stata 13 to estimate a simple model with interaction terms. To give the coefficients a meaningful interpretation at zero, and to avoid multicollinearity, I am mean centering variables.
I am wondering when to do this. I.e. before estimating a regression or only for values that enter the regression? The question stems from the missing structure of my data. Because the mean of the centered variable is not zero when calculated for the observations that acctually entered the regession.
Maybe an example helps in making the point:
clear
set more off
sysuse auto.dta
*Randomly replace weight with missings
gen tomis = ceil(10*runiform())
replace weight=. if tomis==1
*Center mpg
sum mpg, meanonly
gen cmpg = mpg-r(mean)
*Regression
qui reg price cmpg weight foreign
qui gen sample = 1 if e(sample)
*Center mpg when in sample
sum mpg if sample==1, meanonly
gen cmpgs= mpg-r(mean)
*Sums
sum mpg cmpg cmpgs
sum mpg cmpg cmpgs if sample==1
In the example above I mean center mpg
to cmpg
. The mean of cmpg
is thus (close to) zero. However the mean of cmpg
is 0.278 for all observations that entered the regression. Does that make sense or should I center based on the observation that enter the regression as I do when generated cmpgs
?