I wonder how to generate such data, so that in single variable regression feature coefficient would be positive, and in multiple regression would be negative.
So I read several related questions on Cross Validated. As I understood, there are 2 possible cases for this desired situation: when features are correlated (collinearity) or when feature values are not orthogonal.
So I made a code that generates data for collinearity case:
set.seed(1)
xPositive = rnorm(100)
xNegative = xPositive / 3 + rnorm(100, 0, .15)
y = 5 * xPositive - 3 * xNegative + rnorm(100)
fitXNegative = glm(y~xNegative)
fixBothX = glm(y~xNegative+xPositive)
print(paste("Coefficient for xNegative when fitting only xNegative:", coef(fitXNegative)['xNegative']))
print(paste("Coefficient for xNegative when fitting both x'es:", coef(fixBothX)['xNegative']))
print(paste("That is because of correlation between x'es:", cor(xNegative, xPositive)))
The output is:
Coefficient for xNegative when fitting only xNegative: 9.57682149821626
Coefficient for xNegative when fitting both x'es: -2.85321980961991
That is because of correlation between x'es: 0.911618247307253
So the example works.
Similarly, I was trying to come up with non-orthogonality example. Here is my best try so far:
set.seed(1)
xPositive = runif(100, 100, 120)
xNegative = runif(100, 100, 120)
y = 13 * xPositive - 11 * xNegative
fitXNegative = glm(y~0+xNegative)
fitBothX = glm(y~0+xNegative+xPositive)
print(paste("Coefficient for xNegative when fitting only xNegative:", coef(fitXNegative)['xNegative']))
print(paste("Coefficient for xNegative when fitting both x'es:", coef(fixBothX)['xNegative']))
The output is:
Coefficient for xNegative when fitting only xNegative: 1.97024384743124
Coefficient for xNegative when fitting both x'es: -2.85321980961991
But as you may have noticed, I manually excluded the intercept from the model, because otherwise the coefficient doesn't change its sign. Also, if you center the data in this example:
xPositive = xPositive - 110
xNegative = xNegative - 110
then you effectively get rid of non-orhogonality and the coefficient then doesn't change its sign.
So, regarding this second non-orthogonality example I have 2 questions:
I would like to come up with a better mechanism of data generation so that I could observe the effect of changing coefficient sign without the necessity of setting the intercept to zero.
I am checking for non-orthogonality between features using the following formula:
nonOrthogonalityMeasure = mean(xPositive * xNegative)
If that value approximately equals zero then I conclude that features are orthogonal. Is this a correct method?
- Are there other situations when the coefficient may change its sign, except for the aforementioned situations?