Lets say I have a linear regression: $$y \sim 1 + x_1+x_2$$
where the range of $x_2$ is $[0,10]$. I fit this model using lm
or rlm
with regression weights in R
. I collect the residuals and plot them against $x_1$, I found that the residuals show a pattern with respect to the variable $x_1$. The $R^2$ of regressing the residuals onto $x_1$ is $20\%$. Is that possible? What could be the causes?
After the same linear regression as above, if I take a smaller portion of the data, say all the data with $x_2<6$. Then I collect the residuals and $x_1$ of this subset and plot the subsetted residuals against the subsetted $x_1$. I found that the residuals still show a pattern with respect to the variable $x_1$.
(The two 20% above are just for example... they are not related ... and maybe there is a theory saying that one should be definitely larger than the other, etc. )
Is that possible? What could be the causes?
Edit: Let me try to describe the shape of the pattern.
Lets say the range of $x_1$ is $[0, 100]$.
At around $x_1=1$, the residuals are in a vertical band of $[-0.1, 0.1]$.
At around $x_1=10$, the residuals are in a vertical band of $[-1, 1]$.
...
At around $x_1=100$, the residuals are in a vertical band of $[-10, 10]$.
I intentionally put these numbers so you see the upper-band and lower-band are growing somewhat linearly as $x_1$ increases. I know this is heteroskedasiticity. But I guess since I am concerned about "bias", not inference... so I don't worry about the heteroskedasiticity...