Homoscedasticity and independence of errors

Question

In linear regression I often see homoscedasticity and independence of errors listed as assumptions (for example on wikipedia). But I would think that independence of errors would imply homoscedasticity. Look at this error plot example:

Could the errors be independent? I would think that they are dependent, in the sense that a large error would mean an increased chance of a large error on the following observation. What am I missing?

So independence means serial correlation? In the sense that the value of one error should predict (with some uncertainty) the value of the next one? — bgst, Jan 09 '21 at 18:12
Independence implies *no* serial correlation. It doesn't imply anything about heteroscedasticity. — whuber, Jan 09 '21 at 18:13
it is not true that independence imply nothing about heteroschedasticity. It imply scorrealation but scorrealation among squared too, usual kind of heteroschedasticity. Also others go away. However a bit of formalization about the model and assumptions you refers on is needed. Terminology can produce ambiguities. — markowitz, Jan 09 '21 at 19:50
@bgst; from the graph I suppose that your data are in cross section. It's so? — markowitz, Jan 09 '21 at 20:42
@markowitz Although most of your comment is garbled, making it hard to decipher, the initial "it is not true" statement just isn't so. I don't think there's a terminological issue, because "independent," "heteroscedastic," and "correlated" are very, very standard and well known terms. — whuber, Jan 09 '21 at 21:00
@markowitz, the graph is supposed to be an error plot (x is predictor and y is error). — bgst, Jan 09 '21 at 21:02
@whuber, you say that Independence implies no serial correlation. I understand that. But what I don't understand is *why* doesn't imply homoscedasticity as well? From my plot is it obvious that we can use one error to say something about the magnitude of the following error. We can't say anything about the value though, and maybe that is the meaning of "independence" in this context? — bgst, Jan 09 '21 at 21:17
The *independence* of variables says something about their *joint* probability function. *Heteroscedasticity,* in the broadest possible sense, means the variables have different *marginal* distributions. — whuber, Jan 09 '21 at 22:32
I think that I get it now. Independence means that P(A∩B)=P(A)*P(B) where A is an event for an error and B is an event for the following error. Heteroscedasticity means that A and B follow the same distribution. Would that be accurate? — bgst, Jan 10 '21 at 00:31
What you show represent the independence between two events. Here we are interested in independence between two, or more, random variables. Moreover we have to care about which are the r.vs we are interested in (here the ambiguities that I intended before). Several misconceptions move around the topic “regression”. I suggest you to refers on one book only, at the beginning. However in one shared definition the heteroscedasticity is something like $\sigma^2_i = x_i \sigma^2$. So in this case variance of the error depend on the predictors. — markowitz, Jan 10 '21 at 09:35
We can consider them ($\sigma^2_i$ and $x_i$) as two r.vs and them are not independent. Indeed graph like yours is frequently used precisely for show how heteroskedasticity appear. — markowitz, Jan 10 '21 at 09:35
I think the confusing arise from what we consider as random variables. If it is the predictor and the corresponding error, then independence imply homoscedasticity. If it is two consecutive errors, then we can have independence and heteroscedasticity. — bgst, Jan 10 '21 at 11:11
Indeed the confusion that you claim is the more evident. Moreover some other problems can appear. From this point my previous clarifications ask come from. Maybe later I add something about in my answer. — markowitz, Jan 10 '21 at 13:37
I edited my answer. I hope that it can help to move away most confusions. — markowitz, Jan 11 '21 at 14:46

score 2 · Answer 1 · answered Jan 09 '21 at 22:09

What you are missing is information about a study's design. Independence is somehing that comes from the study design - it is NOT implied by homoscedasticity.

To give you a simple example, imagine that you are measuring temperature (degrees Celsius) at your local airport using a sensor, with the goal being to see how temperature changes over time. If you measure temperature every day (once per day, say at noon), then you can expect the resulting daily values of temperature to be correlated with each other if they come from days which are close to each other. If you aggregated the daily temperature values collected within each year to get a yearly temperature value, then it is possible that the yearly temperature values may no longer be correlated over time (because the temporal distance of one year between consecutive values is large enough).

Staying with this example of measuring a response variable (e.g., temperature) over time (e.g., every year) and trying to regress that variable against time, you can expect to encounter all possible combinations of situations, depending on what variable you are measuring and how frequent your measurements are:

(1) independent errors and homoscedasticity;
(2) independent errors and heteroscedasticity;
(3) temporally dependent errors and homoscedasticity;
(4) temporally dependent errors and heteroscedasticity.

This fact alone hints that you are wrong to assume that independence implies homoscedasticity; if that were the case, we would never encounter situation (2) in practice. However, the statistical literature is full of examples to the contrary!

@whuber already hinted that you are confusing two distinct concepts: dependence of errors and variance of errors.

In the context of the temperature versus time example, the variance of errors simply quantifies the amount of spread you can expect to encounter in your temperature values about the underlying temporal trend in these values. Under the assumption of homoscedasticity, the amount of spread is unaffected by the passage of time (i.e., it remains constant over time). However, under the assumption of heteroscedasticity, the amount of spread is affected by the passage of time - for example, the amount of spread can increase over time. Spread is about how far you can expect an observed temperature value to be at time $t$ relative to what is 'typical' for that time $t$.

The dependence of errors looks at something different altogether: if you know something about the value of temperature at the current time $t$, does that tell you anything about the value of temperature at time $t + 1$? If the temperature values are independent from each other, knowing that temperature is high today should have no bearing on what the temperature value will be tomorrow. If the temperature values are NOT independent from each other (e.g., they are positively correlated), knowing that temperature is high today will tell you that the temperature value will also be high tomorrow. How high it will be will depend on the strength of the correlation.

Lets say that the temperatures from your example behaves as in the error plot in my original post. I think I'm supposed to say that it looks likes independent errors and heteroscedasticity right? But if the temperature at t is extreme (very high or very low), it is more likely that the temperature at t+1 is extreme? Isn't this called dependence? Shouldn't the temperature at t+1 be completely unrelated to the temperature at t if there is independence? — bgst, Jan 09 '21 at 23:43
@Isabella Ghement, you are right that study design and context matters. However situation 2 and 3 of your list seems me problematic. Some equations can help us to understand each other. — markowitz, Jan 10 '21 at 00:03
You cannot assess independence or dependence from the plot you shared in your post. If you truly were in a setting where your response variable was temperature and your predictor variable was time (e.g., year), you would assess whether temporal dependence is present among the model errors by examining ACF and PACF plots of the residuals from a simple linear regression of temperature on time. Note that these plots make sense when time is measured at equally spaced intervals (e.g., every year). — Isabella Ghement, Jan 10 '21 at 00:57
I provided my 2 cents - now it is up to you to read up more on this subject and make sure you understand the flaws in your thinking. Cross Validated can point you in the right direction but you need to go deeper and do more self-study. You had multiple people confirm that you are not framing this properly - this should be enough to make you question your current understanding of independence and variability. — Isabella Ghement, Jan 10 '21 at 01:04
By the way, if temperature and time in my example were linearly related (e.g., temperature tends to increase over time), there is a way to look for signs of positive temporal dependence among temperature values in the scatterplot of temperature versus time: you would just look for a series of several consecutive temperature values located above the linear trend line, followed by a series of several consecutive values of temperature values located below the trend line, and so on. This is totally different from looking for increasing/decreasing spread in temperature values over time! — Isabella Ghement, Jan 10 '21 at 01:11

markowitz · Accepted Answer · 2021-01-11T14:43:58.870

1

As I said at the beginning in the comments some detail more would be useful for avoid ambiguities and misconceptions. Anyway I propose here a quite general basic answer.

Starting to define the linear regression model as

$y = X’\beta + \epsilon$

In a widely shared definition (for example from Bruce Hansen) heteroskedasticity is:

$V[\epsilon|X=x] = E[\epsilon^2|X=x] = \sigma^2(x)$ (a function of $X$)

while homoskedasticity mean that

$V[\epsilon|X=x] = E[\epsilon^2|X=x] = \sigma^2$ (a positive constant)

Therefore if regression error ($\epsilon$) and regressors ($X$) are independent we cannot have heteroskedasticity, then homoskedasticity must hold. In your (@bgst) plot residuals are showed as function of $X$, so heteroskedasticity (defined as above) appear evident. Indeed graph like that is frequently used for show how heteroskedasticity appear.

However heteroskedasticity can be defined differently. For example:

$V[\epsilon|Z=z] = E[\epsilon^2|Z=z] = \sigma^2(z)$

where $Z$ is a set of variable not necessarily shared with the regressors set. So even if regression error ($\epsilon$) and regressors ($X$) are independent, heteroskedasticity can appear.

Moreover heteroschedaticity sometimes is presented as a property of error matrix, then as temporal or spatial dependence among errors. In time series framework a GARCH process is a common example of heteroskedasticity and it imply some dependency among errors; more precisely imply correlation among some squared error terms.

The last case can be included in the definition $V[\epsilon|Z=z] = \sigma^2(z)$

where $Z$ represent past value of squared errors.

The message in that: the concept of heteroskedasticity have much to do with that of stochastic dependance, however context/definitions/details matters a lot.

edited Jan 11 '21 at 14:43

answered Jan 09 '21 at 21:24

markowitz

3,964
1
13
28

But if heteroscedasticity disappear, then there homoscedasticity right? So you are saying that independence imply homoscedasticity? – bgst Jan 09 '21 at 21:43
Exactly. Things are so. – markowitz Jan 09 '21 at 21:52
So why list homoscedasticity as a separate assumption if it is implied by independence? – bgst Jan 09 '21 at 22:38
The list of assumption needed can be debatable, is not precisely shared in any ref and context matters. However both, full independent error and/or homoschedasticity are not mandatory. Some detail about your model can help. However in your case heteroschedasticity appear and residuals are not independent from predictors. You are new here, i remember you that If you consider my reply helpful can upvote it. – markowitz Jan 09 '21 at 23:25
I'm very thankful for your replies but right now they are contradicting the other replies as far as I can tell. So I need to figure out who is right. – bgst Jan 09 '21 at 23:45
"Therefore if regression error (ϵ) and regressors (X) are independent we cannot have heteroskedasticity" is confusing because it refers to a *different* sense of independence than used in the question! "Independence of errors" means the $\epsilon_i$ are conditionally independent (given $X$), not that the errors are independent of $X.$ – whuber Jan 11 '21 at 14:49
My reply is not confusing, at most the question is. The title is about independence “of errors” not independence “among errors”. I feared since the beginning that ambiguities like that can appear and I don’t know what the asker had in mind. Indeed I asked clarifications. Later bgst reply that the graph (where the question come from) show “x (axis) is predictor and y (axis) is error”. For this reason I had considered $\sigma^2(x)$ as first case. – markowitz Jan 11 '21 at 15:08

Homoscedasticity and independence of errors

2 Answers2

Linked