Correlation in the residuals of a multivariate linear regression

Question

I'm using a multivariate multiple linear regression model to predict 4 scalars out of multiple regressors. I'm not sure if this model is a good one and now I'm trying to asses the validity of the model assumption.

One of the core assumptions of such models is that the residual vectors are themselves iid among different samples. I have firm evidence that this is the case for my data. This residual vector is usually modeled by a multivariate Gaussian error vector of form ${N}(\textbf{0}, \textbf{$\Sigma$})$. (look at slide 44 of this for instance) which in my case is a 4D Gaussian.

I noticed a strong correlation between my two components of the residuals vector, hinting that $\Sigma$ is not a diagonal matrix. Since the model is surprisingly explaining the data well (even on my test set), I suspected that I might be overfitting to the data. As a result, I asked myself this question which I'd like to share with you:

Compared to the case of multiple single linear models, intuitively, what are the differences with a multivariate linear regression when two of the endogenous variables are (almost) linearly dependent?

Let me formalized itLet's consider a multiple multivariate regression, where we have $N$ samples that contain $K$ features each and our goal is to estimate $T$ dependent variable:

$$ y_{it} = \sum_j x_{ij}w_{jt} + \epsilon_{it} \Leftrightarrow \textbf{Y}_{N \times T} = \textbf{X}_{N \times K} \textbf{W}_{K \times T} +\textbf{E}_{N \times T} $$ where $k \in \{1,...,K\}$, $i \in \{ 1,...,N\}$ and $t \in \{ 1,...,T\}$.

Let's say indices $k,k'$ are linearly dependent: $Y_{k'} = c_{k'k}Y_{k} + d_{k'k}$. This certainly imposes a constraint on $\textbf{W}_k$ and the ones of $k'$. Then, once one estimates the best values for $\textbf{W}$ what issues may come about and specifically, what can we say about $\epsilon_{ik}$?

arash · Accepted Answer · 2019-11-07T17:46:57.900

One can break this system into $T$ different univariate multiple regression or equivalently, reshape the LHS such that instead of a $N \times T$ matrix, one has a column vector with $NT$ elements. Using the formalism, we can write down the system above as

$$ y_{i}^t = \sum_j x_{ij}w_{j}^t + \epsilon_{i}^t \Leftrightarrow \textbf{Y}^t_{N \times 1} = \textbf{X}_{N \times K} \textbf{W}^t_{K \times 1} +\textbf{E}^t_{N \times 1} $$

which is essentially the same equation except that $t$ is now a superscript to emphasize that we are only interested in the linear system corresponding to $t$-th component (column) of $\textbf{Y}$. As explained in wiki for a univariate regression problem, one can find the best estimate for $\textbf{W}^t$ (which is indicated by $\beta$ there), from the following formula:

$$ \hat{\textbf{W}}^t = [(\textbf{X}^T\textbf{X})^{-1}\textbf{X}^{T}]\textbf{Y}^t $$

therefore, the best estimates (in the mean square sense) of weight for $t$-th univariate system is nothing but a multiplication of a (constant over $t$) matrix and $\textbf{Y}^t$. Having this, now I can answer partially my question:

As long as the observation matrix is such that the $\textbf{X}^T\textbf{X}$ is invertible, all the weights for different components $t$ may be estimated totally independently since only the observations are involved in the equation above and not the labels. As a result, no special issue would happen during the estimation process.

The correlation between components (for example the purely linear dependency between $k$ and $k'$ as above), does exist anyway. I checked it using scikit-learn in the following manner: Assume here's a simple relationship between last and the first two DP:

$$ \textbf{Y}^T = 1+2\textbf{Y}^1 + 3 \textbf{Y}^2$$

and we compute our best estimation for $\textbf{W}^t$. In particular (ignoring the noise for the moment) we have

$$ \textbf{Y}^T = 1 + 2(c_1 + \hat{\textbf{W}}^1 \textbf{X}) + 3 (c_2 + \hat{\textbf{W}}^2 \textbf{X}) = c_T + \hat{\textbf{W}}^T \textbf{X} $$

So I investigated the equality of $\hat{\textbf{W}}^T$ and $2 \hat{\textbf{W}}^1+ 3 \hat{\textbf{W}}^2$

import numpy as np
import matplotlib.pyplot as plt
from sklearn import linear_model

N=100  # number of samples
K=10   # number of indep. var
T=4    # number of dep. var

X = np.random.normal(5,3, (N,K))
W = np.random.normal(0,10, (K,T-1)) 

# let's assume that first T-1 DV follow an exact linear regression
y = np.dot(X,W) 

# let's assuem that the last DP is dependent on the first two with this relation: 
# Y_T = 1 + 2Y_1 + 3Y_2
y = np.c_[y, (1+2*y[:,0]+3*y[:,1]).reshape(N,1)] 

# just adding some noise
y += np.random.normal(0,2, (N,T))

model = linear_model.LinearRegression()
model.fit(X,y)

# comparing the value of weights
plt.plot( 2*model.coef_[0,:] + 3*model.coef_[1,:], label='from constraint' )
plt.plot(model.coef_[-1,:], marker='x', linestyle='', color='r', label= 'from optimization')
plt.legend()

which confirms that perfectly:

From my point of view, the whole confusion is due to the fact that my DPs are not independent linearly and the model (despite the correctness of its estimation) has redundancy and is better to be avoided.

Maybe a better question at this point is that how one can avoid making such a mistake, namely assuming wrongly that its labels are independent.

I have a handy answer to that which admittedly is not the noblest one: computing the correlation between the multiple labels. Another approach would be to compute the ratio of mutual information of the labels to their joint probability: $I(\textbf{Y}_i, \textbf{Y}_j) / H(\textbf{Y}_i,\textbf{Y}_j)$ as a metric that also encompasses the nonlinear dependencies.

In my dummy example, this leads to the following correlation matrix.

import pandas as pd
df = pd.DataFrame(data=y, columns=['a','b','c','d'])

fig = plt.figure()
ax = fig.gca()
pos= ax.imshow(df.corr(),cmap='Blues')
fig.colorbar(pos,ax=ax)

I'm open to any suggestion concerning that can hint out the dependency between the labels (in a regression problem) other than a simple correlation matrix.

P.S: Since I realized that the main issue is not the structure of $cov[\textbf{E}]$ but the deficient assumption of independence of coefficients, I didn't dive into discovering the differences between the two approaches when it comes to noise term. Nevertheless, this question highlights some differences between the two which might be fruitful for answering this one as well.

Correlation in the residuals of a multivariate linear regression

1 Answers1