ANCOVA-style regression with PCA on the covariates

Question

I anticipate having to run a t-test with multiple covariates, so an ANCOVA-style problem, but with covariates that are correlated with each other (but not with the group variable).

To get out of issues related to dubious standard errors on the parameter estimates, I thought that I would use PCA on the covariates and then retain all of the PCs. This way, I keep all of the information in the covariates but avoid the issue of correlations between then wrecking my standard errors. Since I don't care to do inference on the covariates, this made sense to me. I went ahead with a simulation to see if my plan would give me added power and maintain the type I error rate.

Using an intercept of $3$ and a group variable coefficient of $0.2$, I got as far as the attached code when I encountered this:

Output

                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.677
Model:                            OLS   Adj. R-squared:                  0.648
Method:                 Least Squares   F-statistic:                     23.56
Date:                Sat, 06 Jun 2020   Prob (F-statistic):           1.49e-10
Time:                        18:27:45   Log-Likelihood:                -65.894
No. Observations:                  50   AIC:                             141.8
Df Residuals:                      45   BIC:                             151.3
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.2754      0.189     12.052      0.000       1.895       2.656
x1             1.0204      0.273      3.741      0.001       0.471       1.570
x2             0.8992      0.256      3.511      0.001       0.383       1.415
x3            -1.0757      0.251     -4.286      0.000      -1.581      -0.570
x4            -0.9662      0.313     -3.091      0.003      -1.596      -0.337
==============================================================================
Omnibus:                        0.231   Durbin-Watson:                   2.074
Prob(Omnibus):                  0.891   Jarque-Bera (JB):                0.429
Skew:                           0.033   Prob(JB):                        0.807
Kurtosis:                       2.551   Cond. No.                         4.35
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS (PCA-style) Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.677
Model:                            OLS   Adj. R-squared:                  0.648
Method:                 Least Squares   F-statistic:                     23.56
Date:                Sat, 06 Jun 2020   Prob (F-statistic):           1.49e-10
Time:                        18:27:45   Log-Likelihood:                -65.894
No. Observations:                  50   AIC:                             141.8
Df Residuals:                      45   BIC:                             151.3
Df Model:                           4
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          6.4051      1.030      6.217      0.000       4.330       8.480
x1            -7.6244      2.128     -3.583      0.001     -11.910      -3.338
x2            -0.9076      0.110     -8.226      0.000      -1.130      -0.685
x3             8.3323      2.034      4.096      0.000       4.236      12.429
x4            -2.7167      0.633     -4.291      0.000      -3.992      -1.442
==============================================================================
Omnibus:                        0.231   Durbin-Watson:                   2.074
Prob(Omnibus):                  0.891   Jarque-Bera (JB):                0.429
Skew:                           0.033   Prob(JB):                        0.807
Kurtosis:                       2.551   Cond. No.                         36.4
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

The intercept and coefficient on the group variable (x1) are way off in the PCAed model! The group variable in the non-PCA model is off for this particular seed, but when I have changed it up, I tend to capture $0.2$ in the confidence interval for the model of the original data, while the PCA model is way off almost every time.

This plan made a lot of sense to me, yet it seems to have serious issues. Have I made a coding error? Have I missed something about principal components? What's going on?

One idea I had was to take the p-value from the PCAed model but the point estimate from the model on the original data. But what if I want a confidence interval for the coefficient?

import numpy as np
import statsmodels.api as sm
from sklearn.decomposition import PCA
import scipy.stats

np.random.seed(2020)

# Define sample size
#
N = 50

# Define the parameter 4-vector WITHOUT an intercept
#
beta_1 = np.array([0.2, 1, -1, -1])

# Define categorical predictor
#
g = np.random.binomial(1, 0.5, N)

# Define covariance matrix of covariates
#
S = np.array([[1, -0.8, 0.7], [-0.8, 1, -0.8], [0.7, -0.8, 1]])

# Define matrix of covariates
#
covs = np.random.multivariate_normal(np.array([0, 0, 0]), S, N)

# Combine all predictors into one matrix
#
X = np.c_[g, covs]

# Make three PCs and add them to g to give the PCAed model matrix
#
pca = PCA(n_components=3)
pca.fit(X)
diag = pca.transform(X)
X_pca = np.c_[g, diag]

# Simulate the expected value of the response variable
#
y_hat = np.matmul(X, beta_1)

# Simulate error term, using the mean as the intercept, beta_0
#
err = np.random.normal(3, 1, N)

# Simulate response variable
#
y = y_hat + err

# Fit full model on original data
#
orig = sm.OLS(y, sm.tools.add_constant(X)).fit()

# Fit full model on PCAed data
#
pca_ed = sm.OLS(y, sm.tools.add_constant(X_pca)).fit()

print(orig.summary())
print(pca_ed.summary())

score 1 · Answer 1 · answered Jun 08 '20 at 18:07

1

The effect of g is relatively small compared to the error of N(3,1). So it will be really hard to estimate what goes into the intercept and what goes into g. I re-ran it with

beta_1 = np.array([2, 1, -1, -1])
err = np.random.normal(0, 1, N)

And maybe got somewhere closer to what was the actual estimate. Regarding why the coefficients are off, I saw in the code:

pca = PCA(n_components=3)
pca.fit(X)
diag = pca.transform(X)
X_pca = np.c_[g, diag]

All the covariates are PCA transformed and the first 3 is taken and combined with the covariate g again. This means you are putting back g together with PCs that are linear combinations of g:

pca = PCA(n_components=3)
pca.fit(X)
diag = pca.transform(X)
X_pca = np.c_[g, diag]
np.round(np.corrcoef(X_pca.T),3)

array([[ 1.   , -0.099,  0.955, -0.25 ],
       [-0.099,  1.   , -0.   , -0.   ],
       [ 0.955, -0.   ,  1.   , -0.   ],
       [-0.25 , -0.   , -0.   ,  1.   ]])

You can see g and first 2 PCs are correlated which defeats the purpose. Maybe try something like:

pca = PCA(n_components=3)
pca.fit(X)
diag = pca.transform(X[:,1:])
X_pca = np.c_[g, diag]

answered Jun 08 '20 at 18:07

StupidWolf

4,494
3
10
26

I've found a mistake in my code. Where I did 'pca.fit(X)` and `diag = pca.transform(X)` should have used `covs` instead of `X`. Having made the change, I see that I am getting the same parameter estimate on `g` either way, but not the same intercept. Any ideas what's going on? Importantly, however, in some other simulations I've run, I've seen that we increase the standard errors on parameters estimates on variables that are correlated with other predictors, but we do not inflate standard errors on the uncorrelated predictor's parameter estimate, even if the other variables are correlated. – Dave Jun 09 '20 at 00:08
My guess is the PCA is centered but the covs aren't or it's really splitting the effect between the intercept and g. The second point.. do u mean that the standard errors on the correlated variables are high while that on PC components are low? – StupidWolf Jun 12 '20 at 05:19
The std error of the coefficients are obtained from the sqrt of the diagonal of the covariance matrix of coefficients multiplied by mse. If it is like the second example, your coefficients are larger so their covariance with itself will be larger. your r square is the same so mse remains unchanged – StupidWolf Jun 12 '20 at 05:46
https://stats.stackexchange.com/questions/44838/how-are-the-standard-errors-of-coefficients-calculated-in-a-regression – StupidWolf Jun 12 '20 at 05:46

score 0 · Accepted Answer · answered Jun 12 '20 at 11:54

The answer is that the way I call PCA from sklearn results in the covariates being centered to have $0$ mean (but not unit variance).

Linear dimensionality reduction using Singular Value Decomposition of the data to project it to a lower dimensional space. The input data is centered but not scaled for each feature before applying the SVD.

(The emphasis is mine.)

I also found another mistake in the PCA code: I was calling PCA on all four predictors, not just the three covariates. This explains why $\hat{\beta_1}$ was changing between the two models...$x_1$ wasn't the same in both!

When I center the covariates and only do PCA on them, I get the same intercept and $\hat{\beta_1}$ (code and output below).

What I have found is that, while the standard errors inflate when I look at a predictor that is correlated with another predictor, the standard error on $\hat{\beta_1}$ remains about the same whether the covariates are correlated or not, so running PCA on covariates in an ANCOVA-style regression problem with multiple correlated predictors does not help.

import numpy as np
import statsmodels.api as sm
from sklearn.decomposition import PCA
import scipy.stats
import sys

np.random.seed(2020)

# Define sample size
#
N = 50

# Define the parameter 4-vector WITHOUT an intercept
#
beta_1 = np.array([0.2, 1, -1, -1])

# Define categorical predictor
#
g = np.random.binomial(1, 0.5, N)

# Define covariance matrix of covariates
#
S = np.array([[1, -0.8, 0.7], [-0.8, 1, -0.8], [0.7, -0.8, 1]])

# Define matrix of covariates
#
covs = np.random.multivariate_normal(np.array([0, 0, 0]), S, N)

# Center the covariates
#
cov0 = covs[:,0] - np.mean(covs[:,0])
cov1 = covs[:,1] - np.mean(covs[:,1])
cov2 = covs[:,2] - np.mean(covs[:,2])
covs = np.c_[cov0, cov1, cov2]

# Combine all predictors into one matrix
#
X = np.c_[g, covs]

# Make three PCs and add them to g to give the PCAed model matrix
#
pca = PCA(n_components=3)
pca.fit(covs)
diag = pca.transform(covs)
X_pca = np.c_[g, diag]

# Simulate the expected value of the response variable
#
y_hat = np.matmul(X, beta_1)

# Simulate error term, using the mean as the intercept, beta_0
#
err = np.random.normal(3, 1, N)

# Simulate response variable
#
y = y_hat + err

# Fit full model on original data
#
orig = sm.OLS(y, sm.tools.add_constant(X)).fit()

# Fit full model on PCAed data
#
pca_ed = sm.OLS(y, sm.tools.add_constant(X_pca)).fit()

print(orig.summary())
print(pca_ed.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.677
Model:                            OLS   Adj. R-squared:                  0.648
Method:                 Least Squares   F-statistic:                     23.56
Date:                Fri, 12 Jun 2020   Prob (F-statistic):           1.49e-10
Time:                        07:53:13   Log-Likelihood:                -65.894
No. Observations:                  50   AIC:                             141.8
Df Residuals:                      45   BIC:                             151.3
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.2753      0.188     12.111      0.000       1.897       2.654
x1             1.0204      0.273      3.741      0.001       0.471       1.570
x2             0.8992      0.256      3.511      0.001       0.383       1.415
x3            -1.0757      0.251     -4.286      0.000      -1.581      -0.570
x4            -0.9662      0.313     -3.091      0.003      -1.596      -0.337
==============================================================================
Omnibus:                        0.231   Durbin-Watson:                   2.074
Prob(Omnibus):                  0.891   Jarque-Bera (JB):                0.429
Skew:                           0.033   Prob(JB):                        0.807
Kurtosis:                       2.551   Cond. No.                         4.32
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.677
Model:                            OLS   Adj. R-squared:                  0.648
Method:                 Least Squares   F-statistic:                     23.56
Date:                Fri, 12 Jun 2020   Prob (F-statistic):           1.49e-10
Time:                        07:53:13   Log-Likelihood:                -65.894
No. Observations:                  50   AIC:                             141.8
Df Residuals:                      45   BIC:                             151.3
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          2.2753      0.188     12.111      0.000       1.897       2.654
x1             1.0204      0.273      3.741      0.001       0.471       1.570
x2            -0.6313      0.087     -7.233      0.000      -0.807      -0.455
x3            -0.3441      0.285     -1.207      0.234      -0.918       0.230
x4            -1.5435      0.371     -4.164      0.000      -2.290      -0.797
==============================================================================
Omnibus:                        0.231   Durbin-Watson:                   2.074
Prob(Omnibus):                  0.891   Jarque-Bera (JB):                0.429
Skew:                           0.033   Prob(JB):                        0.807
Kurtosis:                       2.551   Cond. No.                         4.32
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
```

ANCOVA-style regression with PCA on the covariates

2 Answers2

Linked