I anticipate having to run a t-test with multiple covariates, so an ANCOVA-style problem, but with covariates that are correlated with each other (but not with the group variable).
To get out of issues related to dubious standard errors on the parameter estimates, I thought that I would use PCA on the covariates and then retain all of the PCs. This way, I keep all of the information in the covariates but avoid the issue of correlations between then wrecking my standard errors. Since I don't care to do inference on the covariates, this made sense to me. I went ahead with a simulation to see if my plan would give me added power and maintain the type I error rate.
Using an intercept of $3$ and a group variable coefficient of $0.2$, I got as far as the attached code when I encountered this:
Output
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.677
Model: OLS Adj. R-squared: 0.648
Method: Least Squares F-statistic: 23.56
Date: Sat, 06 Jun 2020 Prob (F-statistic): 1.49e-10
Time: 18:27:45 Log-Likelihood: -65.894
No. Observations: 50 AIC: 141.8
Df Residuals: 45 BIC: 151.3
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 2.2754 0.189 12.052 0.000 1.895 2.656
x1 1.0204 0.273 3.741 0.001 0.471 1.570
x2 0.8992 0.256 3.511 0.001 0.383 1.415
x3 -1.0757 0.251 -4.286 0.000 -1.581 -0.570
x4 -0.9662 0.313 -3.091 0.003 -1.596 -0.337
==============================================================================
Omnibus: 0.231 Durbin-Watson: 2.074
Prob(Omnibus): 0.891 Jarque-Bera (JB): 0.429
Skew: 0.033 Prob(JB): 0.807
Kurtosis: 2.551 Cond. No. 4.35
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
OLS (PCA-style) Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.677
Model: OLS Adj. R-squared: 0.648
Method: Least Squares F-statistic: 23.56
Date: Sat, 06 Jun 2020 Prob (F-statistic): 1.49e-10
Time: 18:27:45 Log-Likelihood: -65.894
No. Observations: 50 AIC: 141.8
Df Residuals: 45 BIC: 151.3
Df Model: 4
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 6.4051 1.030 6.217 0.000 4.330 8.480
x1 -7.6244 2.128 -3.583 0.001 -11.910 -3.338
x2 -0.9076 0.110 -8.226 0.000 -1.130 -0.685
x3 8.3323 2.034 4.096 0.000 4.236 12.429
x4 -2.7167 0.633 -4.291 0.000 -3.992 -1.442
==============================================================================
Omnibus: 0.231 Durbin-Watson: 2.074
Prob(Omnibus): 0.891 Jarque-Bera (JB): 0.429
Skew: 0.033 Prob(JB): 0.807
Kurtosis: 2.551 Cond. No. 36.4
==============================================================================
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
The intercept and coefficient on the group variable (x1
) are way off in the PCAed model! The group variable in the non-PCA model is off for this particular seed, but when I have changed it up, I tend to capture $0.2$ in the confidence interval for the model of the original data, while the PCA model is way off almost every time.
This plan made a lot of sense to me, yet it seems to have serious issues. Have I made a coding error? Have I missed something about principal components? What's going on?
One idea I had was to take the p-value from the PCAed model but the point estimate from the model on the original data. But what if I want a confidence interval for the coefficient?
import numpy as np
import statsmodels.api as sm
from sklearn.decomposition import PCA
import scipy.stats
np.random.seed(2020)
# Define sample size
#
N = 50
# Define the parameter 4-vector WITHOUT an intercept
#
beta_1 = np.array([0.2, 1, -1, -1])
# Define categorical predictor
#
g = np.random.binomial(1, 0.5, N)
# Define covariance matrix of covariates
#
S = np.array([[1, -0.8, 0.7], [-0.8, 1, -0.8], [0.7, -0.8, 1]])
# Define matrix of covariates
#
covs = np.random.multivariate_normal(np.array([0, 0, 0]), S, N)
# Combine all predictors into one matrix
#
X = np.c_[g, covs]
# Make three PCs and add them to g to give the PCAed model matrix
#
pca = PCA(n_components=3)
pca.fit(X)
diag = pca.transform(X)
X_pca = np.c_[g, diag]
# Simulate the expected value of the response variable
#
y_hat = np.matmul(X, beta_1)
# Simulate error term, using the mean as the intercept, beta_0
#
err = np.random.normal(3, 1, N)
# Simulate response variable
#
y = y_hat + err
# Fit full model on original data
#
orig = sm.OLS(y, sm.tools.add_constant(X)).fit()
# Fit full model on PCAed data
#
pca_ed = sm.OLS(y, sm.tools.add_constant(X_pca)).fit()
print(orig.summary())
print(pca_ed.summary())