1

I am applying a multiple linear regression on a data set, where some of the predictors are "transformations" of others (however, I'm not entirely sure if they are linear transformations or not).

For the sake of an example, let's say that we have three predictors, $A$, $B$, and $C$ that completely explain the variance of some dependent variable $Y$. However, $B$ and $C$ are highly correlated since $C$ is a transformation of $B$.

The transformation is $C_i = \sum_{j=0}^{23} V_j B_{i-j}$ where V is a vector of 24 numbers.

My questions for you are the following:

  1. Is it possible to completely eliminate multicollinearity among the predictors seeing that I know exactly how some are transformations of others?

  2. If so, how do I go about eliminating this multicollinearity?

Thank you!

Edit: thanks to @curiositasisasinbutstillcuriou, I am getting very close to the solution of my question; however, I need confirmation that my Python code to retrieve the original coefficients of my predictors makes sense; the majority of the following code was taken from https://www.statology.org/principal-components-regression-in-python/ while the last line is my attempt at retrieving the original coefficients.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import scale 
from sklearn import model_selection
from sklearn.model_selection import RepeatedKFold
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

### Fit the PCR Model ###

# X is a dataframe of the original predictors
# y is a dataframe of the dependent variable

# scale predictor variables
pca = PCA()
X_reduced = pca.fit_transform(scale(X))

# define cross validation method
cv = RepeatedKFold(n_splits=10, n_repeats=3, random_state=1)

regr = LinearRegression()
mse = []

# calculate MSE with only the intercept
score = -1*model_selection.cross_val_score(regr,
           np.ones((len(X_reduced),1)), y, cv=cv,
           scoring='neg_mean_squared_error').mean()    
mse.append(score)

# calculate MSE using cross-validation, adding one component at a time
for i in np.arange(1, len(X.columns)):
    score = -1*model_selection.cross_val_score(regr,
               X_reduced[:,:i], y, cv=cv, scoring='neg_mean_squared_error').mean()
    mse.append(score)
    
# plot cross-validation results    
plt.plot(mse)
plt.xlabel('Number of Principal Components')
plt.ylabel('MSE')
plt.title('hp')

# determine n_components based on the lowest MSE

### Use the Final Model to Make Predictions ###

# split the dataset into training (70%) and testing (30%) sets
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.3,random_state=0) 

# scale the training and testing data
X_reduced_train = pca.fit_transform(scale(X_train))[:,:n_components]
X_reduced_test = pca.transform(scale(X_test))[:,:n_components]

# train PCR model on training data 
regr = LinearRegression()
regr.fit(X_reduced_train, y_train)

# calculate RMSE
pred = regr.predict(X_reduced_test)
np.sqrt(mean_squared_error(y_test, pred))

# find coefficients for original predictors
orig_coef = np.matmul(regr.coef_, pca.components_[:n_components,:])
Florent H
  • 13
  • 3
  • Do you have the untransformed data? Even if not, if you normalize the data you have and then apply PCA, you should end up with data that has no multicollinearity (as far as I know). Have you tried that? – curiositasisasinbutstillcuriou Feb 03 '22 at 23:06
  • Yes, I have the untransformed and transformed data. Do I apply Principal Component Analysis with all the predictors: $A$, $B$, and $C$? – Florent H Feb 03 '22 at 23:57
  • Yes. If you are going to do PCA, you can use all of the variables. You don't need to do any transformations of the variables other than standardization of all your predictors, because PCA will change them radically (standardization of the predictors is important--without that, PCA might not actually deal with multicollinearity). You can then use the principal components for regression on Y. I'm not sure if it will improve your model's performance at all, but at least you will know there's no multicollinearity! – curiositasisasinbutstillcuriou Feb 04 '22 at 00:44
  • Oh, also: you want to train the PCA on the *training set* and find the standardization parameters and then apply them to the test set. You don't want to do separate PCAs/standardizations on the train/test sets separately--you only come up with those parameters from the training set. I hope that makes sense. – curiositasisasinbutstillcuriou Feb 04 '22 at 00:49
  • 1
    Thank you for all the help! All this makes a lot of sense. However, the interpretability of the coefficients is very important. Will I be able to "reverse engineer" the coefficients of the new PCA predictors to find the coefficients of the original predictors ($A$, $B$, and $C$)? – Florent H Feb 04 '22 at 01:13
  • Sure thing. Yes, this is a good question. Your principal component coefficients are probably not really interpretable on their own (they might be for some data sets, depending on how the variables relate to each other, but there's no reason to count on it). You can get the beta coefficients back to the original predictors. Here is an example in R: https://rpubs.com/esobolewska/pcr-step-by-step. – curiositasisasinbutstillcuriou Feb 04 '22 at 01:42
  • My answer below is about the best I can do here, I'm not a statistician or anything. Accept the answer if it's enough for you or you can wait for someone else smarter to tell me why I'm wrong! – curiositasisasinbutstillcuriou Feb 04 '22 at 01:54
  • Thanks again. Can you point out exactly what line of R code in the linked article gets the coefficients back to the original predictors? I read the whole article, but I am having a hard time interpreting the code because I am not at all familiar with R. – Florent H Feb 04 '22 at 02:48
  • Sorry about that! First you need to extract the linear model coefficients for the principal component variables as a matrix. Next you extract the rotations from the the PCA as a matrix. Then you multiply them together using matrix multiplication. It's the last code block in section 3.5.2 PCR. The accepted answer to this question gives the technical explanation: https://stats.stackexchange.com/questions/241890/coefficients-of-principal-components-regression-in-terms-of-original-regressors – curiositasisasinbutstillcuriou Feb 04 '22 at 04:42
  • You can only recover original coefficients in PCR if the number of components is equal to number of original variables, but by setting the number of components equal to number of original variables you've done nothing to reduce multicollinearity. – Always Right Never Left Feb 04 '22 at 06:07
  • @AlwaysRightNeverLeft First, the number of components used does not matter with regard to fixing multicollinearity--whether the number of components matches the number of original variables or not doesn't matter. Test this with VIF. Second, to convert back, you only need to be able to do matrix multiplication---that means the number of columns in your rotation matrix needs to match your number of PC regression coefficients used. That is not going to be problematic, whether you use 10 components or 5. You just adjust your columns accordingly...But if you want to add an answer below, do so. – curiositasisasinbutstillcuriou Feb 04 '22 at 14:00
  • @AlwaysRightNeverLeft The original variables are the *rows* of your rotation matrix, not the *columns*. – curiositasisasinbutstillcuriou Feb 04 '22 at 14:08
  • @curiositasisasinbutstillcuriou - thank you so much for all your help again. Could you please take a look at my edit in the question to see if I understood correctly? – Florent H Feb 04 '22 at 19:24
  • @FlorentH My python is a bit weak and I couldn't get that to run. What version of python did you use to code it? I think what you have should work if pca.components_[:n_components,:] gets you the eigenvector matrix from pca. I think you have to reverse the order of the matrix multiplication (pca.components_[:n_components,:] comes first). Other than that, it looks good to me at least. – curiositasisasinbutstillcuriou Feb 04 '22 at 20:26
  • @FlorentH Frank Harrell who is a legit expert and wrote a regression textbook commented below. For what it's worth, the kind of multicollinearity that your data has won't be a problem because it will always be consistent across your data...all the same, at least you have a way to remove it in other problematic cases! – curiositasisasinbutstillcuriou Feb 05 '22 at 00:38
  • 1
    This is a generalization of *orthogonal polynomials.* Much can be learned from studying them, and the same conclusions apply. – whuber Feb 05 '22 at 00:54

1 Answers1

0

Per Frank Harrell's comment (see below) this kind of multicollinearity (produced by your transformations), will not be a problem because it will be consistent in sample and out of sample.

All the same, if you wanted to rule out multicollinearity affecting your regression, here are two straightforward options:

  1. You can standardize your predictors and then apply PCA. Then your predictors will no longer be multicollinear, although your model may not be better from a predictive standpoint. You can convert the coefficients for the PCA variables to the original variables by extracting the PCA rotations and doing matrix multiplication.

  2. You can also do regression using a tree-based model instead. The performance of a tree-based model should not be strongly impacted by multicollinearity. But, as far as I know, you will not be able to obtain regression coefficients; you will need to look at an alternative measure--like variable importance. See more here: https://medium.com/@manepriyanka48/multicollinearity-in-tree-based-models-b971292db140

  • 1
    This kind of multicollinearity is harmless. See chapter 4 of hbiostat.org/rms – Frank Harrell Feb 04 '22 at 20:14
  • Thank you for your comment @FrankHarrell! To make sure I understand, you mean when transformed variables are derived from the originals and are, as you said in your book, *connected algebraically*? I'm not sure I fully grasped precisely the kind of multicollinearity that is harmless. – curiositasisasinbutstillcuriou Feb 04 '22 at 20:41
  • 2
    That's it. The variables will always be consistent with each other, both in-sample and out-of-sample. Havoc happens when out-of-sample collinearities differ from training sample collinearities. Predicted values are not disturbed by extreme collinearity if it's consistent. – Frank Harrell Feb 04 '22 at 21:48
  • @FrankHarrell - Thank you so much for your valuable insight! I am happy to learn that the multicollinearity in my data set is harmless; however, since I am applying regression in a machine learning context and not a statistical inference context, I would just like to confirm with you that the coefficients obtained with a simple MLR would be indeed accurate and not prone to errors stemming from "harmful" multicollinearity. Finally, would the multicollinearity still be harmless if a predictor was defined as the product of two other predictors? (i.e. $D$ = $A$ * $B$). Thank you so much! – Florent H Feb 05 '22 at 20:28