How to systematically remove collinear variables (pandas columns) in Python?

Question

Thus far, I have removed collinear variables as part of the data preparation process by looking at correlation tables and eliminating variables that are above a certain threshold. Is there a more accepted way of doing this? Additionally, I am aware that only looking at correlation amongst 2 variables at a time is not ideal, measurements like VIF take into account potential correlation across several variables. How would one go about systematically choosing variable combinations that do not exhibit multicollinearity?

I have my data within a pandas data frame and am using sklearn's models.

You might want to consider Partial Least Squares Regression or Principal Components Regression. One of these is probably supported. — spdrnl, Jun 01 '15 at 18:57
I see. So if I understand correctly, running PCA would then give me a set of independent principal components, which I could then use as covariates for my model, since each of the principal components is not colinear with the others? — orange1, Jun 01 '15 at 20:04
Exactly. Some of the components are likely to turn out irrelevant. This is easier than dropping variables. — spdrnl, Jun 01 '15 at 20:13
Hm, so my intention is primarily to run the model for explanatory rather than predictive purposes. How would one go about interpreting a model that used principal components as covariates? — orange1, Jun 01 '15 at 20:35
In that case it does not help since interpreting components is somewhat of a dark art. — spdrnl, Jun 02 '15 at 13:42
https://stackoverflow.com/questions/27651702/checking-for-multicollinearity-in-python http://stackoverflow.com/a/25833792/535665 — jseabold, Jun 02 '15 at 16:45

score 20 · Answer 1 · edited Sep 10 '18 at 17:17

Thanks SpanishBoy - It is a good piece of code. @ilanman: This checks VIF values and then drops variables whose VIF is more than 5. By "performance", I think he means run time. The above code took me about 3 hours to run on about 300 variables, 5000 rows.

By the way, I have modified it to remove some extra loops. Also, i've made it a bit cleaner and return the dataframe with reduced variables. This version reduced my run time by half! My code is below- Hope it helps.

from statsmodels.stats.outliers_influence import variance_inflation_factor    

def calculate_vif_(X, thresh=5.0):
    variables = list(range(X.shape[1]))
    dropped = True
    while dropped:
        dropped = False
        vif = [variance_inflation_factor(X.iloc[:, variables].values, ix)
               for ix in range(X.iloc[:, variables].shape[1])]

        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X.iloc[:, variables].columns[maxloc] +
                  '\' at index: ' + str(maxloc))
            del variables[maxloc]
            dropped = True

    print('Remaining variables:')
    print(X.columns[variables])
    return X.iloc[:, variables]

Thank you. Have you compared the outputs of both functions? I saw an R function (package `usdm` method `vifstep`) for VIF and run time was really cool. As I said before, the variant above and your (optimized by half) are so slow in comparing with the R. Any other ideas how to optimize yet? — SpanishBoy, Apr 12 '17 at 17:22
I have a question about this approach. Let's say that we have A,B and C features. A is correlated with C. If you loop over the features, A and C will have VIF > 5, hence they will be dropped. In reality, shouldn't you re-calculated the VIF after every time you drop a feature. In my example you'd dropb both A and C, but if you calculate VIF (C) after A is dropped, is not going to be > 5 — Titus Pullo, Jun 24 '19 at 13:26

score 4 · Answer 2 · answered Sep 29 '16 at 16:06

You can try use below code:

from statsmodels.stats.outliers_influence import variance_inflation_factor

def calculate_vif_(X):

    '''X - pandas dataframe'''
    thresh = 5.0
    variables = range(X.shape[1])

    for i in np.arange(0, len(variables)):
        vif = [variance_inflation_factor(X[variables].values, ix) for ix in range(X[variables].shape[1])]
        print(vif)
        maxloc = vif.index(max(vif))
        if max(vif) > thresh:
            print('dropping \'' + X[variables].columns[maxloc] + '\' at index: ' + str(maxloc))
            del variables[maxloc]

    print('Remaining variables:')
    print(X.columns[variables])
    return X

It works, but I don't like the performance of that approach

Do you want to comment a little more on what this approach does? And why you don't like the performance? — ilanman, Sep 29 '16 at 16:49

score 3 · Answer 3 · answered Dec 13 '17 at 18:27

I tried SpanishBoy's answer and found serval errors when running it for a data-frame. Here is a debugged solution.

from statsmodels.stats.outliers_influence import variance_inflation_factor    

def calculate_vif_(X, thresh=100):
cols = X.columns
variables = np.arange(X.shape[1])
dropped=True
while dropped:
    dropped=False
    c = X[cols[variables]].values
    vif = [variance_inflation_factor(c, ix) for ix in np.arange(c.shape[1])]

    maxloc = vif.index(max(vif))
    if max(vif) > thresh:
        print('dropping \'' + X[cols[variables]].columns[maxloc] + '\' at index: ' + str(maxloc))
        variables = np.delete(variables, maxloc)
        dropped=True

print('Remaining variables:')
print(X.columns[variables])
return X[cols[variables]]

I also had no issues with performance, but have not tested it extensively.

this is nice and works for me. except, it returns the ominious warning: `RuntimeWarning: divide by zero encountered in double_scalars` — user2205916, Jul 05 '18 at 03:54

How to systematically remove collinear variables (pandas columns) in Python?

3 Answers3

Linked