matched pairs in Python (Propensity score matching)

Question

Is there a function in python to create a matched pairs dataset?

e.g.

df_matched = construct_matched_pairs(df_users_who_did_something,
                                     df_all_other_users,
                                     ..)

https://en.wikipedia.org/wiki/Matching_(statistics)

As an aside this will usually be highly inefficient when compared with methods that do not discard any data. And most matching algorithms are arbitrary in the sense that changing the order of records in the dataset will result in different matches. This does not represent reproducible research IMHO. — Frank Harrell, Apr 12 '16 at 11:34
I don't have experienced with Matching techniques, so comments are appreciated a lot. I want to compare results we get from matching with one we get with Linear model (e.g. controlling for differences in user engagement metrics between affected and un-affected groups). My groups are very different and we can't run proper RCT on this specific feature. — volodymyr, Apr 12 '16 at 11:41
If the dimensionality of adjust variables is low with respect to the effective sample size for $Y$, direct covariate adjustment is hard to beat, if linear model assumptions are fairly well satisfied. — Frank Harrell, Apr 12 '16 at 12:10
Much has been written about that on this site. Do some research. Main issues are normality of residuals, equal variance assumptions, and not assuming linearity for predictors that act non-linearly (e.g. use regression splines). — Frank Harrell, Apr 12 '16 at 12:32

score 15 · Accepted Answer · edited Feb 23 '21 at 13:27

The easiest way I've found is to use NearestNeighbors from sklearn:

from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import NearestNeighbors

def get_matching_pairs(treated_df, non_treated_df, scaler=True):

    treated_x = treated_df.values
    non_treated_x = non_treated_df.values

    if scaler == True:
        scaler = StandardScaler()

    if scaler:
        scaler.fit(treated_x)
        treated_x = scaler.transform(treated_x)
        non_treated_x = scaler.transform(non_treated_x)

    nbrs = NearestNeighbors(n_neighbors=1, algorithm='ball_tree').fit(non_treated_x)
    distances, indices = nbrs.kneighbors(treated_x)
    indices = indices.reshape(indices.shape[0])
    matched = non_treated_df.iloc[indices]
    return matched

Example below:

import pandas as pd
import numpy as np

import matplotlib.pyplot as plt

treated_df = pd.DataFrame()
np.random.seed(1)

size_1 = 200
size_2 = 1000
treated_df['x'] = np.random.normal(0,1,size=size_1)
treated_df['y'] = np.random.normal(50,20,size=size_1)
treated_df['z'] = np.random.normal(0,100,size=size_1)

non_treated_df = pd.DataFrame()
# two different populations
non_treated_df['x'] = list(np.random.normal(0,3,size=size_2)) + list(np.random.normal(-1,2,size=2*size_2))
non_treated_df['y'] = list(np.random.normal(50,30,size=size_2)) + list(np.random.normal(-100,2,size=2*size_2))
non_treated_df['z'] = list(np.random.normal(0,200,size=size_2)) + list(np.random.normal(13,200,size=2*size_2))


matched_df = get_matching_pairs(treated_df, non_treated_df)

fig, ax = plt.subplots(figsize=(6,6))
plt.scatter(non_treated_df['x'], non_treated_df['y'], alpha=0.3, label='All non-treated')
plt.scatter(treated_df['x'], treated_df['y'], label='Treated')
plt.scatter(matched_df['x'], matched_df['y'], marker='x', label='matched')
plt.legend()
plt.xlim(-1,2)

How do you make sure that each non-treated subject is matched to a single treated subject? — Taha, Jul 01 '20 at 20:41

score 5 · Answer 2 · answered Apr 12 '16 at 19:21

As an answer to your question you will find libraries and small recipes that deal with propensity score matching. Such is the case for:

Implements propensity-score matching and eventually will implement balance diagnostics

CausalInference

This last resource (a library) also has an article written to explain what the library actually does. You can check it here. The main features are:

Assessment of overlap in covariate distributions
Estimation of propensity score
Improvement of covariate balance through trimming
Subclassification on propensity score
Estimation of treatment effects via matching, blocking, weighting, and least squares

matched pairs in Python (Propensity score matching)

2 Answers2