Propensity Score Matching (PSM) is a statistical technique used in observational studies to reduce selection bias when estimating the causal effect of a treatment or intervention. It involves pairing treated and untreated units with similar propensity scores, which represent the probability of receiving the treatment based on observed covariates.
Key Concepts
- Propensity Score: The probability of a unit receiving the treatment, given its covariates.
- Matching: Pairing units from the treatment and control groups with similar propensity scores.
- Balancing Covariates: Ensures that the treatment and control groups are comparable in terms of covariates.
Steps in Propensity Score Matching
- Estimate Propensity Scores: Use logistic regression or another model to estimate propensity scores for each unit based on covariates.
- Match Units: Pair treated units with untreated units that have similar propensity scores using methods like nearest neighbor or caliper matching.
- Assess Balance: Check whether covariates are balanced between the matched treatment and control groups.
- Estimate Treatment Effect: Compare outcomes between the matched groups to estimate the causal effect of the treatment.
Matching Methods
Several methods are used for matching units based on propensity scores:
- Nearest Neighbor Matching: Matches each treated unit with the closest untreated unit based on propensity score.
- Caliper Matching: Matches units only if their propensity scores are within a predefined threshold.
- Radius Matching: Matches treated units with all untreated units within a specified range of propensity scores.
- Kernel Matching: Uses weighted averages of untreated units within a certain range of propensity scores.
- Stratification Matching: Divides units into strata based on propensity scores and compares treated and untreated units within each stratum.
Example of PSM in Python
Using the `statsmodels` library to estimate propensity scores and match units:
import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
# Example dataset
data = pd.DataFrame({
'age': [25, 30, 45, 50, 35],
'income': [30000, 40000, 50000, 60000, 45000],
'treatment': [1, 0, 1, 0, 1],
'outcome': [1, 0, 1, 0, 1]
})
# Estimate propensity scores
model = LogisticRegression()
model.fit(data[['age', 'income']], data['treatment'])
data['propensity_score'] = model.predict_proba(data[['age', 'income']])[:, 1]
# Match treated and untreated units
treated = data[data['treatment'] == 1]
untreated = data[data['treatment'] == 0]
matcher = NearestNeighbors(n_neighbors=1)
matcher.fit(untreated[['propensity_score']])
distances, indices = matcher.kneighbors(treated[['propensity_score']])
# Create matched dataset
matched = untreated.iloc[indices.flatten()].reset_index(drop=True)
matched['matched_to'] = treated.index.values
print(matched)
Applications of PSM
Propensity score matching is widely used in fields such as:
- Healthcare: Evaluating the effectiveness of treatments or medical interventions.
- Economics: Analyzing policy impacts using observational data.
- Education: Assessing the impact of programs on student performance.
- Marketing: Measuring the effects of campaigns or promotions.
Advantages
- Reduces Selection Bias: Balances observed covariates between treatment and control groups.
- Improves Causal Inference: Provides a framework for estimating treatment effects in observational studies.
- Simple and Intuitive: Easy to implement and interpret.
Limitations
- Unmeasured Confounding: PSM cannot account for unobserved variables that influence treatment assignment.
- Data Loss: Matching may exclude units that cannot be paired, reducing the sample size.
- Model Dependency: Results depend on the correctness of the model used to estimate propensity scores.