IT용어위키



Propensity Score Matching

Propensity Score Matching (PSM) is a statistical technique used in observational studies to reduce selection bias when estimating the causal effect of a treatment or intervention. It involves pairing treated and untreated units with similar propensity scores, which represent the probability of receiving the treatment based on observed covariates.

Key Concepts

  • Propensity Score: The probability of a unit receiving the treatment, given its covariates.
  • Matching: Pairing units from the treatment and control groups with similar propensity scores.
  • Balancing Covariates: Ensures that the treatment and control groups are comparable in terms of covariates.

Steps in Propensity Score Matching

  1. Estimate Propensity Scores: Use logistic regression or another model to estimate propensity scores for each unit based on covariates.
  2. Match Units: Pair treated units with untreated units that have similar propensity scores using methods like nearest neighbor or caliper matching.
  3. Assess Balance: Check whether covariates are balanced between the matched treatment and control groups.
  4. Estimate Treatment Effect: Compare outcomes between the matched groups to estimate the causal effect of the treatment.

Matching Methods

Several methods are used for matching units based on propensity scores:

  • Nearest Neighbor Matching: Matches each treated unit with the closest untreated unit based on propensity score.
  • Caliper Matching: Matches units only if their propensity scores are within a predefined threshold.
  • Radius Matching: Matches treated units with all untreated units within a specified range of propensity scores.
  • Kernel Matching: Uses weighted averages of untreated units within a certain range of propensity scores.
  • Stratification Matching: Divides units into strata based on propensity scores and compares treated and untreated units within each stratum.

Example of PSM in Python

Using the `statsmodels` library to estimate propensity scores and match units:

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors

# Example dataset
data = pd.DataFrame({
    'age': [25, 30, 45, 50, 35],
    'income': [30000, 40000, 50000, 60000, 45000],
    'treatment': [1, 0, 1, 0, 1],
    'outcome': [1, 0, 1, 0, 1]
})

# Estimate propensity scores
model = LogisticRegression()
model.fit(data[['age', 'income']], data['treatment'])
data['propensity_score'] = model.predict_proba(data[['age', 'income']])[:, 1]

# Match treated and untreated units
treated = data[data['treatment'] == 1]
untreated = data[data['treatment'] == 0]

matcher = NearestNeighbors(n_neighbors=1)
matcher.fit(untreated[['propensity_score']])
distances, indices = matcher.kneighbors(treated[['propensity_score']])

# Create matched dataset
matched = untreated.iloc[indices.flatten()].reset_index(drop=True)
matched['matched_to'] = treated.index.values
print(matched)

Applications of PSM

Propensity score matching is widely used in fields such as:

  • Healthcare: Evaluating the effectiveness of treatments or medical interventions.
  • Economics: Analyzing policy impacts using observational data.
  • Education: Assessing the impact of programs on student performance.
  • Marketing: Measuring the effects of campaigns or promotions.

Advantages

  • Reduces Selection Bias: Balances observed covariates between treatment and control groups.
  • Improves Causal Inference: Provides a framework for estimating treatment effects in observational studies.
  • Simple and Intuitive: Easy to implement and interpret.

Limitations

  • Unmeasured Confounding: PSM cannot account for unobserved variables that influence treatment assignment.
  • Data Loss: Matching may exclude units that cannot be paired, reducing the sample size.
  • Model Dependency: Results depend on the correctness of the model used to estimate propensity scores.

Related Concepts and See Also


  출처: IT위키(IT위키에서 최신 문서 보기)
  * 본 페이지는 공대위키에서 미러링된 페이지입니다. 일부 오류나 표현의 누락이 있을 수 있습니다. 원본 문서는 공대위키에서 확인하세요!