Skip to main content

Lee & Wooldridge Difference-in-Differences estimator for small cross-sectional sample sizes

Project description

lwdid: Difference-in-Differences Estimator for Small Cross-Sectional Samples

Version Python License

Python implementation of the Lee and Wooldridge (2025) difference-in-differences estimator for panel data with small cross-sectional sample sizes.

Overview

This package implements the methodology described in Lee and Wooldridge (2025), providing valid inference for difference-in-differences estimation when the number of treated or control units is small.

Reference: Lee, S. J., and Wooldridge, J. M. (2025). Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes. Available at SSRN 5325686.

Authors: Xuanyu Cai, Wenli Xu

Key Features

The package provides inference for small cross-sectional samples by transforming panel data into cross-sectional regressions:

  • Designed for settings with small numbers of treated or control units
  • Exact t-based inference available under classical linear model assumptions (normality and homoskedasticity)
  • Works best with large time dimensions, where the central limit theorem across time supports normality
  • Serial correlation handled through unit-specific transformations
  • Unit-specific linear trends and seasonal patterns
  • Heteroskedasticity-robust inference (HC1/HC3) for moderate sample sizes
  • Randomization inference for finite-sample validity without distributional assumptions

Transformation Methods

Four transformation methods are available:

  • demean: Unit-specific demeaning (Procedure 2.1)
  • detrend: Unit-specific detrending (Procedure 3.1)
  • demeanq: Quarterly demeaning with seasonal effects
  • detrendq: Quarterly detrending with linear trends and seasonal effects

Installation

pip install lwdid

Or install from source:

git clone https://github.com/gorgeousfish/lwdid-py.git
cd lwdid-py
pip install .

Quick Start

Basic Example

import pandas as pd
from lwdid import lwdid

# Load panel data
data = pd.read_csv('smoking.csv')
# Note: 'd' is the column name for treatment indicator in this dataset

# Estimate ATT with exact inference
results = lwdid(
    data,
    y='lcigsale',      # outcome variable
    d='d',             # treatment indicator (0/1)
    ivar='state',      # unit identifier
    tvar='year',       # time variable
    post='post',       # post-treatment indicator
    rolling='detrend', # transformation: demean, detrend, demeanq, detrendq
    vce=None           # None: exact inference; 'hc3': heteroskedasticity-robust
)

# View results
print(results.summary())
print(f"ATT: {results.att:.4f} (SE: {results.se_att:.4f})")
print(f"95% CI: [{results.ci_lower:.4f}, {results.ci_upper:.4f}]")

# Export results
results.to_excel('results.xlsx')
results.to_latex('results.tex')

Advanced Usage

Randomization Inference

# Randomization inference for finite-sample validity without distributional assumptions
# Default: bootstrap resampling
# Alternative: permutation-based (Fisher randomization inference)
results = lwdid(
    data, 'lcigsale', 'd', 'state', 'year', 'post', 'detrend',
    ri=True,               # enable randomization inference
    ri_method='bootstrap', # 'bootstrap' (default) or 'permutation'
    rireps=1000,           # number of replications
    seed=42
)
print(f"RI p-value: {results.ri_pvalue:.4f}")

Control Variables

# Include time-invariant control variables
# Note: Controls must be constant within each unit across all periods
# For time-varying variables, use pre-treatment mean or first value

# Create time-invariant controls from time-varying variables
data_with_controls = data.copy()
for var in ['retprice', 'beer']:
    # Use pre-treatment period mean
    pre_mean = data[data['post']==0].groupby('state')[var].mean()
    data_with_controls[f'{var}_pre'] = data_with_controls['state'].map(pre_mean)

results = lwdid(
    data_with_controls, 'lcigsale', 'd', 'state', 'year', 'post', 'detrend',
    controls=['retprice_pre', 'beer_pre'],  # time-invariant covariates
    vce='hc3'
)

Quarterly Data

# Quarterly panel with seasonal effects
# Example: data with columns [unit, year, quarter, outcome, d, post]
results = lwdid(
    data, 'outcome', 'd', 'unit',
    tvar=['year', 'quarter'],  # composite time variable
    post='post',
    rolling='detrendq'         # quarterly detrending
)

Capabilities

Core Features

  • Transformation methods: demean, detrend, demeanq, detrendq
  • Inference options: Exact (under normality), HC1 robust, HC3 robust, cluster-robust
  • Control variables: Time-invariant covariates with automatic centering
  • Period-specific effects: Estimate ATT for each post-treatment period
  • Randomization inference: Bootstrap (default) or permutation-based p-values for finite-sample validity
  • Visualization: Time series plots comparing treated and control units
  • Export formats: Excel (multi-sheet), CSV, LaTeX tables

Validation

The implementation has been validated for numerical accuracy and consistency with the methodology described in Lee and Wooldridge (2025).

Requirements

  • Python ≥ 3.8, <3.13
  • numpy ≥ 1.20, <3.0
  • pandas ≥ 1.3, <3.0
  • scipy ≥ 1.7, <2.0
  • statsmodels ≥ 0.13, <1.0
  • matplotlib ≥ 3.3 (visualization)
  • openpyxl ≥ 3.1 (Excel export)

API Reference

Main Function

lwdid(data, y, d, ivar, tvar, post, rolling, **options)

Required Parameters:

  • data (DataFrame): Panel data in long format
  • y (str): Outcome variable
  • d (str): Unit-level treatment indicator Dᵢ (0/1)
    • Important: Must be time-invariant (constant within each unit across all periods)
    • Do not pass time-varying treatment indicator Wᵢₜ = Dᵢ × postₜ
    • If you have Wᵢₜ, construct Dᵢ first: data['D_i'] = data.groupby('unit')['W_it'].transform('max')
  • ivar (str): Unit identifier
  • tvar (str or list): Time variable (must be numeric)
    • Annual data: Single column name (str), e.g., tvar='year'
    • Quarterly data: List of two column names [year, quarter], e.g., tvar=['year', 'quarter']
    • Important: All time variables must contain numeric values (int or float)
  • post (str): Post-treatment indicator (0/1)
  • rolling (str): Transformation method
    • 'demean': Standard DiD with unit fixed effects
    • 'detrend': DiD with unit-specific linear trends
    • 'demeanq': Quarterly data with seasonal effects
    • 'detrendq': Quarterly data with trends and seasonal effects

Optional Parameters:

  • vce (str or None): Variance estimator (default: None, case-insensitive)
    • None: Homoskedastic standard errors (exact inference under normality)
    • 'robust' or 'hc1': HC1 heteroskedasticity-robust standard errors
    • 'hc3': HC3 small-sample adjusted heteroskedasticity-robust standard errors
    • 'cluster': Cluster-robust standard errors (requires cluster_var)
  • cluster_var (str): Cluster variable for cluster-robust standard errors (required when vce='cluster')
    • Must be a column name in the data
    • Clusters are typically defined at a higher aggregation level than units
    • Inference uses G-1 degrees of freedom, where G is the number of clusters
  • controls (list of str): Time-invariant control variables
    • Controls are included only if both N₁ > K+1 and N₀ > K+1 (where K is the number of controls)
    • If conditions are not met, controls are excluded and a warning is issued
  • ri (bool): Enable randomization inference (default: False)
  • ri_method (str): Resampling method for randomization inference (default: 'bootstrap')
    • 'bootstrap': With-replacement resampling
    • 'permutation': Without-replacement permutation (Fisher randomization inference)
  • rireps (int): Number of replications for randomization inference (default: 1000)
  • seed (int): Random seed for reproducibility
  • graph (bool): Generate visualization (default: False)
    • If plotting fails, a warning is issued and estimation continues unaffected
  • gid (str/int): Unit identifier for plotting (default: None for treated group mean)
  • graph_options (dict): Matplotlib plotting options (default: None)
    • Supported keys: figsize, title, xlabel, ylabel, legend_loc, dpi, savefig

Returns: LWDIDResults object with the following attributes:

Attribute Type Description
att float Average treatment effect on treated
se_att float Standard error
t_stat float t-statistic
pvalue float Two-sided p-value
ci_lower, ci_upper float 95% confidence interval
att_by_period DataFrame Period-specific treatment effects
ri_pvalue float Randomization inference p-value (if ri=True)
rireps int Number of RI replications (if ri=True)
ri_method str RI method used: 'bootstrap' or 'permutation' (if ri=True)
ri_valid int Number of successful RI replications (if ri=True)
ri_failed int Number of failed RI replications (if ri=True)
nobs int Number of observations in the cross-sectional regression (equals number of units)
n_treated int Number of treated units
n_control int Number of control units
df_resid int Residual degrees of freedom (N - K - 1)
df_inference int Degrees of freedom used for inference (G - 1 for cluster-robust SE, df_resid otherwise)

Methods:

  • summary(): Print formatted results table
  • plot(gid=None, graph_options=None): Visualize transformed outcomes over time
    • Plots residualized outcomes after removing unit-specific patterns
    • Useful for assessing parallel trends assumption
    • gid: Unit identifier to plot (default: treated group mean)
    • graph_options: Dictionary of matplotlib options
  • to_excel(path): Export to Excel workbook
  • to_csv(path): Export period-specific effects to CSV
  • to_latex(path): Export to LaTeX table

Usage Guidelines

Inference Choice:

  • Use vce=None for exact inference when N is small and normality is plausible
  • Use vce='hc3' for moderate samples (N ≥ 10) or when heteroskedasticity is suspected
  • Use vce='cluster' for cluster-robust inference (requires cluster_var)
    • Inference uses df = G - 1 (number of clusters minus 1)
    • Actual degrees of freedom stored in results.df_inference
  • Use randomization inference (ri=True) for finite-sample validity without distributional assumptions
    • Randomization inference uses homoskedastic standard errors to construct the null distribution
    • The vce option affects only classical t-based inference, not the randomization inference p-value

Data Format:

  • Panel structure:
    • Data must be in long format (one row per unit-time observation)
    • Each (unit, time) combination must be unique
    • Time index must form a continuous sequence
    • Panels may be balanced or unbalanced across units
  • Treatment timing (common timing assumption):
    • All treated units must begin treatment in the same period
    • The post indicator must be a function of time only
    • Treatment must be persistent (no reversals)
    • Staggered adoption is not supported (see Lee and Wooldridge 2025, Section 7)
  • Time variable format:
    • Annual data: Single numeric column (e.g., tvar='year')
    • Quarterly data: Two numeric columns (e.g., tvar=['year', 'quarter'])
  • Reserved column names: Avoid d_, post_, tindex, tq, ydot, ydot_postavg, firstpost

Examples

California Smoking Restrictions

# Analysis with single treated unit (N_treated = 1, N_control = 38)
data = pd.read_csv('smoking.csv')
results = lwdid(
    data,
    y='lcigsale',
    d='d',
    ivar='state',
    tvar='year',
    post='post',
    rolling='detrend',
    vce=None
)
print(results.summary())

See examples/smoking.ipynb for complete example.

Testing

The package includes comprehensive tests:

pytest tests/

Authors

Xuanyu Cai, Wenli Xu

Contributing

Contributions are welcome. Please submit bug reports or feature requests via the issue tracker.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lwdid-0.1.0.tar.gz (1.0 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

lwdid-0.1.0-py3-none-any.whl (55.9 kB view details)

Uploaded Python 3

File details

Details for the file lwdid-0.1.0.tar.gz.

File metadata

  • Download URL: lwdid-0.1.0.tar.gz
  • Upload date:
  • Size: 1.0 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for lwdid-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d57c9b9dc3adfa61e53e4129cb204df9e33217aa45612e72b1790c6aa6efac89
MD5 afc1ed1ba0263fd294bc98ab5b71770b
BLAKE2b-256 c8c87a7b16881dc2e684ead8b05abf1c2635522eb06b868b2e7bb9a94c86b26c

See more details on using hashes here.

File details

Details for the file lwdid-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: lwdid-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 55.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for lwdid-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8003e984ff840286c339df0361b85e40f41120174015901fc96b090dcc9b2d49
MD5 d20b4d4ce03c999c11e89b95eda231a6
BLAKE2b-256 0ef389d9a0200a3fb719d49950a31e9802c64ac209362e1edbe203ef66c75ffe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page