Lee & Wooldridge Difference-in-Differences estimator for small cross-sectional sample sizes

These details have not been verified by PyPI

Project links

Project description

lwdid: Difference-in-Differences Estimator for Small Cross-Sectional Samples

Version Python License

Python implementation of the Lee and Wooldridge (2025) difference-in-differences estimator for panel data with small cross-sectional sample sizes.

Overview

This package implements the methodology described in Lee and Wooldridge (2025), providing valid inference for difference-in-differences estimation when the number of treated or control units is small.

Reference: Lee, S. J., and Wooldridge, J. M. (2025). Simple Approaches to Inference with Difference-in-Differences Estimators with Small Cross-Sectional Sample Sizes. Available at SSRN 5325686.

Authors: Xuanyu Cai, Wenli Xu

Key Features

The package provides inference for small cross-sectional samples by transforming panel data into cross-sectional regressions:

Designed for settings with small numbers of treated or control units
Exact t-based inference available under classical linear model assumptions (normality and homoskedasticity)
Works best with large time dimensions, where the central limit theorem across time supports normality
Serial correlation handled through unit-specific transformations
Unit-specific linear trends and seasonal patterns
Heteroskedasticity-robust inference (HC1/HC3) for moderate sample sizes
Randomization inference for finite-sample validity without distributional assumptions

Transformation Methods

Four transformation methods are available:

demean: Unit-specific demeaning (Procedure 2.1)
detrend: Unit-specific detrending (Procedure 3.1)
demeanq: Quarterly demeaning with seasonal effects
detrendq: Quarterly detrending with linear trends and seasonal effects

Installation

pip install lwdid

Or install from source:

git clone https://github.com/gorgeousfish/lwdid-py.git
cd lwdid-py
pip install .

Quick Start

Basic Example

import pandas as pd
from lwdid import lwdid

# Load panel data
data = pd.read_csv('smoking.csv')
# Note: 'd' is the column name for treatment indicator in this dataset

# Estimate ATT with exact inference
results = lwdid(
    data,
    y='lcigsale',      # outcome variable
    d='d',             # treatment indicator (0/1)
    ivar='state',      # unit identifier
    tvar='year',       # time variable
    post='post',       # post-treatment indicator
    rolling='detrend', # transformation: demean, detrend, demeanq, detrendq
    vce=None           # None: exact inference; 'hc3': heteroskedasticity-robust
)

# View results
print(results.summary())
print(f"ATT: {results.att:.4f} (SE: {results.se_att:.4f})")
print(f"95% CI: [{results.ci_lower:.4f}, {results.ci_upper:.4f}]")

# Export results
results.to_excel('results.xlsx')
results.to_latex('results.tex')

Advanced Usage

Randomization Inference

# Randomization inference for finite-sample validity without distributional assumptions
# Default: bootstrap resampling
# Alternative: permutation-based (Fisher randomization inference)
results = lwdid(
    data, 'lcigsale', 'd', 'state', 'year', 'post', 'detrend',
    ri=True,               # enable randomization inference
    ri_method='bootstrap', # 'bootstrap' (default) or 'permutation'
    rireps=1000,           # number of replications
    seed=42
)
print(f"RI p-value: {results.ri_pvalue:.4f}")

Control Variables

# Include time-invariant control variables
# Note: Controls must be constant within each unit across all periods
# For time-varying variables, use pre-treatment mean or first value

# Create time-invariant controls from time-varying variables
data_with_controls = data.copy()
for var in ['retprice', 'beer']:
    # Use pre-treatment period mean
    pre_mean = data[data['post']==0].groupby('state')[var].mean()
    data_with_controls[f'{var}_pre'] = data_with_controls['state'].map(pre_mean)

results = lwdid(
    data_with_controls, 'lcigsale', 'd', 'state', 'year', 'post', 'detrend',
    controls=['retprice_pre', 'beer_pre'],  # time-invariant covariates
    vce='hc3'
)

Quarterly Data

# Quarterly panel with seasonal effects
# Example: data with columns [unit, year, quarter, outcome, d, post]
results = lwdid(
    data, 'outcome', 'd', 'unit',
    tvar=['year', 'quarter'],  # composite time variable
    post='post',
    rolling='detrendq'         # quarterly detrending
)

Capabilities

Core Features

Transformation methods: demean, detrend, demeanq, detrendq
Inference options: Exact (under normality), HC1 robust, HC3 robust, cluster-robust
Control variables: Time-invariant covariates with automatic centering
Period-specific effects: Estimate ATT for each post-treatment period
Randomization inference: Bootstrap (default) or permutation-based p-values for finite-sample validity
Visualization: Time series plots comparing treated and control units
Export formats: Excel (multi-sheet), CSV, LaTeX tables

Validation

The implementation has been validated for numerical accuracy and consistency with the methodology described in Lee and Wooldridge (2025).

Requirements

Python ≥ 3.8, <3.13
numpy ≥ 1.20, <3.0
pandas ≥ 1.3, <3.0
scipy ≥ 1.7, <2.0
statsmodels ≥ 0.13, <1.0
matplotlib ≥ 3.3 (visualization)
openpyxl ≥ 3.1 (Excel export)

API Reference

Main Function

lwdid(data, y, d, ivar, tvar, post, rolling, **options)

Required Parameters:

data (DataFrame): Panel data in long format
y (str): Outcome variable
d (str): Unit-level treatment indicator Dᵢ (0/1)
- Important: Must be time-invariant (constant within each unit across all periods)
- Do not pass time-varying treatment indicator Wᵢₜ = Dᵢ × postₜ
- If you have Wᵢₜ, construct Dᵢ first: data['D_i'] = data.groupby('unit')['W_it'].transform('max')
ivar (str): Unit identifier
tvar (str or list): Time variable (must be numeric)
- Annual data: Single column name (str), e.g., tvar='year'
- Quarterly data: List of two column names [year, quarter], e.g., tvar=['year', 'quarter']
- Important: All time variables must contain numeric values (int or float)
post (str): Post-treatment indicator (0/1)
rolling (str): Transformation method
- 'demean': Standard DiD with unit fixed effects
- 'detrend': DiD with unit-specific linear trends
- 'demeanq': Quarterly data with seasonal effects
- 'detrendq': Quarterly data with trends and seasonal effects

Optional Parameters:

vce (str or None): Variance estimator (default: None, case-insensitive)
- None: Homoskedastic standard errors (exact inference under normality)
- 'robust' or 'hc1': HC1 heteroskedasticity-robust standard errors
- 'hc3': HC3 small-sample adjusted heteroskedasticity-robust standard errors
- 'cluster': Cluster-robust standard errors (requires cluster_var)
cluster_var (str): Cluster variable for cluster-robust standard errors (required when vce='cluster')
- Must be a column name in the data
- Clusters are typically defined at a higher aggregation level than units
- Inference uses G-1 degrees of freedom, where G is the number of clusters
controls (list of str): Time-invariant control variables
- Controls are included only if both N₁ > K+1 and N₀ > K+1 (where K is the number of controls)
- If conditions are not met, controls are excluded and a warning is issued
ri (bool): Enable randomization inference (default: False)
ri_method (str): Resampling method for randomization inference (default: 'bootstrap')
- 'bootstrap': With-replacement resampling
- 'permutation': Without-replacement permutation (Fisher randomization inference)
rireps (int): Number of replications for randomization inference (default: 1000)
seed (int): Random seed for reproducibility
graph (bool): Generate visualization (default: False)
- If plotting fails, a warning is issued and estimation continues unaffected
gid (str/int): Unit identifier for plotting (default: None for treated group mean)
graph_options (dict): Matplotlib plotting options (default: None)
- Supported keys: figsize, title, xlabel, ylabel, legend_loc, dpi, savefig

Returns: LWDIDResults object with the following attributes:

Attribute	Type	Description
`att`	float	Average treatment effect on treated
`se_att`	float	Standard error
`t_stat`	float	t-statistic
`pvalue`	float	Two-sided p-value
`ci_lower`, `ci_upper`	float	95% confidence interval
`att_by_period`	DataFrame	Period-specific treatment effects
`ri_pvalue`	float	Randomization inference p-value (if `ri=True`)
`rireps`	int	Number of RI replications (if `ri=True`)
`ri_method`	str	RI method used: 'bootstrap' or 'permutation' (if `ri=True`)
`ri_valid`	int	Number of successful RI replications (if `ri=True`)
`ri_failed`	int	Number of failed RI replications (if `ri=True`)
`nobs`	int	Number of observations in the cross-sectional regression (equals number of units)
`n_treated`	int	Number of treated units
`n_control`	int	Number of control units
`df_resid`	int	Residual degrees of freedom (N - K - 1)
`df_inference`	int	Degrees of freedom used for inference (G - 1 for cluster-robust SE, df_resid otherwise)

Methods:

summary(): Print formatted results table
plot(gid=None, graph_options=None): Visualize transformed outcomes over time
- Plots residualized outcomes after removing unit-specific patterns
- Useful for assessing parallel trends assumption
- gid: Unit identifier to plot (default: treated group mean)
- graph_options: Dictionary of matplotlib options
to_excel(path): Export to Excel workbook
to_csv(path): Export period-specific effects to CSV
to_latex(path): Export to LaTeX table

Usage Guidelines

Inference Choice:

Use vce=None for exact inference when N is small and normality is plausible
Use vce='hc3' for moderate samples (N ≥ 10) or when heteroskedasticity is suspected
Use vce='cluster' for cluster-robust inference (requires cluster_var)
- Inference uses df = G - 1 (number of clusters minus 1)
- Actual degrees of freedom stored in results.df_inference
Use randomization inference (ri=True) for finite-sample validity without distributional assumptions
- Randomization inference uses homoskedastic standard errors to construct the null distribution
- The vce option affects only classical t-based inference, not the randomization inference p-value

Data Format:

Panel structure:
- Data must be in long format (one row per unit-time observation)
- Each (unit, time) combination must be unique
- Time index must form a continuous sequence
- Panels may be balanced or unbalanced across units
Treatment timing (common timing assumption):
- All treated units must begin treatment in the same period
- The post indicator must be a function of time only
- Treatment must be persistent (no reversals)
- Staggered adoption is not supported (see Lee and Wooldridge 2025, Section 7)
Time variable format:
- Annual data: Single numeric column (e.g., tvar='year')
- Quarterly data: Two numeric columns (e.g., tvar=['year', 'quarter'])
Reserved column names: Avoid d_, post_, tindex, tq, ydot, ydot_postavg, firstpost

Examples

California Smoking Restrictions

# Analysis with single treated unit (N_treated = 1, N_control = 38)
data = pd.read_csv('smoking.csv')
results = lwdid(
    data,
    y='lcigsale',
    d='d',
    ivar='state',
    tvar='year',
    post='post',
    rolling='detrend',
    vce=None
)
print(results.summary())

See examples/smoking.ipynb for complete example.

Testing

The package includes comprehensive tests:

pytest tests/

Authors

Xuanyu Cai, Wenli Xu

Contributing

Contributions are welcome. Please submit bug reports or feature requests via the issue tracker.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.2.3

Feb 17, 2026

0.2.1

Feb 14, 2026

0.2.0

Feb 10, 2026

This version

0.1.0

Nov 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lwdid-0.1.0.tar.gz (1.0 MB view details)

Uploaded Nov 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

lwdid-0.1.0-py3-none-any.whl (55.9 kB view details)

Uploaded Nov 27, 2025 Python 3

File details

Details for the file lwdid-0.1.0.tar.gz.

File metadata

Download URL: lwdid-0.1.0.tar.gz
Upload date: Nov 27, 2025
Size: 1.0 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for lwdid-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`d57c9b9dc3adfa61e53e4129cb204df9e33217aa45612e72b1790c6aa6efac89`
MD5	`afc1ed1ba0263fd294bc98ab5b71770b`
BLAKE2b-256	`c8c87a7b16881dc2e684ead8b05abf1c2635522eb06b868b2e7bb9a94c86b26c`

See more details on using hashes here.

File details

Details for the file lwdid-0.1.0-py3-none-any.whl.

File metadata

Download URL: lwdid-0.1.0-py3-none-any.whl
Upload date: Nov 27, 2025
Size: 55.9 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.9

File hashes

Hashes for lwdid-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8003e984ff840286c339df0361b85e40f41120174015901fc96b090dcc9b2d49`
MD5	`d20b4d4ce03c999c11e89b95eda231a6`
BLAKE2b-256	`0ef389d9a0200a3fb719d49950a31e9802c64ac209362e1edbe203ef66c75ffe`

See more details on using hashes here.

lwdid 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

lwdid: Difference-in-Differences Estimator for Small Cross-Sectional Samples

Overview

Key Features

Transformation Methods

Installation

Quick Start

Basic Example

Advanced Usage

Capabilities

Core Features

Validation

Requirements

API Reference

Main Function

Usage Guidelines

Examples

California Smoking Restrictions

Testing

Authors

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes