Skip to main content

StataFlow: A Python econometrics toolkit aligned with Stata 17

Project description

StataFlow

StataFlow (stataflow) is a Python econometrics toolkit that reproduces Stata 17 estimation results with high precision. It provides both a Stata-compatible command layer (for researchers migrating from Stata) and a native Python estimator layer (for advanced users who want direct control).

What you can do today

  • Run Stata-style commands in Python: regress, reghdfe, ivregress 2sls, logit, ppmlhdfe, did_imputation, csdid, rdrobust, and more.
  • Obtain coefficients, standard errors, t/z-statistics, p-values, and confidence intervals that are field-level verified against Stata 17.
  • Work with high-dimensional fixed effects (HDFE), IV/2SLS, binary/count models, and DID/event-study estimators.
  • Use Stata-style factor-variable syntax (i.group##c.post, c.x1#c.x2, x1##x2) and space-separated absorb strings directly in wrapper commands. Bare variables inside # / ## are treated as continuous, matching common Stata usage.

What is not yet supported

  • Multi-way clusteringregress supports two-way clustering (Cameron-Gelbach-Miller 2011); all other commands currently use single-cluster robust inference only.
  • Direct post-estimation on wrapper returns — the compat.stata wrappers return ResultSchema result objects. predict and margins are available on the core estimator layer only.
  • Full command surfaces for community commandsreghdfe, ivreghdfe, ppmlhdfe, did_imputation, eventstudyinteract, csdid, and rdrobust are implemented as verified high-frequency subsets, not complete Stata command reproductions. Unsupported options are explicitly rejected rather than silently ignored.

Completeness legend

  • Stable — synthetic + real-data dual-run verified; core API is unlikely to change.
  • Alpha — high-frequency paths are implemented and verified, but the command surface is still a subset of the full Stata community command.
  • Alpha — Partial — a verifiable implementation exists, but large functional areas are still missing (e.g., fuzzy RD for rdrobust, weights beyond aweight).

See the Command Support Matrix for the per-command detailed status.


Installation

pip install StataFlow

Requirements: Python 3.10+, NumPy, pandas, SciPy.

For development (editable install from source):

git clone https://github.com/ZhenHaoFu810/StataFlow.git
cd StataFlow
pip install -e .

Quick start

Stata-compatible command layer (recommended)

All compat.stata wrappers return a ResultSchema object with coefficients, standard errors, and fit statistics. They do not expose .predict() or .margins() directly—use the core estimator layer below for post-estimation.

import pandas as pd
from stataflow.compat.stata import regress, reghdfe, ivregress_2sls, logit

# OLS with robust standard errors
result = regress(df, y="wage", x=["edu", "exper"], vce="robust")

# High-dimensional fixed effects (reghdfe)
result = reghdfe(
    df, y="wage", x=["edu", "exper"],
    absorb="firm_id year_id", vce="cluster", cluster="industry"
)

# Factor-variable syntax in HDFE
result = reghdfe(
    df, y="wage", x=["i.industry##c.post"], absorb="firm_id year_id"
)

# 2SLS
result = ivregress_2sls(
    df, y="lwage", x_exog=["edu"], x_endog=["exper"],
    instruments=["age", "kidslt6"], vce="robust"
)

# Logit
result = logit(df, y="inlf", x=["nwifeinc", "educ", "exper"])

For runnable examples, see the examples/ directory:

Native Python estimator layer (advanced)

from stataflow import OLS, FixedEffectsOLS, AbsorbingOLS, Logit, IV2SLS

model = OLS(data=df, y="wage", x=["edu", "exper"])
result = model.fit(vce="robust")

Supported commands

Command Python entry Core capabilities
regress stataflow.compat.stata.regress OLS, robust, cluster, aweight
xtreg, fe stataflow.compat.stata.xtreg_fe Fixed effects (within), cluster
areg stataflow.compat.stata.areg Single absorb variable FE
reghdfe stataflow.compat.stata.reghdfe 1+ group HDFE, cluster, singleton drop
ivregress 2sls stataflow.compat.stata.ivregress_2sls 2SLS, robust, cluster
ivreghdfe stataflow.compat.stata.ivreghdfe IV + 1+ group HDFE, cluster
logit stataflow.compat.stata.logit MLE, robust, cluster
probit stataflow.compat.stata.probit MLE, robust, cluster
poisson stataflow.compat.stata.poisson MLE, robust, cluster
ppmlhdfe stataflow.compat.stata.ppmlhdfe PPML + 1+ group HDFE
did_imputation stataflow.compat.stata.did_imputation BJS DID imputation
eventstudyinteract stataflow.compat.stata.eventstudyinteract Sun & Abraham IW estimator
csdid stataflow.compat.stata.csdid Callaway-Sant'Anna DID (method="reg" only)
rdrobust stataflow.compat.stata.rdrobust Sharp RD local polynomial (bwselect="mserd", covs)

Full details: docs/command-support-matrix/README.md


Validation philosophy

Every public command is validated with two lines of evidence:

  1. Synthetic / controlled cases — formula, degrees of freedom, sample screening, edge cases.
  2. Real public datasets — field-level comparison against Stata 17 on openly available economic/financial data.

A command is considered "done" only when both lines pass and the source-to-Python mapping is documented. We do not accept "statistical equivalence" without explicit mathematical or source-code justification.

Public evidence and results are available in research/results/validation/.

Running tests

# Unit and integration tests (fast)
pytest tests/ -v --ignore=tests/golden/

# Golden dual-run tests (require Stata 17)
pytest tests/golden/ -v

Project structure

  • src/stataflow/estimators/ — Core Python estimators (OLS, AbsorbingOLS, Logit, PPMLHDFE, DIDImputation, etc.)
  • src/stataflow/compat/stata/ — Stata command wrappers (regress(), reghdfe(), ivregress_2sls(), etc.)
  • docs/command-support-matrix/ — Per-command support matrices
  • examples/ — Runnable demonstration scripts
  • tests/ — Unit and integration tests

Default target version

Stata 17


Documentation


Governance

  • Codex — project goals, architecture, review gates, and statistical-dispute arbitration.
  • Claude Code — implementation, testing, and evidence backfill.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

stataflow-0.1.5.tar.gz (80.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

stataflow-0.1.5-py3-none-any.whl (80.3 kB view details)

Uploaded Python 3

File details

Details for the file stataflow-0.1.5.tar.gz.

File metadata

  • Download URL: stataflow-0.1.5.tar.gz
  • Upload date:
  • Size: 80.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for stataflow-0.1.5.tar.gz
Algorithm Hash digest
SHA256 ef5930dfabc528c4565595d4e59882407f9f81594527e70f6c0da7fc57555919
MD5 91f246ddc6e3089f9f9e4cadb2e31613
BLAKE2b-256 bfcc4147f29162d10396c168d801a73184778ceade02970cdeeb18d3edb3666c

See more details on using hashes here.

File details

Details for the file stataflow-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: stataflow-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 80.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.7

File hashes

Hashes for stataflow-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c9bb009cc65edc5cb3a009dba122675d59afd6ab3cfbc03820e98e8117c81334
MD5 97b4b15f996859a8daf5d9468b3bbc64
BLAKE2b-256 94281118b85a49d88929538d08252bc980c4ca81cf40aeff9692a7b723168d2d

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page