A package to create representative microdata for the US.
Project description
PolicyEngine US Data
Installation
While it is possible to install via PyPi:
pip install policyengine-us-data
the recommended installation is
pip install -e .[dev]
which installs the development dependencies in a reference-only manner (so that changes
to the package code will be reflected immediately); policyengine-us-data is a dev package
and not intended for direct access.
Pull Requests
PRs must come from branches pushed to PolicyEngine/policyengine-us-data, not from
personal forks. The PR workflow hard-fails fork-based PRs before the real test suite
runs because the required secrets are unavailable there.
Before opening a PR, push the current branch to the upstream repo:
make push-pr-branch
That target pushes the current branch to the upstream remote and sets tracking so
gh pr create opens the PR from PolicyEngine/policyengine-us-data.
SSA Data Sources
The following SSA data sources are used in this project:
- Latest Trustee's Report (2025) - Source for
social_security_aux.csv(extracted viaextract_ssa_costs.py) - Single Year Supplementary Tables (2025) - Long-range demographic and economic projections
- Single Year Age Demographic Projections (2024 - latest published) - Source for
SSPopJul_TR2024.csvpopulation data
Pipeline Overview
PolicyEngine constructs its representative household datasets through a multi-step pipeline. Public survey data is merged, stratified, and cloned to geographic variants per household. Each clone is simulated through PolicyEngine US with stochastic take-up, then calibrated via L0-regularized optimization against administrative targets at the national, state, and congressional district levels, producing geographically representative datasets.
The Enhanced CPS (make data-legacy) produces a national-only calibrated dataset. For the current geography-specific pipeline, see docs/calibration.md.
The repo currently contains two calibration tracks:
- Legacy Enhanced CPS (
make data-legacy), which uses the olderEnhancedCPS/build_loss_matrix()path for national-only calibration. - Unified calibration (
docs/calibration.md), which usesstorage/calibration/policy_data.dband the sparse matrix + L0 pipeline for current national and geography-specific builds.
For detailed calibration usage, see docs/calibration.md and modal_app/README.md.
Running the Full Pipeline
The pipeline runs as sequential steps in Modal:
make pipeline # prints the steps below
# 1. Build data (CPS/PUF/ACS → source-imputed stratified CPS)
make build-data-modal
# 2. Build calibration matrices (CPU, ~10h)
make build-matrices
# 3. Fit weights (GPU, county + national in parallel)
make calibrate-both
# 4. Build H5 files (state/district/city + national in parallel)
make stage-all-h5s
# 5. Promote to versioned HF paths
make promote
Building the Paper
Prerequisites
The paper requires a LaTeX distribution (e.g., TeXLive or MiKTeX) with the following packages:
- graphicx (for figures)
- amsmath (for mathematical notation)
- natbib (for bibliography management)
- hyperref (for PDF links)
- booktabs (for tables)
- geometry (for page layout)
- microtype (for typography)
- xcolor (for colored links)
On Ubuntu/Debian, you can install these with:
sudo apt-get install texlive-latex-base texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended
On macOS with Homebrew:
brew install --cask mactex
Building
To build the paper:
make paper
To clean LaTeX build files:
make clean-paper
The output PDF will be at paper/main.pdf.
Building the Documentation
Prerequisites
The documentation uses Jupyter Book 2 (pre-release) with MyST. To install:
# Install Jupyter Book 2 pre-release
pip install --pre "jupyter-book==2.*"
# Install MyST CLI
npm install -g mystmd
Building
To build and serve the documentation locally:
cd docs
myst start
Or alternatively from the project root:
jupyter book start docs
Both commands will start a local server at http://localhost:3001 where you can view the documentation.
The legacy Makefile command:
make documentation
Note: The Makefile uses the older jb command syntax which may not work with Jupyter Book 2. Use myst start or jupyter book start docs instead.
TRACE provenance output
Each US data release now publishes both:
release_manifest.jsontrace.tro.jsonld
The release manifest remains the operational source of truth for:
- published artifact paths and checksums
- build IDs and timestamps
- build-time
policyengine-usprovenance
trace.tro.jsonld is a generated TRACE declaration built from that manifest. It gives a
standards-based provenance export over the same release artifacts, including a
composition fingerprint across the release manifest and the artifacts it describes.
The TRO uses the canonical TROv 0.1 vocabulary and
surfaces PolicyEngine-specific build provenance under the https://policyengine.org/trace/0.1#
extension namespace. Structured fields on the performance node
(pe:dataBuildFingerprint, pe:builtWithModelVersion, pe:builtWithModelGitSha,
pe:dataBuildId, pe:emittedIn) let a verifier cross-check this TRO against the
certified-bundle TRO emitted by policyengine.py without parsing prose.
The emitted TRO is validated against policyengine_us_data/schemas/trace_tro.schema.json.
Important boundary:
- the TRACE file does not replace the release manifest
- the TRACE file does not decide model/data compatibility
For the broader certified-bundle architecture, see
policyengine.py release bundles
and the official TRACE specification.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file policyengine_us_data-1.89.1.tar.gz.
File metadata
- Download URL: policyengine_us_data-1.89.1.tar.gz
- Upload date:
- Size: 55.6 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
28734e45e0ef96f28bb735051408d3ca57734f0f1c152c5f7a8de0067de7e751
|
|
| MD5 |
6c3d024aaa0839393783cc8df6904cfe
|
|
| BLAKE2b-256 |
041493eca6fae9ca6a40ad59cfe76a1ec3efc9749af4bb3e19b2fbd1ccad5a92
|
File details
Details for the file policyengine_us_data-1.89.1-py3-none-any.whl.
File metadata
- Download URL: policyengine_us_data-1.89.1-py3-none-any.whl
- Upload date:
- Size: 48.3 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0160e856c47e95db039e7059ab568d88d5c4e6ec87364016a1dc6db1a27b88e7
|
|
| MD5 |
4040cf37f2c1e95f4af24fac092e1319
|
|
| BLAKE2b-256 |
3b61d5f61ed542fd343c5eb99cc98e78da387713e2897b887bdba0677c7e98ac
|