Skip to main content

A package to create representative microdata for the US.

Project description

PolicyEngine US Data

Installation

While it is possible to install via PyPi:

pip install policyengine-us-data

the recommended installation is

pip install -e .[dev]

which installs the development dependencies in a reference-only manner (so that changes to the package code will be reflected immediately); policyengine-us-data is a dev package and not intended for direct access.

Pull Requests

PRs must come from branches pushed to PolicyEngine/policyengine-us-data, not from personal forks. The PR workflow hard-fails fork-based PRs before the real test suite runs because the required secrets are unavailable there.

Before opening a PR, push the current branch to the upstream repo:

make push-pr-branch

That target pushes the current branch to the upstream remote and sets tracking so gh pr create opens the PR from PolicyEngine/policyengine-us-data.

SSA Data Sources

The following SSA data sources are used in this project:

Pipeline Overview

PolicyEngine constructs its representative household datasets through a multi-step pipeline. Public survey data is merged, stratified, and cloned to geographic variants per household. Each clone is simulated through PolicyEngine US with stochastic take-up, then calibrated via L0-regularized optimization against administrative targets at the national, state, and congressional district levels, producing geographically representative datasets.

The Enhanced CPS (make data-legacy) produces a national-only calibrated dataset. For the current geography-specific pipeline, see docs/calibration.md.

The repo currently contains two calibration tracks:

  • Legacy Enhanced CPS (make data-legacy), which uses the older EnhancedCPS / build_loss_matrix() path for national-only calibration.
  • Unified calibration (docs/calibration.md), which uses storage/calibration/policy_data.db and the sparse matrix + L0 pipeline for current national and geography-specific builds.

For detailed calibration usage, see docs/calibration.md and modal_app/README.md.

Running the Full Pipeline

The pipeline runs as sequential steps in Modal:

make pipeline   # prints the steps below

# 1. Build data (CPS/PUF/ACS → source-imputed stratified CPS)
make build-data-modal

# 2. Build calibration matrices (CPU, ~10h)
make build-matrices

# 3. Fit weights (GPU, county + national in parallel)
make calibrate-both

# 4. Build H5 files (state/district/city + national in parallel)
make stage-all-h5s

# 5. Promote to versioned HF paths
make promote

Building the Paper

Prerequisites

The paper requires a LaTeX distribution (e.g., TeXLive or MiKTeX) with the following packages:

  • graphicx (for figures)
  • amsmath (for mathematical notation)
  • natbib (for bibliography management)
  • hyperref (for PDF links)
  • booktabs (for tables)
  • geometry (for page layout)
  • microtype (for typography)
  • xcolor (for colored links)

On Ubuntu/Debian, you can install these with:

sudo apt-get install texlive-latex-base texlive-latex-recommended texlive-latex-extra texlive-fonts-recommended

On macOS with Homebrew:

brew install --cask mactex

Building

To build the paper:

make paper

To clean LaTeX build files:

make clean-paper

The output PDF will be at paper/main.pdf.

Building the Documentation

Prerequisites

The documentation uses Jupyter Book 2 (pre-release) with MyST. To install:

# Install Jupyter Book 2 pre-release
pip install --pre "jupyter-book==2.*"

# Install MyST CLI
npm install -g mystmd

Building

To build and serve the documentation locally:

cd docs
myst start

Or alternatively from the project root:

jupyter book start docs

Both commands will start a local server at http://localhost:3001 where you can view the documentation.

The legacy Makefile command:

make documentation

Note: The Makefile uses the older jb command syntax which may not work with Jupyter Book 2. Use myst start or jupyter book start docs instead.

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

policyengine_us_data-1.83.2.tar.gz (54.9 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

policyengine_us_data-1.83.2-py3-none-any.whl (47.7 MB view details)

Uploaded Python 3

File details

Details for the file policyengine_us_data-1.83.2.tar.gz.

File metadata

  • Download URL: policyengine_us_data-1.83.2.tar.gz
  • Upload date:
  • Size: 54.9 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for policyengine_us_data-1.83.2.tar.gz
Algorithm Hash digest
SHA256 a4feebbe239f446990f6f63501c5825b6a80c9776204ef05092a13a5e83f4da3
MD5 a4532f0014f96df7269931d2be3c248a
BLAKE2b-256 6e14cb8c9ada5acbadb86294d389792edd79a4db2a9c945d88c0c2443e102e1d

See more details on using hashes here.

File details

Details for the file policyengine_us_data-1.83.2-py3-none-any.whl.

File metadata

File hashes

Hashes for policyengine_us_data-1.83.2-py3-none-any.whl
Algorithm Hash digest
SHA256 fab88cfdaf17fd8ed22be4e38e14ed6464f2ec55b5ce2ca7eca4ac3a63319afc
MD5 50d9dfc2ff35aaa05c516752ccfb3ab4
BLAKE2b-256 11f89671d96ba67fb47f8a47e8f1482f02623bb02e7e354340cbd323017dbd03

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page