Turn zipped fixed-width (FWF) trade microdata into Hive-partitioned Parquet datasets.
Project description
microtrade
Turn monthly drops of zipped fixed-width (FWF) trade microdata into Hive-partitioned Parquet datasets, one per trade type.
microtrade streams each raw <trade_type>_<YYYYMM>.zip directly from its zip
archive (no extraction, bounded memory), slices columns according to a
versioned YAML spec, and writes year=YYYY/month=MM/part-0.parquet atomically
under a per-type dataset root. Monthly runs reprocess all months YTD of the
current year; prior years are frozen.
Three trade types are supported, each with its own distinct schema:
importsexports_usexports_nonus
Requirements
- Python 3.12+
- uv for environment and dependency management
Install
uv sync
This resolves and installs runtime + dev dependencies into .venv/ based on
pyproject.toml and uv.lock.
Usage
Import the schema workbook (once per schema version)
The authoritative schema lives in an Excel workbook. Sheets are mapped
positionally: the first sheet becomes imports, the second exports_us,
the third exports_nonus (sheet names are ignored). Each sheet's field table
is autodetected by looking for a row containing Position, Description,
Length, and Type; rows with Description = Blank are FWF padding and are
skipped. See examples/microdata-layout.xls for a reference workbook with the
expected shape.
Convert the workbook to versioned YAML specs under src/microtrade/specs/:
uv run microtrade import-spec examples/microdata-layout.xls \
--effective-from 2020-01
The resulting YAML files are the runtime contract — review and commit them.
Re-run with --force to replace an existing version. When a workbook changes,
run again with a later --effective-from; the pipeline picks the appropriate
spec per period automatically, and a column-level diff against the previous
version is printed.
Ingest raw monthly zips
uv run microtrade ingest \
--input /path/to/raw_zips \
--output /path/to/datasets
Defaults: year-to-date of the current calendar year, all three trade types, zstd-compressed Parquet. Common flags:
| Flag | Default | Purpose |
|---|---|---|
--type imports |
all | Repeat for multiple; limits processing |
--year 2024 |
unset | Process a single year (disables YTD logic) |
--month 4 |
unset | Combine with --year for one-shot re-ingest |
--all |
off | Process every year present under --input |
--chunk-rows 250000 |
250000 | Rows per Parquet row group / memory batch |
--compression zstd |
zstd | Parquet compression codec |
--encoding utf-8 |
utf-8 | Text encoding of the inner FWF |
Per-partition outcomes are logged as JSON lines under
<output>/_manifests/<trade_type>/<run_id>.jsonl, and a one-line summary is
printed at the end. The exit code is non-zero if any partition failed; other
partitions in the same run still complete.
Output layout
output/
imports/
year=2024/month=01/part-0.parquet
year=2024/month=02/part-0.parquet
...
exports_us/
year=2024/month=01/part-0.parquet
...
exports_nonus/
...
_manifests/
imports/<run_id>.jsonl
exports_us/<run_id>.jsonl
exports_nonus/<run_id>.jsonl
Partition columns (year, month) are encoded in the directory path only,
not duplicated inside each Parquet file. Read with any Hive-aware scanner:
import polars as pl
df = pl.scan_parquet("output/imports", hive_partitioning=True).collect()
Or with DuckDB:
SELECT * FROM read_parquet('output/imports/**/*.parquet', hive_partitioning=1);
Architecture
discover.scan(input_dir) -> list[RawInput(trade_type, year, month, path)]
schema.resolve(specs, period) -> Spec effective for that period
ingest.iter_record_batches -> pyarrow.RecordBatch stream (bounded memory)
write.PartitionWriter -> year=/month=/part-0.parquet.tmp, atomic rename
pipeline.run -> orchestrates the above + JSONL manifest
Key invariants:
- Excel is the upstream source of truth; committed YAML under
src/microtrade/specs/is the runtime contract. The pipeline never reads Excel at runtime. - Each partition write is idempotent: re-running YTD cleanly replaces the current year's partitions, leaving prior years untouched.
- The zip is decompressed on the fly via
zipfile.ZipFile.open(); the raw FWF is never extracted to disk and never fully materialized in memory. - Per-partition failures are recorded in the manifest but do not abort the run
- one bad month will not block the rest.
Development
uv run pytest # full suite with coverage
uv run pytest tests/test_pipeline.py::test_name # single test
uv run ruff format # auto-format
uv run ruff check # lint
uv run mypy src # strict type check
uv run pre-commit run --all-files # all pre-commit hooks
Tests build synthetic Excel workbooks, YAML specs, and FWF zips on the fly in
tests/_helpers.py rather than checking in binary fixtures, so the exercised
code paths match the real production workflow end-to-end.
Status
The pipeline is feature-complete: scaffolding, Excel → YAML, discover + ingest
- write, and the orchestrated CLI subcommands (
ingest,import-spec,inspect,validate-specs) are all landed and covered. Reference YAML specs generated fromexamples/microdata-layout.xlsship undersrc/microtrade/specs/; replace them by runningmicrotrade import-specagainst the real schema workbook (typically with a later--effective-from, which preserves the historical layouts). Runmicrotrade validate-specsafter importing to catch dtype conflicts between versions.
License
MIT (see LICENSE).
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file microtrade_fwf-0.1.0.tar.gz.
File metadata
- Download URL: microtrade_fwf-0.1.0.tar.gz
- Upload date:
- Size: 98.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d99566f8fe21c2c799a5f175865f85b87d0cf1b68b77cab633fe028991a52dd3
|
|
| MD5 |
444fdbdf3984d141b3587d93e7f850e8
|
|
| BLAKE2b-256 |
63b21da6fb45b85e72a97b78d6ca5ad0c1c6e4a1c4ff681412617d7c2fe64d3c
|
Provenance
The following attestation bundles were made for microtrade_fwf-0.1.0.tar.gz:
Publisher:
publish.yml on twedl/microtrade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
microtrade_fwf-0.1.0.tar.gz -
Subject digest:
d99566f8fe21c2c799a5f175865f85b87d0cf1b68b77cab633fe028991a52dd3 - Sigstore transparency entry: 1339565297
- Sigstore integration time:
-
Permalink:
twedl/microtrade@71b25bef92a67f1ba94e0924377db0997be7bc46 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/twedl
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@71b25bef92a67f1ba94e0924377db0997be7bc46 -
Trigger Event:
push
-
Statement type:
File details
Details for the file microtrade_fwf-0.1.0-py3-none-any.whl.
File metadata
- Download URL: microtrade_fwf-0.1.0-py3-none-any.whl
- Upload date:
- Size: 30.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a7f7fc848824b68596547cb4d5ae6db20b17672da49025fe13f8201d17af6c6b
|
|
| MD5 |
227f5a18c9ca61114d9a782399904dc0
|
|
| BLAKE2b-256 |
0386ea6a56e1d70a358e812f1778f080cd914276142d6e865730e39b39efdd13
|
Provenance
The following attestation bundles were made for microtrade_fwf-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on twedl/microtrade
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
microtrade_fwf-0.1.0-py3-none-any.whl -
Subject digest:
a7f7fc848824b68596547cb4d5ae6db20b17672da49025fe13f8201d17af6c6b - Sigstore transparency entry: 1339565313
- Sigstore integration time:
-
Permalink:
twedl/microtrade@71b25bef92a67f1ba94e0924377db0997be7bc46 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/twedl
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@71b25bef92a67f1ba94e0924377db0997be7bc46 -
Trigger Event:
push
-
Statement type: