Skip to main content

Turn zipped fixed-width (FWF) trade microdata into Hive-partitioned Parquet datasets.

Project description

microtrade

Turn monthly drops of zipped fixed-width (FWF) trade microdata into Hive-partitioned Parquet datasets, one per trade type.

microtrade streams each raw <trade_type>_<YYYYMM>.zip directly from its zip archive (no extraction, bounded memory), slices columns according to a versioned YAML spec, and writes year=YYYY/month=MM/part-0.parquet atomically under a per-type dataset root. Monthly runs reprocess all months YTD of the current year; prior years are frozen.

Three trade types are supported, each with its own distinct schema:

  • imports
  • exports_us
  • exports_nonus

Requirements

  • Python 3.12+
  • uv for environment and dependency management

Install

uv sync

This resolves and installs runtime + dev dependencies into .venv/ based on pyproject.toml and uv.lock.

Usage

Import the schema workbook (once per schema version)

The authoritative schema lives in an Excel workbook. Sheets are mapped positionally: the first sheet becomes imports, the second exports_us, the third exports_nonus (sheet names are ignored). Each sheet's field table is autodetected by looking for a row containing Position, Description, Length, and Type; rows with Description = Blank are FWF padding and are skipped. See examples/microdata-layout.xls for a reference workbook with the expected shape.

Convert the workbook to versioned YAML specs under src/microtrade/specs/:

uv run microtrade import-spec examples/microdata-layout.xls \
    --effective-from 2020-01

The resulting YAML files are the runtime contract — review and commit them. Re-run with --force to replace an existing version. When a workbook changes, run again with a later --effective-from; the pipeline picks the appropriate spec per period automatically, and a column-level diff against the previous version is printed.

Ingest raw monthly zips

uv run microtrade ingest \
    --input  /path/to/raw_zips   \
    --output /path/to/datasets

Defaults: year-to-date of the current calendar year, all three trade types, zstd-compressed Parquet. Common flags:

Flag Default Purpose
--type imports all Repeat for multiple; limits processing
--year 2024 unset Process a single year (disables YTD logic)
--month 4 unset Combine with --year for one-shot re-ingest
--all off Process every year present under --input
--chunk-rows 250000 250000 Rows per Parquet row group / memory batch
--compression zstd zstd Parquet compression codec
--encoding utf-8 utf-8 Text encoding of the inner FWF

Per-partition outcomes are logged as JSON lines under <output>/_manifests/<trade_type>/<run_id>.jsonl, and a one-line summary is printed at the end. The exit code is non-zero if any partition failed; other partitions in the same run still complete.

Output layout

output/
  imports/
    year=2024/month=01/part-0.parquet
    year=2024/month=02/part-0.parquet
    ...
  exports_us/
    year=2024/month=01/part-0.parquet
    ...
  exports_nonus/
    ...
  _manifests/
    imports/<run_id>.jsonl
    exports_us/<run_id>.jsonl
    exports_nonus/<run_id>.jsonl

Partition columns (year, month) are encoded in the directory path only, not duplicated inside each Parquet file. Read with any Hive-aware scanner:

import polars as pl

df = pl.scan_parquet("output/imports", hive_partitioning=True).collect()

Or with DuckDB:

SELECT * FROM read_parquet('output/imports/**/*.parquet', hive_partitioning=1);

Architecture

discover.scan(input_dir)       -> list[RawInput(trade_type, year, month, path)]
schema.resolve(specs, period)  -> Spec effective for that period
ingest.iter_record_batches     -> pyarrow.RecordBatch stream (bounded memory)
write.PartitionWriter          -> year=/month=/part-0.parquet.tmp, atomic rename
pipeline.run                   -> orchestrates the above + JSONL manifest

Key invariants:

  • Excel is the upstream source of truth; committed YAML under src/microtrade/specs/ is the runtime contract. The pipeline never reads Excel at runtime.
  • Each partition write is idempotent: re-running YTD cleanly replaces the current year's partitions, leaving prior years untouched.
  • The zip is decompressed on the fly via zipfile.ZipFile.open(); the raw FWF is never extracted to disk and never fully materialized in memory.
  • Per-partition failures are recorded in the manifest but do not abort the run
    • one bad month will not block the rest.

Development

uv run pytest                                     # full suite with coverage
uv run pytest tests/test_pipeline.py::test_name   # single test
uv run ruff format                                # auto-format
uv run ruff check                                 # lint
uv run mypy src                                   # strict type check
uv run pre-commit run --all-files                 # all pre-commit hooks

Tests build synthetic Excel workbooks, YAML specs, and FWF zips on the fly in tests/_helpers.py rather than checking in binary fixtures, so the exercised code paths match the real production workflow end-to-end.

Status

The pipeline is feature-complete: scaffolding, Excel → YAML, discover + ingest

  • write, and the orchestrated CLI subcommands (ingest, import-spec, inspect, validate-specs) are all landed and covered. Reference YAML specs generated from examples/microdata-layout.xls ship under src/microtrade/specs/; replace them by running microtrade import-spec against the real schema workbook (typically with a later --effective-from, which preserves the historical layouts). Run microtrade validate-specs after importing to catch dtype conflicts between versions.

License

MIT (see LICENSE).

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

microtrade_fwf-0.1.0.tar.gz (98.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

microtrade_fwf-0.1.0-py3-none-any.whl (30.8 kB view details)

Uploaded Python 3

File details

Details for the file microtrade_fwf-0.1.0.tar.gz.

File metadata

  • Download URL: microtrade_fwf-0.1.0.tar.gz
  • Upload date:
  • Size: 98.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for microtrade_fwf-0.1.0.tar.gz
Algorithm Hash digest
SHA256 d99566f8fe21c2c799a5f175865f85b87d0cf1b68b77cab633fe028991a52dd3
MD5 444fdbdf3984d141b3587d93e7f850e8
BLAKE2b-256 63b21da6fb45b85e72a97b78d6ca5ad0c1c6e4a1c4ff681412617d7c2fe64d3c

See more details on using hashes here.

Provenance

The following attestation bundles were made for microtrade_fwf-0.1.0.tar.gz:

Publisher: publish.yml on twedl/microtrade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file microtrade_fwf-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: microtrade_fwf-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 30.8 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for microtrade_fwf-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a7f7fc848824b68596547cb4d5ae6db20b17672da49025fe13f8201d17af6c6b
MD5 227f5a18c9ca61114d9a782399904dc0
BLAKE2b-256 0386ea6a56e1d70a358e812f1778f080cd914276142d6e865730e39b39efdd13

See more details on using hashes here.

Provenance

The following attestation bundles were made for microtrade_fwf-0.1.0-py3-none-any.whl:

Publisher: publish.yml on twedl/microtrade

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page