Skip to main content

SDMX protocol connector (ECB, Eurostat, IMF, OECD, BIS, World Bank, ILO) for the parsimony framework

Project description

parsimony-sdmx

Flat SDMX catalog builder. Harvests dataflows from statistical agencies (ECB, Eurostat, IMF, World Bank) and writes two parquet files per agency — a dataset-level catalog and one series-level catalog per dataset — ready to feed a FAISS or SQL index.

Layout

outputs/{AGENCY}/datasets.parquet        # columns: dataset_id, agency_id, title
outputs/{AGENCY}/series/{DATASET}.parquet # columns: id, dataset_id, title

Titles are composed as "CODE1: label1 - CODE2: label2 - …" over the DSD's non-TIME_PERIOD dimensions in DSD order. ECB series additionally append TITLE / TITLE_COMPL fetched from the per-series XML endpoint. HTML-embedded descriptions (common on Eurostat) are stripped before they reach parquet.

Install

Requires Python ≥ 3.12 and uv.

make install              # uv sync --all-extras

Usage

# Print every dataset the agency exposes
parsimony-sdmx -a ESTAT --list-datasets

# Write only outputs/{AGENCY}/datasets.parquet (no series fetched)
parsimony-sdmx -a ECB --catalog

# Fetch one dataset
parsimony-sdmx -a ECB -d YC

# Fetch every dataset the agency exposes
parsimony-sdmx -a ESTAT --all

# Preview what an --all run would write, without fetching
parsimony-sdmx -a ESTAT --all --dry-run

# Rebuild datasets whose parquet already exists (resume contract: file present → skip)
parsimony-sdmx -a ECB -d YC --force

Or via Make shortcuts:

make catalog AGENCY=ECB            # datasets.parquet only
make fetch AGENCY=ECB DATASET=YC   # single dataset
make fetch-all AGENCY=ESTAT        # every dataset for the agency
make list AGENCY=IMF_DATA          # enumerate datasets

Supported agencies: ECB, ESTAT, IMF_DATA, WB_WDI. Exit codes: 0 every dataset ok/empty, 1 at least one failed.

Architecture

Four packages under parsimony_sdmx/:

  • core/ — pure, I/O-free: record dataclasses, title composition, codelist resolution, outcome types, domain exceptions.
  • io/ — boundary effects: atomic parquet writers, hardened lxml iterparse, HTTPS-only bounded HTTP session, path safety helpers, exception classification.
  • providers/ — per-agency adapters behind a narrow CatalogProvider protocol; ECB/ESTAT/IMF share a common sdmx1 flow helper, WB diverges with a path × decade sweep because its SDMX endpoint doesn't expose series_keys.
  • cli/ — argparse front-end, orchestrator that forks one subprocess per dataset (mp.spawn) for memory isolation, a psutil-backed memory monitor that kills the largest child above a threshold and writes OOM markers, atomic .tmp/ cleanup, and an operator-readable end-of-run summary.

Why subprocess-per-dataset

sdmx1 caches structure messages at module level with no public invalidation hook. Running every dataset in its own subprocess is the only reliable way to start each fetch with a clean cache. Large catalogs (8 k+ Eurostat dataflows) have hit real OOMs in production; the memory monitor catches runaway workers before the kernel OOM killer does, preserving a classifiable failure instead of an opaque exit 137.

Resume

Filesystem-backed: a dataset is skipped if outputs/{AGENCY}/series/{DATASET}.parquet already exists. Writes land in .tmp/ first and are os.replace-d atomically, so the canonical path exists iff the previous run completed. --force overrides (and logs a count of overwrites).

Development

make check         # ruff + mypy strict + fast tests
make test          # fast tests only (skip subprocess-path tests)
make test-slow     # subprocess tests (fork real mp.Process children)
make test-all      # everything
make format        # ruff format + --fix
make clean         # wipe caches + build artifacts

Hardening enforced by default:

  • mypy --strict across source and tests
  • ruff with E, F, W, I, B, UP, S (security rules on)
  • Hardened lxml.iterparse (no entity resolution, no DTD load, no network)
  • HTTPS-only bounded_get with a configurable byte cap
  • Path traversal guards on every on-disk write

Project layout

parsimony_sdmx/
├── core/      # pure domain logic
├── io/        # boundary-layer effects
├── providers/ # per-agency adapters
└── cli/       # argparse → orchestrator → worker → parquet

tests/        # flat: test_<module>.py per source module

Test suite: 312 tests. 4 of them (@pytest.mark.slow) fork real subprocesses to exercise the orchestrator's timeout / OOM classification / clean-exit paths; the other 308 run in < 2 s.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsimony_sdmx-0.2.0.tar.gz (54.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsimony_sdmx-0.2.0-py3-none-any.whl (75.5 kB view details)

Uploaded Python 3

File details

Details for the file parsimony_sdmx-0.2.0.tar.gz.

File metadata

  • Download URL: parsimony_sdmx-0.2.0.tar.gz
  • Upload date:
  • Size: 54.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsimony_sdmx-0.2.0.tar.gz
Algorithm Hash digest
SHA256 83b651e64c8c443f489d56360d720c8f94675b3fb7611721399da94c913e709a
MD5 eea632768382fc1d3b03328e86453274
BLAKE2b-256 864f68ab555e633547bdd284dfd683c273d505188526181062306b17ec56e52b

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsimony_sdmx-0.2.0.tar.gz:

Publisher: release.yml on ockham-sh/parsimony-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parsimony_sdmx-0.2.0-py3-none-any.whl.

File metadata

  • Download URL: parsimony_sdmx-0.2.0-py3-none-any.whl
  • Upload date:
  • Size: 75.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsimony_sdmx-0.2.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3775f2eeee2c0b5180ce8b3d37141d6ec2d82ccf56cf883663e055777418b196
MD5 421f65cc0e6840907b949feb46c57201
BLAKE2b-256 54b61dbb635c6cf7439e730d9663d57b9a16462153ac540f2fff56983821d17e

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsimony_sdmx-0.2.0-py3-none-any.whl:

Publisher: release.yml on ockham-sh/parsimony-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page