Skip to main content

SDMX protocol connector (ECB, Eurostat, IMF, OECD, BIS, World Bank, ILO) for the parsimony framework

Project description

parsimony-sdmx

SDMX connector plugin for parsimony. Harvests dataflow listings and per-dataset series keys from statistical agencies (ECB, Eurostat, IMF, World Bank), composes human-readable titles from the DSD + codelists, and publishes one parquet + FAISS bundle per catalog via parsimony publish.

Part of the parsimony-connectors monorepo. Distributed standalone on PyPI as parsimony-sdmx.

No separate builder CLI, no intermediate on-disk cache — every call hits the live agency endpoint inside a spawned subprocess.

Supported agencies

Agency ID Source
ECB European Central Bank SDMX 2.1
ESTAT Eurostat SDMX 2.1
IMF_DATA IMF SDMX 3 (sdmx.imf.org)
WB_WDI World Bank SDMX 2.1 (custom path × decade sweep)

Connectors

Name Kind Description
enumerate_sdmx_datasets enumerator One row per dataflow across every supported agency. Drives the sdmx_datasets catalog.
enumerate_sdmx_series enumerator One row per series key for a single (agency, dataset_id). Drives one sdmx_series_<agency>_<dataset_id> catalog per dataset.
sdmx_fetch connector Live observation fetch for a series key against the agency endpoint.

Install

pip install parsimony-sdmx

Pulls in parsimony-core>=0.4,<0.5 automatically. For local publishing you also want the standard extra on parsimony-core, which adds FAISS, BM25, and sentence-transformers (the default embedder stack):

pip install "parsimony-core[standard]"

Verify discovery:

python -c "from parsimony import discover; print([p.name for p in discover.iter_providers()])"

Quick start

import asyncio
from parsimony_sdmx import CONNECTORS

async def main():
    connectors = CONNECTORS.bind_env()
    result = await connectors["sdmx_fetch"](
        agency="ECB",
        dataset_id="YC",
        key="B.U2.EUR.4F.G_N_A.SV_C_YM.SR_10Y",
    )
    print(result.data.head())

asyncio.run(main())

For multi-plugin composition:

from parsimony import discover
connectors = discover.load_all().bind_env()

Catalog publishing

This plugin's namespaces are dynamic — one per (agency, dataset_id) pair discovered at publish time, plus one static cross-agency catalog:

  • sdmx_datasets — one cross-agency catalog of every dataflow.
  • sdmx_series_<agency>_<dataset_id> — one per-dataset catalog of series keys, e.g. sdmx_series_ecb_yc.

The plugin exports CATALOGS as an async generator function: yielding the static sdmx_datasets namespace first, then one sdmx_series_<agency>_<dataset_id> namespace per dataflow returned by live agency listing. RESOLVE_CATALOG(namespace) provides the cheap reverse lookup used by --only, parsing namespace strings back into (agency, dataset_id) without enumerating the full listing.

Publish a single catalog

Publish by name with --only (pure string lookup, no listing walk — RESOLVE_CATALOG fast-path):

parsimony publish \
  --provider sdmx \
  --target "file:///tmp/parsimony-smoke/{namespace}" \
  --only sdmx_series_ecb_yc

The {namespace} placeholder is substituted before the push. The target scheme is what decides where the bundle lands:

Scheme Destination Extra required
file://<path> Local filesystem
hf://<repo> Hugging Face dataset repo standard
s3://<bucket> S3 bucket s3

A local publish produces:

/tmp/parsimony-smoke/sdmx_series_ecb_yc/
├── entries.parquet   # rows: (namespace, code, title, description, tags, metadata, embedding)
├── embeddings.faiss  # FAISS index aligned with entries.parquet
└── meta.json         # catalog metadata + embedder fingerprint

Publish everything

Omit --only to drive the async generator through every agency listing, publishing one bundle per discovered (agency, dataset_id) pair plus the static sdmx_datasets catalog. Expect a long run — ESTAT alone has 8 k+ dataflows:

parsimony publish \
  --provider sdmx \
  --target "file:///tmp/parsimony-smoke/{namespace}"

An agency that fails listing is skipped with a warning; the run continues for the others.

Overnight chain (ESTAT → IMF_DATA → WB_WDI)

The 3-agency long-tail (~10 k flows total, several pinning ~5 GB heap) needs process recycling — CPython does not return memory to the OS, so one python process eventually OOMs the host. scripts/publish_overnight.sh wraps scripts/publish_agency.py in a per-batch restart loop:

cd /home/espinet/ockham/parsimony-connectors                  # one-time
uv sync --all-packages --extra publish

cd packages/sdmx                                              # then, every run
mkdir -p logs
nohup bash scripts/publish_overnight.sh > logs/overnight.log 2>&1 &
echo $! > logs/overnight.pid

The wrapper recycles the publisher every PARSIMONY_PUBLISH_BATCH_SIZE flows (default 15) and passes --resume so namespaces with a meta.json already on disk are skipped — safe to interrupt and restart. Output stages to ~/.cache/parsimony/catalogs/sdmx/<namespace>/ (PARSIMONY_CACHE_DIR to redirect). Per-agency stdout lands in logs/publish_<agency>.log; the wrapper banner goes to logs/overnight.log.

Monitor and ship:

tail -f logs/overnight.log                          # batch-level events
uv run parsimony cache info                         # catalogs subtree size
ls -1 ~/.cache/parsimony/catalogs/sdmx | wc -l      # namespace count

# When done — push to HF (per-provider dir is the namespaced root):
hf upload ockham/sdmx ~/.cache/parsimony/catalogs/sdmx/

Search a published bundle

import asyncio
from parsimony.catalog import Catalog

async def main():
    cat = await Catalog.from_url("file:///tmp/parsimony-smoke/sdmx_series_ecb_yc")
    for hit in await cat.search("10 year yield", 3):
        print(f"{hit.similarity:.3f}  {hit.code}  {hit.title[:80]}")

asyncio.run(main())

The same Catalog.from_url(...) works against hf://, s3://, and file:// URLs — FAISS + BM25 are combined via RRF at query time.

Plugin contract

The package implements the standard parsimony plugin contract, exported at the top level of parsimony_sdmx:

Export Role
CONNECTORS Connectors collection — two enumerators + the sdmx_fetch connector.
CATALOGS Async generator function — yields every catalog this plugin can publish.
RESOLVE_CATALOG namespace -> Callable | None — cheap reverse lookup for --only.

SDMX endpoints are public; no environment variables are required.

Architecture

parsimony_sdmx/
├── core/         pure domain logic: record dataclasses, title composition,
│                 codelist resolution, outcome types, domain exceptions
├── io/           boundary effects: atomic parquet writers, hardened lxml
│                 iterparse, HTTPS-only bounded session, path safety
├── providers/    per-agency adapters behind a narrow `CatalogProvider`
│                 protocol; ECB/ESTAT/IMF share a common sdmx1 flow helper,
│                 WB diverges with a path × decade sweep
├── connectors/   parsimony `@enumerator` surface + ``sdmx_fetch`` live
│                 observation connector
└── _isolation/   subprocess-spawning boundary for every sdmx1 call

Title composition

Each series row's title is built per DSD:

  • ECB — uses the TITLE / TITLE_COMPL natural-language attributes fetched via the portal side-channel. Titles like "All euro area yield curve - 10-year spot rate". Short, semantic, directly embedder-friendly.
  • ESTAT / IMF_DATA / WB_WDI — no natural-language attributes exposed; falls back to compose_series_title() which concatenates "DIM: label - DIM: label - …" across every dimension in DSD order. Longer (80-150 tokens) but still searchable.

The codelist-composed form used to be prefixed onto ECB titles as well ("base | TITLE - TITLE_COMPL"), but it duplicated content the natural language already expresses and inflated embedding cost quadratically (BERT attention is O(N²)). It now serves only as a fallback when TITLE_COMPL is absent. The raw SDMX series key is always available in the code column, so keyword-exact queries are unaffected.

Why subprocess isolation

sdmx1 caches parsed structure messages (DSDs, codelists, dataflows) at module scope with no public invalidation hook. A long-lived Python process that imports it accumulates cache monotonically until OOM. Process death is the only working way to flush that cache.

Every sdmx1-touching call (list_datasets for listings, fetch_series for per-dataset sweeps) runs inside a freshly spawned process that is discarded after the call — never pooled. A ProcessPoolExecutor would retain sdmx1 in each worker across tasks and defeat the invariant.

The two entry points in _isolation handle payload size differently:

  • list_datasets returns up to ~8 k dataflow tuples through an mp.Queue that the parent drains before proc.join() — the feeder thread blocks on the OS pipe buffer once pickled bytes exceed ~64 KB, so join-before-read deadlocks. Regression-guarded by test_listing.py::test_large_payload_does_not_deadlock.
  • fetch_series writes the series parquet to a caller-supplied tmpdir inside the child and returns only a small DatasetOutcome envelope. The parent reads the parquet back and the tmpdir is discarded. Disk is the transport.

Under load (ESTAT with ~8 k dataflows, ECB YC with ~2 k series) the parent process stays sdmx1-free — verified by test_listing.py::test_plugin_surface_import_does_not_pull_sdmx.

Development

# Fast tier (306 tests, ~3 s) — excludes slow + integration markers
uv run --package parsimony-sdmx pytest packages/sdmx/tests -q

# Subprocess regression tier (2 tests, ~2 s) — real mp.Process children
uv run --package parsimony-sdmx pytest packages/sdmx/tests -m slow -v

# Lint + type check
uv run --package parsimony-sdmx ruff check packages/sdmx/
uv run --package parsimony-sdmx mypy packages/sdmx/parsimony_sdmx/

Hardening defaults: HTTPS-only bounded HTTP session, hardened lxml.iterparse (no entity resolution, no DTD load, no network), path traversal guards on every on-disk write.

Provider

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsimony_sdmx-0.4.0.tar.gz (53.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsimony_sdmx-0.4.0-py3-none-any.whl (72.0 kB view details)

Uploaded Python 3

File details

Details for the file parsimony_sdmx-0.4.0.tar.gz.

File metadata

  • Download URL: parsimony_sdmx-0.4.0.tar.gz
  • Upload date:
  • Size: 53.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsimony_sdmx-0.4.0.tar.gz
Algorithm Hash digest
SHA256 302de65ddb7eab23cc5129906840d065f31e3ba5450527dcf0e8898bef737651
MD5 f9955b29606504b6896b712e2915a954
BLAKE2b-256 215e048fb3d3d33a98ddea3470be921d47a378015c079776df60b6995badb79b

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsimony_sdmx-0.4.0.tar.gz:

Publisher: release.yml on ockham-sh/parsimony-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parsimony_sdmx-0.4.0-py3-none-any.whl.

File metadata

  • Download URL: parsimony_sdmx-0.4.0-py3-none-any.whl
  • Upload date:
  • Size: 72.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsimony_sdmx-0.4.0-py3-none-any.whl
Algorithm Hash digest
SHA256 af3b6ac2d5eaa6cde0375c08f742181b3f92aa9fb1f8e543051407b252cf8386
MD5 e4790b8b9fadebe076c1c7fefbbdeeff
BLAKE2b-256 459c86735da7144e65e644124c94add4b69847a0833fd44625f70b44ea995899

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsimony_sdmx-0.4.0-py3-none-any.whl:

Publisher: release.yml on ockham-sh/parsimony-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page