Skip to main content

SDMX protocol connector (ECB, Eurostat, IMF, OECD, BIS, World Bank, ILO) for the parsimony framework

Project description

parsimony-sdmx

SDMX connector plugin for parsimony. Harvests dataflow listings and per-dataset series keys from statistical agencies (ECB, Eurostat, IMF, World Bank), composes human-readable titles from the DSD + codelists, and exposes lazy Catalog declarations that maintainers can build and push directly.

Part of the parsimony-connectors monorepo. Distributed standalone on PyPI as parsimony-sdmx.

No separate builder CLI, no intermediate on-disk cache — every call hits the live agency endpoint inside a spawned subprocess.

Supported agencies

Agency ID Source
ECB European Central Bank SDMX 2.1
ESTAT Eurostat SDMX 2.1
IMF_DATA IMF SDMX 3 (sdmx.imf.org)
WB_WDI World Bank SDMX 2.1 (custom path × decade sweep)

Connectors

Name Kind Description
enumerate_sdmx_datasets enumerator One row per dataflow per agency (sdmx_datasets_<agency> namespaces).
enumerate_sdmx_series connector (dynamic schema) One row per series key for a single (agency, dataset_id). Drives one sdmx_series_<agency>_<dataset_id> catalog per dataset.
sdmx_fetch connector Live observation fetch for a series key against the agency endpoint.
sdmx_datasets_search connector Structured search over per-agency dataset catalogs.
sdmx_series_search connector Structured search over per-dataset series catalogs.

Five registered connectors total (2 enumerators + 1 fetch + 2 search).

Dynamic schema: enumerate_sdmx_series

Per-dataset series enumeration returns a wide DataFrame whose columns depend on the SDMX datastructure definition for that flow. The output schema is therefore dynamic per call — it cannot be declared statically on @enumerator. The connector stays a plain @connector that returns raw pd.DataFrame rows; catalog builders project entities with entities_from_connector after the framework applies the per-call schema.

Install

pip install parsimony-sdmx

Pulls in parsimony-core>=0.6,<0.7 automatically. Local catalog publishing uses the core catalog stack (hybrid BM25+vector or BM25-only per field):

pip install "parsimony-core[standard]"

Verify discovery:

python -c "from parsimony import discover; print([p.name for p in discover.iter_providers()])"

Quick start

import asyncio
from parsimony_sdmx import CONNECTORS

async def main():
    connectors = CONNECTORS
    result = await connectors["sdmx_fetch"](
        dataset_key="ECB-YC",
        series_key="B.U2.EUR.4F.G_N_A.SV_C_YM.SR_10Y",
    )
    print(result.data.head())

asyncio.run(main())

For multi-plugin composition:

from parsimony import discover
connectors = discover.load_all()

Catalog building

Catalog building is an operator workflow in scripts/build_catalog.py. Indexing policy lives in parsimony_sdmx/catalog_policy.py: hybrid BM25+vector per field when unique text count is below 1,000, otherwise BM25-only.

Namespaces:

  • sdmx_datasets_<agency> — one dataset catalog per agency (e.g. sdmx_datasets_ecb).
  • sdmx_series_<agency>_<dataset_id> — per-flow series catalogs for selected macro/finance flows.

Build and push

# One agency: full dataset index + selected series catalogs
uv run python scripts/build_catalog.py --catalog agency --agency ECB \
  --save-root /tmp/parsimony-catalogs/sdmx --push-root hf://parsimony-dev/sdmx

# Full portfolio (all agencies)
uv run python scripts/build_catalog.py --catalog portfolio \
  --save-root /tmp/parsimony-catalogs/sdmx --push-root hf://parsimony-dev/sdmx \
  --parallel 2 --keep-going

Use --save-root /tmp/sdmx to write local snapshots under namespace subdirectories. Use --push <url> for one explicit catalog URL or --push-root <root> for namespace subdirectories.

A local build produces:

/tmp/parsimony-smoke/sdmx_series_ecb_yc/
├── entries.parquet
├── indexes/
└── meta.json

Build an agency batch

uv run python scripts/build_catalog.py --catalog agency --agency ECB --push-root hf://parsimony-dev/sdmx
uv run python scripts/build_catalog.py --catalog agency --agency ESTAT --max-catalogs 30 --save-root /tmp/sdmx

An agency that fails listing raises before building; individual $DV_* derived views are skipped because they are not fetchable series catalogs.

Expected search workflow (agents and maintainers)

SDMX catalogs are built for structured field search first, not open-ended semantic Q&A.

  1. sdmx_datasets_search(agency='ECB', query=...) on sdmx_datasets_ecb — structured code: ECB|YC or title text.
  2. Read the returned dimensions manifest (present only when a series catalog exists for that flow).
  3. sdmx_series_search(flow_id='ECB/YC', ...) — structured dimension clauses.
  4. sdmx_fetch with the series key from search results.

High-cardinality fields (especially title on large series catalogs) may be BM25-only when unique value count reaches 1,000 or more. Prefer structured FIELD: value clauses over long natural-language probes on those catalogs.

Override the catalog root for local dev: PARSIMONY_SDMX_CATALOG_URL=file:///tmp/sdmx (default publish target: hf://parsimony-dev/sdmx).

Search a published bundle

import asyncio
from parsimony.catalog import Catalog

async def main():
    datasets = await Catalog.load("hf://parsimony-dev/sdmx/sdmx_datasets_ecb")
    flows, _ = await datasets.search("code: ECB|YC", limit=3)
    print("datasets", flows[0].code, flows[0].title[:80])

    series = await Catalog.load("hf://parsimony-dev/sdmx/sdmx_series_ecb_yc")
    hits, _ = await series.search("REF_AREA: Spain && FREQ: Monthly", limit=3)
    for hit in hits:
        print(f"{hit.score:.3f}  {hit.code}  {hit.title[:80]}")

asyncio.run(main())

The same Catalog.load(...) works against hf:// and file:// URLs. Structured queries intersect candidates across fields; plain text without field syntax falls back to the title index only.

Validate a built or published snapshot:

uv run python scripts/validate_catalog.py --catalog-url file:///tmp/parsimony-catalogs/sdmx/sdmx_series_ecb_yc
uv run python scripts/validate_catalog.py \
  --catalog-url file:///tmp/parsimony-catalogs/sdmx/sdmx_datasets_ecb \
  --catalog-root file:///tmp/parsimony-catalogs/sdmx \
  --queries-file packages/sdmx/catalog_tests/queries.yaml

Plugin contract

The package implements the standard parsimony plugin contract, exported at the top level of parsimony_sdmx:

Export Role
CONNECTORS Connectors collection — two enumerators, sdmx_fetch, and two search connectors.

SDMX endpoints are public; no environment variables are required.

Architecture

parsimony_sdmx/
├── core/         pure domain logic: record dataclasses, title composition,
│                 codelist resolution, outcome types, domain exceptions
├── io/           boundary effects: atomic parquet writers, hardened lxml
│                 iterparse, HTTPS-only bounded session, path safety
├── providers/    per-agency adapters behind a narrow `CatalogProvider`
│                 protocol; ECB/ESTAT/IMF share a common sdmx1 flow helper,
│                 WB diverges with a path × decade sweep
├── connectors/   parsimony `@enumerator` surface + ``sdmx_fetch`` live
│                 observation connector
└── _isolation/   subprocess-spawning boundary for every sdmx1 call

Title composition

Each series row's title is built per DSD:

  • ECB — uses the TITLE / TITLE_COMPL natural-language attributes fetched via the portal side-channel. Titles like "All euro area yield curve - 10-year spot rate". Short, semantic, directly embedder-friendly.
  • ESTAT / IMF_DATA / WB_WDI — no natural-language attributes exposed; falls back to compose_series_title() which concatenates "DIM: label - DIM: label - …" across every dimension in DSD order. Longer (80-150 tokens) but still searchable.

The codelist-composed form is used only as a fallback when TITLE_COMPL is absent — duplicating it onto natural-language titles inflates embedding cost quadratically (BERT attention is O(N²)) without adding signal. The raw SDMX series key is always available in the code column, so keyword-exact queries are unaffected.

Why subprocess isolation

sdmx1 caches parsed structure messages (DSDs, codelists, dataflows) at module scope with no public invalidation hook. A long-lived Python process that imports it accumulates cache monotonically until OOM. Process death is the only working way to flush that cache.

Every sdmx1-touching call (list_datasets for listings, fetch_series for per-dataset sweeps) runs inside a freshly spawned process that is discarded after the call — never pooled. A ProcessPoolExecutor would retain sdmx1 in each worker across tasks and defeat the invariant.

The two entry points in _isolation handle payload size differently:

  • list_datasets returns up to ~8 k dataflow tuples through an mp.Queue that the parent drains before proc.join() — the feeder thread blocks on the OS pipe buffer once pickled bytes exceed ~64 KB, so join-before-read deadlocks. Regression-guarded by test_listing.py::test_large_payload_does_not_deadlock.
  • fetch_series writes the series parquet to a caller-supplied tmpdir inside the child and returns only a small DatasetOutcome envelope. The parent reads the parquet back and the tmpdir is discarded. Disk is the transport.

Under load (ESTAT with ~8 k dataflows, ECB YC with ~2 k series) the parent process stays sdmx1-free — verified by test_listing.py::test_plugin_surface_import_does_not_pull_sdmx.

Development

# Fast tier (306 tests, ~3 s) — excludes slow + integration markers
uv run --package parsimony-sdmx pytest packages/sdmx/tests -q

# Subprocess regression tier (2 tests, ~2 s) — real mp.Process children
uv run --package parsimony-sdmx pytest packages/sdmx/tests -m slow -v

# Lint + type check
uv run --package parsimony-sdmx ruff check packages/sdmx/
uv run --package parsimony-sdmx mypy packages/sdmx/parsimony_sdmx/

Hardening defaults: HTTPS-only bounded HTTP session, hardened lxml.iterparse (no entity resolution, no DTD load, no network), path traversal guards on every on-disk write.

Provider

License

See LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

parsimony_sdmx-0.7.0.tar.gz (56.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

parsimony_sdmx-0.7.0-py3-none-any.whl (76.1 kB view details)

Uploaded Python 3

File details

Details for the file parsimony_sdmx-0.7.0.tar.gz.

File metadata

  • Download URL: parsimony_sdmx-0.7.0.tar.gz
  • Upload date:
  • Size: 56.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsimony_sdmx-0.7.0.tar.gz
Algorithm Hash digest
SHA256 b38b790b47e977825fbb2963741a38de4472c748df667403e4bf7db245a9727b
MD5 808470c47b1d7375d9c5316d44cde317
BLAKE2b-256 1d46b0b6101eb4e1926629b7c5736f01a63fbdf415b7aa54209d0920fa00ea7f

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsimony_sdmx-0.7.0.tar.gz:

Publisher: release.yml on ockham-sh/parsimony-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file parsimony_sdmx-0.7.0-py3-none-any.whl.

File metadata

  • Download URL: parsimony_sdmx-0.7.0-py3-none-any.whl
  • Upload date:
  • Size: 76.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for parsimony_sdmx-0.7.0-py3-none-any.whl
Algorithm Hash digest
SHA256 d92b355912b7950c5e542f81fa5b9e5d39327a2185341c35e6a21e7ff4ac5a72
MD5 4021a7f1c1688387fd7e5335ea3d585e
BLAKE2b-256 e132a7de2c0b0166c74f977cb6749e4c004f598484a40a7af46b579674a4ced7

See more details on using hashes here.

Provenance

The following attestation bundles were made for parsimony_sdmx-0.7.0-py3-none-any.whl:

Publisher: release.yml on ockham-sh/parsimony-connectors

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page