SDMX protocol connector (ECB, Eurostat, IMF, OECD, BIS, World Bank, ILO) for the parsimony framework
Project description
parsimony-sdmx
SDMX connector plugin for parsimony. Harvests dataflow listings and per-dataset series keys from statistical agencies (ECB, Eurostat, IMF, World Bank), composes human-readable titles from the DSD + codelists, and publishes one parquet + FAISS bundle per catalog via parsimony publish.
Part of the parsimony-connectors monorepo. Distributed standalone on PyPI as parsimony-sdmx.
No separate builder CLI, no intermediate on-disk cache — every call hits the live agency endpoint inside a spawned subprocess.
Supported agencies
| Agency ID | Source |
|---|---|
ECB |
European Central Bank SDMX 2.1 |
ESTAT |
Eurostat SDMX 2.1 |
IMF_DATA |
IMF SDMX 3 (sdmx.imf.org) |
WB_WDI |
World Bank SDMX 2.1 (custom path × decade sweep) |
Connectors
| Name | Kind | Description |
|---|---|---|
enumerate_sdmx_datasets |
enumerator | One row per dataflow across every supported agency. Drives the sdmx_datasets catalog. |
enumerate_sdmx_series |
enumerator | One row per series key for a single (agency, dataset_id). Drives one sdmx_series_<agency>_<dataset_id> catalog per dataset. |
sdmx_fetch |
connector | Live observation fetch for a series key against the agency endpoint. |
Install
pip install parsimony-sdmx
Pulls in parsimony-core>=0.4,<0.5 automatically. For local publishing you also want the standard extra on parsimony-core, which adds FAISS, BM25, and sentence-transformers (the default embedder stack):
pip install "parsimony-core[standard]"
Verify discovery:
python -c "from parsimony import discover; print([p.name for p in discover.iter_providers()])"
Quick start
import asyncio
from parsimony_sdmx import CONNECTORS
async def main():
connectors = CONNECTORS.bind_env()
result = await connectors["sdmx_fetch"](
agency="ECB",
dataset_id="YC",
key="B.U2.EUR.4F.G_N_A.SV_C_YM.SR_10Y",
)
print(result.data.head())
asyncio.run(main())
For multi-plugin composition:
from parsimony import discover
connectors = discover.load_all().bind_env()
Catalog publishing
This plugin's namespaces are dynamic — one per (agency, dataset_id) pair discovered at publish time, plus one static cross-agency catalog:
sdmx_datasets— one cross-agency catalog of every dataflow.sdmx_series_<agency>_<dataset_id>— one per-dataset catalog of series keys, e.g.sdmx_series_ecb_yc.
The plugin exports CATALOGS as an async generator function: yielding the static sdmx_datasets namespace first, then one sdmx_series_<agency>_<dataset_id> namespace per dataflow returned by live agency listing. RESOLVE_CATALOG(namespace) provides the cheap reverse lookup used by --only, parsing namespace strings back into (agency, dataset_id) without enumerating the full listing.
Publish a single catalog
Publish by name with --only (pure string lookup, no listing walk — RESOLVE_CATALOG fast-path):
parsimony publish \
--provider sdmx \
--target "file:///tmp/parsimony-smoke/{namespace}" \
--only sdmx_series_ecb_yc
The {namespace} placeholder is substituted before the push. The target scheme is what decides where the bundle lands:
| Scheme | Destination | Extra required |
|---|---|---|
file://<path> |
Local filesystem | — |
hf://<repo> |
Hugging Face dataset repo | standard |
s3://<bucket> |
S3 bucket | s3 |
A local publish produces:
/tmp/parsimony-smoke/sdmx_series_ecb_yc/
├── entries.parquet # rows: (namespace, code, title, description, tags, metadata, embedding)
├── embeddings.faiss # FAISS index aligned with entries.parquet
└── meta.json # catalog metadata + embedder fingerprint
Publish everything
Omit --only to drive the async generator through every agency listing, publishing one bundle per discovered (agency, dataset_id) pair plus the static sdmx_datasets catalog. Expect a long run — ESTAT alone has 8 k+ dataflows:
parsimony publish \
--provider sdmx \
--target "file:///tmp/parsimony-smoke/{namespace}"
An agency that fails listing is skipped with a warning; the run continues for the others.
Overnight chain (ESTAT → IMF_DATA → WB_WDI)
The 3-agency long-tail (~10 k flows total, several pinning ~5 GB heap)
needs process recycling — CPython does not return memory to the OS, so
one python process eventually OOMs the host. scripts/publish_overnight.sh
wraps scripts/publish_agency.py in a per-batch restart loop:
cd /home/espinet/ockham/parsimony-connectors # one-time
uv sync --all-packages --extra publish
cd packages/sdmx # then, every run
mkdir -p logs
nohup bash scripts/publish_overnight.sh > logs/overnight.log 2>&1 &
echo $! > logs/overnight.pid
The wrapper recycles the publisher every PARSIMONY_PUBLISH_BATCH_SIZE
flows (default 15) and passes --resume so namespaces with a
meta.json already on disk are skipped — safe to interrupt and restart.
Output stages to ~/.cache/parsimony/catalogs/sdmx/<namespace>/
(PARSIMONY_CACHE_DIR to redirect). Per-agency stdout lands in
logs/publish_<agency>.log; the wrapper banner goes to logs/overnight.log.
Monitor and ship:
tail -f logs/overnight.log # batch-level events
uv run parsimony cache info # catalogs subtree size
ls -1 ~/.cache/parsimony/catalogs/sdmx | wc -l # namespace count
# When done — push to HF (per-provider dir is the namespaced root):
hf upload ockham/sdmx ~/.cache/parsimony/catalogs/sdmx/
Search a published bundle
import asyncio
from parsimony.catalog import Catalog
async def main():
cat = await Catalog.from_url("file:///tmp/parsimony-smoke/sdmx_series_ecb_yc")
for hit in await cat.search("10 year yield", 3):
print(f"{hit.similarity:.3f} {hit.code} {hit.title[:80]}")
asyncio.run(main())
The same Catalog.from_url(...) works against hf://, s3://, and file:// URLs — FAISS + BM25 are combined via RRF at query time.
Plugin contract
The package implements the standard parsimony plugin contract, exported at the top level of parsimony_sdmx:
| Export | Role |
|---|---|
CONNECTORS |
Connectors collection — two enumerators + the sdmx_fetch connector. |
CATALOGS |
Async generator function — yields every catalog this plugin can publish. |
RESOLVE_CATALOG |
namespace -> Callable | None — cheap reverse lookup for --only. |
SDMX endpoints are public; no environment variables are required.
Architecture
parsimony_sdmx/
├── core/ pure domain logic: record dataclasses, title composition,
│ codelist resolution, outcome types, domain exceptions
├── io/ boundary effects: atomic parquet writers, hardened lxml
│ iterparse, HTTPS-only bounded session, path safety
├── providers/ per-agency adapters behind a narrow `CatalogProvider`
│ protocol; ECB/ESTAT/IMF share a common sdmx1 flow helper,
│ WB diverges with a path × decade sweep
├── connectors/ parsimony `@enumerator` surface + ``sdmx_fetch`` live
│ observation connector
└── _isolation/ subprocess-spawning boundary for every sdmx1 call
Title composition
Each series row's title is built per DSD:
- ECB — uses the
TITLE/TITLE_COMPLnatural-language attributes fetched via the portal side-channel. Titles like"All euro area yield curve - 10-year spot rate". Short, semantic, directly embedder-friendly. - ESTAT / IMF_DATA / WB_WDI — no natural-language attributes exposed; falls back to
compose_series_title()which concatenates"DIM: label - DIM: label - …"across every dimension in DSD order. Longer (80-150 tokens) but still searchable.
The codelist-composed form used to be prefixed onto ECB titles as well ("base | TITLE - TITLE_COMPL"), but it duplicated content the natural language already expresses and inflated embedding cost quadratically (BERT attention is O(N²)). It now serves only as a fallback when TITLE_COMPL is absent. The raw SDMX series key is always available in the code column, so keyword-exact queries are unaffected.
Why subprocess isolation
sdmx1 caches parsed structure messages (DSDs, codelists, dataflows) at module scope with no public invalidation hook. A long-lived Python process that imports it accumulates cache monotonically until OOM. Process death is the only working way to flush that cache.
Every sdmx1-touching call (list_datasets for listings, fetch_series for per-dataset sweeps) runs inside a freshly spawned process that is discarded after the call — never pooled. A ProcessPoolExecutor would retain sdmx1 in each worker across tasks and defeat the invariant.
The two entry points in _isolation handle payload size differently:
list_datasetsreturns up to ~8 k dataflow tuples through anmp.Queuethat the parent drains beforeproc.join()— the feeder thread blocks on the OS pipe buffer once pickled bytes exceed ~64 KB, so join-before-read deadlocks. Regression-guarded bytest_listing.py::test_large_payload_does_not_deadlock.fetch_serieswrites the series parquet to a caller-supplied tmpdir inside the child and returns only a smallDatasetOutcomeenvelope. The parent reads the parquet back and the tmpdir is discarded. Disk is the transport.
Under load (ESTAT with ~8 k dataflows, ECB YC with ~2 k series) the parent process stays sdmx1-free — verified by test_listing.py::test_plugin_surface_import_does_not_pull_sdmx.
Development
# Fast tier (306 tests, ~3 s) — excludes slow + integration markers
uv run --package parsimony-sdmx pytest packages/sdmx/tests -q
# Subprocess regression tier (2 tests, ~2 s) — real mp.Process children
uv run --package parsimony-sdmx pytest packages/sdmx/tests -m slow -v
# Lint + type check
uv run --package parsimony-sdmx ruff check packages/sdmx/
uv run --package parsimony-sdmx mypy packages/sdmx/parsimony_sdmx/
Hardening defaults: HTTPS-only bounded HTTP session, hardened lxml.iterparse (no entity resolution, no DTD load, no network), path traversal guards on every on-disk write.
Provider
- SDMX standard: https://sdmx.org
- ECB SDMX: https://data.ecb.europa.eu/help/api/overview
- Eurostat SDMX: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+SDMX+2.1
- IMF SDMX: https://datahelp.imf.org/knowledgebase/articles/667681-using-sdmx-to-query-imf-data
- World Bank SDMX: https://datahelpdesk.worldbank.org/knowledgebase/articles/889398-developer-information-overview
License
See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsimony_sdmx-0.4.0.tar.gz.
File metadata
- Download URL: parsimony_sdmx-0.4.0.tar.gz
- Upload date:
- Size: 53.0 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
302de65ddb7eab23cc5129906840d065f31e3ba5450527dcf0e8898bef737651
|
|
| MD5 |
f9955b29606504b6896b712e2915a954
|
|
| BLAKE2b-256 |
215e048fb3d3d33a98ddea3470be921d47a378015c079776df60b6995badb79b
|
Provenance
The following attestation bundles were made for parsimony_sdmx-0.4.0.tar.gz:
Publisher:
release.yml on ockham-sh/parsimony-connectors
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsimony_sdmx-0.4.0.tar.gz -
Subject digest:
302de65ddb7eab23cc5129906840d065f31e3ba5450527dcf0e8898bef737651 - Sigstore transparency entry: 1397167001
- Sigstore integration time:
-
Permalink:
ockham-sh/parsimony-connectors@c9f3d6fb220eec6231d212e945c944702959146b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ockham-sh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c9f3d6fb220eec6231d212e945c944702959146b -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file parsimony_sdmx-0.4.0-py3-none-any.whl.
File metadata
- Download URL: parsimony_sdmx-0.4.0-py3-none-any.whl
- Upload date:
- Size: 72.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
af3b6ac2d5eaa6cde0375c08f742181b3f92aa9fb1f8e543051407b252cf8386
|
|
| MD5 |
e4790b8b9fadebe076c1c7fefbbdeeff
|
|
| BLAKE2b-256 |
459c86735da7144e65e644124c94add4b69847a0833fd44625f70b44ea995899
|
Provenance
The following attestation bundles were made for parsimony_sdmx-0.4.0-py3-none-any.whl:
Publisher:
release.yml on ockham-sh/parsimony-connectors
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsimony_sdmx-0.4.0-py3-none-any.whl -
Subject digest:
af3b6ac2d5eaa6cde0375c08f742181b3f92aa9fb1f8e543051407b252cf8386 - Sigstore transparency entry: 1397167023
- Sigstore integration time:
-
Permalink:
ockham-sh/parsimony-connectors@c9f3d6fb220eec6231d212e945c944702959146b -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ockham-sh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@c9f3d6fb220eec6231d212e945c944702959146b -
Trigger Event:
workflow_dispatch
-
Statement type: