SDMX protocol connector (ECB, Eurostat, IMF, OECD, BIS, World Bank, ILO) for the parsimony framework
Project description
parsimony-sdmx
SDMX connector plugin for parsimony. Harvests dataflow listings and per-dataset series keys from statistical agencies (ECB, Eurostat, IMF, World Bank), composes human-readable titles from the DSD + codelists, and exposes lazy Catalog declarations that maintainers can build and push directly.
Part of the parsimony-connectors monorepo. Distributed standalone on PyPI as parsimony-sdmx.
No separate builder CLI, no intermediate on-disk cache — every call hits the live agency endpoint inside a spawned subprocess.
Supported agencies
| Agency ID | Source |
|---|---|
ECB |
European Central Bank SDMX 2.1 |
ESTAT |
Eurostat SDMX 2.1 |
IMF_DATA |
IMF SDMX 3 (sdmx.imf.org) |
WB_WDI |
World Bank SDMX 2.1 (custom path × decade sweep) |
Connectors
| Name | Kind | Description |
|---|---|---|
enumerate_sdmx_datasets |
enumerator | One row per dataflow per agency (sdmx_datasets_<agency> namespaces). |
enumerate_sdmx_series |
connector (dynamic schema) | One row per series key for a single (agency, dataset_id). Drives one sdmx_series_<agency>_<dataset_id> catalog per dataset. |
sdmx_fetch |
connector | Live observation fetch for a series key against the agency endpoint. |
sdmx_datasets_search |
connector | Structured search over per-agency dataset catalogs. |
sdmx_series_search |
connector | Structured search over per-dataset series catalogs. |
Five registered connectors total (2 enumerators + 1 fetch + 2 search).
Dynamic schema: enumerate_sdmx_series
Per-dataset series enumeration returns a wide DataFrame whose columns depend
on the SDMX datastructure definition for that flow. The output schema is
therefore dynamic per call — it cannot be declared statically on
@enumerator. The connector stays a plain @connector that returns raw
pd.DataFrame rows; catalog builders project entities with
entities_from_connector after the framework applies the per-call schema.
Install
pip install parsimony-sdmx
Pulls in parsimony-core>=0.6,<0.7 automatically. Local catalog publishing uses the core catalog stack (hybrid BM25+vector or BM25-only per field):
pip install "parsimony-core[standard]"
Verify discovery:
python -c "from parsimony import discover; print([p.name for p in discover.iter_providers()])"
Quick start
import asyncio
from parsimony_sdmx import CONNECTORS
async def main():
connectors = CONNECTORS
result = await connectors["sdmx_fetch"](
dataset_key="ECB-YC",
series_key="B.U2.EUR.4F.G_N_A.SV_C_YM.SR_10Y",
)
print(result.data.head())
asyncio.run(main())
For multi-plugin composition:
from parsimony import discover
connectors = discover.load_all()
Catalog building
Catalog building is an operator workflow in scripts/build_catalog.py. Indexing policy lives in parsimony_sdmx/catalog_policy.py: hybrid BM25+vector per field when unique text count is below 1,000, otherwise BM25-only.
Namespaces:
sdmx_datasets_<agency>— one dataset catalog per agency (e.g.sdmx_datasets_ecb).sdmx_series_<agency>_<dataset_id>— per-flow series catalogs for selected macro/finance flows.
Build and push
# One agency: full dataset index + selected series catalogs
uv run python scripts/build_catalog.py --catalog agency --agency ECB \
--save-root /tmp/parsimony-catalogs/sdmx --push-root hf://parsimony-dev/sdmx
# Full portfolio (all agencies)
uv run python scripts/build_catalog.py --catalog portfolio \
--save-root /tmp/parsimony-catalogs/sdmx --push-root hf://parsimony-dev/sdmx \
--parallel 2 --keep-going
Use --save-root /tmp/sdmx to write local snapshots under namespace subdirectories. Use --push <url> for one explicit catalog URL or --push-root <root> for namespace subdirectories.
A local build produces:
/tmp/parsimony-smoke/sdmx_series_ecb_yc/
├── entries.parquet
├── indexes/
└── meta.json
Build an agency batch
uv run python scripts/build_catalog.py --catalog agency --agency ECB --push-root hf://parsimony-dev/sdmx
uv run python scripts/build_catalog.py --catalog agency --agency ESTAT --max-catalogs 30 --save-root /tmp/sdmx
An agency that fails listing raises before building; individual $DV_* derived views are skipped because they are not fetchable series catalogs.
Expected search workflow (agents and maintainers)
SDMX catalogs are built for structured field search first, not open-ended semantic Q&A.
sdmx_datasets_search(agency='ECB', query=...)onsdmx_datasets_ecb— structuredcode: ECB|YCor title text.- Read the returned
dimensionsmanifest (present only when a series catalog exists for that flow). sdmx_series_search(flow_id='ECB/YC', ...)— structured dimension clauses.sdmx_fetchwith the series key from search results.
High-cardinality fields (especially title on large series catalogs) may be BM25-only when unique value count reaches 1,000 or more. Prefer structured FIELD: value clauses over long natural-language probes on those catalogs.
Override the catalog root for local dev: PARSIMONY_SDMX_CATALOG_URL=file:///tmp/sdmx (default publish target: hf://parsimony-dev/sdmx).
Search a published bundle
import asyncio
from parsimony.catalog import Catalog
async def main():
datasets = await Catalog.load("hf://parsimony-dev/sdmx/sdmx_datasets_ecb")
flows, _ = await datasets.search("code: ECB|YC", limit=3)
print("datasets", flows[0].code, flows[0].title[:80])
series = await Catalog.load("hf://parsimony-dev/sdmx/sdmx_series_ecb_yc")
hits, _ = await series.search("REF_AREA: Spain && FREQ: Monthly", limit=3)
for hit in hits:
print(f"{hit.score:.3f} {hit.code} {hit.title[:80]}")
asyncio.run(main())
The same Catalog.load(...) works against hf:// and file:// URLs. Structured queries intersect candidates across fields; plain text without field syntax falls back to the title index only.
Validate a built or published snapshot:
uv run python scripts/validate_catalog.py --catalog-url file:///tmp/parsimony-catalogs/sdmx/sdmx_series_ecb_yc
uv run python scripts/validate_catalog.py \
--catalog-url file:///tmp/parsimony-catalogs/sdmx/sdmx_datasets_ecb \
--catalog-root file:///tmp/parsimony-catalogs/sdmx \
--queries-file packages/sdmx/catalog_tests/queries.yaml
Plugin contract
The package implements the standard parsimony plugin contract, exported at the top level of parsimony_sdmx:
| Export | Role |
|---|---|
CONNECTORS |
Connectors collection — two enumerators, sdmx_fetch, and two search connectors. |
SDMX endpoints are public; no environment variables are required.
Architecture
parsimony_sdmx/
├── core/ pure domain logic: record dataclasses, title composition,
│ codelist resolution, outcome types, domain exceptions
├── io/ boundary effects: atomic parquet writers, hardened lxml
│ iterparse, HTTPS-only bounded session, path safety
├── providers/ per-agency adapters behind a narrow `CatalogProvider`
│ protocol; ECB/ESTAT/IMF share a common sdmx1 flow helper,
│ WB diverges with a path × decade sweep
├── connectors/ parsimony `@enumerator` surface + ``sdmx_fetch`` live
│ observation connector
└── _isolation/ subprocess-spawning boundary for every sdmx1 call
Title composition
Each series row's title is built per DSD:
- ECB — uses the
TITLE/TITLE_COMPLnatural-language attributes fetched via the portal side-channel. Titles like"All euro area yield curve - 10-year spot rate". Short, semantic, directly embedder-friendly. - ESTAT / IMF_DATA / WB_WDI — no natural-language attributes exposed; falls back to
compose_series_title()which concatenates"DIM: label - DIM: label - …"across every dimension in DSD order. Longer (80-150 tokens) but still searchable.
The codelist-composed form is used only as a fallback when TITLE_COMPL is absent — duplicating it onto natural-language titles inflates embedding cost quadratically (BERT attention is O(N²)) without adding signal. The raw SDMX series key is always available in the code column, so keyword-exact queries are unaffected.
Why subprocess isolation
sdmx1 caches parsed structure messages (DSDs, codelists, dataflows) at module scope with no public invalidation hook. A long-lived Python process that imports it accumulates cache monotonically until OOM. Process death is the only working way to flush that cache.
Every sdmx1-touching call (list_datasets for listings, fetch_series for per-dataset sweeps) runs inside a freshly spawned process that is discarded after the call — never pooled. A ProcessPoolExecutor would retain sdmx1 in each worker across tasks and defeat the invariant.
The two entry points in _isolation handle payload size differently:
list_datasetsreturns up to ~8 k dataflow tuples through anmp.Queuethat the parent drains beforeproc.join()— the feeder thread blocks on the OS pipe buffer once pickled bytes exceed ~64 KB, so join-before-read deadlocks. Regression-guarded bytest_listing.py::test_large_payload_does_not_deadlock.fetch_serieswrites the series parquet to a caller-supplied tmpdir inside the child and returns only a smallDatasetOutcomeenvelope. The parent reads the parquet back and the tmpdir is discarded. Disk is the transport.
Under load (ESTAT with ~8 k dataflows, ECB YC with ~2 k series) the parent process stays sdmx1-free — verified by test_listing.py::test_plugin_surface_import_does_not_pull_sdmx.
Development
# Fast tier (306 tests, ~3 s) — excludes slow + integration markers
uv run --package parsimony-sdmx pytest packages/sdmx/tests -q
# Subprocess regression tier (2 tests, ~2 s) — real mp.Process children
uv run --package parsimony-sdmx pytest packages/sdmx/tests -m slow -v
# Lint + type check
uv run --package parsimony-sdmx ruff check packages/sdmx/
uv run --package parsimony-sdmx mypy packages/sdmx/parsimony_sdmx/
Hardening defaults: HTTPS-only bounded HTTP session, hardened lxml.iterparse (no entity resolution, no DTD load, no network), path traversal guards on every on-disk write.
Provider
- SDMX standard: https://sdmx.org
- ECB SDMX: https://data.ecb.europa.eu/help/api/overview
- Eurostat SDMX: https://wikis.ec.europa.eu/display/EUROSTATHELP/API+SDMX+2.1
- IMF SDMX: https://datahelp.imf.org/knowledgebase/articles/667681-using-sdmx-to-query-imf-data
- World Bank SDMX: https://datahelpdesk.worldbank.org/knowledgebase/articles/889398-developer-information-overview
License
See LICENSE.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file parsimony_sdmx-0.7.0.tar.gz.
File metadata
- Download URL: parsimony_sdmx-0.7.0.tar.gz
- Upload date:
- Size: 56.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b38b790b47e977825fbb2963741a38de4472c748df667403e4bf7db245a9727b
|
|
| MD5 |
808470c47b1d7375d9c5316d44cde317
|
|
| BLAKE2b-256 |
1d46b0b6101eb4e1926629b7c5736f01a63fbdf415b7aa54209d0920fa00ea7f
|
Provenance
The following attestation bundles were made for parsimony_sdmx-0.7.0.tar.gz:
Publisher:
release.yml on ockham-sh/parsimony-connectors
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsimony_sdmx-0.7.0.tar.gz -
Subject digest:
b38b790b47e977825fbb2963741a38de4472c748df667403e4bf7db245a9727b - Sigstore transparency entry: 1658353581
- Sigstore integration time:
-
Permalink:
ockham-sh/parsimony-connectors@8f77ab8a5c4a1a4c38dbba8ab200da29ff5db495 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ockham-sh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8f77ab8a5c4a1a4c38dbba8ab200da29ff5db495 -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file parsimony_sdmx-0.7.0-py3-none-any.whl.
File metadata
- Download URL: parsimony_sdmx-0.7.0-py3-none-any.whl
- Upload date:
- Size: 76.1 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.12.9
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
d92b355912b7950c5e542f81fa5b9e5d39327a2185341c35e6a21e7ff4ac5a72
|
|
| MD5 |
4021a7f1c1688387fd7e5335ea3d585e
|
|
| BLAKE2b-256 |
e132a7de2c0b0166c74f977cb6749e4c004f598484a40a7af46b579674a4ced7
|
Provenance
The following attestation bundles were made for parsimony_sdmx-0.7.0-py3-none-any.whl:
Publisher:
release.yml on ockham-sh/parsimony-connectors
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
parsimony_sdmx-0.7.0-py3-none-any.whl -
Subject digest:
d92b355912b7950c5e542f81fa5b9e5d39327a2185341c35e6a21e7ff4ac5a72 - Sigstore transparency entry: 1658353679
- Sigstore integration time:
-
Permalink:
ockham-sh/parsimony-connectors@8f77ab8a5c4a1a4c38dbba8ab200da29ff5db495 -
Branch / Tag:
refs/heads/main - Owner: https://github.com/ockham-sh
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@8f77ab8a5c4a1a4c38dbba8ab200da29ff5db495 -
Trigger Event:
workflow_dispatch
-
Statement type: