Skip to main content

Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse refresh workflows.

Project description

sdmxflow

Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse and periodic refresh workflows.

sdmxflow is designed for the common “ELT input dataset” pattern:

  • pull a dataset from an SDMX provider,
  • store it locally in a stable folder structure,
  • refresh it periodically,
  • keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
  • export the reference data (codelists) required to interpret coded columns.

Status: early but functional. Current provider support is Eurostat (source_id="ESTAT").


Why sdmxflow

Many SDMX ingestion solutions focus on “get me data” (often very flexibly), but stop short of the metadata needed for downstream analytics and governance:

  • dataset versioning (what changed upstream and when),
  • artifact locations and repeatability,
  • codelists/reference data exported alongside the facts.

There are also community solutions (for example a dlt extension shared by Martin Salo) that are great for flexible extraction, and this project started from that direction. sdmxflow builds on those ideas but focuses more strongly on a warehouse-friendly artifact layout and richer metadata + codelist outputs.

sdmxflow aims to be a pragmatic building block for warehouse pipelines: straightforward API, deterministic output layout, and predictable refresh behavior.

Where we come from:

  • Early prototyping and the “bring SDMX into warehouse refresh workflows” motivation was influenced by Martin Salo’s SDMX dlt extension gist.
  • The heavy lifting for SDMX protocol/model parsing is powered by the sdmx1 Python package.

Features

  • Append-only refresh: only downloads and appends when upstream changed.
  • Warehouse-friendly layout:
    • dataset.csv (facts)
    • metadata.json (versions + fetch info)
    • codelists/ (reference tables)
  • Fast upstream change detection (Eurostat): uses SDMX annotations for last-updated.
  • User-friendly logging at INFO and detailed diagnostics at DEBUG.
  • Optional per-run log file capture via save_logs=True.

Non-goals (for now):

  • full multi-provider support,
  • a full-blown orchestration framework,
  • a “do everything” SDMX exploration UI.

Installation

From PyPI (recommended)

Once published:

pip install sdmxflow

From source (this repository)

This project uses uv for development.

git clone https://github.com/knifflig/sdmxflow
cd sdmxflow
uv sync --group dev

Quickstart

The primary entrypoint is SdmxDataset.

from pathlib import Path

from sdmxflow.dataset import SdmxDataset

ds = SdmxDataset(
	out_dir=Path("./out/lfsa_egai2d"),
	source_id="ESTAT",
	dataset_id="lfsa_egai2d",
	# Optional:
	# agency_id="ESTAT",
	# key=...,        # provider-specific key restriction
	# params={...},   # provider-specific passthrough params
	save_logs=True,  # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
)

result = ds.fetch()
print("Appended new version:", result.appended)
print("Dataset CSV:", result.dataset_csv)
print("Metadata JSON:", result.metadata_json)
print("Codelists dir:", result.codelists_dir)

What fetch() does

fetch() is designed for scheduled refresh jobs:

  1. Fetch upstream “last updated” timestamp.
  2. Compare with the latest locally recorded timestamp in metadata.json.
  3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
  4. If changed: download and append a new slice to dataset.csv, then update metadata + codelists.

Output layout

sdmxflow writes a stable folder structure under your chosen out_dir:

<out_dir>/
	dataset.csv
	metadata.json
	codelists/
		... generated reference CSVs ...
	logs/                     # only when save_logs=True
		<agency>__<dataset>__<timestamp>.log

dataset.csv

  • Append-only across versions.
  • Includes a leading last_updated column (UTC ISO-8601) indicating which upstream version a row belongs to.

metadata.json

Stores dataset identity and version history, such as:

  • upstream timestamps,
  • fetch times,
  • HTTP URL/status/headers (when available),
  • number of rows appended for each version.

codelists/

Contains exported codelists needed to interpret coded dataset columns.


Logging

sdmxflow is built to be readable in production logs.

  • At INFO level, fetch() emits exactly three user-facing messages:
    1. intention (what, where),
    2. version decision (download vs. already up to date),
    3. completion summary (artifact paths).
  • Enable DEBUG for rich diagnostics.
  • If you pass save_logs=True, sdmxflow writes a per-run debug log file under <out_dir>/logs/.

Integrating into warehouse workflows

Typical patterns:

  • Airflow / Dagster / Prefect task: call fetch() on a schedule; downstream tasks ingest dataset.csv into your warehouse.
  • dbt sources: load dataset.csv into a staging table and build models on top.
  • Lakehouse: treat <out_dir> as a partitioned artifact folder; metadata.json provides lineage.

Because the dataset is append-only, you can:

  • reprocess from scratch (read the full file), or
  • incrementally process “new versions” by filtering on last_updated.

Provider support and limitations

  • Supported:
    • Eurostat (source_id="ESTAT")

Planned/possible future work (not guaranteed):

  • additional SDMX sources,
  • richer metadata capture (more SDMX structure fields),
  • export formats beyond CSV/JSON.

Development

Install dev dependencies:

uv sync --group dev

Run tests:

uv run pytest

Run lint/format:

uv run ruff check .
uv run ruff format .

Contributing

Contributions are welcome.

Good first contributions:

  • improvements to metadata extraction,
  • better codelist export coverage,
  • adding new provider support behind a clean interface,
  • documentation and examples.

Please open an issue before large changes.


Contact


License

Licensed under the Apache License, Version 2.0. See LICENSE.md.


Credits and acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmxflow-0.1.0.tar.gz (34.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdmxflow-0.1.0-py3-none-any.whl (45.2 kB view details)

Uploaded Python 3

File details

Details for the file sdmxflow-0.1.0.tar.gz.

File metadata

  • Download URL: sdmxflow-0.1.0.tar.gz
  • Upload date:
  • Size: 34.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sdmxflow-0.1.0.tar.gz
Algorithm Hash digest
SHA256 3a83ebfadbc69d7e45bf3addb37ea470cdf3e121b96e0f1ade4c9b2dd4f162f4
MD5 531686a6aed1a5f096ce32a606e382f8
BLAKE2b-256 1760de348664f2f8b0bbbce3b2b33e44bfc306f8275b73fcb88223946f9b2d24

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdmxflow-0.1.0.tar.gz:

Publisher: pypi.yml on knifflig/sdmxflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdmxflow-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sdmxflow-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 45.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sdmxflow-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 a143b8eeef9bc5367701c91a4be6710dd8dddda0630b507f610140f309c84c54
MD5 a3b05ae637970a9973d4bc73bd17d26d
BLAKE2b-256 f7020c6725c4cb49656dc77331ade3a86ae540b55fca285d9bf0a39b8ad3d549

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdmxflow-0.1.0-py3-none-any.whl:

Publisher: pypi.yml on knifflig/sdmxflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page