Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse refresh workflows.
Project description
sdmxflow
Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse and periodic refresh workflows.
sdmxflow is designed for the common “ELT input dataset” pattern:
- pull a dataset from an SDMX provider,
- store it locally in a stable folder structure,
- refresh it periodically,
- keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
- export the reference data (codelists) required to interpret coded columns.
Status: early but functional. Current provider support is Eurostat (
source_id="ESTAT").
Why sdmxflow
Many SDMX ingestion solutions focus on “get me data” (often very flexibly), but stop short of the metadata needed for downstream analytics and governance:
- dataset versioning (what changed upstream and when),
- artifact locations and repeatability,
- codelists/reference data exported alongside the facts.
There are also community solutions (for example a dlt extension shared by Martin Salo) that are great for flexible extraction, and this project started from that direction. sdmxflow builds on those ideas but focuses more strongly on a warehouse-friendly artifact layout and richer metadata + codelist outputs.
sdmxflow aims to be a pragmatic building block for warehouse pipelines: straightforward API, deterministic output layout, and predictable refresh behavior.
Where we come from:
- Early prototyping and the “bring SDMX into warehouse refresh workflows” motivation was influenced by Martin Salo’s SDMX
dltextension gist. - The heavy lifting for SDMX protocol/model parsing is powered by the
sdmx1Python package.
Features
- Append-only refresh: only downloads and appends when upstream changed.
- Warehouse-friendly layout:
dataset.csv(facts)metadata.json(versions + fetch info)codelists/(reference tables)
- Fast upstream change detection (Eurostat): uses SDMX annotations for last-updated.
- User-friendly logging at
INFOand detailed diagnostics atDEBUG. - Optional per-run log file capture via
save_logs=True.
Non-goals (for now):
- full multi-provider support,
- a full-blown orchestration framework,
- a “do everything” SDMX exploration UI.
Installation
From PyPI (recommended)
Once published:
pip install sdmxflow
From source (this repository)
This project uses uv for development.
git clone https://github.com/knifflig/sdmxflow
cd sdmxflow
uv sync --group dev
Quickstart
The primary entrypoint is SdmxDataset.
from pathlib import Path
from sdmxflow.dataset import SdmxDataset
ds = SdmxDataset(
out_dir=Path("./out/lfsa_egai2d"),
source_id="ESTAT",
dataset_id="lfsa_egai2d",
# Optional:
# agency_id="ESTAT",
# key=..., # provider-specific key restriction
# params={...}, # provider-specific passthrough params
save_logs=True, # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
)
result = ds.fetch()
print("Appended new version:", result.appended)
print("Dataset CSV:", result.dataset_csv)
print("Metadata JSON:", result.metadata_json)
print("Codelists dir:", result.codelists_dir)
What fetch() does
fetch() is designed for scheduled refresh jobs:
- Fetch upstream “last updated” timestamp.
- Compare with the latest locally recorded timestamp in
metadata.json. - If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
- If changed: download and append a new slice to
dataset.csv, then update metadata + codelists.
Output layout
sdmxflow writes a stable folder structure under your chosen out_dir:
<out_dir>/
dataset.csv
metadata.json
codelists/
... generated reference CSVs ...
logs/ # only when save_logs=True
<agency>__<dataset>__<timestamp>.log
dataset.csv
- Append-only across versions.
- Includes a leading
last_updatedcolumn (UTC ISO-8601) indicating which upstream version a row belongs to.
metadata.json
Stores dataset identity and version history, such as:
- upstream timestamps,
- fetch times,
- HTTP URL/status/headers (when available),
- number of rows appended for each version.
codelists/
Contains exported codelists needed to interpret coded dataset columns.
Logging
sdmxflow is built to be readable in production logs.
- At
INFOlevel,fetch()emits exactly three user-facing messages:- intention (what, where),
- version decision (download vs. already up to date),
- completion summary (artifact paths).
- Enable
DEBUGfor rich diagnostics. - If you pass
save_logs=True,sdmxflowwrites a per-run debug log file under<out_dir>/logs/.
Integrating into warehouse workflows
Typical patterns:
- Airflow / Dagster / Prefect task: call
fetch()on a schedule; downstream tasks ingestdataset.csvinto your warehouse. - dbt sources: load
dataset.csvinto a staging table and build models on top. - Lakehouse: treat
<out_dir>as a partitioned artifact folder;metadata.jsonprovides lineage.
Because the dataset is append-only, you can:
- reprocess from scratch (read the full file), or
- incrementally process “new versions” by filtering on
last_updated.
Provider support and limitations
- Supported:
- Eurostat (
source_id="ESTAT")
- Eurostat (
Planned/possible future work (not guaranteed):
- additional SDMX sources,
- richer metadata capture (more SDMX structure fields),
- export formats beyond CSV/JSON.
Development
Install dev dependencies:
uv sync --group dev
Run tests:
uv run pytest
Run lint/format:
uv run ruff check .
uv run ruff format .
Contributing
Contributions are welcome.
Good first contributions:
- improvements to metadata extraction,
- better codelist export coverage,
- adding new provider support behind a clean interface,
- documentation and examples.
Please open an issue before large changes.
Contact
- Henry Zehe: https://github.com/knifflig
License
Licensed under the Apache License, Version 2.0. See LICENSE.md.
Credits and acknowledgements
- Martin Salo (https://github.com/salomartin) and the SDMX
dltextension gist that helped inform early direction and requirements: https://gist.github.com/salomartin/d4ee7170f678b0b44554af46fe8efb3f sdmx1(https://github.com/khaeru/sdmx/) and its maintainers/contributors:sdmxflowrelies onsdmx1for core SDMX handling.- Zensical (https://zensical.org/, https://github.com/zensical/zensical) is used to build and publish this documentation site. Zensical is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdmxflow-0.1.0.tar.gz.
File metadata
- Download URL: sdmxflow-0.1.0.tar.gz
- Upload date:
- Size: 34.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3a83ebfadbc69d7e45bf3addb37ea470cdf3e121b96e0f1ade4c9b2dd4f162f4
|
|
| MD5 |
531686a6aed1a5f096ce32a606e382f8
|
|
| BLAKE2b-256 |
1760de348664f2f8b0bbbce3b2b33e44bfc306f8275b73fcb88223946f9b2d24
|
Provenance
The following attestation bundles were made for sdmxflow-0.1.0.tar.gz:
Publisher:
pypi.yml on knifflig/sdmxflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdmxflow-0.1.0.tar.gz -
Subject digest:
3a83ebfadbc69d7e45bf3addb37ea470cdf3e121b96e0f1ade4c9b2dd4f162f4 - Sigstore transparency entry: 1019056966
- Sigstore integration time:
-
Permalink:
knifflig/sdmxflow@dcb25df0056bb42e6fbf957b513bc20b4679e753 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/knifflig
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@dcb25df0056bb42e6fbf957b513bc20b4679e753 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sdmxflow-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sdmxflow-0.1.0-py3-none-any.whl
- Upload date:
- Size: 45.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a143b8eeef9bc5367701c91a4be6710dd8dddda0630b507f610140f309c84c54
|
|
| MD5 |
a3b05ae637970a9973d4bc73bd17d26d
|
|
| BLAKE2b-256 |
f7020c6725c4cb49656dc77331ade3a86ae540b55fca285d9bf0a39b8ad3d549
|
Provenance
The following attestation bundles were made for sdmxflow-0.1.0-py3-none-any.whl:
Publisher:
pypi.yml on knifflig/sdmxflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdmxflow-0.1.0-py3-none-any.whl -
Subject digest:
a143b8eeef9bc5367701c91a4be6710dd8dddda0630b507f610140f309c84c54 - Sigstore transparency entry: 1019056969
- Sigstore integration time:
-
Permalink:
knifflig/sdmxflow@dcb25df0056bb42e6fbf957b513bc20b4679e753 -
Branch / Tag:
refs/tags/v0.1.0 - Owner: https://github.com/knifflig
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@dcb25df0056bb42e6fbf957b513bc20b4679e753 -
Trigger Event:
push
-
Statement type: