Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse refresh workflows.
Project description
sdmxflow
sdmxflow turns SDMX datasets (Eurostat today) into deterministic, append-only warehouse refresh artifacts: facts CSV + versioned metadata trail + exported codelists.
Problem: SDMX is easy to query, but harder to operationalize for warehouses (repeatable artifacts, refresh semantics, reference data, governance).
Solution: sdmxflow fetches a dataset and writes a stable on-disk layout that you can load into your warehouse on a schedule.
Proof: Eurostat is supported now (source_id="ESTAT"), with append-only refresh and last-updated change detection.
[!NOTE] Status: early but functional Supported providers: Eurostat (
source_id="ESTAT") Docs: https://knifflig.github.io/sdmxflow/
sdmxflow is designed for the common “ELT input dataset” pattern:
- pull a dataset from an SDMX provider,
- store it locally in a stable folder structure,
- refresh it periodically,
- keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
- export the reference data (codelists) required to interpret coded columns.
Quickstart
The primary entrypoint is SdmxDataset.
from pathlib import Path
from sdmxflow.dataset import SdmxDataset
ds = SdmxDataset(
out_dir=Path("./out/lfsa_egai2d"),
source_id="ESTAT",
dataset_id="lfsa_egai2d",
# Optional:
# agency_id="ESTAT",
# key=..., # provider-specific key restriction
# params={...}, # provider-specific passthrough params
save_logs=True, # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
)
result = ds.fetch()
# `result` contains paths to the artifacts that were created/updated:
# - result.dataset_csv
# - result.metadata_json
# - result.codelists_dir
What you get on disk
<out_dir>/
dataset.csv # append-only facts across versions
metadata.json # version history + fetch metadata
codelists/ # exported reference tables
logs/ # only when save_logs=True
<agency>__<dataset>__<timestamp>.log
Integrations (Airflow/dbt style)
The intended workflow is: fetch artifacts → load into your warehouse → model downstream.
Example (Airflow task pseudocode):
from pathlib import Path
from sdmxflow.dataset import SdmxDataset
def refresh_eurostat_lfsa_egai2d() -> None:
ds = SdmxDataset(
out_dir=Path("/data/sdmx/lfsa_egai2d"),
source_id="ESTAT",
dataset_id="lfsa_egai2d",
)
ds.fetch()
Then:
- load
<out_dir>/dataset.csvinto a staging table, - define it as a dbt source,
- build models on top; select the newest version via the
last_updatedcolumn.
How refresh works
graph TD
A["Scheduled job<br/>cron / Airflow / Prefect"] --> B["Fetch upstream last-updated<br/>SDMX annotations (Eurostat)"]
B --> C{"Local metadata.json<br/>exists?"}
C -- Yes --> D{"Upstream<br/>changed?"}
C -- No --> E0
D -- No --> G["No new version<br/>Keep dataset.csv<br/>Ensure metadata + codelists"]
D -- Yes --> E1
subgraph DL[" "]
direction LR
E1["Download new slice"]
E0["Download initial slice"]
end
style DL fill:transparent,stroke:transparent
E0 --> F["Append rows to dataset.csv<br/>append-only, adds last_updated column"]
E1 --> F
F --> H["Update metadata history<br/>Export codelists"]
G --> I["Warehouse ingestion step<br/>dbt / COPY / load job"]
H --> I
fetch() is designed for scheduled refresh jobs:
- Fetch upstream “last updated” timestamp.
- Compare with the latest locally recorded timestamp in
metadata.json. - If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
- If changed: download and append a new slice to
dataset.csv, then update metadata + codelists.
Use cases
- Refresh Eurostat indicators nightly into Postgres/Snowflake/BigQuery staging.
- Keep reference codelists versioned alongside fact extracts for governance.
- Produce reproducible ELT inputs (facts + metadata + reference tables) for analysts.
Why sdmxflow
sdmxflow is intentionally opinionated about operationalizing SDMX datasets for warehouse refresh jobs.
- Compared to SDMX client libraries: they fetch data;
sdmxflowproduces deterministic refresh artifacts + metadata trail + codelists. - Compared to flexible extractors:
sdmxflowfocuses on stable layout and predictable refresh semantics.
See “Credits and acknowledgements” below for project influences and dependencies.
Features
- Append-only refresh: only downloads and appends when upstream changed.
- Warehouse-friendly layout:
dataset.csv(facts),metadata.json(versions + fetch info),codelists/(reference tables). - Fast upstream change detection (Eurostat): uses SDMX annotations for last-updated.
- User-friendly logging at
INFOand detailed diagnostics atDEBUG. - Optional per-run log file capture via
save_logs=True.
Non-goals (for now):
- full multi-provider support,
- a full-blown orchestration framework,
- a “do everything” SDMX exploration UI.
Installation
From PyPI (recommended)
pip install sdmxflow
If you run sdmxflow inside an orchestration environment that pins packaging<25.1 (for example, Prefect 3.6.x), install sdmxflow>=0.1.1 so that dependency resolution can avoid sdmx1==2.25.1 (which declares packaging>=26).
From source (this repository)
This project uses uv for development.
git clone https://github.com/knifflig/sdmxflow
cd sdmxflow
uv sync --group dev
Output layout
sdmxflow writes a stable folder structure under your chosen out_dir:
<out_dir>/
dataset.csv
metadata.json
codelists/
... generated reference CSVs ...
logs/ # only when save_logs=True
<agency>__<dataset>__<timestamp>.log
dataset.csv
- Append-only across versions.
- Includes a leading
last_updatedcolumn (UTC ISO-8601) indicating which upstream version a row belongs to.
metadata.json
Stores dataset identity and version history, such as:
- upstream timestamps,
- fetch times,
- HTTP URL/status/headers (when available),
- number of rows appended for each version.
codelists/
Contains exported codelists needed to interpret coded dataset columns.
Logging
sdmxflow is built to be readable in production logs.
- At
INFOlevel,fetch()emits exactly three user-facing messages:- intention (what, where),
- version decision (download vs. already up to date),
- completion summary (artifact paths).
- Enable
DEBUGfor rich diagnostics. - If you pass
save_logs=True,sdmxflowwrites a per-run debug log file under<out_dir>/logs/.
Provider support and limitations
- Supported:
- Eurostat (
source_id="ESTAT")
- Eurostat (
Planned/possible future work (not guaranteed):
- additional SDMX sources,
- richer metadata capture (more SDMX structure fields),
- export formats beyond CSV/JSON.
FAQ
Does sdmxflow load into my warehouse directly?
No. It produces deterministic on-disk artifacts (CSV/JSON/codelists). You load them using your existing tooling (Airflow, dbt, COPY/LOAD jobs, etc.).
Does it support providers besides Eurostat?
Not yet. Eurostat (source_id="ESTAT") is the current supported provider.
Does it deduplicate data?
It is append-only across upstream versions. Each appended slice is marked with a last_updated value so downstream jobs can select the newest version (or reprocess full history).
How does it detect upstream changes?
For Eurostat, it uses SDMX annotations to obtain a last-updated timestamp and compares it to the latest locally recorded timestamp.
Development
Install dev dependencies:
uv sync --group dev
Run tests:
uv run pytest
Run lint/format:
uv run ruff check .
uv run ruff format .
Contributing
Contributions are welcome.
Good first contributions:
- improvements to metadata extraction,
- better codelist export coverage,
- adding new provider support behind a clean interface,
- documentation and examples.
Please open an issue before large changes.
Contact
- Henry Zehe: https://github.com/knifflig
License
Licensed under the Apache License, Version 2.0. See LICENSE.md.
Credits and acknowledgements
- Martin Salo (https://github.com/salomartin) and the SDMX
dltextension gist that helped inform early direction and requirements: https://gist.github.com/salomartin/d4ee7170f678b0b44554af46fe8efb3f sdmx1(https://github.com/khaeru/sdmx/) and its maintainers/contributors:sdmxflowrelies onsdmx1for core SDMX handling.- Zensical (https://zensical.org/, https://github.com/zensical/zensical) is used to build and publish this documentation site. Zensical is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sdmxflow-0.1.1.tar.gz.
File metadata
- Download URL: sdmxflow-0.1.1.tar.gz
- Upload date:
- Size: 35.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ffaa6eb4061deeced6103eb5d02c4a4113e36317536fbc9832646def9293866c
|
|
| MD5 |
8731d16cb69b69136787120cb3f4d28c
|
|
| BLAKE2b-256 |
ab3d6eef73c3b9167102f210c216852fdca1d2e35c27d5258f5a6c01118d1603
|
Provenance
The following attestation bundles were made for sdmxflow-0.1.1.tar.gz:
Publisher:
pypi.yml on knifflig/sdmxflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdmxflow-0.1.1.tar.gz -
Subject digest:
ffaa6eb4061deeced6103eb5d02c4a4113e36317536fbc9832646def9293866c - Sigstore transparency entry: 1019391373
- Sigstore integration time:
-
Permalink:
knifflig/sdmxflow@ff315d0674d5ff71decf48b9c7e70730a7694ad8 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/knifflig
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@ff315d0674d5ff71decf48b9c7e70730a7694ad8 -
Trigger Event:
push
-
Statement type:
File details
Details for the file sdmxflow-0.1.1-py3-none-any.whl.
File metadata
- Download URL: sdmxflow-0.1.1-py3-none-any.whl
- Upload date:
- Size: 46.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cd939e0d1620fb9c33e57c49628835c4a73b66d352fb076298a1fdbf0c1325e6
|
|
| MD5 |
3b57aa06171fb2c27dca52f09542fb5b
|
|
| BLAKE2b-256 |
b0edccf9d150da6cc9a9c1047079b3ed7fea078ec51abe93544a0ac3b3513580
|
Provenance
The following attestation bundles were made for sdmxflow-0.1.1-py3-none-any.whl:
Publisher:
pypi.yml on knifflig/sdmxflow
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
sdmxflow-0.1.1-py3-none-any.whl -
Subject digest:
cd939e0d1620fb9c33e57c49628835c4a73b66d352fb076298a1fdbf0c1325e6 - Sigstore transparency entry: 1019391407
- Sigstore integration time:
-
Permalink:
knifflig/sdmxflow@ff315d0674d5ff71decf48b9c7e70730a7694ad8 -
Branch / Tag:
refs/tags/v0.1.1 - Owner: https://github.com/knifflig
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@ff315d0674d5ff71decf48b9c7e70730a7694ad8 -
Trigger Event:
push
-
Statement type: