Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse refresh workflows.

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

These details have not been verified by PyPI

Project description

sdmxflow

sdmxflow turns SDMX datasets (Eurostat today) into deterministic, append-only warehouse refresh artifacts: facts CSV + versioned metadata trail + exported codelists.

Problem: SDMX is easy to query, but harder to operationalize for warehouses (repeatable artifacts, refresh semantics, reference data, governance).

Solution: sdmxflow fetches a dataset and writes a stable on-disk layout that you can load into your warehouse on a schedule.

Proof: Eurostat is supported now (source_id="ESTAT"), with append-only refresh and last-updated change detection.

[!NOTE] Status: early but functional Supported providers: Eurostat (source_id="ESTAT") Docs: https://knifflig.github.io/sdmxflow/

sdmxflow is designed for the common “ELT input dataset” pattern:

pull a dataset from an SDMX provider,
store it locally in a stable folder structure,
refresh it periodically,
keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
export the reference data (codelists) required to interpret coded columns.

Quickstart

The primary entrypoint is SdmxDataset.

from pathlib import Path

from sdmxflow.dataset import SdmxDataset

ds = SdmxDataset(
	out_dir=Path("./out/lfsa_egai2d"),
	source_id="ESTAT",
	dataset_id="lfsa_egai2d",
	# Optional:
	# agency_id="ESTAT",
	# key=...,        # provider-specific key restriction
	# params={...},   # provider-specific passthrough params
	save_logs=True,  # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
)

result = ds.fetch()

# `result` contains paths to the artifacts that were created/updated:
# - result.dataset_csv
# - result.metadata_json
# - result.codelists_dir

What you get on disk

<out_dir>/
	dataset.csv          # append-only facts across versions
	metadata.json        # version history + fetch metadata
	codelists/           # exported reference tables
	logs/                # only when save_logs=True
		<agency>__<dataset>__<timestamp>.log

Integrations (Airflow/dbt style)

The intended workflow is: fetch artifacts → load into your warehouse → model downstream.

Example (Airflow task pseudocode):

from pathlib import Path

from sdmxflow.dataset import SdmxDataset


def refresh_eurostat_lfsa_egai2d() -> None:
	ds = SdmxDataset(
		out_dir=Path("/data/sdmx/lfsa_egai2d"),
		source_id="ESTAT",
		dataset_id="lfsa_egai2d",
	)
	ds.fetch()

Then:

load <out_dir>/dataset.csv into a staging table,
define it as a dbt source,
build models on top; select the newest version via the last_updated column.

How refresh works

graph TD
	A["Scheduled job<br/>cron / Airflow / Prefect"] --> B["Fetch upstream last-updated<br/>SDMX annotations (Eurostat)"]
	B --> C{"Local metadata.json<br/>exists?"}
	C -- Yes --> D{"Upstream<br/>changed?"}
	C -- No --> E0

	D -- No --> G["No new version<br/>Keep dataset.csv<br/>Ensure metadata + codelists"]
	D -- Yes --> E1

	subgraph DL[" "]
		direction LR
		E1["Download new slice"]
		E0["Download initial slice"]
	end
	style DL fill:transparent,stroke:transparent

	E0 --> F["Append rows to dataset.csv<br/>append-only, adds last_updated column"]
	E1 --> F
	F --> H["Update metadata history<br/>Export codelists"]
	G --> I["Warehouse ingestion step<br/>dbt / COPY / load job"]
	H --> I

fetch() is designed for scheduled refresh jobs:

Fetch upstream “last updated” timestamp.
Compare with the latest locally recorded timestamp in metadata.json.
If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
If changed: download and append a new slice to dataset.csv, then update metadata + codelists.

Use cases

Refresh Eurostat indicators nightly into Postgres/Snowflake/BigQuery staging.
Keep reference codelists versioned alongside fact extracts for governance.
Produce reproducible ELT inputs (facts + metadata + reference tables) for analysts.

Why sdmxflow

sdmxflow is intentionally opinionated about operationalizing SDMX datasets for warehouse refresh jobs.

Compared to SDMX client libraries: they fetch data; sdmxflow produces deterministic refresh artifacts + metadata trail + codelists.
Compared to flexible extractors: sdmxflow focuses on stable layout and predictable refresh semantics.

See “Credits and acknowledgements” below for project influences and dependencies.

Features

Append-only refresh: only downloads and appends when upstream changed.
Warehouse-friendly layout: dataset.csv (facts), metadata.json (versions + fetch info), codelists/ (reference tables).
Fast upstream change detection (Eurostat): uses SDMX annotations for last-updated.
User-friendly logging at INFO and detailed diagnostics at DEBUG.
Optional per-run log file capture via save_logs=True.

Non-goals (for now):

full multi-provider support,
a full-blown orchestration framework,
a “do everything” SDMX exploration UI.

Installation

From PyPI (recommended)

pip install sdmxflow

If you run sdmxflow inside an orchestration environment that pins packaging<25.1 (for example, Prefect 3.6.x), install sdmxflow>=0.1.1 so that dependency resolution can avoid sdmx1==2.25.1 (which declares packaging>=26).

From source (this repository)

This project uses uv for development.

git clone https://github.com/knifflig/sdmxflow
cd sdmxflow
uv sync --group dev

Output layout

sdmxflow writes a stable folder structure under your chosen out_dir:

<out_dir>/
	dataset.csv
	metadata.json
	codelists/
		... generated reference CSVs ...
	logs/                     # only when save_logs=True
		<agency>__<dataset>__<timestamp>.log

`dataset.csv`

Append-only across versions.
Includes a leading last_updated column (UTC ISO-8601) indicating which upstream version a row belongs to.

`metadata.json`

Stores dataset identity and version history, such as:

upstream timestamps,
fetch times,
HTTP URL/status/headers (when available),
number of rows appended for each version.

`codelists/`

Contains exported codelists needed to interpret coded dataset columns.

Logging

sdmxflow is built to be readable in production logs.

At INFO level, fetch() emits exactly three user-facing messages:
1. intention (what, where),
2. version decision (download vs. already up to date),
3. completion summary (artifact paths).
Enable DEBUG for rich diagnostics.
If you pass save_logs=True, sdmxflow writes a per-run debug log file under <out_dir>/logs/.

Provider support and limitations

Supported:
- Eurostat (source_id="ESTAT")

Planned/possible future work (not guaranteed):

additional SDMX sources,
richer metadata capture (more SDMX structure fields),
export formats beyond CSV/JSON.

FAQ

Does sdmxflow load into my warehouse directly?

No. It produces deterministic on-disk artifacts (CSV/JSON/codelists). You load them using your existing tooling (Airflow, dbt, COPY/LOAD jobs, etc.).

Does it support providers besides Eurostat?

Not yet. Eurostat (source_id="ESTAT") is the current supported provider.

Does it deduplicate data?

It is append-only across upstream versions. Each appended slice is marked with a last_updated value so downstream jobs can select the newest version (or reprocess full history).

How does it detect upstream changes?

For Eurostat, it uses SDMX annotations to obtain a last-updated timestamp and compares it to the latest locally recorded timestamp.

Development

Install dev dependencies:

uv sync --group dev

Run tests:

uv run pytest

Run lint/format:

uv run ruff check .
uv run ruff format .

Contributing

Contributions are welcome.

Good first contributions:

improvements to metadata extraction,
better codelist export coverage,
adding new provider support behind a clean interface,
documentation and examples.

Please open an issue before large changes.

Contact

Henry Zehe: https://github.com/knifflig

License

Licensed under the Apache License, Version 2.0. See LICENSE.md.

Credits and acknowledgements

Martin Salo (https://github.com/salomartin) and the SDMX dlt extension gist that helped inform early direction and requirements: https://gist.github.com/salomartin/d4ee7170f678b0b44554af46fe8efb3f
sdmx1 (https://github.com/khaeru/sdmx/) and its maintainers/contributors: sdmxflow relies on sdmx1 for core SDMX handling.
Zensical (https://zensical.org/, https://github.com/zensical/zensical) is used to build and publish this documentation site. Zensical is licensed under the MIT License.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

Knifflig

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

Mar 3, 2026

0.1.0

Mar 3, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmxflow-0.1.1.tar.gz (35.5 kB view details)

Uploaded Mar 3, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sdmxflow-0.1.1-py3-none-any.whl (46.0 kB view details)

Uploaded Mar 3, 2026 Python 3

File details

Details for the file sdmxflow-0.1.1.tar.gz.

File metadata

Download URL: sdmxflow-0.1.1.tar.gz
Upload date: Mar 3, 2026
Size: 35.5 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sdmxflow-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`ffaa6eb4061deeced6103eb5d02c4a4113e36317536fbc9832646def9293866c`
MD5	`8731d16cb69b69136787120cb3f4d28c`
BLAKE2b-256	`ab3d6eef73c3b9167102f210c216852fdca1d2e35c27d5258f5a6c01118d1603`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdmxflow-0.1.1.tar.gz:

Publisher: pypi.yml on knifflig/sdmxflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sdmxflow-0.1.1.tar.gz
- Subject digest: ffaa6eb4061deeced6103eb5d02c4a4113e36317536fbc9832646def9293866c
- Sigstore transparency entry: 1019391373
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: knifflig/sdmxflow@ff315d0674d5ff71decf48b9c7e70730a7694ad8
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/knifflig
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@ff315d0674d5ff71decf48b9c7e70730a7694ad8
- Trigger Event: push

File details

Details for the file sdmxflow-0.1.1-py3-none-any.whl.

File metadata

Download URL: sdmxflow-0.1.1-py3-none-any.whl
Upload date: Mar 3, 2026
Size: 46.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sdmxflow-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd939e0d1620fb9c33e57c49628835c4a73b66d352fb076298a1fdbf0c1325e6`
MD5	`3b57aa06171fb2c27dca52f09542fb5b`
BLAKE2b-256	`b0edccf9d150da6cc9a9c1047079b3ed7fea078ec51abe93544a0ac3b3513580`

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdmxflow-0.1.1-py3-none-any.whl:

Publisher: pypi.yml on knifflig/sdmxflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: sdmxflow-0.1.1-py3-none-any.whl
- Subject digest: cd939e0d1620fb9c33e57c49628835c4a73b66d352fb076298a1fdbf0c1325e6
- Sigstore transparency entry: 1019391407
- Sigstore integration time: Mar 3, 2026
Source repository:
- Permalink: knifflig/sdmxflow@ff315d0674d5ff71decf48b9c7e70730a7694ad8
- Branch / Tag: refs/tags/v0.1.1
- Owner: https://github.com/knifflig
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: pypi.yml@ff315d0674d5ff71decf48b9c7e70730a7694ad8
- Trigger Event: push

sdmxflow 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

sdmxflow

Quickstart

What you get on disk

Integrations (Airflow/dbt style)

How refresh works

Use cases

Why sdmxflow

Features

Installation

From PyPI (recommended)

From source (this repository)

Output layout

dataset.csv

metadata.json

codelists/

Logging

Provider support and limitations

FAQ

Development

Contributing

Contact

License

Credits and acknowledgements

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance

`dataset.csv`

`metadata.json`

`codelists/`