Skip to main content

Download SDMX datasets into a reproducible, append-only on-disk layout for data warehouse refresh workflows.

Project description

sdmxflow

PyPI Python versions License CI

sdmxflow turns SDMX datasets (Eurostat today) into deterministic, append-only warehouse refresh artifacts: facts CSV + versioned metadata trail + exported codelists.

Problem: SDMX is easy to query, but harder to operationalize for warehouses (repeatable artifacts, refresh semantics, reference data, governance).

Solution: sdmxflow fetches a dataset and writes a stable on-disk layout that you can load into your warehouse on a schedule.

Proof: Eurostat is supported now (source_id="ESTAT"), with append-only refresh and last-updated change detection.

[!NOTE] Status: early but functional Supported providers: Eurostat (source_id="ESTAT") Docs: https://knifflig.github.io/sdmxflow/

sdmxflow is designed for the common “ELT input dataset” pattern:

  • pull a dataset from an SDMX provider,
  • store it locally in a stable folder structure,
  • refresh it periodically,
  • keep a minimal but useful metadata trail (versions, timestamps, URLs, status, row counts),
  • export the reference data (codelists) required to interpret coded columns.

Quickstart

The primary entrypoint is SdmxDataset.

from pathlib import Path

from sdmxflow.dataset import SdmxDataset

ds = SdmxDataset(
	out_dir=Path("./out/lfsa_egai2d"),
	source_id="ESTAT",
	dataset_id="lfsa_egai2d",
	# Optional:
	# agency_id="ESTAT",
	# key=...,        # provider-specific key restriction
	# params={...},   # provider-specific passthrough params
	save_logs=True,  # writes <out_dir>/logs/<agency>__<dataset>__<timestamp>.log
)

result = ds.fetch()

# `result` contains paths to the artifacts that were created/updated:
# - result.dataset_csv
# - result.metadata_json
# - result.codelists_dir

What you get on disk

<out_dir>/
	dataset.csv          # append-only facts across versions
	metadata.json        # version history + fetch metadata
	codelists/           # exported reference tables
	logs/                # only when save_logs=True
		<agency>__<dataset>__<timestamp>.log

Integrations (Airflow/dbt style)

The intended workflow is: fetch artifacts → load into your warehouse → model downstream.

Example (Airflow task pseudocode):

from pathlib import Path

from sdmxflow.dataset import SdmxDataset


def refresh_eurostat_lfsa_egai2d() -> None:
	ds = SdmxDataset(
		out_dir=Path("/data/sdmx/lfsa_egai2d"),
		source_id="ESTAT",
		dataset_id="lfsa_egai2d",
	)
	ds.fetch()

Then:

  • load <out_dir>/dataset.csv into a staging table,
  • define it as a dbt source,
  • build models on top; select the newest version via the last_updated column.

How refresh works

graph TD
	A["Scheduled job<br/>cron / Airflow / Prefect"] --> B["Fetch upstream last-updated<br/>SDMX annotations (Eurostat)"]
	B --> C{"Local metadata.json<br/>exists?"}
	C -- Yes --> D{"Upstream<br/>changed?"}
	C -- No --> E0

	D -- No --> G["No new version<br/>Keep dataset.csv<br/>Ensure metadata + codelists"]
	D -- Yes --> E1

	subgraph DL[" "]
		direction LR
		E1["Download new slice"]
		E0["Download initial slice"]
	end
	style DL fill:transparent,stroke:transparent

	E0 --> F["Append rows to dataset.csv<br/>append-only, adds last_updated column"]
	E1 --> F
	F --> H["Update metadata history<br/>Export codelists"]
	G --> I["Warehouse ingestion step<br/>dbt / COPY / load job"]
	H --> I

fetch() is designed for scheduled refresh jobs:

  1. Fetch upstream “last updated” timestamp.
  2. Compare with the latest locally recorded timestamp in metadata.json.
  3. If unchanged: do nothing to the dataset (but still ensures metadata + codelists).
  4. If changed: download and append a new slice to dataset.csv, then update metadata + codelists.

Use cases

  • Refresh Eurostat indicators nightly into Postgres/Snowflake/BigQuery staging.
  • Keep reference codelists versioned alongside fact extracts for governance.
  • Produce reproducible ELT inputs (facts + metadata + reference tables) for analysts.

Why sdmxflow

sdmxflow is intentionally opinionated about operationalizing SDMX datasets for warehouse refresh jobs.

  • Compared to SDMX client libraries: they fetch data; sdmxflow produces deterministic refresh artifacts + metadata trail + codelists.
  • Compared to flexible extractors: sdmxflow focuses on stable layout and predictable refresh semantics.

See “Credits and acknowledgements” below for project influences and dependencies.


Features

  • Append-only refresh: only downloads and appends when upstream changed.
  • Warehouse-friendly layout: dataset.csv (facts), metadata.json (versions + fetch info), codelists/ (reference tables).
  • Fast upstream change detection (Eurostat): uses SDMX annotations for last-updated.
  • User-friendly logging at INFO and detailed diagnostics at DEBUG.
  • Optional per-run log file capture via save_logs=True.

Non-goals (for now):

  • full multi-provider support,
  • a full-blown orchestration framework,
  • a “do everything” SDMX exploration UI.

Installation

From PyPI (recommended)

pip install sdmxflow

If you run sdmxflow inside an orchestration environment that pins packaging<25.1 (for example, Prefect 3.6.x), install sdmxflow>=0.1.1 so that dependency resolution can avoid sdmx1==2.25.1 (which declares packaging>=26).

From source (this repository)

This project uses uv for development.

git clone https://github.com/knifflig/sdmxflow
cd sdmxflow
uv sync --group dev

Output layout

sdmxflow writes a stable folder structure under your chosen out_dir:

<out_dir>/
	dataset.csv
	metadata.json
	codelists/
		... generated reference CSVs ...
	logs/                     # only when save_logs=True
		<agency>__<dataset>__<timestamp>.log

dataset.csv

  • Append-only across versions.
  • Includes a leading last_updated column (UTC ISO-8601) indicating which upstream version a row belongs to.

metadata.json

Stores dataset identity and version history, such as:

  • upstream timestamps,
  • fetch times,
  • HTTP URL/status/headers (when available),
  • number of rows appended for each version.

codelists/

Contains exported codelists needed to interpret coded dataset columns.


Logging

sdmxflow is built to be readable in production logs.

  • At INFO level, fetch() emits exactly three user-facing messages:
    1. intention (what, where),
    2. version decision (download vs. already up to date),
    3. completion summary (artifact paths).
  • Enable DEBUG for rich diagnostics.
  • If you pass save_logs=True, sdmxflow writes a per-run debug log file under <out_dir>/logs/.

Provider support and limitations

  • Supported:
    • Eurostat (source_id="ESTAT")

Planned/possible future work (not guaranteed):

  • additional SDMX sources,
  • richer metadata capture (more SDMX structure fields),
  • export formats beyond CSV/JSON.

FAQ

Does sdmxflow load into my warehouse directly?

No. It produces deterministic on-disk artifacts (CSV/JSON/codelists). You load them using your existing tooling (Airflow, dbt, COPY/LOAD jobs, etc.).

Does it support providers besides Eurostat?

Not yet. Eurostat (source_id="ESTAT") is the current supported provider.

Does it deduplicate data?

It is append-only across upstream versions. Each appended slice is marked with a last_updated value so downstream jobs can select the newest version (or reprocess full history).

How does it detect upstream changes?

For Eurostat, it uses SDMX annotations to obtain a last-updated timestamp and compares it to the latest locally recorded timestamp.


Development

Install dev dependencies:

uv sync --group dev

Run tests:

uv run pytest

Run lint/format:

uv run ruff check .
uv run ruff format .

Contributing

Contributions are welcome.

Good first contributions:

  • improvements to metadata extraction,
  • better codelist export coverage,
  • adding new provider support behind a clean interface,
  • documentation and examples.

Please open an issue before large changes.


Contact


License

Licensed under the Apache License, Version 2.0. See LICENSE.md.


Credits and acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sdmxflow-0.1.1.tar.gz (35.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sdmxflow-0.1.1-py3-none-any.whl (46.0 kB view details)

Uploaded Python 3

File details

Details for the file sdmxflow-0.1.1.tar.gz.

File metadata

  • Download URL: sdmxflow-0.1.1.tar.gz
  • Upload date:
  • Size: 35.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sdmxflow-0.1.1.tar.gz
Algorithm Hash digest
SHA256 ffaa6eb4061deeced6103eb5d02c4a4113e36317536fbc9832646def9293866c
MD5 8731d16cb69b69136787120cb3f4d28c
BLAKE2b-256 ab3d6eef73c3b9167102f210c216852fdca1d2e35c27d5258f5a6c01118d1603

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdmxflow-0.1.1.tar.gz:

Publisher: pypi.yml on knifflig/sdmxflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file sdmxflow-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sdmxflow-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 46.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sdmxflow-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 cd939e0d1620fb9c33e57c49628835c4a73b66d352fb076298a1fdbf0c1325e6
MD5 3b57aa06171fb2c27dca52f09542fb5b
BLAKE2b-256 b0edccf9d150da6cc9a9c1047079b3ed7fea078ec51abe93544a0ac3b3513580

See more details on using hashes here.

Provenance

The following attestation bundles were made for sdmxflow-0.1.1-py3-none-any.whl:

Publisher: pypi.yml on knifflig/sdmxflow

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page