Skip to main content

Config-driven data ingestion and historization framework built on dlt

Project description

dlt-saga

Config-driven data ingestion and historization framework, built on dlt.

PyPI version License CI codecov Python

Why dlt-saga?

dlt is an excellent Python library for building data pipelines. dlt-saga adds the operational layer that teams need to run dlt at scale:

What you get How
Zero-code pipelines Drop a YAML file in configs/ — no Python needed for common sources
SCD2 historization write_disposition: append+historize turns any snapshot table into a full change history with _dlt_valid_from / _dlt_valid_to
dbt-style selectors saga ingest --select "tag:daily,group:api" — union, intersection, glob patterns
Multi-environment profiles profiles.yml with dev/prod targets, service account impersonation, per-environment datasets
Plugin architecture Register custom sources and destinations via packages.yml or Python entry points — no framework fork needed
Cloud-agnostic BigQuery today, Databricks and DuckDB included, more via plugins

If you are already using dlt directly and finding yourself re-implementing incremental state management, environment switching, or SCD2 transforms — dlt-saga is the config layer you are building.

Installation

pip install dlt-saga[bigquery]          # BigQuery
pip install dlt-saga[databricks,azure]  # Databricks on Azure
pip install dlt-saga                    # DuckDB only (no cloud dependencies)

Quick Start

# 1. Create and scaffold a project
mkdir my-pipelines && cd my-pipelines
saga init                               # prompts for destination and credentials

# 2. Authenticate to your destination (skip for DuckDB)
#    See: https://github.com/Glitni/dlt-saga/wiki/Getting-Started

# 3. List available pipelines
saga list

# 4. Run a pipeline
saga ingest --select "example__sample"

See the Getting Started guide for a full walkthrough, or browse example/ for a minimal runnable setup.

Local execution is the default. Use --orchestrate to fan out to parallel workers (requires orchestration: configured in saga_project.yml).

CLI Commands

All commands are subcommands under the saga entry point and share common options: --select, --verbose, --profile, --target.

Selectors (dbt-style)

Selectors filter which pipelines to run. They work across all commands.

Syntax Meaning Example
name Exact pipeline name --select google_sheets__my_pipeline
*glob* Glob pattern --select "*balance*"
tag:name Filter by tag --select "tag:daily" (schedule-aware — see Configuration → Scheduling tags)
group:name Filter by source group --select "group:google_sheets"
space-separated UNION (OR) --select "tag:daily group:filesystem"
comma-separated INTERSECTION (AND) --select "tag:daily,group:google_sheets"

Common Examples

# List pipelines
saga list                                        # All enabled pipelines
saga list --resource-type ingest                 # Ingest-enabled only
saga list --resource-type historize              # Historize-enabled only
saga list --select "tag:daily"                   # Filtered by tag

# Ingest
saga ingest --select "tag:daily"
saga ingest --select "group:api" --workers 8
saga ingest --full-refresh --select "my_pipeline"
saga ingest --select "group:api" --start-value-override "2026-01-01"  # Backfill

# Historize (SCD2)
saga historize --select "tag:daily"
saga historize --full-refresh --select "filesystem__*"

# Run (ingest + historize sequentially)
saga run --select "tag:daily"

# Update BigQuery access controls
saga update-access --select "group:google_sheets"

# Target a specific environment
saga ingest --target prod --select "tag:daily"   # production (with impersonation)

Adding a New Pipeline

Create a YAML config file in configs/<source_type>/ — that's it. The framework auto-discovers configs.

Supported source types out of the box: API, Database (PostgreSQL, MySQL, SQL Server, and more via ConnectorX), Filesystem (GCS, SFTP, local), Google Sheets, and SharePoint.

See the Pipeline Types guide for config examples for each source type, and the Configuration reference for all available fields.

Write Dispositions and Historize

The write_disposition field controls what operations are enabled for a pipeline:

Value Ingest Historize Use Case
append Yes No Raw event/log data
merge Yes No Upsert on primary key
replace Yes No Full refresh each run
append+historize Yes Yes Snapshot → SCD2
historize No Yes External data → SCD2

Historize transforms raw snapshot data into SCD2 tables with _dlt_valid_from, _dlt_valid_to, and _dlt_is_deleted columns. See the Historize guide for the full reference.

Community

Further Reading

  • Getting Started — Full walkthrough: install, init, first pipeline
  • Architecture — Three-layer design, plugin system, execution flow
  • Pipeline Types — Config reference for API, Database, Filesystem, Sheets, SharePoint
  • Configuration — Hierarchical config, all options reference
  • Profiles — Multi-environment setup, service account impersonation
  • Historize (SCD2) — Snapshot tables → slowly changing dimensions
  • CLI Reference — All commands, flags, and the programmatic API
  • Deployment — Orchestration, Cloud Run, worker setup
  • Performance — Parallel execution, worker tuning, backfill
  • Plugin Development — Custom sources, destinations, hooks

Origin

dlt-saga is derived from an internal data ingestion framework originally built by Glitni for Amedia, a leading Nordic media group, as the ingestion layer of Amedia's data platform. Amedia supported open-sourcing the project and continues to fund ongoing development through their partnership with Glitni, enabling the framework to be shared with the broader community.

Project Structure

dlt-saga/
├── dlt_saga/              # Main package
│   ├── cli.py            #   CLI entry point (saga command)
│   ├── pipelines/        #   Built-in source implementations
│   │   ├── api/          #     Generic REST API pipeline
│   │   ├── database/     #     Database source (ConnectorX)
│   │   ├── filesystem/   #     Filesystem / GCS source
│   │   ├── google_sheets/#     Google Sheets source
│   │   └── sharepoint/   #     SharePoint source
│   ├── historize/        #   SCD2 historization engine
│   ├── destinations/     #   Destination implementations
│   │   ├── bigquery/     #     BigQuery
│   │   └── duckdb/       #     DuckDB (local development)
│   ├── pipeline_config/  #   Config discovery and parsing
│   ├── schemas/          #   Bundled static schemas (dlt_common.json)
│   └── utility/          #   Shared utilities (CLI, naming, orchestration)
├── example/              # Minimal runnable consumer project (DuckDB)
├── wiki/                 # Documentation (synced to GitHub wiki)
└── .dlt/                 # dlt runtime config overrides

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlt_saga-0.2.7.tar.gz (286.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dlt_saga-0.2.7-py3-none-any.whl (335.6 kB view details)

Uploaded Python 3

File details

Details for the file dlt_saga-0.2.7.tar.gz.

File metadata

  • Download URL: dlt_saga-0.2.7.tar.gz
  • Upload date:
  • Size: 286.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dlt_saga-0.2.7.tar.gz
Algorithm Hash digest
SHA256 3ec63ff79de46e0045f77ee9ad2338de8a48984cd01929f28afebbec9f6589a2
MD5 e5dbe4d89452c9484a4055823e55b1ca
BLAKE2b-256 c0677fd7e45efcf51ccba9f14933b57eb3660300829da17c39b06f293a4b971f

See more details on using hashes here.

Provenance

The following attestation bundles were made for dlt_saga-0.2.7.tar.gz:

Publisher: publish.yml on Glitni/dlt-saga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dlt_saga-0.2.7-py3-none-any.whl.

File metadata

  • Download URL: dlt_saga-0.2.7-py3-none-any.whl
  • Upload date:
  • Size: 335.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dlt_saga-0.2.7-py3-none-any.whl
Algorithm Hash digest
SHA256 40b34dce1c6c77fc0277333b15a440c3c73b1fe9e8e003e7b4df57cc138f5209
MD5 225faa5a5e7e7bdd89fd66fabf960b47
BLAKE2b-256 4ac9e926f9abdc9f3b19fb92e92e4dc71f1dad8ffbf28b9ab9f618d03329e710

See more details on using hashes here.

Provenance

The following attestation bundles were made for dlt_saga-0.2.7-py3-none-any.whl:

Publisher: publish.yml on Glitni/dlt-saga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page