Config-driven data ingestion and historization framework built on dlt
Project description
dlt-saga
Config-driven data ingestion and historization framework, built on dlt.
Why dlt-saga?
dlt is an excellent Python library for building data pipelines. dlt-saga adds the operational layer that teams need to run dlt at scale:
| What you get | How |
|---|---|
| Zero-code pipelines | Drop a YAML file in configs/ — no Python needed for common sources |
| SCD2 historization | write_disposition: append+historize turns any snapshot table into a full change history with _dlt_valid_from / _dlt_valid_to |
| dbt-style selectors | saga ingest --select "tag:daily,group:api" — union, intersection, glob patterns |
| Multi-environment profiles | profiles.yml with dev/prod targets, service account impersonation, per-environment datasets |
| Plugin architecture | Register custom sources and destinations via packages.yml or Python entry points — no framework fork needed |
| Cloud-agnostic | BigQuery today, Databricks and DuckDB included, more via plugins |
If you are already using dlt directly and finding yourself re-implementing incremental state management, environment switching, or SCD2 transforms — dlt-saga is the config layer you are building.
Installation
pip install dlt-saga[bigquery] # BigQuery
pip install dlt-saga[databricks,azure] # Databricks on Azure
pip install dlt-saga # DuckDB only (no cloud dependencies)
Quick Start
# 1. Create and scaffold a project
mkdir my-pipelines && cd my-pipelines
saga init # prompts for destination and credentials
# 2. Authenticate to your destination (skip for DuckDB)
# See: https://github.com/Glitni/dlt-saga/wiki/Getting-Started
# 3. List available pipelines
saga list
# 4. Run a pipeline
saga ingest --select "example__sample"
See the Getting Started guide for a full walkthrough, or browse
example/for a minimal runnable setup.
Local execution is the default. Use
--orchestrateto fan out to parallel workers (requiresorchestration:configured insaga_project.yml).
CLI Commands
All commands are subcommands under the saga entry point and share common options:
--select, --verbose, --profile, --target.
Selectors (dbt-style)
Selectors filter which pipelines to run. They work across all commands.
| Syntax | Meaning | Example |
|---|---|---|
name |
Exact pipeline name | --select google_sheets__my_pipeline |
*glob* |
Glob pattern | --select "*balance*" |
tag:name |
Filter by tag | --select "tag:daily" (schedule-aware — see Configuration → Scheduling tags) |
group:name |
Filter by source group | --select "group:google_sheets" |
| space-separated | UNION (OR) | --select "tag:daily group:filesystem" |
| comma-separated | INTERSECTION (AND) | --select "tag:daily,group:google_sheets" |
Common Examples
# List pipelines
saga list # All enabled pipelines
saga list --resource-type ingest # Ingest-enabled only
saga list --resource-type historize # Historize-enabled only
saga list --select "tag:daily" # Filtered by tag
# Ingest
saga ingest --select "tag:daily"
saga ingest --select "group:api" --workers 8
saga ingest --full-refresh --select "my_pipeline"
saga ingest --select "group:api" --start-value-override "2026-01-01" # Backfill
# Historize (SCD2)
saga historize --select "tag:daily"
saga historize --full-refresh --select "filesystem__*"
# Run (ingest + historize sequentially)
saga run --select "tag:daily"
# Update BigQuery access controls
saga update-access --select "group:google_sheets"
# Target a specific environment
saga ingest --target prod --select "tag:daily" # production (with impersonation)
Adding a New Pipeline
Create a YAML config file in configs/<source_type>/ — that's it. The framework auto-discovers configs.
Supported source types out of the box: API, Database (PostgreSQL, MySQL, SQL Server, and more via ConnectorX), Filesystem (GCS, SFTP, local), Google Sheets, and SharePoint.
See the Pipeline Types guide for config examples for each source type, and the Configuration reference for all available fields.
Write Dispositions and Historize
The write_disposition field controls what operations are enabled for a pipeline:
| Value | Ingest | Historize | Use Case |
|---|---|---|---|
append |
Yes | No | Raw event/log data |
merge |
Yes | No | Upsert on primary key |
replace |
Yes | No | Full refresh each run |
append+historize |
Yes | Yes | Snapshot → SCD2 |
historize |
No | Yes | External data → SCD2 |
Historize transforms raw snapshot data into SCD2 tables with _dlt_valid_from, _dlt_valid_to, and _dlt_is_deleted columns. See the Historize guide for the full reference.
Community
- GitHub Issues — bug reports and feature requests
- GitHub Discussions — questions, ideas, show & tell
- Contributing guide — how to get involved
- dlt community — dlt Slack / Discord
Further Reading
- Getting Started — Full walkthrough: install, init, first pipeline
- Architecture — Three-layer design, plugin system, execution flow
- Pipeline Types — Config reference for API, Database, Filesystem, Sheets, SharePoint
- Configuration — Hierarchical config, all options reference
- Profiles — Multi-environment setup, service account impersonation
- Historize (SCD2) — Snapshot tables → slowly changing dimensions
- CLI Reference — All commands, flags, and the programmatic API
- Deployment — Orchestration, Cloud Run, worker setup
- Performance — Parallel execution, worker tuning, backfill
- Plugin Development — Custom sources, destinations, hooks
Origin
dlt-saga is derived from an internal data ingestion framework originally built by Glitni for Amedia, a leading Nordic media group, as the ingestion layer of Amedia's data platform. Amedia supported open-sourcing the project and continues to fund ongoing development through their partnership with Glitni, enabling the framework to be shared with the broader community.
Project Structure
dlt-saga/
├── dlt_saga/ # Main package
│ ├── cli.py # CLI entry point (saga command)
│ ├── pipelines/ # Built-in source implementations
│ │ ├── api/ # Generic REST API pipeline
│ │ ├── database/ # Database source (ConnectorX)
│ │ ├── filesystem/ # Filesystem / GCS source
│ │ ├── google_sheets/# Google Sheets source
│ │ └── sharepoint/ # SharePoint source
│ ├── historize/ # SCD2 historization engine
│ ├── destinations/ # Destination implementations
│ │ ├── bigquery/ # BigQuery
│ │ └── duckdb/ # DuckDB (local development)
│ ├── pipeline_config/ # Config discovery and parsing
│ ├── schemas/ # Bundled static schemas (dlt_common.json)
│ └── utility/ # Shared utilities (CLI, naming, orchestration)
├── example/ # Minimal runnable consumer project (DuckDB)
├── wiki/ # Documentation (synced to GitHub wiki)
└── .dlt/ # dlt runtime config overrides
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dlt_saga-0.2.6.tar.gz.
File metadata
- Download URL: dlt_saga-0.2.6.tar.gz
- Upload date:
- Size: 276.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7449ab42f952b40ad03435af029dc37a37b7796bc5e6dbd0b910118c7ce5d8c5
|
|
| MD5 |
21f5292299d0245326aa1877ab0f312a
|
|
| BLAKE2b-256 |
e85c28fc505af400f9146622f999463091a185d21155e1b545de5867afb6724a
|
Provenance
The following attestation bundles were made for dlt_saga-0.2.6.tar.gz:
Publisher:
publish.yml on Glitni/dlt-saga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dlt_saga-0.2.6.tar.gz -
Subject digest:
7449ab42f952b40ad03435af029dc37a37b7796bc5e6dbd0b910118c7ce5d8c5 - Sigstore transparency entry: 1573255932
- Sigstore integration time:
-
Permalink:
Glitni/dlt-saga@2046bfdc165b9080e86aecfb949fae34cd4bcec4 -
Branch / Tag:
refs/tags/v0.2.6 - Owner: https://github.com/Glitni
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2046bfdc165b9080e86aecfb949fae34cd4bcec4 -
Trigger Event:
push
-
Statement type:
File details
Details for the file dlt_saga-0.2.6-py3-none-any.whl.
File metadata
- Download URL: dlt_saga-0.2.6-py3-none-any.whl
- Upload date:
- Size: 324.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
97e19f9a52cdd37466c601f968949cb523f07828cf779ff2104d44accf2306fd
|
|
| MD5 |
85505da12edb75dee387a3dbffc6234e
|
|
| BLAKE2b-256 |
6d55334b24522af0823b7ab4db3333afa315efb1cb0b5ec1e84241055ec3b1c2
|
Provenance
The following attestation bundles were made for dlt_saga-0.2.6-py3-none-any.whl:
Publisher:
publish.yml on Glitni/dlt-saga
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
dlt_saga-0.2.6-py3-none-any.whl -
Subject digest:
97e19f9a52cdd37466c601f968949cb523f07828cf779ff2104d44accf2306fd - Sigstore transparency entry: 1573255970
- Sigstore integration time:
-
Permalink:
Glitni/dlt-saga@2046bfdc165b9080e86aecfb949fae34cd4bcec4 -
Branch / Tag:
refs/tags/v0.2.6 - Owner: https://github.com/Glitni
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@2046bfdc165b9080e86aecfb949fae34cd4bcec4 -
Trigger Event:
push
-
Statement type: