Skip to main content

Config-driven data ingestion and historization framework built on dlt

Project description

dlt-saga

Config-driven data ingestion and historization framework, built on dlt.

PyPI version License CI codecov Python

Why dlt-saga?

dlt is an excellent Python library for building data pipelines. dlt-saga adds the operational layer that teams need to run dlt at scale:

What you get How
Zero-code pipelines Drop a YAML file in configs/ — no Python needed for common sources
SCD2 historization write_disposition: append+historize turns any snapshot table into a full change history with _dlt_valid_from / _dlt_valid_to
dbt-style selectors saga ingest --select "tag:daily,group:api" — union, intersection, glob patterns
Multi-environment profiles profiles.yml with dev/prod targets, service account impersonation, per-environment datasets
Plugin architecture Register custom sources and destinations via packages.yml or Python entry points — no framework fork needed
Cloud-agnostic BigQuery today, Databricks and DuckDB included, more via plugins

If you are already using dlt directly and finding yourself re-implementing incremental state management, environment switching, or SCD2 transforms — dlt-saga is the config layer you are building.

Installation

pip install dlt-saga[bigquery]          # BigQuery
pip install dlt-saga[databricks,azure]  # Databricks on Azure
pip install dlt-saga                    # DuckDB only (no cloud dependencies)

Quick Start

# 1. Create and scaffold a project
mkdir my-pipelines && cd my-pipelines
saga init                               # prompts for destination and credentials

# 2. Authenticate to your destination (skip for DuckDB)
#    See: https://github.com/Glitni/dlt-saga/wiki/Getting-Started

# 3. List available pipelines
saga list

# 4. Run a pipeline
saga ingest --select "example__sample"

See the Getting Started guide for a full walkthrough, or browse example/ for a minimal runnable setup.

Local execution is the default. Use --orchestrate to fan out to parallel workers (requires orchestration: configured in saga_project.yml).

CLI Commands

All commands are subcommands under the saga entry point and share common options: --select, --verbose, --profile, --target.

Selectors (dbt-style)

Selectors filter which pipelines to run. They work across all commands.

Syntax Meaning Example
name Exact pipeline name --select google_sheets__my_pipeline
*glob* Glob pattern --select "*balance*"
tag:name Filter by tag --select "tag:daily"
group:name Filter by source group --select "group:google_sheets"
space-separated UNION (OR) --select "tag:daily group:filesystem"
comma-separated INTERSECTION (AND) --select "tag:daily,group:google_sheets"

Common Examples

# List pipelines
saga list                                        # All enabled pipelines
saga list --resource-type ingest                 # Ingest-enabled only
saga list --resource-type historize              # Historize-enabled only
saga list --select "tag:daily"                   # Filtered by tag

# Ingest
saga ingest --select "tag:daily"
saga ingest --select "group:api" --workers 8
saga ingest --full-refresh --select "my_pipeline"
saga ingest --select "group:api" --start-value-override "2026-01-01"  # Backfill

# Historize (SCD2)
saga historize --select "tag:daily"
saga historize --full-refresh --select "filesystem__*"

# Run (ingest + historize sequentially)
saga run --select "tag:daily"

# Update BigQuery access controls
saga update-access --select "group:google_sheets"

# Target a specific environment
saga ingest --target prod --select "tag:daily"   # production (with impersonation)

Adding a New Pipeline

Create a YAML config file in configs/<source_type>/ — that's it. The framework auto-discovers configs.

Supported source types out of the box: API, Database (PostgreSQL, MySQL, SQL Server, and more via ConnectorX), Filesystem (GCS, SFTP, local), Google Sheets, and SharePoint.

See the Pipeline Types guide for config examples for each source type, and the Configuration reference for all available fields.

Write Dispositions and Historize

The write_disposition field controls what operations are enabled for a pipeline:

Value Ingest Historize Use Case
append Yes No Raw event/log data
merge Yes No Upsert on primary key
replace Yes No Full refresh each run
append+historize Yes Yes Snapshot → SCD2
historize No Yes External data → SCD2

Historize transforms raw snapshot data into SCD2 tables with _dlt_valid_from, _dlt_valid_to, and _dlt_is_deleted columns. See the Historize guide for the full reference.

Community

Further Reading

  • Getting Started — Full walkthrough: install, init, first pipeline
  • Architecture — Three-layer design, plugin system, execution flow
  • Pipeline Types — Config reference for API, Database, Filesystem, Sheets, SharePoint
  • Configuration — Hierarchical config, all options reference
  • Profiles — Multi-environment setup, service account impersonation
  • Historize (SCD2) — Snapshot tables → slowly changing dimensions
  • CLI Reference — All commands, flags, and the programmatic API
  • Deployment — Orchestration, Cloud Run, worker setup
  • Performance — Parallel execution, worker tuning, backfill
  • Plugin Development — Custom sources, destinations, hooks

Project Structure

dlt-saga/
├── dlt_saga/              # Main package
│   ├── cli.py            #   CLI entry point (saga command)
│   ├── pipelines/        #   Built-in source implementations
│   │   ├── api/          #     Generic REST API pipeline
│   │   ├── database/     #     Database source (ConnectorX)
│   │   ├── filesystem/   #     Filesystem / GCS source
│   │   ├── google_sheets/#     Google Sheets source
│   │   └── sharepoint/   #     SharePoint source
│   ├── historize/        #   SCD2 historization engine
│   ├── destinations/     #   Destination implementations
│   │   ├── bigquery/     #     BigQuery
│   │   └── duckdb/       #     DuckDB (local development)
│   ├── pipeline_config/  #   Config discovery and parsing
│   ├── schemas/          #   Bundled static schemas (dlt_common.json)
│   └── utility/          #   Shared utilities (CLI, naming, orchestration)
├── example/              # Minimal runnable consumer project (DuckDB)
├── wiki/                 # Documentation (synced to GitHub wiki)
└── .dlt/                 # dlt runtime config overrides

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlt_saga-0.2.2.tar.gz (272.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dlt_saga-0.2.2-py3-none-any.whl (321.1 kB view details)

Uploaded Python 3

File details

Details for the file dlt_saga-0.2.2.tar.gz.

File metadata

  • Download URL: dlt_saga-0.2.2.tar.gz
  • Upload date:
  • Size: 272.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dlt_saga-0.2.2.tar.gz
Algorithm Hash digest
SHA256 e633bc72f23ed5c5aaec4494dc0f27780ef6c0cddc554a1566b26ce6408bbd03
MD5 25d796aeff3bb0d7d353561fb2cb872d
BLAKE2b-256 bde3b025df7bbce08e3ae9f3af1aa712f230d2658a453615b6caa0527e675c18

See more details on using hashes here.

Provenance

The following attestation bundles were made for dlt_saga-0.2.2.tar.gz:

Publisher: publish.yml on Glitni/dlt-saga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dlt_saga-0.2.2-py3-none-any.whl.

File metadata

  • Download URL: dlt_saga-0.2.2-py3-none-any.whl
  • Upload date:
  • Size: 321.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dlt_saga-0.2.2-py3-none-any.whl
Algorithm Hash digest
SHA256 4935f60cd9652320756e5a04e2733d2b8c7147b0821f2c1c26ce37664d7a24a3
MD5 dc5d265864aab976d2823af723adfc23
BLAKE2b-256 7a63d062df8269f87d6ab7ce7acacc0b2005fe4e0121685474ff3707db084f9b

See more details on using hashes here.

Provenance

The following attestation bundles were made for dlt_saga-0.2.2-py3-none-any.whl:

Publisher: publish.yml on Glitni/dlt-saga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page