Skip to main content

Config-driven data ingestion and historization framework built on dlt

Project description

dlt-saga

Config-driven data ingestion and historization framework, built on dlt.

PyPI version License CI codecov Python

Why dlt-saga?

dlt is an excellent Python library for building data pipelines. dlt-saga adds the operational layer that teams need to run dlt at scale:

What you get How
Zero-code pipelines Drop a YAML file in configs/ — no Python needed for common sources
SCD2 historization write_disposition: append+historize turns any snapshot table into a full change history with _dlt_valid_from / _dlt_valid_to
dbt-style selectors saga ingest --select "tag:daily,group:api" — union, intersection, glob patterns
Multi-environment profiles profiles.yml with dev/prod targets, service account impersonation, per-environment datasets
Plugin architecture Register custom sources and destinations via packages.yml or Python entry points — no framework fork needed
Cloud-agnostic BigQuery today, Databricks and DuckDB included, more via plugins

If you are already using dlt directly and finding yourself re-implementing incremental state management, environment switching, or SCD2 transforms — dlt-saga is the config layer you are building.

Installation

pip install dlt-saga[bigquery]          # BigQuery
pip install dlt-saga[databricks,azure]  # Databricks on Azure
pip install dlt-saga                    # DuckDB only (no cloud dependencies)

Quick Start

# 1. Create and scaffold a project
mkdir my-pipelines && cd my-pipelines
saga init                               # prompts for destination and credentials

# 2. Authenticate to your destination (skip for DuckDB)
#    See: https://github.com/Glitni/dlt-saga/wiki/Getting-Started

# 3. List available pipelines
saga list

# 4. Run a pipeline
saga ingest --select "example__sample"

See the Getting Started guide for a full walkthrough, or browse example/ for a minimal runnable setup.

Local execution is the default. Use --orchestrate to fan out to parallel workers (requires orchestration: configured in saga_project.yml).

CLI Commands

All commands are subcommands under the saga entry point and share common options: --select, --verbose, --profile, --target.

Selectors (dbt-style)

Selectors filter which pipelines to run. They work across all commands.

Syntax Meaning Example
name Exact pipeline name --select google_sheets__my_pipeline
*glob* Glob pattern --select "*balance*"
tag:name Filter by tag --select "tag:daily"
group:name Filter by source group --select "group:google_sheets"
space-separated UNION (OR) --select "tag:daily group:filesystem"
comma-separated INTERSECTION (AND) --select "tag:daily,group:google_sheets"

Common Examples

# List pipelines
saga list                                        # All enabled pipelines
saga list --resource-type ingest                 # Ingest-enabled only
saga list --resource-type historize              # Historize-enabled only
saga list --select "tag:daily"                   # Filtered by tag

# Ingest
saga ingest --select "tag:daily"
saga ingest --select "group:api" --workers 8
saga ingest --full-refresh --select "my_pipeline"
saga ingest --select "group:api" --start-value-override "2026-01-01"  # Backfill

# Historize (SCD2)
saga historize --select "tag:daily"
saga historize --full-refresh --select "filesystem__*"

# Run (ingest + historize sequentially)
saga run --select "tag:daily"

# Update BigQuery access controls
saga update-access --select "group:google_sheets"

# Target a specific environment
saga ingest --target prod --select "tag:daily"   # production (with impersonation)

Adding a New Pipeline

Create a YAML config file in configs/<source_type>/ — that's it. The framework auto-discovers configs.

Supported source types out of the box: API, Database (PostgreSQL, MySQL, SQL Server, and more via ConnectorX), Filesystem (GCS, SFTP, local), Google Sheets, and SharePoint.

See the Pipeline Types guide for config examples for each source type, and the Configuration reference for all available fields.

Write Dispositions and Historize

The write_disposition field controls what operations are enabled for a pipeline:

Value Ingest Historize Use Case
append Yes No Raw event/log data
merge Yes No Upsert on primary key
replace Yes No Full refresh each run
append+historize Yes Yes Snapshot → SCD2
historize No Yes External data → SCD2

Historize transforms raw snapshot data into SCD2 tables with _dlt_valid_from, _dlt_valid_to, and _dlt_is_deleted columns. See the Historize guide for the full reference.

Community

Further Reading

  • Getting Started — Full walkthrough: install, init, first pipeline
  • Architecture — Three-layer design, plugin system, execution flow
  • Pipeline Types — Config reference for API, Database, Filesystem, Sheets, SharePoint
  • Configuration — Hierarchical config, all options reference
  • Profiles — Multi-environment setup, service account impersonation
  • Historize (SCD2) — Snapshot tables → slowly changing dimensions
  • CLI Reference — All commands, flags, and the programmatic API
  • Deployment — Orchestration, Cloud Run, worker setup
  • Performance — Parallel execution, worker tuning, backfill
  • Plugin Development — Custom sources, destinations, hooks

Project Structure

dlt-saga/
├── dlt_saga/              # Main package
│   ├── cli.py            #   CLI entry point (saga command)
│   ├── pipelines/        #   Built-in source implementations
│   │   ├── api/          #     Generic REST API pipeline
│   │   ├── database/     #     Database source (ConnectorX)
│   │   ├── filesystem/   #     Filesystem / GCS source
│   │   ├── google_sheets/#     Google Sheets source
│   │   └── sharepoint/   #     SharePoint source
│   ├── historize/        #   SCD2 historization engine
│   ├── destinations/     #   Destination implementations
│   │   ├── bigquery/     #     BigQuery
│   │   └── duckdb/       #     DuckDB (local development)
│   ├── pipeline_config/  #   Config discovery and parsing
│   ├── schemas/          #   Bundled static schemas (dlt_common.json)
│   └── utility/          #   Shared utilities (CLI, naming, orchestration)
├── example/              # Minimal runnable consumer project (DuckDB)
├── wiki/                 # Documentation (synced to GitHub wiki)
└── .dlt/                 # dlt runtime config overrides

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dlt_saga-0.2.3.tar.gz (272.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

dlt_saga-0.2.3-py3-none-any.whl (321.2 kB view details)

Uploaded Python 3

File details

Details for the file dlt_saga-0.2.3.tar.gz.

File metadata

  • Download URL: dlt_saga-0.2.3.tar.gz
  • Upload date:
  • Size: 272.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dlt_saga-0.2.3.tar.gz
Algorithm Hash digest
SHA256 d5f0136b2b283e22e48b730bb882cebd8db34b48a7e8a9112d0e36d280e1aadf
MD5 9f5065176e97349d30480c576c44e006
BLAKE2b-256 afc7a9e910abaf95a7cc9da9240a1b907a2edf5b5004f3d89c391ee3a16125e1

See more details on using hashes here.

Provenance

The following attestation bundles were made for dlt_saga-0.2.3.tar.gz:

Publisher: publish.yml on Glitni/dlt-saga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file dlt_saga-0.2.3-py3-none-any.whl.

File metadata

  • Download URL: dlt_saga-0.2.3-py3-none-any.whl
  • Upload date:
  • Size: 321.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.13

File hashes

Hashes for dlt_saga-0.2.3-py3-none-any.whl
Algorithm Hash digest
SHA256 69f04cc4177392246783ea48e2c6c726a6a7def6d02475d7abe88faa35837cdb
MD5 e821e0df751692decf2deb202936e7bf
BLAKE2b-256 61e87ee81c223d1fb80ae2a1ff4f5f1864dfd5ecf1442c4fff6348e561bde9b7

See more details on using hashes here.

Provenance

The following attestation bundles were made for dlt_saga-0.2.3-py3-none-any.whl:

Publisher: publish.yml on Glitni/dlt-saga

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page