A lightweight ELT framework in pure Python — SQL transformations, Python ingestion, data contracts and snapshots in a single CLI tool

These details have not been verified by PyPI

Project links

Project description

DataForge

DataForge is a lightweight, pure-Python ELT framework designed for data engineers who want the power of SQL-based transformations and Python-based ingestion in a single, dependency-minimal tool — without the overhead of heavy frameworks.

It combines declarative SQL modeling (inspired by dbt) with programmable data ingestion (inspired by dlt), delivering a complete ELT pipeline from a single command-line interface.

Overview

DataForge organizes data pipelines around two building blocks:

Python models — extract data from any source (APIs, files, databases) and load it into your target warehouse
SQL models — transform raw data into analytics-ready tables and views using Jinja2-templated SQL

Both run in a single DAG with automatic dependency resolution, parallel execution, and full observability.

Installation

pip install dataforge-dfg

DataForge requires Python 3.11 or higher. The only mandatory dependencies are jinja2 and pyyaml. You are responsible for installing the database driver of your choice:

pip install duckdb          # DuckDB
pip install psycopg2-binary # PostgreSQL
pip install mysql-connector-python # MySQL
# SQLite is included in the Python standard library

Quickstart

mkdir my_project && cd my_project
dfg init
dfg run

dfg init detects which database drivers are installed in your environment and guides you through the initial configuration. The result is a ready-to-run project with example models and a structured directory layout.

Core Concepts

SQL Models

SQL files in models/ are compiled through Jinja2 before execution. Dependencies between models are declared with ref() and resolved automatically into a DAG.

-- models/stg_orders.sql
{{ config(materialized='table') }}

SELECT
    id          AS order_id,
    customer_id,
    total_amount,
    status,
    created_at
FROM raw_orders
WHERE status != 'cancelled'

-- models/fct_revenue.sql
{{ config(materialized='table') }}

SELECT
    DATE(created_at) AS date,
    SUM(total_amount) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY DATE(created_at)

Supported materializations: table, view, incremental.

Python Models

Python files in models/ handle the extraction and loading phase. Any function named model(context) that returns a list of dictionaries is a valid model. Schema creation and evolution are handled automatically.

# models/ingest_customers.py
DEPENDENCIES = []

def model(context):
    import urllib.request, json
    with urllib.request.urlopen("https://api.example.com/customers") as r:
        return json.loads(r.read())["data"]

The context object provides access to project configuration, upstream model data via ref(), and a persistent state mechanism for incremental ingestion.

Data Contracts

Column-level tests are declared in schema.yml and validated with dfg test:

version: 1
models:
  - name: stg_orders
    columns:
      - name: order_id
        tests:
          - not_null
          - unique

Snapshots — SCD Type 2

Snapshots track historical changes to records over time using the Slowly Changing Dimensions Type 2 pattern. Control columns (dfg_valid_from, dfg_valid_to, dfg_is_active) are added automatically.

{% snapshot snapshot_customers %}
{{ config(unique_key='id', updated_at='updated_at') }}
SELECT * FROM {{ ref('stg_customers') }}
{% endsnapshot %}

Command Reference

Command	Description
`dfg init`	Initialize a new project in the current directory
`dfg run`	Execute the full pipeline (ingestion + transformation)
`dfg ingest`	Run Python models only
`dfg transform`	Run SQL models only
`dfg test`	Validate data contracts
`dfg compile`	Compile Jinja2 templates and generate manifest (dry run)
`dfg seed`	Load static CSV files into the database
`dfg snapshot`	Run SCD Type 2 snapshots
`dfg docs [--serve]`	Generate HTML lineage graph
`dfg debug`	Diagnose environment and database connectivity
`dfg log <ID>`	Search the daily log by session ID

Project Structure

my_project/
├── dfg_project.toml    # Project configuration
├── profiles.toml       # Database credentials (do not commit)
├── models/             # SQL and Python models
│   └── schema.yml      # Data contracts
├── snapshots/          # SCD Type 2 snapshot definitions
├── seeds/              # Static CSV files
└── target/             # Generated artifacts (do not commit)
    ├── manifest.json
    └── run_results.json

Supported Databases

DataForge works with any database that implements the Python DB-API 2.0 (PEP 249) specification. Tested adapters:

Database	Driver
DuckDB	`duckdb`
PostgreSQL	`psycopg2`
MySQL	`mysql-connector-python`
SQLite	built-in

License

MIT License. See LICENSE for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.2.0

Apr 30, 2026

0.1.0

Apr 28, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dataforge_dfg-0.2.0.tar.gz (59.6 kB view details)

Uploaded Apr 30, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

dataforge_dfg-0.2.0-py3-none-any.whl (70.5 kB view details)

Uploaded Apr 30, 2026 Python 3

File details

Details for the file dataforge_dfg-0.2.0.tar.gz.

File metadata

Download URL: dataforge_dfg-0.2.0.tar.gz
Upload date: Apr 30, 2026
Size: 59.6 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataforge_dfg-0.2.0.tar.gz
Algorithm	Hash digest
SHA256	`ccf581c8427e873e089d7f86187d774bb31e53410096eb351a8c97856bc3d0cb`
MD5	`0e746531df960b1f640a9d1575b0fff0`
BLAKE2b-256	`1dd313e78ac11eb388f9106bbd0b034ef353bae6a2d563cf927ffad9ef0d82fc`

See more details on using hashes here.

File details

Details for the file dataforge_dfg-0.2.0-py3-none-any.whl.

File metadata

Download URL: dataforge_dfg-0.2.0-py3-none-any.whl
Upload date: Apr 30, 2026
Size: 70.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.12.13

File hashes

Hashes for dataforge_dfg-0.2.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`9cfa43262464db9e5b23f6addaa9fe134c5196b7011aad61f33b6eb917ac38c7`
MD5	`6deaadb64c68a7d87420833770fd90b7`
BLAKE2b-256	`8182d8bd2a3cd213f9b158400df91181c466267322227ea845fd215588224c70`

See more details on using hashes here.

dataforge-dfg 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

DataForge

Overview

Installation

Quickstart

Core Concepts

SQL Models

Python Models

Data Contracts

Snapshots — SCD Type 2

Command Reference

Project Structure

Supported Databases

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes