A lightweight ELT framework in pure Python — SQL transformations, Python ingestion, data contracts and snapshots in a single CLI tool
Project description
DataForge
DataForge is a lightweight, pure-Python ELT framework designed for data engineers who want the power of SQL-based transformations and Python-based ingestion in a single, dependency-minimal tool — without the overhead of heavy frameworks.
It combines declarative SQL modeling (inspired by dbt) with programmable data ingestion (inspired by dlt), delivering a complete ELT pipeline from a single command-line interface.
Overview
DataForge organizes data pipelines around two building blocks:
- Python models — extract data from any source (APIs, files, databases) and load it into your target warehouse
- SQL models — transform raw data into analytics-ready tables and views using Jinja2-templated SQL
Both run in a single DAG with automatic dependency resolution, parallel execution, and full observability.
Installation
pip install dataforge-dfg
DataForge requires Python 3.11 or higher. The only mandatory dependencies are jinja2 and pyyaml. You are responsible for installing the database driver of your choice:
pip install duckdb # DuckDB
pip install psycopg2-binary # PostgreSQL
pip install mysql-connector-python # MySQL
# SQLite is included in the Python standard library
Quickstart
mkdir my_project && cd my_project
dfg init
dfg run
dfg init detects which database drivers are installed in your environment and guides you through the initial configuration. The result is a ready-to-run project with example models and a structured directory layout.
Core Concepts
SQL Models
SQL files in models/ are compiled through Jinja2 before execution. Dependencies between models are declared with ref() and resolved automatically into a DAG.
-- models/stg_orders.sql
{{ config(materialized='table') }}
SELECT
id AS order_id,
customer_id,
total_amount,
status,
created_at
FROM raw_orders
WHERE status != 'cancelled'
-- models/fct_revenue.sql
{{ config(materialized='table') }}
SELECT
DATE(created_at) AS date,
SUM(total_amount) AS revenue
FROM {{ ref('stg_orders') }}
GROUP BY DATE(created_at)
Supported materializations: table, view, incremental.
Python Models
Python files in models/ handle the extraction and loading phase. Any function named model(context) that returns a list of dictionaries is a valid model. Schema creation and evolution are handled automatically.
# models/ingest_customers.py
DEPENDENCIES = []
def model(context):
import urllib.request, json
with urllib.request.urlopen("https://api.example.com/customers") as r:
return json.loads(r.read())["data"]
The context object provides access to project configuration, upstream model data via ref(), and a persistent state mechanism for incremental ingestion.
Data Contracts
Column-level tests are declared in schema.yml and validated with dfg test:
version: 1
models:
- name: stg_orders
columns:
- name: order_id
tests:
- not_null
- unique
Snapshots — SCD Type 2
Snapshots track historical changes to records over time using the Slowly Changing Dimensions Type 2 pattern. Control columns (dfg_valid_from, dfg_valid_to, dfg_is_active) are added automatically.
{% snapshot snapshot_customers %}
{{ config(unique_key='id', updated_at='updated_at') }}
SELECT * FROM {{ ref('stg_customers') }}
{% endsnapshot %}
Command Reference
| Command | Description |
|---|---|
dfg init |
Initialize a new project in the current directory |
dfg run |
Execute the full pipeline (ingestion + transformation) |
dfg ingest |
Run Python models only |
dfg transform |
Run SQL models only |
dfg test |
Validate data contracts |
dfg compile |
Compile Jinja2 templates and generate manifest (dry run) |
dfg seed |
Load static CSV files into the database |
dfg snapshot |
Run SCD Type 2 snapshots |
dfg docs [--serve] |
Generate HTML lineage graph |
dfg debug |
Diagnose environment and database connectivity |
dfg log <ID> |
Search the daily log by session ID |
Project Structure
my_project/
├── dfg_project.toml # Project configuration
├── profiles.toml # Database credentials (do not commit)
├── models/ # SQL and Python models
│ └── schema.yml # Data contracts
├── snapshots/ # SCD Type 2 snapshot definitions
├── seeds/ # Static CSV files
└── target/ # Generated artifacts (do not commit)
├── manifest.json
└── run_results.json
Supported Databases
DataForge works with any database that implements the Python DB-API 2.0 (PEP 249) specification. Tested adapters:
| Database | Driver |
|---|---|
| DuckDB | duckdb |
| PostgreSQL | psycopg2 |
| MySQL | mysql-connector-python |
| SQLite | built-in |
License
MIT License. See LICENSE for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file dataforge_dfg-0.2.0.tar.gz.
File metadata
- Download URL: dataforge_dfg-0.2.0.tar.gz
- Upload date:
- Size: 59.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ccf581c8427e873e089d7f86187d774bb31e53410096eb351a8c97856bc3d0cb
|
|
| MD5 |
0e746531df960b1f640a9d1575b0fff0
|
|
| BLAKE2b-256 |
1dd313e78ac11eb388f9106bbd0b034ef353bae6a2d563cf927ffad9ef0d82fc
|
File details
Details for the file dataforge_dfg-0.2.0-py3-none-any.whl.
File metadata
- Download URL: dataforge_dfg-0.2.0-py3-none-any.whl
- Upload date:
- Size: 70.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.12.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9cfa43262464db9e5b23f6addaa9fe134c5196b7011aad61f33b6eb917ac38c7
|
|
| MD5 |
6deaadb64c68a7d87420833770fd90b7
|
|
| BLAKE2b-256 |
8182d8bd2a3cd213f9b158400df91181c466267322227ea845fd215588224c70
|