Atomic file ingestion with content-hash idempotency and a full audit trail — into SQLite, PostgreSQL, BigQuery, Databricks, or DuckDB.
Project description
Filedge
Files are the universal building block of data engineering. Whether data starts in Kafka, Stripe's API, a partner SFTP, or a CDC stream, every reliable pipeline eventually crystallizes it into a file before it touches the warehouse. Filedge is the load boundary built around that fact: atomic per-file ingestion, content-hash idempotency, and a full audit trail — into SQLite, PostgreSQL, BigQuery, Databricks, or DuckDB.
Why files?
Streams are continuous; files are discrete. That discreteness is what makes ingestion auditable: a file has a SHA-256, a row count, a state in the audit DB, and a row-level provenance trail in the destination. Every downstream question — did we load this?, replay this, where did this row come from? — has a deterministic anchor.
Filedge starts where the file lands and ends when its rows are committed. Upstream is your choice: dlt or vendor exporters for APIs, Kafka Connect or Vector for queues, rclone for SFTP. Downstream is your warehouse. The hard part in between — retry-safe commits, dedupe, retries, lineage — is all Filedge does.
What it gives you that a hand-rolled DAG doesn't
| Failure mode | Typical pipeline | Filedge |
|---|---|---|
| Half-written tables after a crash | Manual cleanup | Per-file atomic commit, retry-safe by content hash |
| "Did we already load this file?" | Filename heuristics | SHA-256 dedupe at the entry point |
| "Where did this row come from?" | Grep logs | _source_file_hash + _ingested_at on every row |
| Stale lock from a killed worker | Page someone | Reclaimed automatically on next run |
| One bad file blocks the pipeline | Skip and forget | Bounded retry → terminal FAILED with audit |
| Schema drift in destination | Silent corruption | Loud failure with a clear diff |
How it differs from neighbors
- vs Airbyte / Fivetran / dlt — those fetch (paginate APIs, manage cursors). Filedge lands — it takes whatever they produce as files and makes the write to the warehouse audit-grade. Use them as Fetchers in front of Filedge.
- vs Kafka Connect / Flink / Spark Structured Streaming — streaming systems own continuous offsets and incremental state. Filedge owns the file as the unit of work — simpler to reason about, replay, and audit. Materialize queues to files, then ingest.
- vs Airflow + custom Python loaders — same DAG shape, but partial-load corruption, lock reclaim, retry caps, idempotent CDC apply, and row provenance are already wired in.
- vs Iceberg / Delta tables — those are table formats. Filedge is what writes to them (or to plain BigQuery / Postgres / Databricks tables) with the per-file commit guarantee.
Quick start
Requires uv.
uv sync --extra dev # core (SQLite)
uv sync --extra dev --extra postgres # + PostgreSQL
uv sync --extra dev --extra bigquery # + BigQuery
uv sync --extra dev --extra databricks # + Databricks
uv sync --extra dev --extra duckdb # + DuckDB
Declare a pipeline:
# pipeline.yaml
format: csv
dest_table: orders
write_mode: append # append | truncate | cdc
retry_cap: 3
batch_size: 1000
connector:
type: sqlite
url: sqlite:///orders.db
columns:
- { source: order_id, dest: order_id, type: string, required: true }
- { source: amount, dest: amount, type: float, required: true }
- { source: order_date, dest: order_date, type: date }
Run it:
filedge run --dir ./incoming --config pipeline.yaml --audit-db-url sqlite:///filedge.db
# Committed: 3 Failed: 0 Skipped: 0 New: 3 Reclaimed: 0 Retried: 0
filedge status --audit-db-url sqlite:///filedge.db
# PENDING: 0 PROCESSING: 0 COMMITTED: 3 FAILED: 0
Don't know the schema yet? filedge inspect data.csv samples the file and prints a columns: block with confidence tiers ready to paste.
Connectors
The destination is configured via a connector: block in pipeline.yaml. Built-ins:
| Destination | Extra | Notes |
|---|---|---|
| SQLite | (core) | Default for local dev; configure with type: sqlite and a url |
| PostgreSQL | postgres |
COPY bulk load; idempotent via per-hash DELETE |
| BigQuery | bigquery |
NDJSON staging + load job; job-ID-keyed idempotency (7-day window) |
| Databricks | databricks |
Unity Catalog volume staging |
| DuckDB | duckdb |
File-based; single-writer, fails fast if locked |
See docs/guides/run.md for full connector config, credentials, and live-integration test setup.
How a run works
filedge run
├── Reset FAILED below retry_cap → PENDING
├── Reclaim stale PROCESSING locks → PENDING
├── Connector: ensure destination table exists
├── Hash files in watched dir; enqueue new hashes as PENDING
└── For each PENDING file:
├── Audit DB: mark PROCESSING (distributed lock)
├── Connector: stream rows → commit (idempotent per file_hash)
└── Audit DB: mark COMMITTED / FAILED
The audit DB and the destination are separate systems. A crash between connector commit and audit mark leaves the file PROCESSING — the next run reclaims it, and the connector's per-hash idempotency guarantees no duplicate rows.
More
- Guides: run · scale · inspect · preview · validate · compact · healthcheck · requeue · CDC files · API sources · queue sources
- Domain model: CONTEXT.md
- Architecture decisions:
- ADR-0001: Single-transaction commit
- ADR-0002: Content hash as idempotency key
- ADR-0003: Strict-mode validation
- ADR-0004: Audit DB / Connector split
- ADR-0005: SFTP out of scope
- ADR-0006: API sources fetched to files
- ADR-0007: Queue source ingestion model
- ADR-0008: Schema inference confidence tiers
- ADR-0009: Warehouse CDC applied-file markers
License
Apache 2.0 — see LICENSE.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file filedge-0.1.1.tar.gz.
File metadata
- Download URL: filedge-0.1.1.tar.gz
- Upload date:
- Size: 293.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6fdeb63f1bb152f514796af3d039cf666bc40cc4a914618cb68af5e169f4f2f6
|
|
| MD5 |
0ddce2697c3028d00a12d419e09bc280
|
|
| BLAKE2b-256 |
fdeb73987e67923226ba0054e7c5be803d01d0e31543d9333f4fee3ec551eb26
|
File details
Details for the file filedge-0.1.1-py3-none-any.whl.
File metadata
- Download URL: filedge-0.1.1-py3-none-any.whl
- Upload date:
- Size: 54.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
fada579a75a9a4a6fad7300594dd2b64c371a002b6866292ef431528501ddf72
|
|
| MD5 |
06e1d837384bd9a32984c374436f1a5f
|
|
| BLAKE2b-256 |
37a521f9a5dd872437f0a07082373ea0c60a46b9401611e723ba03331c697c9a
|