Atomic file ingestion with content-hash idempotency and a full audit trail — into SQLite, PostgreSQL, BigQuery, Databricks, or DuckDB.

These details have not been verified by PyPI

Project description

Filedge

Files are the universal building block of data engineering. Whether data starts in Kafka, Stripe's API, a partner SFTP, or a CDC stream, every reliable pipeline eventually crystallizes it into a file before it touches the warehouse. Filedge is the load boundary built around that fact: atomic per-file ingestion, content-hash idempotency, and a full audit trail — into SQLite, PostgreSQL, BigQuery, Databricks, or DuckDB.

Why files?

Streams are continuous; files are discrete. That discreteness is what makes ingestion auditable: a file has a SHA-256, a row count, a state in the audit DB, and a row-level provenance trail in the destination. Every downstream question — did we load this?, replay this, where did this row come from? — has a deterministic anchor.

Filedge starts where the file lands and ends when its rows are committed. Upstream is your choice: dlt or vendor exporters for APIs, Kafka Connect or Vector for queues, rclone for SFTP. Downstream is your warehouse. The hard part in between — retry-safe commits, dedupe, retries, lineage — is all Filedge does.

What it gives you that a hand-rolled DAG doesn't

Failure mode	Typical pipeline	Filedge
Half-written tables after a crash	Manual cleanup	Per-file atomic commit, retry-safe by content hash
"Did we already load this file?"	Filename heuristics	SHA-256 dedupe at the entry point
"Where did this row come from?"	Grep logs	`_source_file_hash` + `_ingested_at` on every row
Stale lock from a killed worker	Page someone	Reclaimed automatically on next run
One bad file blocks the pipeline	Skip and forget	Bounded retry → terminal FAILED with audit
Schema drift in destination	Silent corruption	Loud failure with a clear diff

How it differs from neighbors

vs Airbyte / Fivetran / dlt — those fetch (paginate APIs, manage cursors). Filedge lands — it takes whatever they produce as files and makes the write to the warehouse audit-grade. Use them as Fetchers in front of Filedge.
vs Kafka Connect / Flink / Spark Structured Streaming — streaming systems own continuous offsets and incremental state. Filedge owns the file as the unit of work — simpler to reason about, replay, and audit. Materialize queues to files, then ingest.
vs Airflow + custom Python loaders — same DAG shape, but partial-load corruption, lock reclaim, retry caps, idempotent CDC apply, and row provenance are already wired in.
vs Iceberg / Delta tables — those are table formats. Filedge is what writes to them (or to plain BigQuery / Postgres / Databricks tables) with the per-file commit guarantee.

Quick start

Requires uv.

uv sync --extra dev                          # core (SQLite)
uv sync --extra dev --extra postgres         # + PostgreSQL
uv sync --extra dev --extra bigquery         # + BigQuery
uv sync --extra dev --extra databricks       # + Databricks
uv sync --extra dev --extra duckdb           # + DuckDB

Declare a pipeline:

# pipeline.yaml
format: csv
dest_table: orders
write_mode: append          # append | truncate | cdc
retry_cap: 3
batch_size: 1000

connector:
  type: sqlite
  url: sqlite:///orders.db

columns:
  - { source: order_id,   dest: order_id,   type: string,  required: true }
  - { source: amount,     dest: amount,     type: float,   required: true }
  - { source: order_date, dest: order_date, type: date }

Run it:

filedge run --dir ./incoming --config pipeline.yaml --audit-db-url sqlite:///filedge.db
# Committed: 3  Failed: 0  Skipped: 0  New: 3  Reclaimed: 0  Retried: 0

filedge status --audit-db-url sqlite:///filedge.db
# PENDING: 0  PROCESSING: 0  COMMITTED: 3  FAILED: 0

Don't know the schema yet? filedge inspect data.csv samples the file and prints a columns: block with confidence tiers ready to paste.

Connectors

The destination is configured via a connector: block in pipeline.yaml. Built-ins:

Destination	Extra	Notes
SQLite	(core)	Default for local dev; configure with `type: sqlite` and a `url`
PostgreSQL	`postgres`	`COPY` bulk load; idempotent via per-hash DELETE
BigQuery	`bigquery`	NDJSON staging + load job; job-ID-keyed idempotency (7-day window)
Databricks	`databricks`	Unity Catalog volume staging
DuckDB	`duckdb`	File-based; single-writer, fails fast if locked

See docs/guides/run.md for full connector config, credentials, and live-integration test setup.

How a run works

filedge run
├── Reset FAILED below retry_cap → PENDING
├── Reclaim stale PROCESSING locks → PENDING
├── Connector: ensure destination table exists
├── Hash files in watched dir; enqueue new hashes as PENDING
└── For each PENDING file:
    ├── Audit DB: mark PROCESSING        (distributed lock)
    ├── Connector: stream rows → commit  (idempotent per file_hash)
    └── Audit DB: mark COMMITTED / FAILED

The audit DB and the destination are separate systems. A crash between connector commit and audit mark leaves the file PROCESSING — the next run reclaims it, and the connector's per-hash idempotency guarantees no duplicate rows.

Guides: run · scale · inspect · preview · validate · compact · healthcheck · requeue · CDC files · API sources · queue sources
Domain model: CONTEXT.md
Architecture decisions:

License

Apache 2.0 — see LICENSE.

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.1

May 25, 2026

0.1.0

May 25, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filedge-0.1.1.tar.gz (293.9 kB view details)

Uploaded May 25, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

filedge-0.1.1-py3-none-any.whl (54.2 kB view details)

Uploaded May 25, 2026 Python 3

File details

Details for the file filedge-0.1.1.tar.gz.

File metadata

Download URL: filedge-0.1.1.tar.gz
Upload date: May 25, 2026
Size: 293.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for filedge-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`6fdeb63f1bb152f514796af3d039cf666bc40cc4a914618cb68af5e169f4f2f6`
MD5	`0ddce2697c3028d00a12d419e09bc280`
BLAKE2b-256	`fdeb73987e67923226ba0054e7c5be803d01d0e31543d9333f4fee3ec551eb26`

See more details on using hashes here.

File details

Details for the file filedge-0.1.1-py3-none-any.whl.

File metadata

Download URL: filedge-0.1.1-py3-none-any.whl
Upload date: May 25, 2026
Size: 54.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for filedge-0.1.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`fada579a75a9a4a6fad7300594dd2b64c371a002b6866292ef431528501ddf72`
MD5	`06e1d837384bd9a32984c374436f1a5f`
BLAKE2b-256	`37a521f9a5dd872437f0a07082373ea0c60a46b9401611e723ba03331c697c9a`

See more details on using hashes here.

filedge 0.1.1

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Filedge

Why files?

What it gives you that a hand-rolled DAG doesn't

How it differs from neighbors

Quick start

Connectors

How a run works

More

License

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes