Skip to main content

Atomic file ingestion with content-hash idempotency and a full audit trail — into SQLite, PostgreSQL, BigQuery, Databricks, or DuckDB.

Project description

Filedge

CI codecov Python License Ruff PRs Welcome

Files are the universal building block of data engineering. Whether data starts in Kafka, Stripe's API, a partner SFTP, or a CDC stream, every reliable pipeline eventually crystallizes it into a file before it touches the warehouse. Filedge is the load boundary built around that fact: atomic per-file ingestion, content-hash idempotency, and a full audit trail — into SQLite, PostgreSQL, BigQuery, Databricks, or DuckDB.

Why files?

Streams are continuous; files are discrete. That discreteness is what makes ingestion auditable: a file has a SHA-256, a row count, a state in the audit DB, and a row-level provenance trail in the destination. Every downstream question — did we load this?, replay this, where did this row come from? — has a deterministic anchor.

Filedge starts where the file lands and ends when its rows are committed. Upstream is your choice: dlt or vendor exporters for APIs, Kafka Connect or Vector for queues, rclone for SFTP. Downstream is your warehouse. The hard part in between — retry-safe commits, dedupe, retries, lineage — is all Filedge does.

What it gives you that a hand-rolled DAG doesn't

Failure mode Typical pipeline Filedge
Half-written tables after a crash Manual cleanup Per-file atomic commit, retry-safe by content hash
"Did we already load this file?" Filename heuristics SHA-256 dedupe at the entry point
"Where did this row come from?" Grep logs _source_file_hash + _ingested_at on every row
Stale lock from a killed worker Page someone Reclaimed automatically on next run
One bad file blocks the pipeline Skip and forget Bounded retry → terminal FAILED with audit
Schema drift in destination Silent corruption Loud failure with a clear diff

How it differs from neighbors

  • vs Airbyte / Fivetran / dlt — those fetch (paginate APIs, manage cursors). Filedge lands — it takes whatever they produce as files and makes the write to the warehouse audit-grade. Use them as Fetchers in front of Filedge.
  • vs Kafka Connect / Flink / Spark Structured Streaming — streaming systems own continuous offsets and incremental state. Filedge owns the file as the unit of work — simpler to reason about, replay, and audit. Materialize queues to files, then ingest.
  • vs Airflow + custom Python loaders — same DAG shape, but partial-load corruption, lock reclaim, retry caps, idempotent CDC apply, and row provenance are already wired in.
  • vs Iceberg / Delta tables — those are table formats. Filedge is what writes to them (or to plain BigQuery / Postgres / Databricks tables) with the per-file commit guarantee.

Quick start

Requires uv.

uv sync --extra dev                          # core (SQLite)
uv sync --extra dev --extra postgres         # + PostgreSQL
uv sync --extra dev --extra bigquery         # + BigQuery
uv sync --extra dev --extra databricks       # + Databricks
uv sync --extra dev --extra duckdb           # + DuckDB

Declare a pipeline:

# pipeline.yaml
format: csv
dest_table: orders
write_mode: append          # append | truncate | cdc
retry_cap: 3
batch_size: 1000

connector:
  type: sqlite
  url: sqlite:///orders.db

columns:
  - { source: order_id,   dest: order_id,   type: string,  required: true }
  - { source: amount,     dest: amount,     type: float,   required: true }
  - { source: order_date, dest: order_date, type: date }

Run it:

filedge run --dir ./incoming --config pipeline.yaml --audit-db-url sqlite:///filedge.db
# Committed: 3  Failed: 0  Skipped: 0  New: 3  Reclaimed: 0  Retried: 0

filedge status --audit-db-url sqlite:///filedge.db
# PENDING: 0  PROCESSING: 0  COMMITTED: 3  FAILED: 0

Don't know the schema yet? filedge inspect data.csv samples the file and prints a columns: block with confidence tiers ready to paste.

Connectors

The destination is configured via a connector: block in pipeline.yaml. Built-ins:

Destination Extra Notes
SQLite (core) Default for local dev; configure with type: sqlite and a url
PostgreSQL postgres COPY bulk load; idempotent via per-hash DELETE
BigQuery bigquery NDJSON staging + load job; job-ID-keyed idempotency (7-day window)
Databricks databricks Unity Catalog volume staging
DuckDB duckdb File-based; single-writer, fails fast if locked

See docs/guides/run.md for full connector config, credentials, and live-integration test setup.

How a run works

filedge run
├── Reset FAILED below retry_cap → PENDING
├── Reclaim stale PROCESSING locks → PENDING
├── Connector: ensure destination table exists
├── Hash files in watched dir; enqueue new hashes as PENDING
└── For each PENDING file:
    ├── Audit DB: mark PROCESSING        (distributed lock)
    ├── Connector: stream rows → commit  (idempotent per file_hash)
    └── Audit DB: mark COMMITTED / FAILED

The audit DB and the destination are separate systems. A crash between connector commit and audit mark leaves the file PROCESSING — the next run reclaims it, and the connector's per-hash idempotency guarantees no duplicate rows.

More

License

Apache 2.0 — see LICENSE.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

filedge-0.1.1.tar.gz (293.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

filedge-0.1.1-py3-none-any.whl (54.2 kB view details)

Uploaded Python 3

File details

Details for the file filedge-0.1.1.tar.gz.

File metadata

  • Download URL: filedge-0.1.1.tar.gz
  • Upload date:
  • Size: 293.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for filedge-0.1.1.tar.gz
Algorithm Hash digest
SHA256 6fdeb63f1bb152f514796af3d039cf666bc40cc4a914618cb68af5e169f4f2f6
MD5 0ddce2697c3028d00a12d419e09bc280
BLAKE2b-256 fdeb73987e67923226ba0054e7c5be803d01d0e31543d9333f4fee3ec551eb26

See more details on using hashes here.

File details

Details for the file filedge-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: filedge-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 54.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.16 {"installer":{"name":"uv","version":"0.11.16","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for filedge-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 fada579a75a9a4a6fad7300594dd2b64c371a002b6866292ef431528501ddf72
MD5 06e1d837384bd9a32984c374436f1a5f
BLAKE2b-256 37a521f9a5dd872437f0a07082373ea0c60a46b9401611e723ba03331c697c9a

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page