High-performance Unix file deduplication engine with tiered short-circuit logic and xxHash128.
Project description
bgate-unix
High-performance Unix file deduplication engine with tiered short-circuit logic.
Overview
bgate-unix is a fingerprinting gatekeeper that performs strict binary identity deduplication using tiered short-circuit logic. Designed for high-volume Unix pipelines where disk I/O is the bottleneck.
Key Features:
- Sub-millisecond duplicate rejection via O(1) index lookups
- Journaled file moves with crash recovery
- BLOB-based xxHash128 storage for collision-proof identity
- Atomic
link/unlinkmoves (no TOCTOU races)
The 4-Tier Engine
Incoming File
│
▼
┌─────────────────────────────────────────┐
│ TIER 0: Empty Check │
│ file_size == 0 → SKIP │
│ Cost: stat() only │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ TIER 1: Size Uniqueness │
│ Size not in DB → UNIQUE │
│ Cost: SQLite lookup │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ TIER 2: Fringe Hash (xxh64) │
│ First 64KB + Last 64KB + size │
│ (Last 64KB overlaps if file < 128KB) │
│ Hash not in DB → UNIQUE │
│ Cost: 128KB read max │
└─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────┐
│ TIER 3: Full Hash (xxh128) │
│ Entire file in 256KB chunks │
│ Hash in DB → DUPLICATE │
│ Hash not in DB → UNIQUE │
│ Cost: Full file read │
└─────────────────────────────────────────┘
Installation
# Using uv (recommended)
uv add bgate-unix
# Using pip
pip install bgate-unix
Requirements: Unix-based OS (Linux, macOS, BSD). Windows is not supported.
CLI Usage
bgate-unix provides a high-performance CLI for pipeline integration.
# Scan a directory and move unique files to vault
bgate scan ./incoming --into ./vault --recursive
# Show index statistics
bgate stats --db dedupe.db
# Recover from an interrupted session
bgate recover --db dedupe.db
Quick Start
As a CLI tool
# Install
pip install bgate-unix
# Scan and move unique files to tiered storage
bgate scan ./incoming --into ./vault --recursive
As a Library
from pathlib import Path
from bgate_unix import FileDeduplicator
from bgate_unix.engine import DedupeResult
with FileDeduplicator("dedupe.db") as deduper:
result = deduper.process_file("incoming/document.pdf")
match result.result:
case DedupeResult.UNIQUE:
print(f"New file (tier {result.tier})")
case DedupeResult.DUPLICATE:
print(f"Duplicate of {result.duplicate_of}")
case DedupeResult.SKIPPED:
print(f"Skipped: {result.error or 'empty'}")
Usage
File Movement Pipeline
Unique files are atomically moved to a processing directory:
from pathlib import Path
from bgate_unix import FileDeduplicator
with FileDeduplicator("index.db", processing_dir=Path("processed/")) as deduper:
for result in deduper.process_directory("inbound/", recursive=True):
if result.result == DedupeResult.UNIQUE:
# result.path is the new location in processed/
# result.original_path is the source location
# result.stored_path is also the new location (explicit field)
print(f"Moved: {result.original_path.name} -> {result.stored_path.name}")
Important: processing_dir must be on the same filesystem as source files (required for atomic os.link).
Batch Processing
from bgate_unix import FileDeduplicator
from bgate_unix.engine import DedupeResult
with FileDeduplicator("index.db") as deduper:
results = list(deduper.process_directory("incoming/", recursive=True))
unique = sum(1 for r in results if r.result == DedupeResult.UNIQUE)
dupes = sum(1 for r in results if r.result == DedupeResult.DUPLICATE)
print(f"Unique: {unique}, Duplicates: {dupes}")
print(f"Stats: {deduper.stats}")
Database & Recovery
- Strict Schema Enforcement: Engines will hard-stop if a database version mismatch is detected.
- Orphan Recovery: If a crash occurs during file moves, orphaned files are automatically recovered on next connect.
- Emergency Logging: If the database becomes unavailable during a critical I/O operation, orphan records are written to an atomic
.jsonllog file for manual recovery.
Technical Details
Threat Model & Hashing
bgate-unix is designed for trusted internal pipelines.
- xxHash128: Used as an extremely low-collision identifier for high-volume data (2^128 range). For trusted inputs, collisions are treated as mathematically impossible.
- Deduplication Priority: Speed and durability are prioritized over security.
- Not for Adversarial Input: If you are processing untrusted/malicious files where hash collisions could be intentionally engineered, use a cryptographically secure mode (like BLAKE3 or SHA-256) which may be added in future versions.
Sharded Storage Layout
Unique files are stored in a 2-level hex-sharded structure inside processing_dir:
- Path:
{processing_dir}/{id[0:2]}/{id[2:16]}{original_suffix} - Note:
idis the full content hash when available (Tier 3), otherwise a unique UUID (Tier 1/2) to preserve "Move-then-Hash" performance. - Example:
processed/a3/bc4f91e2d0f8.pdf
Database Schema
SQLite with BLOB-based hash storage:
-- Tier 1: Size lookup (existence set)
CREATE TABLE size_index (
file_size INTEGER PRIMARY KEY
) WITHOUT ROWID;
-- Tier 2: Fringe hash (BLOB)
CREATE TABLE fringe_index (
fringe_hash BLOB NOT NULL,
file_size INTEGER NOT NULL,
file_path TEXT NOT NULL,
PRIMARY KEY (fringe_hash, file_size)
) WITHOUT ROWID;
-- Tier 3: Full hash (BLOB)
CREATE TABLE full_index (
full_hash BLOB PRIMARY KEY,
file_path TEXT NOT NULL
) WITHOUT ROWID;
-- Crash recovery tables
CREATE TABLE orphan_registry (
id INTEGER PRIMARY KEY AUTOINCREMENT,
original_path TEXT NOT NULL,
orphan_path TEXT NOT NULL,
file_size INTEGER NOT NULL,
created_at TEXT NOT NULL,
recovered_at TEXT,
status TEXT NOT NULL DEFAULT 'pending',
UNIQUE(orphan_path)
);
CREATE TABLE move_journal (
id INTEGER PRIMARY KEY AUTOINCREMENT,
source_path TEXT NOT NULL,
dest_path TEXT NOT NULL,
file_size INTEGER NOT NULL,
created_at TEXT NOT NULL,
phase TEXT NOT NULL DEFAULT 'planned',
completed_at TEXT
);
CREATE TABLE schema_version (
version INTEGER PRIMARY KEY,
applied_at TEXT NOT NULL
);
Pragmas: WAL mode, synchronous=FULL, 64MB cache, 256MB mmap.
Atomic File Moves
Uses hard-link + unlink (os.link / Path.unlink) for atomic same-filesystem moves.
Durability Guarantees
- Signal Deferral: SIGINT/SIGTERM signals are deferred during critical move operations using
critical_section(). - Fsync Ordering: File and directory durability is strictly enforced:
- After linking destination, newly created parent directories are fsynced (top-down).
- The destination directory is fsynced to persist the new link.
- The source file is unlinked.
- The source directory is fsynced to persist the removal.
- FS Enforcement: Cross-device moves are explicitly rejected (
EXDEVerror) to maintain atomicity.
Crash Recovery
Move operations use phase-based journaling: planned → moving → completed.
On startup, the engine automatically recovers incomplete entries:
planned: Move never started → Marked asfailed.moving: File may have been moved but not yet indexed → Engine attempts atomic rollback (link back to source + fsync + unlink destination).
Development
git clone https://github.com/mr3od/bgate-unix.git
cd bgate-unix
uv sync --dev
# Run tests
uv run pytest
# Lint
uv run ruff check .
uv run ruff format .
# Type check
uv run ty check src/
License
MIT
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file bgate_unix-0.4.3.tar.gz.
File metadata
- Download URL: bgate_unix-0.4.3.tar.gz
- Upload date:
- Size: 49.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a5fbb6539a453ee7308f6e77e892a9b96bfbba9f0d7ea7ad57a9c6f154d5235a
|
|
| MD5 |
5a618537e6b9bb2fa42c717a710253fb
|
|
| BLAKE2b-256 |
6dc20df01c2d16bf43e194cb4c12dac050a17bdcedc2fb447cf601aa98492260
|
Provenance
The following attestation bundles were made for bgate_unix-0.4.3.tar.gz:
Publisher:
release.yml on mr3od/bgate-unix
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bgate_unix-0.4.3.tar.gz -
Subject digest:
a5fbb6539a453ee7308f6e77e892a9b96bfbba9f0d7ea7ad57a9c6f154d5235a - Sigstore transparency entry: 812545858
- Sigstore integration time:
-
Permalink:
mr3od/bgate-unix@4f9dca21f34c15fe9d33f1aeaf5af508b33809dc -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/mr3od
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4f9dca21f34c15fe9d33f1aeaf5af508b33809dc -
Trigger Event:
push
-
Statement type:
File details
Details for the file bgate_unix-0.4.3-py3-none-any.whl.
File metadata
- Download URL: bgate_unix-0.4.3-py3-none-any.whl
- Upload date:
- Size: 20.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a83cc067e3e3f11a9e9c909eba31db89f269839f35d34a1c253889c152c4cf71
|
|
| MD5 |
7d78e984d53b19da7d21dc5d7a666a6f
|
|
| BLAKE2b-256 |
5bba2f6bc255ca39d8581220f5ccbaee2bdbee4e7c5f9ceb0bb1ce748c32e4cf
|
Provenance
The following attestation bundles were made for bgate_unix-0.4.3-py3-none-any.whl:
Publisher:
release.yml on mr3od/bgate-unix
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
bgate_unix-0.4.3-py3-none-any.whl -
Subject digest:
a83cc067e3e3f11a9e9c909eba31db89f269839f35d34a1c253889c152c4cf71 - Sigstore transparency entry: 812545861
- Sigstore integration time:
-
Permalink:
mr3od/bgate-unix@4f9dca21f34c15fe9d33f1aeaf5af508b33809dc -
Branch / Tag:
refs/tags/v0.4.3 - Owner: https://github.com/mr3od
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@4f9dca21f34c15fe9d33f1aeaf5af508b33809dc -
Trigger Event:
push
-
Statement type: