Skip to main content

High-performance Unix file deduplication engine with tiered short-circuit logic and xxHash128.

Project description

bgate-unix

High-performance Unix file deduplication engine with tiered short-circuit logic.

PyPI CI Python 3.11+ License: MIT

Overview

bgate-unix is a fingerprinting gatekeeper that performs strict binary identity deduplication using tiered short-circuit logic. Designed for high-volume Unix pipelines where disk I/O is the bottleneck.

Key Features:

  • Sub-millisecond duplicate rejection via O(1) index lookups
  • Journaled file moves with crash recovery
  • BLOB-based xxHash128 storage for collision-proof identity
  • Atomic link/unlink moves (no TOCTOU races)

The 4-Tier Engine

Incoming File
     │
     ▼
┌─────────────────────────────────────────┐
│  TIER 0: Empty Check                    │
│  file_size == 0 → SKIP                  │
│  Cost: stat() only                      │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  TIER 1: Size Uniqueness                │
│  Size not in DB → UNIQUE                │
│  Cost: SQLite lookup                    │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  TIER 2: Fringe Hash (xxh64)            │
│  First 64KB + Last 64KB + size          │
│  (Last 64KB overlaps if file < 128KB)   │
│  Hash not in DB → UNIQUE                │
│  Cost: 128KB read max                   │
└─────────────────────────────────────────┘
     │
     ▼
┌─────────────────────────────────────────┐
│  TIER 3: Full Hash (xxh128)             │
│  Entire file in 256KB chunks            │
│  Hash in DB → DUPLICATE                 │
│  Hash not in DB → UNIQUE                │
│  Cost: Full file read                   │
└─────────────────────────────────────────┘

Installation

# Using uv (recommended)
uv add bgate-unix

# Using pip
pip install bgate-unix

Requirements: Unix-based OS (Linux, macOS, BSD). Windows is not supported.

CLI Usage

bgate-unix provides a high-performance CLI for pipeline integration.

# Scan a directory and move unique files to vault
bgate scan ./incoming --into ./vault --recursive

# Show index statistics
bgate stats --db dedupe.db

# Recover from an interrupted session
bgate recover --db dedupe.db

Quick Start

As a CLI tool

# Install
pip install bgate-unix

# Scan and move unique files to tiered storage
bgate scan ./incoming --into ./vault --recursive

As a Library

from pathlib import Path
from bgate_unix import FileDeduplicator
from bgate_unix.engine import DedupeResult

with FileDeduplicator("dedupe.db") as deduper:
    result = deduper.process_file("incoming/document.pdf")
    
    match result.result:
        case DedupeResult.UNIQUE:
            print(f"New file (tier {result.tier})")
        case DedupeResult.DUPLICATE:
            print(f"Duplicate of {result.duplicate_of}")
        case DedupeResult.SKIPPED:
            print(f"Skipped: {result.error or 'empty'}")

Usage

File Movement Pipeline

Unique files are atomically moved to a processing directory:

from pathlib import Path
from bgate_unix import FileDeduplicator

with FileDeduplicator("index.db", processing_dir=Path("processed/")) as deduper:
    for result in deduper.process_directory("inbound/", recursive=True):
        if result.result == DedupeResult.UNIQUE:
            # result.path is the new location in processed/
            # result.original_path is the source location
            # result.stored_path is also the new location (explicit field)
            print(f"Moved: {result.original_path.name} -> {result.stored_path.name}")

Important: processing_dir must be on the same filesystem as source files (required for atomic os.link).

Batch Processing

from bgate_unix import FileDeduplicator
from bgate_unix.engine import DedupeResult

with FileDeduplicator("index.db") as deduper:
    results = list(deduper.process_directory("incoming/", recursive=True))
    
    unique = sum(1 for r in results if r.result == DedupeResult.UNIQUE)
    dupes = sum(1 for r in results if r.result == DedupeResult.DUPLICATE)
    
    print(f"Unique: {unique}, Duplicates: {dupes}")
    print(f"Stats: {deduper.stats}")

Database & Recovery

  • Strict Schema Enforcement: Engines will hard-stop if a database version mismatch is detected.
  • Orphan Recovery: If a crash occurs during file moves, orphaned files are automatically recovered on next connect.
  • Emergency Logging: If the database becomes unavailable during a critical I/O operation, orphan records are written to an atomic .jsonl log file for manual recovery.

Technical Details

Threat Model & Hashing

bgate-unix is designed for trusted internal pipelines.

  • xxHash128: Used as an extremely low-collision identifier for high-volume data (2^128 range). For trusted inputs, collisions are treated as mathematically impossible.
  • Deduplication Priority: Speed and durability are prioritized over security.
  • Not for Adversarial Input: If you are processing untrusted/malicious files where hash collisions could be intentionally engineered, use a cryptographically secure mode (like BLAKE3 or SHA-256) which may be added in future versions.

Sharded Storage Layout

Unique files are stored in a 2-level hex-sharded structure inside processing_dir:

  • Path: {processing_dir}/{id[0:2]}/{id[2:16]}{original_suffix}
  • Note: id is the full content hash when available (Tier 3), otherwise a unique UUID (Tier 1/2) to preserve "Move-then-Hash" performance.
  • Example: processed/a3/bc4f91e2d0f8.pdf

Database Schema

SQLite with BLOB-based hash storage:

-- Tier 1: Size lookup (existence set)
CREATE TABLE size_index (
    file_size INTEGER PRIMARY KEY
) WITHOUT ROWID;

-- Tier 2: Fringe hash (BLOB)
CREATE TABLE fringe_index (
    fringe_hash BLOB NOT NULL,
    file_size INTEGER NOT NULL,
    file_path TEXT NOT NULL,
    PRIMARY KEY (fringe_hash, file_size)
) WITHOUT ROWID;

-- Tier 3: Full hash (BLOB)
CREATE TABLE full_index (
    full_hash BLOB PRIMARY KEY,
    file_path TEXT NOT NULL
) WITHOUT ROWID;

-- Crash recovery tables
CREATE TABLE orphan_registry (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    original_path TEXT NOT NULL,
    orphan_path TEXT NOT NULL,
    file_size INTEGER NOT NULL,
    created_at TEXT NOT NULL,
    recovered_at TEXT,
    status TEXT NOT NULL DEFAULT 'pending',
    UNIQUE(orphan_path)
);

CREATE TABLE move_journal (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    source_path TEXT NOT NULL,
    dest_path TEXT NOT NULL,
    file_size INTEGER NOT NULL,
    created_at TEXT NOT NULL,
    phase TEXT NOT NULL DEFAULT 'planned',
    completed_at TEXT
);

CREATE TABLE schema_version (
    version INTEGER PRIMARY KEY,
    applied_at TEXT NOT NULL
);

Pragmas: WAL mode, synchronous=FULL, 64MB cache, 256MB mmap.

Atomic File Moves

Uses hard-link + unlink (os.link / Path.unlink) for atomic same-filesystem moves.

Durability Guarantees

  • Signal Deferral: SIGINT/SIGTERM signals are deferred during critical move operations using critical_section().
  • Fsync Ordering: File and directory durability is strictly enforced:
    1. After linking destination, newly created parent directories are fsynced (top-down).
    2. The destination directory is fsynced to persist the new link.
    3. The source file is unlinked.
    4. The source directory is fsynced to persist the removal.
  • FS Enforcement: Cross-device moves are explicitly rejected (EXDEV error) to maintain atomicity.

Crash Recovery

Move operations use phase-based journaling: planned → moving → completed.

On startup, the engine automatically recovers incomplete entries:

  • planned: Move never started → Marked as failed.
  • moving: File may have been moved but not yet indexed → Engine attempts atomic rollback (link back to source + fsync + unlink destination).

Development

git clone https://github.com/mr3od/bgate-unix.git
cd bgate-unix
uv sync --dev

# Run tests
uv run pytest

# Lint
uv run ruff check .
uv run ruff format .

# Type check
uv run ty check src/

License

MIT

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bgate_unix-0.5.0.tar.gz (51.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

bgate_unix-0.5.0-py3-none-any.whl (21.5 kB view details)

Uploaded Python 3

File details

Details for the file bgate_unix-0.5.0.tar.gz.

File metadata

  • Download URL: bgate_unix-0.5.0.tar.gz
  • Upload date:
  • Size: 51.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bgate_unix-0.5.0.tar.gz
Algorithm Hash digest
SHA256 036d70751ddacb54d8639b0dde2effc36223674380ac273f0db9a1a85ba19c8b
MD5 af79599868cbff61aef3d3b2bdaee69c
BLAKE2b-256 8e238e94629c2d87014381fdcc3523bf22d50f774ef583bf5a20e26224630c5d

See more details on using hashes here.

Provenance

The following attestation bundles were made for bgate_unix-0.5.0.tar.gz:

Publisher: release.yml on mr3od/bgate-unix

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file bgate_unix-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: bgate_unix-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 21.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for bgate_unix-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 4520d57c0e1b44fb8f7bd973e0e1301d573bf041d00bab475490a974137fa63c
MD5 272732af8f45ad000c786e35c1d0571a
BLAKE2b-256 5aeb93c59f3234f8cbb28d7e5a68b46a84e1e15f5a91dc76adbb6c03a921d503

See more details on using hashes here.

Provenance

The following attestation bundles were made for bgate_unix-0.5.0-py3-none-any.whl:

Publisher: release.yml on mr3od/bgate-unix

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page