Skip to main content

High-performance document intelligence gateway and safety guardrail engine.

Project description

DocGaurd (Document Intelligence Gateway)

Python 3.8+ License: MIT PyPI version

DocGaurd (Document Intelligence Gateway) is a high-performance document validation, security scanning, quality guardrail, and exact token counting engine. Built in Rust with native Python bindings via PyO3, DocGaurd sits between raw document ingestion and downstream LLM/RAG pipelines to prevent system exploitation, database bloat, and unexpected API costs.


Features • Installation • Quick Start • Python API • Telemetry Schema • Supported Formats • Examples • License


Features

  • ✨ Real GPT Tokenization - Integrates high-performance tiktoken-rs in Rust to calculate exact GPT token budgets (not approximations) for models like GPT-4, GPT-3.5, Claude, or LLaMA.
  • ⚡ Multi-Format Support - Seamlessly extracts text and parses metadata from PDF, TXT, MD, DOCX, PPTX, XLSX, CSV, JSON, XML, and HTML files.
  • 🛡️ Ingestion Security - Built-in security scanners inspect compressed documents and file headers to intercept Zip bombs, compression bombs, and oversized resource limits before they reach system memory.
  • 🔍 Text Quality & OCR Necessity Detection - Evaluates page text density, whitespace-to-character ratio, and empty page signals to flag scanned/image-only documents (requires_ocr) before vector database embedding.
  • 🚀 Native Parallel Batch Processing - Utilizes Rust's concurrent work-stealing thread pool (Rayon) to process thousands of files or directory trees in parallel with zero GIL serialization.
  • 💾 Global De-duplication - Computes high-performance SHA-256 content hashes in parallel to identify and skip exact duplicate files inside a batch queue automatically.
  • 💰 Dynamic Cost Estimation - Estimates LLM input cost and vector database embedding cost dynamically before making external API requests.
  • 🎯 Intelligent Agent Routing - Classifies text based on heuristic token frequencies and assigns a target downstream AI Agent (e.g., LegalAgent, ProcurementAgent).

Installation

From PyPI (Recommended)

Install pre-compiled native binary wheels instantly on Windows, Linux, or macOS:

pip install docgaurd

(No Rust compilers, C-libraries, or compilation tools are required on the host system).

From Source

git clone https://github.com/JIVTESH28/docgaurd.git
cd docgaurd
pip install .

Quick Start

Initialize the Analyzer

import json
import docgaurd

# Initialize the gateway analyzer with custom thresholds
analyzer = docgaurd.DocumentAnalyzer({
    "target_model": "gpt-4",                   # Target context window check
    "tokenizer_name": "cl100k_base",           # Tiktoken profile
    "embedding_rate_per_million": 0.02,        # Cost per 1M tokens ($)
    "llm_input_rate_per_million": 5.00,        # Cost per 1M tokens ($)
    "max_file_size": 52428800                  # Max file size (50MB)
})

Python API Usage

Single File Ingestion (Local Disk)

report_str = analyzer.analyze_file("contract.pdf")
report = json.loads(report_str)
print(f"Tokens: {report['token_count']} | RAG Ready: {report['rag_ready']}")

In-Memory Bytes Ingestion (API Uploads)

uploaded_bytes = b"Sample document text buffer."
report_str = analyzer.analyze_bytes(uploaded_bytes, "invoice.txt")
report = json.loads(report_str)
print(f"Domain Class: {report['document_class']} | RAG Ready: {report['rag_ready']}")

Natively Parallel Batch Processing

file_list = ["agreement.docx", "data.xlsx", "spec.pdf"]
batch_report_str = analyzer.analyze_batch(file_list)
batch_report = json.loads(batch_report_str)

print(f"Successful files: {batch_report['summary']['successful_files']}")
print(f"Duplicates skipped: {batch_report['summary']['duplicate_files']}")

Directory Ingestion (Recursive Scan)

dir_report_str = analyzer.analyze_directory("./archive", recursive=True)
dir_report = json.loads(dir_report_str)
print(f"Total directory tokens: {dir_report['summary']['total_tokens']}")

Ultra-Fast Single-Metric Bypasses

If you only need a single metric and want to bypass the rest of the gateway analysis pipeline (such as security checks, cost estimation, and domain classification), use the sub-millisecond helpers:

# Raw metric count helpers (File-based)
word_count = analyzer.count_words("document.docx")
char_count = analyzer.count_chars("document.docx")
token_count = analyzer.count_tokens("document.docx")

# Raw metric count helpers (Byte-based)
token_count = analyzer.count_tokens_bytes(uploaded_bytes, "invoice.txt")

Telemetry Output Schema

DocGaurd generates a comprehensive, metadata-rich telemetry report for every analyzed file:

{
  "file_name": "contract_agreement.pdf",
  "file_type": "pdf",
  "sha256": "07c270b274dae324f906e0aa3a8d606471931e9c1afc241ddbc8f9ae52baffe7",
  "token_count": 2424,
  "word_count": 1612,
  "character_count": 11448,
  "page_count": 4,
  "requires_ocr": false,
  "quality_score": 0.8,
  "duplicate": false,
  "security_risk": "low",
  "fits_context": true,
  "rag_ready": true,
  "requires_summarization": false,
  "recommended_chunking": "semantic chunking",
  "document_class": "Legal",
  "recommended_agent": "LegalAgent",
  "estimated_embedding_cost": 0.0,
  "estimated_llm_cost": 0.0121,
  "processing_time_ms": 12.34
}

Telemetry Field Descriptions

Field Type Description
file_name String Base name of the analyzed file.
file_type String Lowercase file extension (e.g. pdf, docx, txt).
sha256 String Cryptographic SHA-256 hash representing the exact content payload.
token_count Integer Exact token count matching the selected model tokenizer profile.
word_count Integer Number of words counted based on unicode whitespace dividers.
character_count Integer UTF-8 character length of the extracted document text.
page_count Integer Page count (e.g. PDF pages, PowerPoint slides, Excel sheets, estimated text lines).
requires_ocr Boolean Flags true if document has page structures but low text density (image-only scanned).
quality_score Float Cleanliness index (0.0 - 1.0) graded by density, metadata, ratio, and OCR markers.
duplicate Boolean Flags true if identical SHA-256 has already been processed in the concurrent batch queue.
security_risk String Security score (low, medium, high) validating Zip bombs and size thresholds.
fits_context Boolean Checks if token_count fits inside the target model's context window.
rag_ready Boolean Evaluates suitability for search databases (true if secure, non-scanned, and clean).
requires_summarization Boolean Recommends pre-summarizing if the token count or page density is excessively large.
recommended_chunking String Suggested chunking strategy (no chunking, fixed, semantic, hierarchical, agentic).
document_class String Classified topical domain (Finance, Procurement, Legal, HR, Tech Doc, Research, etc.).
recommended_agent String Recommended target downstream AI Agent target (e.g. LegalAgent).
estimated_embedding_cost Float Predicted vector database indexing cost.
estimated_llm_cost Float Predicted input processing cost.
processing_time_ms Float Internal Gateway execution latency in milliseconds.

Supported Formats

Format Extension Extraction Method Key Features
PDF .pdf Native lopdf Parser Structural reading, scanned detection, page extraction
Word .docx Native docx XML Parser Direct paragraph and table text extraction
PowerPoint .pptx Native pptx XML Parser Shape text, slide processing, bullet analysis
Excel .xlsx Calamine Engine Spreadsheet parsing, cell extraction, rows estimation
CSV .csv CSV Parser Direct row, column parsing, delimiter validation
Plain Text .txt, .md Unicode Parser Streaming flat extraction, lossy fallback encoding
JSON .json Serde JSON Recursive nested key-value string extraction
XML .xml Quick XML Parser Tag-stripped text, element-wise traversal
HTML .html Quick XML Parser Element parsing, script/style extraction filtering

Configuration Limits

Setting Default Value Purpose
target_model "gpt-4" Target context size limit check
tokenizer_name "cl100k_base" Tokenizer profile (cl100k_base, r50k_base, p50k_base)
max_file_size 52,428,800 bytes (50MB) Intercept oversized documents
embedding_rate_per_million $0.02 Custom embedding cost rate

How the OCR Integration Works

DocGaurd implements a high-performance hybrid OCR gateway under the OcrDocumentAnalyzer class:

  1. Rust-Native Gatekeeping: When a file is submitted, DocGaurd first uses its sub-millisecond Rust parsers to check the file type and structure.
    • If the document is a clean digital file (e.g., text PDF, Word doc, or markdown), the text is extracted instantly, and the heavy OCR engine is completely bypassed.
    • If the file is an image (.png, .jpg, .jpeg, etc.) or is flagged by the Rust quality scanner as a scanned/text-empty PDF (requires_ocr: True), the OCR engine is initialized.
  2. Lazy Loading: To keep package imports sub-millisecond, PyTorch and EasyOCR model weights are loaded lazily on-demand only when the first scanned document or raw image is encountered.
  3. Hardware Auto-Detection: The engine dynamically autodetects your host hardware to run deep learning models at maximum speed:
    • macOS (Apple Silicon): Natively offloads tensor computations to the GPU via Metal Performance Shaders (MPS).
    • Windows/Linux with GPU: Automatically targets your Nvidia GPU via CUDA.
    • Fallback: Runs on optimized multi-threaded CPU.
  4. Rust Telemetry Reconciliation: Once text is extracted via OCR, the raw text bytes are passed back into DocGaurd's Rust core using a virtual text buffer. The Rust engine then computes exact GPT token budgets (tiktoken-rs), counts words/characters, runs domain classification, and generates cost estimations—reconciling all statistics back into a single unified JSON schema.

Examples

Example 1: RAG Ingestion Security & Quality Gatekeeper

Ensure that only secure, high-quality, digital documents enter your vector database:

import json
import docgaurd

analyzer = docgaurd.DocumentAnalyzer()
report = json.loads(analyzer.analyze_file("user_upload.pdf"))

# Intercept risks at the gateway
if report["security_risk"] == "high":
    raise ValueError(f"CRITICAL: Security exception triggered for {report['file_name']}")

if report["requires_ocr"]:
    print(f"Routing {report['file_name']} to hardware-accelerated OCR pipeline.")
elif not report["rag_ready"]:
    print(f"Skipping {report['file_name']} due to low text quality score: {report['quality_score']}")
else:
    print(f"Ingesting clean document text. Context Size: {report['token_count']} tokens.")

Example 2: API Cost Budgeting & Model Window Check

Calculate API transaction costs and verify if a document fits within a model's context window:

import json
import docgaurd

analyzer = docgaurd.DocumentAnalyzer({
    "target_model": "gpt-3.5-turbo",
    "llm_input_rate_per_million": 1.50
})

report = json.loads(analyzer.analyze_file("long_transcript.txt"))

if not report["fits_context"]:
    print(f"Document exceeds target context window. Recommended chunking strategy: {report['recommended_chunking']}")
else:
    print(f"Document fits. Estimated processing cost: ${report['estimated_llm_cost']:.4f}")

Example 3: Hardware-Accelerated OCR Integration (Metal/CUDA)

Incorporate unified OCR for scanned files directly from the installed package:

import json
from docgaurd import OcrDocumentAnalyzer

# Initialize unified OcrDocumentAnalyzer (auto-routes to Apple Metal MPS or CUDA)
gateway = OcrDocumentAnalyzer()

report_json = gateway.analyze_file("scanned_receipt.jpg")
report = json.loads(report_json)

print(f"OCR Text: {report['text']}")
print(f"OCR Tokens: {report['token_count']} | RAG Ready: {report['rag_ready']}")

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docgaurd-0.1.12.tar.gz (34.2 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docgaurd-0.1.12-cp38-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file docgaurd-0.1.12.tar.gz.

File metadata

  • Download URL: docgaurd-0.1.12.tar.gz
  • Upload date:
  • Size: 34.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docgaurd-0.1.12.tar.gz
Algorithm Hash digest
SHA256 973197489781f1a9ec2240a7d60faa47a3140825c9a79d94e42771d2596c4ce6
MD5 1bf23392d6deb3ae466adfe5afa7c649
BLAKE2b-256 70ca0aee9c52df6b9978a91d7b4ad5fe0ee6f37c0364953baadb24fef98241d2

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.12.tar.gz:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.12-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: docgaurd-0.1.12-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docgaurd-0.1.12-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 4fa85ae5acd847098b630b9d732da1877952da72b208c6eb509cfc821cb307a4
MD5 0d3e1f7027e09a43b0b38d883125f3b9
BLAKE2b-256 1145cd66d844e46c725a7d9ee2c103c03de237d84af004746e03f6e4c2e58f7b

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-win_amd64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 9bc2e4752046824d92bc74f51c5c1c4e0ba807519cba6a1a8854ff5a900d17cc
MD5 fdeaae7e0fb983e7be05f6ebfb1cdd4a
BLAKE2b-256 b765dff442572cfdbb31d6fd5133f1ab9d0d329b6a9d1322bc5cc5cfe5168bf3

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 03a6c89a6328fad85d2e108c709d3b6a5843983caaf13912b2456a74a8398e16
MD5 563c7bb4b91b66f633e83efc84983d7b
BLAKE2b-256 2257d788a9aa9e91c94967c56f16c90d1a992d6605a201ccd28523121e208e53

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 11c0c638ff9d10c6754db1004d8fe69f7d3560ba668ca861aac0e58ccbb6c6c6
MD5 f3413b3a254841525c4bba3a06d2dfef
BLAKE2b-256 8dddd9ecc0ffc9e9a701d0c685e8c4926ce3e055ffffb93bdb2aad6bc30ec7b9

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 f935ab1e6d9dbb0f5fbff4007b484a93c4bff3d8f5f8edb669583aa6ae6f90f8
MD5 2eaef56752d042c630c3b09703d6cb04
BLAKE2b-256 6b80260f9bd32a0bc99cc4872f82aacc06079962c79f093e838dce4cc44ed79f

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page