Skip to main content

High-performance document intelligence gateway and safety guardrail engine.

Project description

DocGaurd (Document Intelligence Gateway)

Python 3.8+ License: MIT PyPI version

DocGaurd (Document Intelligence Gateway) is a high-performance document validation, security scanning, quality guardrail, and exact token counting engine. Built in Rust with native Python bindings via PyO3, DocGaurd sits between raw document ingestion and downstream LLM/RAG pipelines to prevent system exploitation, database bloat, and unexpected API costs.


Features • Installation • Quick Start • Python API • Telemetry Schema • Supported Formats • Examples • License


Features

  • ✨ Real GPT Tokenization - Integrates high-performance tiktoken-rs in Rust to calculate exact GPT token budgets (not approximations) for models like GPT-4, GPT-3.5, Claude, or LLaMA.
  • ⚡ Multi-Format Support - Seamlessly extracts text and parses metadata from PDF, TXT, MD, DOCX, PPTX, XLSX, CSV, JSON, XML, and HTML files.
  • 🛡️ Ingestion Security - Built-in security scanners inspect compressed documents and file headers to intercept Zip bombs, compression bombs, and oversized resource limits before they reach system memory.
  • 🔍 Text Quality & OCR Necessity Detection - Evaluates page text density, whitespace-to-character ratio, and empty page signals to flag scanned/image-only documents (requires_ocr) before vector database embedding.
  • 🚀 Native Parallel Batch Processing - Utilizes Rust's concurrent work-stealing thread pool (Rayon) to process thousands of files or directory trees in parallel with zero GIL serialization.
  • 💾 Global De-duplication - Computes high-performance SHA-256 content hashes in parallel to identify and skip exact duplicate files inside a batch queue automatically.
  • 💰 Dynamic Cost Estimation - Estimates LLM input cost and vector database embedding cost dynamically before making external API requests.
  • 🎯 Intelligent Agent Routing - Classifies text based on heuristic token frequencies and assigns a target downstream AI Agent (e.g., LegalAgent, ProcurementAgent).

Installation

From PyPI (Recommended)

Install pre-compiled native binary wheels instantly on Windows, Linux, or macOS:

pip install docgaurd

(No Rust compilers, C-libraries, or compilation tools are required on the host system).

From Source

git clone https://github.com/JIVTESH28/docgaurd.git
cd docgaurd
pip install .

Quick Start

Initialize the Analyzer

import json
import docgaurd

# Initialize the gateway analyzer with custom thresholds
analyzer = docgaurd.DocumentAnalyzer({
    "target_model": "gpt-4",                   # Target context window check
    "tokenizer_name": "cl100k_base",           # Tiktoken profile
    "embedding_rate_per_million": 0.02,        # Cost per 1M tokens ($)
    "llm_input_rate_per_million": 5.00,        # Cost per 1M tokens ($)
    "max_file_size": 52428800                  # Max file size (50MB)
})

Python API Usage

Single File Ingestion (Local Disk)

report_str = analyzer.analyze_file("contract.pdf")
report = json.loads(report_str)
print(f"Tokens: {report['token_count']} | RAG Ready: {report['rag_ready']}")

In-Memory Bytes Ingestion (API Uploads)

uploaded_bytes = b"Sample document text buffer."
report_str = analyzer.analyze_bytes(uploaded_bytes, "invoice.txt")
report = json.loads(report_str)
print(f"Domain Class: {report['document_class']} | RAG Ready: {report['rag_ready']}")

Natively Parallel Batch Processing

file_list = ["agreement.docx", "data.xlsx", "spec.pdf"]
batch_report_str = analyzer.analyze_batch(file_list)
batch_report = json.loads(batch_report_str)

print(f"Successful files: {batch_report['summary']['successful_files']}")
print(f"Duplicates skipped: {batch_report['summary']['duplicate_files']}")

Directory Ingestion (Recursive Scan)

dir_report_str = analyzer.analyze_directory("./archive", recursive=True)
dir_report = json.loads(dir_report_str)
print(f"Total directory tokens: {dir_report['summary']['total_tokens']}")

Ultra-Fast Single-Metric Bypasses

If you only need a single metric and want to bypass the rest of the gateway analysis pipeline (such as security checks, cost estimation, and domain classification), use the sub-millisecond helpers:

# Raw metric count helpers (File-based)
word_count = analyzer.count_words("document.docx")
char_count = analyzer.count_chars("document.docx")
token_count = analyzer.count_tokens("document.docx")

# Raw metric count helpers (Byte-based)
token_count = analyzer.count_tokens_bytes(uploaded_bytes, "invoice.txt")

Telemetry Output Schema

DocGaurd generates a comprehensive, metadata-rich telemetry report for every analyzed file:

{
  "file_name": "contract_agreement.pdf",
  "file_type": "pdf",
  "sha256": "07c270b274dae324f906e0aa3a8d606471931e9c1afc241ddbc8f9ae52baffe7",
  "token_count": 2424,
  "word_count": 1612,
  "character_count": 11448,
  "page_count": 4,
  "requires_ocr": false,
  "quality_score": 0.8,
  "duplicate": false,
  "security_risk": "low",
  "fits_context": true,
  "rag_ready": true,
  "requires_summarization": false,
  "recommended_chunking": "semantic chunking",
  "document_class": "Legal",
  "recommended_agent": "LegalAgent",
  "estimated_embedding_cost": 0.0,
  "estimated_llm_cost": 0.0121,
  "processing_time_ms": 12.34
}

Telemetry Field Descriptions

Field Type Description
file_name String Base name of the analyzed file.
file_type String Lowercase file extension (e.g. pdf, docx, txt).
sha256 String Cryptographic SHA-256 hash representing the exact content payload.
token_count Integer Exact token count matching the selected model tokenizer profile.
word_count Integer Number of words counted based on unicode whitespace dividers.
character_count Integer UTF-8 character length of the extracted document text.
page_count Integer Page count (e.g. PDF pages, PowerPoint slides, Excel sheets, estimated text lines).
requires_ocr Boolean Flags true if document has page structures but low text density (image-only scanned).
quality_score Float Cleanliness index (0.0 - 1.0) graded by density, metadata, ratio, and OCR markers.
duplicate Boolean Flags true if identical SHA-256 has already been processed in the concurrent batch queue.
security_risk String Security score (low, medium, high) validating Zip bombs and size thresholds.
fits_context Boolean Checks if token_count fits inside the target model's context window.
rag_ready Boolean Evaluates suitability for search databases (true if secure, non-scanned, and clean).
requires_summarization Boolean Recommends pre-summarizing if the token count or page density is excessively large.
recommended_chunking String Suggested chunking strategy (no chunking, fixed, semantic, hierarchical, agentic).
document_class String Classified topical domain (Finance, Procurement, Legal, HR, Tech Doc, Research, etc.).
recommended_agent String Recommended target downstream AI Agent target (e.g. LegalAgent).
estimated_embedding_cost Float Predicted vector database indexing cost.
estimated_llm_cost Float Predicted input processing cost.
processing_time_ms Float Internal Gateway execution latency in milliseconds.

Supported Formats

Format Extension Extraction Method Key Features
PDF .pdf Native lopdf Parser Structural reading, scanned detection, page extraction
Word .docx Native docx XML Parser Direct paragraph and table text extraction
PowerPoint .pptx Native pptx XML Parser Shape text, slide processing, bullet analysis
Excel .xlsx Calamine Engine Spreadsheet parsing, cell extraction, rows estimation
CSV .csv CSV Parser Direct row, column parsing, delimiter validation
Plain Text .txt, .md Unicode Parser Streaming flat extraction, lossy fallback encoding
JSON .json Serde JSON Recursive nested key-value string extraction
XML .xml Quick XML Parser Tag-stripped text, element-wise traversal
HTML .html Quick XML Parser Element parsing, script/style extraction filtering

Configuration Limits

Setting Default Value Purpose
target_model "gpt-4" Target context size limit check
tokenizer_name "cl100k_base" Tokenizer profile (cl100k_base, r50k_base, p50k_base)
max_file_size 52,428,800 bytes (50MB) Intercept oversized documents
embedding_rate_per_million $0.02 Custom embedding cost rate
llm_input_rate_per_million $5.00 Custom LLM input rate

Examples

Example 1: RAG Ingestion Security & Quality Gatekeeper

Ensure that only secure, high-quality, digital documents enter your vector database:

import json
import docgaurd

analyzer = docgaurd.DocumentAnalyzer()
report = json.loads(analyzer.analyze_file("user_upload.pdf"))

# Intercept risks at the gateway
if report["security_risk"] == "high":
    raise ValueError(f"CRITICAL: Security exception triggered for {report['file_name']}")

if report["requires_ocr"]:
    print(f"Routing {report['file_name']} to hardware-accelerated OCR pipeline.")
elif not report["rag_ready"]:
    print(f"Skipping {report['file_name']} due to low text quality score: {report['quality_score']}")
else:
    print(f"Ingesting clean document text. Context Size: {report['token_count']} tokens.")

Example 2: API Cost Budgeting & Model Window Check

Calculate API transaction costs and verify if a document fits within a model's context window:

import json
import docgaurd

analyzer = docgaurd.DocumentAnalyzer({
    "target_model": "gpt-3.5-turbo",
    "llm_input_rate_per_million": 1.50
})

report = json.loads(analyzer.analyze_file("long_transcript.txt"))

if not report["fits_context"]:
    print(f"Document exceeds target context window. Recommended chunking strategy: {report['recommended_chunking']}")
else:
    print(f"Document fits. Estimated processing cost: ${report['estimated_llm_cost']:.4f}")

Example 3: Hardware-Accelerated OCR Integration (Metal/CUDA)

Incorporate unified OCR for scanned files using the docgaurd_ocr.py module:

import json
from docgaurd_ocr import OcrDocumentAnalyzer

# Initialize unified OcrDocumentAnalyzer (auto-routes to Apple Metal MPS or CUDA)
gateway = OcrDocumentAnalyzer()

report_json = gateway.analyze_file("scanned_receipt.jpg")
report = json.loads(report_json)

print(f"OCR Text: {report['text']}")
print(f"OCR Tokens: {report['token_count']} | RAG Ready: {report['rag_ready']}")

License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docgaurd-0.1.11.tar.gz (32.9 kB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

docgaurd-0.1.11-cp38-abi3-win_amd64.whl (3.0 MB view details)

Uploaded CPython 3.8+Windows x86-64

docgaurd-0.1.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (3.3 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ x86-64

docgaurd-0.1.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl (3.2 MB view details)

Uploaded CPython 3.8+manylinux: glibc 2.17+ ARM64

docgaurd-0.1.11-cp38-abi3-macosx_11_0_arm64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 11.0+ ARM64

docgaurd-0.1.11-cp38-abi3-macosx_10_12_x86_64.whl (3.1 MB view details)

Uploaded CPython 3.8+macOS 10.12+ x86-64

File details

Details for the file docgaurd-0.1.11.tar.gz.

File metadata

  • Download URL: docgaurd-0.1.11.tar.gz
  • Upload date:
  • Size: 32.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docgaurd-0.1.11.tar.gz
Algorithm Hash digest
SHA256 b97da10038cc960df6c1b0cc1338702f17b7e6cfa89ab89fc84737573c8b1b6f
MD5 442eb5f39aeb9b4a2b04d5d47483aa16
BLAKE2b-256 d8263f925ec2ffa0c4a796bd99b2f15bcf2ab097f258a841922e4b14c9420528

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.11.tar.gz:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.11-cp38-abi3-win_amd64.whl.

File metadata

  • Download URL: docgaurd-0.1.11-cp38-abi3-win_amd64.whl
  • Upload date:
  • Size: 3.0 MB
  • Tags: CPython 3.8+, Windows x86-64
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for docgaurd-0.1.11-cp38-abi3-win_amd64.whl
Algorithm Hash digest
SHA256 e662126bed02a95f453a11ea8d4e0bbb9514eeb865d645511f677cf5a1e3d416
MD5 ecb8509b10e9db23795fd33570209e05
BLAKE2b-256 18dcab347fbcde970a809db48c91b7704008a6ef8410acc2b341f3502c4096dc

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.11-cp38-abi3-win_amd64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
Algorithm Hash digest
SHA256 e4b927c348201230b820f194ad4b7408da23b0e2d906c5cc8af6d508aa08afa2
MD5 7fd636a4ae52845b1787189b10cc2db1
BLAKE2b-256 763f7a82b83a1b2dc966765e2e6679a9ab74c73316109e4b6a4b8cb45110d3ce

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.11-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
Algorithm Hash digest
SHA256 81d21048623913c98e2180c2eb45e3d4519dddf395f83564bd9acbda525f99d4
MD5 3580434e0fb8a4320c9794eb579a3d55
BLAKE2b-256 4b275f788958b5ddc1e4679d4f316902381864853b9f5cd9a4c8bfb072b2e881

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.11-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.11-cp38-abi3-macosx_11_0_arm64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.11-cp38-abi3-macosx_11_0_arm64.whl
Algorithm Hash digest
SHA256 e032b78de4f19bd636e8b43d068f5fb3b6b40b74d7e06b8cadc87db4e9c2f7d8
MD5 981cc52da2685b2b3f9e4ee014c88166
BLAKE2b-256 03570f46adef9a004f143397e68cd4318fcfdb440b5473415031c5f5c887d051

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.11-cp38-abi3-macosx_11_0_arm64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docgaurd-0.1.11-cp38-abi3-macosx_10_12_x86_64.whl.

File metadata

File hashes

Hashes for docgaurd-0.1.11-cp38-abi3-macosx_10_12_x86_64.whl
Algorithm Hash digest
SHA256 ec5c8c499bf3581768f5e7cd908cdca73cec39c1b3d9f03c288ce37941e500be
MD5 73d873cf617aaf82a9e294e6e91bf82c
BLAKE2b-256 c4bb1e2dfd40bed7dce9d7f7ea54bbfdf4a547e6af263a0854a53aa2f4573a92

See more details on using hashes here.

Provenance

The following attestation bundles were made for docgaurd-0.1.11-cp38-abi3-macosx_10_12_x86_64.whl:

Publisher: pypi.yml on JIVTESH28/docgaurd

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page