High-performance document intelligence gateway and safety guardrail engine.
Project description
DocGaurd (Document Intelligence Gateway)
DocGaurd (Document Intelligence Gateway) is a high-performance document validation, security scanning, quality guardrail, and exact token counting engine. Built in Rust with native Python bindings via PyO3, DocGaurd sits between raw document ingestion and downstream LLM/RAG pipelines to prevent system exploitation, database bloat, and unexpected API costs.
Features • Installation • Quick Start • Python API • Telemetry Schema • Supported Formats • Examples • License
Features
- ✨ Real GPT Tokenization - Integrates high-performance
tiktoken-rsin Rust to calculate exact GPT token budgets (not approximations) for models like GPT-4, GPT-3.5, Claude, or LLaMA. - ⚡ Multi-Format Support - Seamlessly extracts text and parses metadata from PDF, TXT, MD, DOCX, PPTX, XLSX, CSV, JSON, XML, and HTML files.
- 🛡️ Ingestion Security - Built-in security scanners inspect compressed documents and file headers to intercept Zip bombs, compression bombs, and oversized resource limits before they reach system memory.
- 🔍 Text Quality & OCR Necessity Detection - Evaluates page text density, whitespace-to-character ratio, and empty page signals to flag scanned/image-only documents (
requires_ocr) before vector database embedding. - 🚀 Native Parallel Batch Processing - Utilizes Rust's concurrent work-stealing thread pool (
Rayon) to process thousands of files or directory trees in parallel with zero GIL serialization. - 💾 Global De-duplication - Computes high-performance SHA-256 content hashes in parallel to identify and skip exact duplicate files inside a batch queue automatically.
- 💰 Dynamic Cost Estimation - Estimates LLM input cost and vector database embedding cost dynamically before making external API requests.
- 🎯 Intelligent Agent Routing - Classifies text based on heuristic token frequencies and assigns a target downstream AI Agent (e.g.,
LegalAgent,ProcurementAgent).
Installation
From PyPI (Recommended)
Install pre-compiled native binary wheels instantly on Windows, Linux, or macOS:
pip install docgaurd
(No Rust compilers, C-libraries, or compilation tools are required on the host system).
From Source
git clone https://github.com/JIVTESH28/docgaurd.git
cd docgaurd
pip install .
Quick Start
Initialize the Analyzer
import json
import docgaurd
# Initialize the gateway analyzer with custom thresholds
analyzer = docgaurd.DocumentAnalyzer({
"target_model": "gpt-4", # Target context window check
"tokenizer_name": "cl100k_base", # Tiktoken profile
"embedding_rate_per_million": 0.02, # Cost per 1M tokens ($)
"llm_input_rate_per_million": 5.00, # Cost per 1M tokens ($)
"max_file_size": 52428800 # Max file size (50MB)
})
Python API Usage
Single File Ingestion (Local Disk)
report_str = analyzer.analyze_file("contract.pdf")
report = json.loads(report_str)
print(f"Tokens: {report['token_count']} | RAG Ready: {report['rag_ready']}")
In-Memory Bytes Ingestion (API Uploads)
uploaded_bytes = b"Sample document text buffer."
report_str = analyzer.analyze_bytes(uploaded_bytes, "invoice.txt")
report = json.loads(report_str)
print(f"Domain Class: {report['document_class']} | RAG Ready: {report['rag_ready']}")
Natively Parallel Batch Processing
file_list = ["agreement.docx", "data.xlsx", "spec.pdf"]
batch_report_str = analyzer.analyze_batch(file_list)
batch_report = json.loads(batch_report_str)
print(f"Successful files: {batch_report['summary']['successful_files']}")
print(f"Duplicates skipped: {batch_report['summary']['duplicate_files']}")
Directory Ingestion (Recursive Scan)
dir_report_str = analyzer.analyze_directory("./archive", recursive=True)
dir_report = json.loads(dir_report_str)
print(f"Total directory tokens: {dir_report['summary']['total_tokens']}")
Ultra-Fast Single-Metric Bypasses
If you only need a single metric and want to bypass the rest of the gateway analysis pipeline (such as security checks, cost estimation, and domain classification), use the sub-millisecond helpers:
# Raw metric count helpers (File-based)
word_count = analyzer.count_words("document.docx")
char_count = analyzer.count_chars("document.docx")
token_count = analyzer.count_tokens("document.docx")
# Raw metric count helpers (Byte-based)
token_count = analyzer.count_tokens_bytes(uploaded_bytes, "invoice.txt")
Telemetry Output Schema
DocGaurd generates a comprehensive, metadata-rich telemetry report for every analyzed file:
{
"file_name": "contract_agreement.pdf",
"file_type": "pdf",
"sha256": "07c270b274dae324f906e0aa3a8d606471931e9c1afc241ddbc8f9ae52baffe7",
"token_count": 2424,
"word_count": 1612,
"character_count": 11448,
"page_count": 4,
"requires_ocr": false,
"quality_score": 0.8,
"duplicate": false,
"security_risk": "low",
"fits_context": true,
"rag_ready": true,
"requires_summarization": false,
"recommended_chunking": "semantic chunking",
"document_class": "Legal",
"recommended_agent": "LegalAgent",
"estimated_embedding_cost": 0.0,
"estimated_llm_cost": 0.0121,
"processing_time_ms": 12.34
}
Telemetry Field Descriptions
| Field | Type | Description |
|---|---|---|
file_name |
String | Base name of the analyzed file. |
file_type |
String | Lowercase file extension (e.g. pdf, docx, txt). |
sha256 |
String | Cryptographic SHA-256 hash representing the exact content payload. |
token_count |
Integer | Exact token count matching the selected model tokenizer profile. |
word_count |
Integer | Number of words counted based on unicode whitespace dividers. |
character_count |
Integer | UTF-8 character length of the extracted document text. |
page_count |
Integer | Page count (e.g. PDF pages, PowerPoint slides, Excel sheets, estimated text lines). |
requires_ocr |
Boolean | Flags true if document has page structures but low text density (image-only scanned). |
quality_score |
Float | Cleanliness index (0.0 - 1.0) graded by density, metadata, ratio, and OCR markers. |
duplicate |
Boolean | Flags true if identical SHA-256 has already been processed in the concurrent batch queue. |
security_risk |
String | Security score (low, medium, high) validating Zip bombs and size thresholds. |
fits_context |
Boolean | Checks if token_count fits inside the target model's context window. |
rag_ready |
Boolean | Evaluates suitability for search databases (true if secure, non-scanned, and clean). |
requires_summarization |
Boolean | Recommends pre-summarizing if the token count or page density is excessively large. |
recommended_chunking |
String | Suggested chunking strategy (no chunking, fixed, semantic, hierarchical, agentic). |
document_class |
String | Classified topical domain (Finance, Procurement, Legal, HR, Tech Doc, Research, etc.). |
recommended_agent |
String | Recommended target downstream AI Agent target (e.g. LegalAgent). |
estimated_embedding_cost |
Float | Predicted vector database indexing cost. |
estimated_llm_cost |
Float | Predicted input processing cost. |
processing_time_ms |
Float | Internal Gateway execution latency in milliseconds. |
Supported Formats
| Format | Extension | Extraction Method | Key Features |
|---|---|---|---|
.pdf |
Native lopdf Parser | Structural reading, scanned detection, page extraction | |
| Word | .docx |
Native docx XML Parser | Direct paragraph and table text extraction |
| PowerPoint | .pptx |
Native pptx XML Parser | Shape text, slide processing, bullet analysis |
| Excel | .xlsx |
Calamine Engine | Spreadsheet parsing, cell extraction, rows estimation |
| CSV | .csv |
CSV Parser | Direct row, column parsing, delimiter validation |
| Plain Text | .txt, .md |
Unicode Parser | Streaming flat extraction, lossy fallback encoding |
| JSON | .json |
Serde JSON | Recursive nested key-value string extraction |
| XML | .xml |
Quick XML Parser | Tag-stripped text, element-wise traversal |
| HTML | .html |
Quick XML Parser | Element parsing, script/style extraction filtering |
Configuration Limits
| Setting | Default Value | Purpose |
|---|---|---|
target_model |
"gpt-4" |
Target context size limit check |
tokenizer_name |
"cl100k_base" |
Tokenizer profile (cl100k_base, r50k_base, p50k_base) |
max_file_size |
52,428,800 bytes (50MB) |
Intercept oversized documents |
embedding_rate_per_million |
$0.02 |
Custom embedding cost rate |
How the OCR Integration Works
DocGaurd implements a high-performance hybrid OCR gateway under the OcrDocumentAnalyzer class:
- Rust-Native Gatekeeping:
When a file is submitted, DocGaurd first uses its sub-millisecond Rust parsers to check the file type and structure.
- If the document is a clean digital file (e.g., text PDF, Word doc, or markdown), the text is extracted instantly, and the heavy OCR engine is completely bypassed.
- If the file is an image (
.png,.jpg,.jpeg, etc.) or is flagged by the Rust quality scanner as a scanned/text-empty PDF (requires_ocr: True), the OCR engine is initialized.
- Lazy Loading: To keep package imports sub-millisecond, PyTorch and EasyOCR model weights are loaded lazily on-demand only when the first scanned document or raw image is encountered.
- Hardware Auto-Detection:
The engine dynamically autodetects your host hardware to run deep learning models at maximum speed:
- macOS (Apple Silicon): Natively offloads tensor computations to the GPU via Metal Performance Shaders (MPS).
- Windows/Linux with GPU: Automatically targets your Nvidia GPU via CUDA.
- Fallback: Runs on optimized multi-threaded CPU.
- Rust Telemetry Reconciliation:
Once text is extracted via OCR, the raw text bytes are passed back into DocGaurd's Rust core using a virtual text buffer. The Rust engine then computes exact GPT token budgets (
tiktoken-rs), counts words/characters, runs domain classification, and generates cost estimations—reconciling all statistics back into a single unified JSON schema.
Examples
Example 1: RAG Ingestion Security & Quality Gatekeeper
Ensure that only secure, high-quality, digital documents enter your vector database:
import json
import docgaurd
analyzer = docgaurd.DocumentAnalyzer()
report = json.loads(analyzer.analyze_file("user_upload.pdf"))
# Intercept risks at the gateway
if report["security_risk"] == "high":
raise ValueError(f"CRITICAL: Security exception triggered for {report['file_name']}")
if report["requires_ocr"]:
print(f"Routing {report['file_name']} to hardware-accelerated OCR pipeline.")
elif not report["rag_ready"]:
print(f"Skipping {report['file_name']} due to low text quality score: {report['quality_score']}")
else:
print(f"Ingesting clean document text. Context Size: {report['token_count']} tokens.")
Example 2: API Cost Budgeting & Model Window Check
Calculate API transaction costs and verify if a document fits within a model's context window:
import json
import docgaurd
analyzer = docgaurd.DocumentAnalyzer({
"target_model": "gpt-3.5-turbo",
"llm_input_rate_per_million": 1.50
})
report = json.loads(analyzer.analyze_file("long_transcript.txt"))
if not report["fits_context"]:
print(f"Document exceeds target context window. Recommended chunking strategy: {report['recommended_chunking']}")
else:
print(f"Document fits. Estimated processing cost: ${report['estimated_llm_cost']:.4f}")
Example 3: Hardware-Accelerated OCR Integration (Metal/CUDA)
Incorporate unified OCR for scanned files directly from the installed package:
import json
from docgaurd import OcrDocumentAnalyzer
# Initialize unified OcrDocumentAnalyzer (auto-routes to Apple Metal MPS or CUDA)
gateway = OcrDocumentAnalyzer()
report_json = gateway.analyze_file("scanned_receipt.jpg")
report = json.loads(report_json)
print(f"OCR Text: {report['text']}")
print(f"OCR Tokens: {report['token_count']} | RAG Ready: {report['rag_ready']}")
License
This project is licensed under the MIT License.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distributions
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docgaurd-0.1.12.tar.gz.
File metadata
- Download URL: docgaurd-0.1.12.tar.gz
- Upload date:
- Size: 34.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
973197489781f1a9ec2240a7d60faa47a3140825c9a79d94e42771d2596c4ce6
|
|
| MD5 |
1bf23392d6deb3ae466adfe5afa7c649
|
|
| BLAKE2b-256 |
70ca0aee9c52df6b9978a91d7b4ad5fe0ee6f37c0364953baadb24fef98241d2
|
Provenance
The following attestation bundles were made for docgaurd-0.1.12.tar.gz:
Publisher:
pypi.yml on JIVTESH28/docgaurd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docgaurd-0.1.12.tar.gz -
Subject digest:
973197489781f1a9ec2240a7d60faa47a3140825c9a79d94e42771d2596c4ce6 - Sigstore transparency entry: 1704710625
- Sigstore integration time:
-
Permalink:
JIVTESH28/docgaurd@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Branch / Tag:
refs/tags/v0.1.12 - Owner: https://github.com/JIVTESH28
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docgaurd-0.1.12-cp38-abi3-win_amd64.whl.
File metadata
- Download URL: docgaurd-0.1.12-cp38-abi3-win_amd64.whl
- Upload date:
- Size: 3.0 MB
- Tags: CPython 3.8+, Windows x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4fa85ae5acd847098b630b9d732da1877952da72b208c6eb509cfc821cb307a4
|
|
| MD5 |
0d3e1f7027e09a43b0b38d883125f3b9
|
|
| BLAKE2b-256 |
1145cd66d844e46c725a7d9ee2c103c03de237d84af004746e03f6e4c2e58f7b
|
Provenance
The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-win_amd64.whl:
Publisher:
pypi.yml on JIVTESH28/docgaurd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docgaurd-0.1.12-cp38-abi3-win_amd64.whl -
Subject digest:
4fa85ae5acd847098b630b9d732da1877952da72b208c6eb509cfc821cb307a4 - Sigstore transparency entry: 1704710668
- Sigstore integration time:
-
Permalink:
JIVTESH28/docgaurd@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Branch / Tag:
refs/tags/v0.1.12 - Owner: https://github.com/JIVTESH28
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.
File metadata
- Download URL: docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl
- Upload date:
- Size: 3.3 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
9bc2e4752046824d92bc74f51c5c1c4e0ba807519cba6a1a8854ff5a900d17cc
|
|
| MD5 |
fdeaae7e0fb983e7be05f6ebfb1cdd4a
|
|
| BLAKE2b-256 |
b765dff442572cfdbb31d6fd5133f1ab9d0d329b6a9d1322bc5cc5cfe5168bf3
|
Provenance
The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl:
Publisher:
pypi.yml on JIVTESH28/docgaurd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docgaurd-0.1.12-cp38-abi3-manylinux_2_17_x86_64.manylinux2014_x86_64.whl -
Subject digest:
9bc2e4752046824d92bc74f51c5c1c4e0ba807519cba6a1a8854ff5a900d17cc - Sigstore transparency entry: 1704710638
- Sigstore integration time:
-
Permalink:
JIVTESH28/docgaurd@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Branch / Tag:
refs/tags/v0.1.12 - Owner: https://github.com/JIVTESH28
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl.
File metadata
- Download URL: docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl
- Upload date:
- Size: 3.2 MB
- Tags: CPython 3.8+, manylinux: glibc 2.17+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
03a6c89a6328fad85d2e108c709d3b6a5843983caaf13912b2456a74a8398e16
|
|
| MD5 |
563c7bb4b91b66f633e83efc84983d7b
|
|
| BLAKE2b-256 |
2257d788a9aa9e91c94967c56f16c90d1a992d6605a201ccd28523121e208e53
|
Provenance
The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl:
Publisher:
pypi.yml on JIVTESH28/docgaurd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docgaurd-0.1.12-cp38-abi3-manylinux_2_17_aarch64.manylinux2014_aarch64.whl -
Subject digest:
03a6c89a6328fad85d2e108c709d3b6a5843983caaf13912b2456a74a8398e16 - Sigstore transparency entry: 1704710630
- Sigstore integration time:
-
Permalink:
JIVTESH28/docgaurd@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Branch / Tag:
refs/tags/v0.1.12 - Owner: https://github.com/JIVTESH28
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl.
File metadata
- Download URL: docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.8+, macOS 11.0+ ARM64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
11c0c638ff9d10c6754db1004d8fe69f7d3560ba668ca861aac0e58ccbb6c6c6
|
|
| MD5 |
f3413b3a254841525c4bba3a06d2dfef
|
|
| BLAKE2b-256 |
8dddd9ecc0ffc9e9a701d0c685e8c4926ce3e055ffffb93bdb2aad6bc30ec7b9
|
Provenance
The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl:
Publisher:
pypi.yml on JIVTESH28/docgaurd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docgaurd-0.1.12-cp38-abi3-macosx_11_0_arm64.whl -
Subject digest:
11c0c638ff9d10c6754db1004d8fe69f7d3560ba668ca861aac0e58ccbb6c6c6 - Sigstore transparency entry: 1704710660
- Sigstore integration time:
-
Permalink:
JIVTESH28/docgaurd@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Branch / Tag:
refs/tags/v0.1.12 - Owner: https://github.com/JIVTESH28
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Trigger Event:
push
-
Statement type:
File details
Details for the file docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl.
File metadata
- Download URL: docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl
- Upload date:
- Size: 3.1 MB
- Tags: CPython 3.8+, macOS 10.12+ x86-64
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f935ab1e6d9dbb0f5fbff4007b484a93c4bff3d8f5f8edb669583aa6ae6f90f8
|
|
| MD5 |
2eaef56752d042c630c3b09703d6cb04
|
|
| BLAKE2b-256 |
6b80260f9bd32a0bc99cc4872f82aacc06079962c79f093e838dce4cc44ed79f
|
Provenance
The following attestation bundles were made for docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl:
Publisher:
pypi.yml on JIVTESH28/docgaurd
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docgaurd-0.1.12-cp38-abi3-macosx_10_12_x86_64.whl -
Subject digest:
f935ab1e6d9dbb0f5fbff4007b484a93c4bff3d8f5f8edb669583aa6ae6f90f8 - Sigstore transparency entry: 1704710651
- Sigstore integration time:
-
Permalink:
JIVTESH28/docgaurd@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Branch / Tag:
refs/tags/v0.1.12 - Owner: https://github.com/JIVTESH28
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
pypi.yml@484a5beebdb56d08a1afd471dadb710cd3e02010 -
Trigger Event:
push
-
Statement type: