A modular document-processing pipeline for AI-powered document intelligence.

These details have not been verified by PyPI

Project description

Tandon AI Document Intelligence

A Production-Ready Unstructured Document Analytics Framework

This library implements a modular, end-to-end pipeline for processing unstructured documents (PDFs). It moves beyond simple OCR by integrating automated classification, structured extraction (text & tables), LLM-powered enrichment (risk analysis, summarization), and quality validation.

Designed for high-compliance environments (Engineering, Legal, Finance) where data accuracy and semantic understanding are critical.

🏗️ System Architecture

graph TD
    A[Ingestion] --> B{Classification}
    B -->|Digital PDF| C[PyMuPDF + Camelot]
    B -->|Scanned PDF| D[Pre-processing + Tesseract OCR]
    C --> E[Text & Table Stream]
    D --> E
    E --> F[Enrichment & Validation]
    F -->|LLM| G[Summary, Entities, Risk]
    F -->|Analytics| H[Readability, Sentiment, Factuality]
    F --> I{Vector Store}
    I -->|Embeddings| J[ChromaDB]
    J --> K[Semantic Search & Retrieval]

🚀 Key Features

Intelligent Ingestion: Automatically detects if a PDF is Digital (selectable text) or Scanned (image-based).
Hybrid Extraction:
- Digital: Uses PyMuPDF for high-fidelity text extraction and Camelot for structured tables.
- Scanned: Routes through Tesseract OCR (pluggable with AWS/Azure) for image-to-text conversion. Includes auto-deskewing and denoising.
LLM Enrichment: Uses OpenAI to:
- Summarize content.
- Extract key entities (People, Orgs, Dates).
- Analyze potential Risks (Legal/Financial).
- Factuality Check: Scores summary against source text to detect hallucinations.
Quality Validation Loop: Automatically scores extraction quality based on text density, OCR noise, and table confidence.
Research-Grade Analytics:
- Readability: Flesch Reading Ease, Gunning Fog Index.
- Semantic: Sentiment Analysis, Subjectivity, Lexical Diversity.
- Clustering: PCA & K-Means visualization of document embeddings.
Vector Store Ready: Generates embeddings (OpenAI) and stores chunked text in ChromaDB for semantic search. Supports Hybrid Search (Keyword + Vector) using Reciprocal Rank Fusion (RRF) combined with rank_bm25.
Benchmarking & Evaluation: Includes tools for calculating CER/WER, Recall@k, Precision@k, and nDCG against ground truth. Support for Dataset Manifests and Aggregated Reporting.
Cost & Token Tracking: Detailed tracking of LLM token usage (Input/Output) and cost estimation per document.

📦 Installation

Prerequisites

Python 3.9+
System Dependencies:
- tesseract (for OCR)
- ghostscript (required by Camelot)
- tk (required by Camelot)

Install the Library

Clone the repository and install in editable mode:

pip install -e .

🖥️ Usage

1. Run the Web Dashboard (Dash)

We recommend using the Dash dashboard for the best visual experience and advanced analytics.

python dash_app.py

Open http://127.0.0.1:8050 in your browser.

Note: The older Streamlit (app.py) and Gradio (gradio_app.py) apps are available but deprecated.

2. Run Research Benchmarks

For benchmarking extraction quality (CER/WER) and retrieval performance (Recall@k) using a dataset manifest:

python scripts/run_benchmarks.py \
    --data-dir ./data/test_corpus \
    --manifest ./experiments/dataset_manifest.json \
    --output-csv results.csv \
    --api-key sk-your-key

This will generate:

results.csv: Per-document metrics.
results_summary.csv: Aggregated statistics (Mean CER/WER, Cost, Throughput).
results_retrieval.csv: Retrieval metrics (nDCG@k, MRR) if queries are provided.

3. Interactive Notebooks

Explore the library capabilities with our tutorial notebooks in examples/:

01_pipeline_demo.ipynb: Step-by-step walkthrough of the pipeline.
02_hybrid_search_experiment.ipynb: Compare Vector Search vs. Hybrid Search (Vector + BM25).

4. Use in Python Code

import os
from tandon_ai_doc_intel import DocumentPipeline

# 1. Initialize Pipeline
pipeline = DocumentPipeline(openai_api_key="sk-...")

# 2. Process a Document
result = pipeline.process("invoice.pdf")

# 3. Access Insights
print(f"Validation Score: {result.validation_score}")
print(f"Readability (Flesch): {result.readability_score}")
print(f"Risk Level: {result.risk_analysis['risk_level']}")

# 4. Access Structured Data
if result.tables:
    print(f"Found {len(result.tables)} tables.")

📂 Project Structure

src/tandon_ai_doc_intel/: Core library package.
- pipeline.py: Orchestrator for the entire flow.
- ingestion.py: Handles file loading (paths, bytes).
- classification.py: Detects digital vs. scanned PDFs.
- extraction/: Modules for PyMuPDF (digital) and Tesseract (scanned).
- enrichment/: LLM integration for summary, entities, and risk.
- analytics.py: Advanced metrics (Readability, Sentiment, NLP).
- validation.py: Quality assurance checks.
- embeddings/: Vector generation and storage.
- metrics.py: Calculation of CER, WER, and Retrieval metrics.
- evaluation.py: Helper class for ground-truth comparison.
scripts/: Utility scripts.
- run_benchmarks.py: Batch processing and evaluation script.
experiments/: Directory for datasets and experiment configurations.
examples/: Example notebooks and scripts.
dash_app.py: The main interactive web application.

🤝 Contributing

Fork the repo.
Create your feature branch (git checkout -b feature/amazing-feature).
Commit your changes (git commit -m 'Add some amazing feature').
Push to the branch (git push origin feature/amazing-feature).
Open a Pull Request.

Project details

These details have not been verified by PyPI

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3

Release history Release notifications | RSS feed

This version

0.1.0

Dec 10, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tandon_ai_doc_intel-0.1.0.tar.gz (27.5 kB view details)

Uploaded Dec 10, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

tandon_ai_doc_intel-0.1.0-py3-none-any.whl (29.7 kB view details)

Uploaded Dec 10, 2025 Python 3

File details

Details for the file tandon_ai_doc_intel-0.1.0.tar.gz.

File metadata

Download URL: tandon_ai_doc_intel-0.1.0.tar.gz
Upload date: Dec 10, 2025
Size: 27.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for tandon_ai_doc_intel-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`b3c3699a4c8018f59fe5f687feb1a7fecdc2936afe6562b70cf256da6f23ac7e`
MD5	`4b00c8234178ec3e4cd2667e0b4b1ff0`
BLAKE2b-256	`3038cdd4ac06cef74aabc581b9f2e3c5227b418289cd2ab9f5d1ed29b44f99b9`

See more details on using hashes here.

File details

Details for the file tandon_ai_doc_intel-0.1.0-py3-none-any.whl.

File metadata

Download URL: tandon_ai_doc_intel-0.1.0-py3-none-any.whl
Upload date: Dec 10, 2025
Size: 29.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for tandon_ai_doc_intel-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c04531a810c4b2f16db59c6938d09e7b7fc556ada006557f61b00e5a3170f86e`
MD5	`dd39f6dc4dd93313d5cecac95410433c`
BLAKE2b-256	`85343da146c0b5465643734c14b3967ab1e17b5cf9319cdb70f3175a577b5cc1`

See more details on using hashes here.

tandon-ai-doc-intel 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Tandon AI Document Intelligence

🏗️ System Architecture

🚀 Key Features

📦 Installation

Prerequisites

Install the Library

🖥️ Usage

1. Run the Web Dashboard (Dash)

2. Run Research Benchmarks

3. Interactive Notebooks

4. Use in Python Code

📂 Project Structure

🤝 Contributing

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes