A modular document-processing pipeline for AI-powered document intelligence.
Project description
Tandon AI Document Intelligence
A Production-Ready Unstructured Document Analytics Framework
This library implements a modular, end-to-end pipeline for processing unstructured documents (PDFs). It moves beyond simple OCR by integrating automated classification, structured extraction (text & tables), LLM-powered enrichment (risk analysis, summarization), and quality validation.
Designed for high-compliance environments (Engineering, Legal, Finance) where data accuracy and semantic understanding are critical.
🏗️ System Architecture
graph TD
A[Ingestion] --> B{Classification}
B -->|Digital PDF| C[PyMuPDF + Camelot]
B -->|Scanned PDF| D[Pre-processing + Tesseract OCR]
C --> E[Text & Table Stream]
D --> E
E --> F[Enrichment & Validation]
F -->|LLM| G[Summary, Entities, Risk]
F -->|Analytics| H[Readability, Sentiment, Factuality]
F --> I{Vector Store}
I -->|Embeddings| J[ChromaDB]
J --> K[Semantic Search & Retrieval]
🚀 Key Features
- Intelligent Ingestion: Automatically detects if a PDF is Digital (selectable text) or Scanned (image-based).
- Hybrid Extraction:
- Digital: Uses
PyMuPDFfor high-fidelity text extraction andCamelotfor structured tables. - Scanned: Routes through
Tesseract OCR(pluggable with AWS/Azure) for image-to-text conversion. Includes auto-deskewing and denoising.
- Digital: Uses
- LLM Enrichment: Uses OpenAI to:
- Summarize content.
- Extract key entities (People, Orgs, Dates).
- Analyze potential Risks (Legal/Financial).
- Factuality Check: Scores summary against source text to detect hallucinations.
- Quality Validation Loop: Automatically scores extraction quality based on text density, OCR noise, and table confidence.
- Research-Grade Analytics:
- Readability: Flesch Reading Ease, Gunning Fog Index.
- Semantic: Sentiment Analysis, Subjectivity, Lexical Diversity.
- Clustering: PCA & K-Means visualization of document embeddings.
- Vector Store Ready: Generates embeddings (OpenAI) and stores chunked text in
ChromaDBfor semantic search. Supports Hybrid Search (Keyword + Vector) using Reciprocal Rank Fusion (RRF) combined withrank_bm25. - Benchmarking & Evaluation: Includes tools for calculating CER/WER, Recall@k, Precision@k, and nDCG against ground truth. Support for Dataset Manifests and Aggregated Reporting.
- Cost & Token Tracking: Detailed tracking of LLM token usage (Input/Output) and cost estimation per document.
📦 Installation
Prerequisites
- Python 3.9+
- System Dependencies:
tesseract(for OCR)ghostscript(required by Camelot)tk(required by Camelot)
Install the Library
Clone the repository and install in editable mode:
pip install -e .
🖥️ Usage
1. Run the Web Dashboard (Dash)
We recommend using the Dash dashboard for the best visual experience and advanced analytics.
python dash_app.py
Open http://127.0.0.1:8050 in your browser.
Note: The older Streamlit (app.py) and Gradio (gradio_app.py) apps are available but deprecated.
2. Run Research Benchmarks
For benchmarking extraction quality (CER/WER) and retrieval performance (Recall@k) using a dataset manifest:
python scripts/run_benchmarks.py \
--data-dir ./data/test_corpus \
--manifest ./experiments/dataset_manifest.json \
--output-csv results.csv \
--api-key sk-your-key
This will generate:
results.csv: Per-document metrics.results_summary.csv: Aggregated statistics (Mean CER/WER, Cost, Throughput).results_retrieval.csv: Retrieval metrics (nDCG@k, MRR) if queries are provided.
3. Interactive Notebooks
Explore the library capabilities with our tutorial notebooks in examples/:
- 01_pipeline_demo.ipynb: Step-by-step walkthrough of the pipeline.
- 02_hybrid_search_experiment.ipynb: Compare Vector Search vs. Hybrid Search (Vector + BM25).
4. Use in Python Code
import os
from tandon_ai_doc_intel import DocumentPipeline
# 1. Initialize Pipeline
pipeline = DocumentPipeline(openai_api_key="sk-...")
# 2. Process a Document
result = pipeline.process("invoice.pdf")
# 3. Access Insights
print(f"Validation Score: {result.validation_score}")
print(f"Readability (Flesch): {result.readability_score}")
print(f"Risk Level: {result.risk_analysis['risk_level']}")
# 4. Access Structured Data
if result.tables:
print(f"Found {len(result.tables)} tables.")
📂 Project Structure
src/tandon_ai_doc_intel/: Core library package.pipeline.py: Orchestrator for the entire flow.ingestion.py: Handles file loading (paths, bytes).classification.py: Detects digital vs. scanned PDFs.extraction/: Modules for PyMuPDF (digital) and Tesseract (scanned).enrichment/: LLM integration for summary, entities, and risk.analytics.py: Advanced metrics (Readability, Sentiment, NLP).validation.py: Quality assurance checks.embeddings/: Vector generation and storage.metrics.py: Calculation of CER, WER, and Retrieval metrics.evaluation.py: Helper class for ground-truth comparison.
scripts/: Utility scripts.run_benchmarks.py: Batch processing and evaluation script.
experiments/: Directory for datasets and experiment configurations.examples/: Example notebooks and scripts.dash_app.py: The main interactive web application.
🤝 Contributing
- Fork the repo.
- Create your feature branch (
git checkout -b feature/amazing-feature). - Commit your changes (
git commit -m 'Add some amazing feature'). - Push to the branch (
git push origin feature/amazing-feature). - Open a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file tandon_ai_doc_intel-0.1.0.tar.gz.
File metadata
- Download URL: tandon_ai_doc_intel-0.1.0.tar.gz
- Upload date:
- Size: 27.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
b3c3699a4c8018f59fe5f687feb1a7fecdc2936afe6562b70cf256da6f23ac7e
|
|
| MD5 |
4b00c8234178ec3e4cd2667e0b4b1ff0
|
|
| BLAKE2b-256 |
3038cdd4ac06cef74aabc581b9f2e3c5227b418289cd2ab9f5d1ed29b44f99b9
|
File details
Details for the file tandon_ai_doc_intel-0.1.0-py3-none-any.whl.
File metadata
- Download URL: tandon_ai_doc_intel-0.1.0-py3-none-any.whl
- Upload date:
- Size: 29.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c04531a810c4b2f16db59c6938d09e7b7fc556ada006557f61b00e5a3170f86e
|
|
| MD5 |
dd39f6dc4dd93313d5cecac95410433c
|
|
| BLAKE2b-256 |
85343da146c0b5465643734c14b3967ab1e17b5cf9319cdb70f3175a577b5cc1
|