Skip to main content

A modular document-processing pipeline for AI-powered document intelligence.

Project description

Tandon AI Document Intelligence

A Production-Ready Unstructured Document Analytics Framework

This library implements a modular, end-to-end pipeline for processing unstructured documents (PDFs). It moves beyond simple OCR by integrating automated classification, structured extraction (text & tables), LLM-powered enrichment (risk analysis, summarization), and quality validation.

Designed for high-compliance environments (Engineering, Legal, Finance) where data accuracy and semantic understanding are critical.

🏗️ System Architecture

graph TD
    A[Ingestion] --> B{Classification}
    B -->|Digital PDF| C[PyMuPDF + Camelot]
    B -->|Scanned PDF| D[Pre-processing + Tesseract OCR]
    C --> E[Text & Table Stream]
    D --> E
    E --> F[Enrichment & Validation]
    F -->|LLM| G[Summary, Entities, Risk]
    F -->|Analytics| H[Readability, Sentiment, Factuality]
    F --> I{Vector Store}
    I -->|Embeddings| J[ChromaDB]
    J --> K[Semantic Search & Retrieval]

🚀 Key Features

  1. Intelligent Ingestion: Automatically detects if a PDF is Digital (selectable text) or Scanned (image-based).
  2. Hybrid Extraction:
    • Digital: Uses PyMuPDF for high-fidelity text extraction and Camelot for structured tables.
    • Scanned: Routes through Tesseract OCR (pluggable with AWS/Azure) for image-to-text conversion. Includes auto-deskewing and denoising.
  3. LLM Enrichment: Uses OpenAI to:
    • Summarize content.
    • Extract key entities (People, Orgs, Dates).
    • Analyze potential Risks (Legal/Financial).
    • Factuality Check: Scores summary against source text to detect hallucinations.
  4. Quality Validation Loop: Automatically scores extraction quality based on text density, OCR noise, and table confidence.
  5. Research-Grade Analytics:
    • Readability: Flesch Reading Ease, Gunning Fog Index.
    • Semantic: Sentiment Analysis, Subjectivity, Lexical Diversity.
    • Clustering: PCA & K-Means visualization of document embeddings.
  6. Vector Store Ready: Generates embeddings (OpenAI) and stores chunked text in ChromaDB for semantic search. Supports Hybrid Search (Keyword + Vector) using Reciprocal Rank Fusion (RRF) combined with rank_bm25.
  7. Benchmarking & Evaluation: Includes tools for calculating CER/WER, Recall@k, Precision@k, and nDCG against ground truth. Support for Dataset Manifests and Aggregated Reporting.
  8. Cost & Token Tracking: Detailed tracking of LLM token usage (Input/Output) and cost estimation per document.

📦 Installation

Prerequisites

  • Python 3.9+
  • System Dependencies:
    • tesseract (for OCR)
    • ghostscript (required by Camelot)
    • tk (required by Camelot)

Install the Library

Clone the repository and install in editable mode:

pip install -e .

🖥️ Usage

1. Run the Web Dashboard (Dash)

We recommend using the Dash dashboard for the best visual experience and advanced analytics.

python dash_app.py

Open http://127.0.0.1:8050 in your browser.

Note: The older Streamlit (app.py) and Gradio (gradio_app.py) apps are available but deprecated.

2. Run Research Benchmarks

For benchmarking extraction quality (CER/WER) and retrieval performance (Recall@k) using a dataset manifest:

python scripts/run_benchmarks.py \
    --data-dir ./data/test_corpus \
    --manifest ./experiments/dataset_manifest.json \
    --output-csv results.csv \
    --api-key sk-your-key

This will generate:

  • results.csv: Per-document metrics.
  • results_summary.csv: Aggregated statistics (Mean CER/WER, Cost, Throughput).
  • results_retrieval.csv: Retrieval metrics (nDCG@k, MRR) if queries are provided.

3. Interactive Notebooks

Explore the library capabilities with our tutorial notebooks in examples/:

4. Use in Python Code

import os
from tandon_ai_doc_intel import DocumentPipeline

# 1. Initialize Pipeline
pipeline = DocumentPipeline(openai_api_key="sk-...")

# 2. Process a Document
result = pipeline.process("invoice.pdf")

# 3. Access Insights
print(f"Validation Score: {result.validation_score}")
print(f"Readability (Flesch): {result.readability_score}")
print(f"Risk Level: {result.risk_analysis['risk_level']}")

# 4. Access Structured Data
if result.tables:
    print(f"Found {len(result.tables)} tables.")

📂 Project Structure

  • src/tandon_ai_doc_intel/: Core library package.
    • pipeline.py: Orchestrator for the entire flow.
    • ingestion.py: Handles file loading (paths, bytes).
    • classification.py: Detects digital vs. scanned PDFs.
    • extraction/: Modules for PyMuPDF (digital) and Tesseract (scanned).
    • enrichment/: LLM integration for summary, entities, and risk.
    • analytics.py: Advanced metrics (Readability, Sentiment, NLP).
    • validation.py: Quality assurance checks.
    • embeddings/: Vector generation and storage.
    • metrics.py: Calculation of CER, WER, and Retrieval metrics.
    • evaluation.py: Helper class for ground-truth comparison.
  • scripts/: Utility scripts.
    • run_benchmarks.py: Batch processing and evaluation script.
  • experiments/: Directory for datasets and experiment configurations.
  • examples/: Example notebooks and scripts.
  • dash_app.py: The main interactive web application.

🤝 Contributing

  1. Fork the repo.
  2. Create your feature branch (git checkout -b feature/amazing-feature).
  3. Commit your changes (git commit -m 'Add some amazing feature').
  4. Push to the branch (git push origin feature/amazing-feature).
  5. Open a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tandon_ai_doc_intel-0.1.0.tar.gz (27.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

tandon_ai_doc_intel-0.1.0-py3-none-any.whl (29.7 kB view details)

Uploaded Python 3

File details

Details for the file tandon_ai_doc_intel-0.1.0.tar.gz.

File metadata

  • Download URL: tandon_ai_doc_intel-0.1.0.tar.gz
  • Upload date:
  • Size: 27.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for tandon_ai_doc_intel-0.1.0.tar.gz
Algorithm Hash digest
SHA256 b3c3699a4c8018f59fe5f687feb1a7fecdc2936afe6562b70cf256da6f23ac7e
MD5 4b00c8234178ec3e4cd2667e0b4b1ff0
BLAKE2b-256 3038cdd4ac06cef74aabc581b9f2e3c5227b418289cd2ab9f5d1ed29b44f99b9

See more details on using hashes here.

File details

Details for the file tandon_ai_doc_intel-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for tandon_ai_doc_intel-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c04531a810c4b2f16db59c6938d09e7b7fc556ada006557f61b00e5a3170f86e
MD5 dd39f6dc4dd93313d5cecac95410433c
BLAKE2b-256 85343da146c0b5465643734c14b3967ab1e17b5cf9319cdb70f3175a577b5cc1

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page