Skip to main content

Comprehensive document processing toolkit for AI/ML applications

Project description

Document AI Toolkit

Production-ready document processing toolkit with AI capabilities for text extraction, table detection, OCR, entity recognition, and document classification.

Features

  • Multi-Format Support: PDF, DOCX, HTML, Markdown, TXT, images, and more
  • Text Extraction: Simple, layout-aware, and structured extraction modes
  • Table Detection: Automatic table detection with cell-level extraction
  • OCR Integration: Built-in OCR support with Tesseract, EasyOCR, PaddleOCR
  • Entity Extraction: Named entity recognition (persons, organizations, dates, etc.)
  • Document Classification: Automatic document type detection
  • Layout Analysis: Detect headers, footers, paragraphs, lists, and more
  • Zero Dependencies Core: Core functionality works without heavy dependencies

Installation

pip install document-ai-toolkit          # Core
pip install document-ai-toolkit[pdf]     # PDF support
pip install document-ai-toolkit[ocr]     # OCR support
pip install document-ai-toolkit[full]    # All features

Quick Start

from document_ai_toolkit import DocumentProcessor, ProcessingConfig

# Basic processing
processor = DocumentProcessor()
result = processor.process("document.pdf")

print(result.document.content)
print(f"Pages: {result.document.metadata.page_count}")

# With tables
config = ProcessingConfig(extract_tables=True)
processor = DocumentProcessor(config)
result = processor.process("report.pdf")

for table in result.document.tables:
    print(table.to_dict())

# Classification
from document_ai_toolkit import DocumentClassifier
classifier = DocumentClassifier()
result = classifier.classify("document.pdf")
print(f"Type: {result.document_type.value} ({result.confidence:.0%})")

# Comparison
from document_ai_toolkit import DocumentComparator
comparator = DocumentComparator()
result = comparator.compare("v1.docx", "v2.docx")
print(f"Similarity: {result.similarity_score:.0%}")

Supported Formats

Format Extension Read Write
PDF .pdf
Word .docx
HTML .html
Markdown .md
Plain Text .txt
Images .png, .jpg

License

MIT License - Pranay M

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

document_ai_toolkit-0.1.0.tar.gz (12.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

document_ai_toolkit-0.1.0-py3-none-any.whl (14.2 kB view details)

Uploaded Python 3

File details

Details for the file document_ai_toolkit-0.1.0.tar.gz.

File metadata

  • Download URL: document_ai_toolkit-0.1.0.tar.gz
  • Upload date:
  • Size: 12.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.13.5

File hashes

Hashes for document_ai_toolkit-0.1.0.tar.gz
Algorithm Hash digest
SHA256 7fd944b03c1e0366984988e3868833c10011cbab0040e420182cdecb5e30953f
MD5 73475b06dcdfa68fb44a5bf6172447b8
BLAKE2b-256 6f7dec5c414f2c19bb6b3c64563cd2417d49502036055849416869e0efea0ef2

See more details on using hashes here.

File details

Details for the file document_ai_toolkit-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for document_ai_toolkit-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 b987e1696ec671468c602b6f3074304a7190e07c01d75c7b4ecc33c3960e697c
MD5 44bd84cc583433e84644b732aa9e070d
BLAKE2b-256 a0a19a32bbdeaad6eeebb25517074cd101306af12632e0978a5d49a53be2eb3e

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page