Skip to main content

Document AI - Intelligent document processing and extraction

Project description

Document AI

Documentation: https://zeel-04.github.io/doc-intelligence/

A library for parsing, formatting, and processing documents that can be used to build AI-powered document processing pipelines with structured data extraction and citation support.

Document AI

Features

  • Extract structured data from PDF documents using LLMs
  • Automatic citation tracking with page numbers, line numbers, and bounding boxes
  • Support for digital PDFs and scanned (image-only) PDFs via OCR
  • Type-safe data models using Pydantic
  • Multi-provider LLM support: OpenAI, Anthropic, Gemini, Ollama
  • Pluggable OCR pipeline — swap in any layout detector or OCR engine

Installation

Requirements

  • Python >= 3.10
  • An API key for your chosen LLM provider (OpenAI, Anthropic, or Gemini) — or a local Ollama server

Install with uv

uv pip install doc-intelligence

Or with pip:

pip install doc-intelligence

Quick Start

Set up your API key (example with OpenAI):

echo "OPENAI_API_KEY=your-api-key-here" > .env

Configure a PDFProcessor once, then pass the document and schema per call:

from doc_intelligence import PDFExtractionMode, PDFProcessor
from pydantic import BaseModel

class License(BaseModel):
    license_name: str

processor = PDFProcessor(
    provider="openai",
    model="gpt-4o-mini",
    include_citations=True,
    extraction_mode=PDFExtractionMode.SINGLE_PASS,
)

result = processor.extract(
    "https://example-files.online-convert.com/document/pdf/example.pdf",
    License,
)
print(f"Extracted data: {result.data}")
print(f"Metadata: {result.metadata}")

Sample Output

The extract method returns an ExtractionResult with .data and .metadata attributes:

result.data
# License(license_name='Attribution-ShareAlike 3.0 Unported')

result.metadata
# {
#     'license_name': {
#         'value': 'Attribution-ShareAlike 3.0 Unported',
#         'citations': [{
#             'page': 0,
#             'bboxes': [{
#                 'x0': 0.201,
#                 'top': 0.859,
#                 'x1': 0.565,
#                 'bottom': 0.872
#             }]
#         }]
#     }
# }

Scanned PDFs

For image-only PDFs, use strategy=ParseStrategy.SCANNED and supply your own layout detector and OCR engine:

from doc_intelligence import PDFProcessor, ParseStrategy

processor = PDFProcessor(
    provider="openai",
    strategy=ParseStrategy.SCANNED,
    layout_detector=my_layout_detector,
    ocr_engine=my_ocr_engine,
)
result = processor.extract("scanned_invoice.pdf", Invoice)

See the Scanned PDFs guide and Custom OCR Components docs for details.

Documentation

For more detailed documentation, see the docs directory or visit the documentation site.

Development Setup

Prerequisites:

  • Python 3.10+
  • uv
git clone https://github.com/zeel-04/doc-intelligence.git
cd doc_intelligence
uv venv
uv sync

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_intelligence-0.1.5.tar.gz (9.4 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_intelligence-0.1.5-py3-none-any.whl (30.1 kB view details)

Uploaded Python 3

File details

Details for the file doc_intelligence-0.1.5.tar.gz.

File metadata

  • Download URL: doc_intelligence-0.1.5.tar.gz
  • Upload date:
  • Size: 9.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for doc_intelligence-0.1.5.tar.gz
Algorithm Hash digest
SHA256 2b66bb9874c9abb22edb9ad8ae0f898975e37a33c04a4ca853f0aafede9d148a
MD5 c4104bd28222e91935dda9c701460609
BLAKE2b-256 36ea9cef40d69e7e0b16ec0a48676307379962ee065998ff6bc0cd7beb3f1aaa

See more details on using hashes here.

File details

Details for the file doc_intelligence-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: doc_intelligence-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 30.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.11.3 {"installer":{"name":"uv","version":"0.11.3","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for doc_intelligence-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 e62f0e8a59ba7bde61a6b6d5a8906d31540e136134937c427db495591ccad814
MD5 1dd85f4d079875590d3d22f725421aea
BLAKE2b-256 4df846d3af220acabd591b52ad2e5ea14f0a1c1bf1d3ebd154cd6983d9e2296c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page