Skip to main content

Document AI - Intelligent document processing and extraction

Project description

Document AI

Documentation: https://zeel-04.github.io/document-ai/

A library for parsing, formatting, and processing documents that can be used to build AI-powered document processing pipelines with structured data extraction and citation support.

Features

  • Extract structured data from PDF documents using LLMs
  • Automatic citation tracking with page numbers, line numbers, and bounding boxes
  • Support for digital PDFs
  • Type-safe data models using Pydantic
  • OpenAI integration with support for reasoning models

Installation

Requirements

  • Python >= 3.10
  • OpenAI API key

Install uv

First, install uv if you haven't already:

curl -LsSf https://astral.sh/uv/install.sh | sh

Install from Source

Clone the repository and install the package:

git clone https://github.com/zeel-04/document-ai.git
cd document-ai
uv sync

Install from Git (Alternative)

You can also install directly from the git repository:

uv pip install git+https://github.com/zeel-04/document-ai.git

Quick Start

Set up your OpenAI API key:

echo "OPENAI_API_KEY=your-api-key-here" > .env

Here's a simple example to extract structured data from a PDF:

from dotenv import load_dotenv
from document_ai.processer import DocumentProcessor
from document_ai.llm import OpenAILLM
from pydantic import BaseModel

# Load environment variables
load_dotenv()

# Initialize the LLM
llm = OpenAILLM()

# Create a processor from a PDF file
processor = DocumentProcessor.from_digital_pdf(
    uri="path/to/your/document.pdf",
    llm=llm,
)

# Define your data model with citations
# If you want to include citations for any field, 
# Use the `processor.citation_type` as the type.
class MyData(BaseModel):
    my_data: str
    my_data_citation: processor.citation_type

# Extract structured data
response = processor.extract(
    model="gpt-5-mini",
    reasoning={"effort": "low"},
    response_format=MyData,
)

# Get the extracted data
data = response.model_dump()
print(data)

Sample Output

{
    "my_data": "my data",
    "my_data_citation": [{
        "page": 0,
        "lines": [10],
        "bboxes": [{
            "x0": 0.058823529411764705,
            "top": 0.6095707475757575,
            "x1": 0.5635455037254902,
            "bottom": 0.6221969596969696
        }]
    }]
}

Documentation

For more detailed documentation, see the docs directory or visit the documentation site.

Development Setup

Prerequisites:

  • Python 3.10+
  • uv
git clone https://github.com/zeel-04/document-ai.git
cd document-ai
uv venv
uv sync 

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_intelligence-0.1.0.tar.gz (117.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

doc_intelligence-0.1.0-py3-none-any.whl (11.4 kB view details)

Uploaded Python 3

File details

Details for the file doc_intelligence-0.1.0.tar.gz.

File metadata

  • Download URL: doc_intelligence-0.1.0.tar.gz
  • Upload date:
  • Size: 117.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for doc_intelligence-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0ebcf7116e89ba4f3fa0441b7b8628a445118bf488caa90c1fc5bee16b516ea4
MD5 5a0693c016dbc29466ea60805d291cbd
BLAKE2b-256 17b2f0bf4ec5dd038fde5cc2dc181e9a64dc06669f5595c46910e2d79446d914

See more details on using hashes here.

File details

Details for the file doc_intelligence-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: doc_intelligence-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 11.4 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: uv/0.9.26 {"installer":{"name":"uv","version":"0.9.26","subcommand":["publish"]},"python":null,"implementation":{"name":null,"version":null},"distro":{"name":"Ubuntu","version":"24.04","id":"noble","libc":null},"system":{"name":null,"release":null},"cpu":null,"openssl_version":null,"setuptools_version":null,"rustc_version":null,"ci":true}

File hashes

Hashes for doc_intelligence-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 761af6a751829700ad7e0b32a5a4035b83acab44bd6856e6f1b6bd6b61fa5a9e
MD5 e4643e716e1685e6e4fe61fe39ad696d
BLAKE2b-256 22458a76cb4a96f3815e07cd0f167e658d394975a352653f8cccd57dd58edf7c

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page