Skip to main content

Open-source framework that extracts structured data from unstructured data.

Project description

OpenXtract

PyPI version Python 3.12+ License: MIT Downloads GitHub stars

Turn any document into structured data with AI

Open-source toolkit for extracting clean, structured data from text, images, and PDFs using state-of-the-art large language models

HomepagePyPIDocumentationExamples


Quick Start

Installation

# Using pip
pip install open-xtract

# Using uv (recommended)
uv add open-xtract

Basic Usage

from pydantic import BaseModel
from open_xtract import OpenXtract

# Define your data structure
class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    vendor: str

# Initialize extractor
ox = OpenXtract(model="openai:gpt-4o")

# Extract from any input type
result = ox.extract("Invoice #123 from ACME Corp on 2025-03-01 for $456.78", InvoiceData)
print(result)
# InvoiceData(invoice_number='123', date='2025-03-01', total_amount=456.78, vendor='ACME Corp')

Features

  • Universal Input Support: Extract from text, images (PNG, JPG), and PDFs
  • Model Agnostic: Works with OpenAI, Anthropic, Google, XAI, and any OpenAI-compatible API
  • Type-Safe: Built on Pydantic for guaranteed data structure validation
  • Fast & Efficient: Optimized extraction pipeline with minimal overhead
  • Precise: Advanced prompt engineering for accurate structured data extraction
  • Simple API: One method to extract from any input type

Detailed Usage

Input Types

The model can be specified in two formats:

  1. With colon: <provider>:<model_string> (e.g., "openai:gpt-4o")
  2. Without colon: <model_string> when provider parameter is provided separately

Examples:

  • OpenXtract(model="openai:gpt-4o")
  • OpenXtract(model="gpt-4o", provider="openai")
  • OpenXtract(model="anthropic:claude-3-5-sonnet-20241022")
  • OpenXtract(model="xai:grok-beta")
from pydantic import BaseModel
from open_xtract import OpenXtract

class InvoiceData(BaseModel):
    invoice_number: str
    date: str
    total_amount: float
    vendor: str

ox = OpenXtract(model="openai:gpt-4o")

# Extract from text
result = ox.extract("Invoice #INV-2024-001 from TechCorp dated 2024-03-15 for $1,250.00", InvoiceData)

# Extract from image bytes
with open("receipt.png", "rb") as f:
    result = ox.extract(f.read(), InvoiceData)

# Extract from PDF bytes (automatically converts pages to images)
with open("invoice.pdf", "rb") as f:
    result = ox.extract(f.read(), InvoiceData)

print(result)
# InvoiceData(invoice_number='INV-2024-001', date='2024-03-15', total_amount=1250.0, vendor='TechCorp')

Supported Models

# OpenAI
ox = OpenXtract(model="openai:gpt-4o")
ox = OpenXtract(model="openai:gpt-4o-mini")

# Anthropic
ox = OpenXtract(model="anthropic:claude-3-5-sonnet-20241022")
ox = OpenXtract(model="anthropic:claude-3-5-haiku-20241022")

# Google
ox = OpenXtract(model="google:gemini-2.0-flash-exp")

# XAI
ox = OpenXtract(model="xai:grok-beta")

# OpenRouter (proxy to many models)
ox = OpenXtract(model="openrouter:qwen/qwen-2.5-72b-instruct")

Configuration Options

You can configure OpenXtract using environment variables (default) or by passing parameters directly:

# Using environment variables (default)
# Set OPENAI_API_KEY=your-key in your environment or .env file
ox = OpenXtract(model="openai:gpt-4o")

# Pass API key directly
ox = OpenXtract(
    model="openai:gpt-4o",
    api_key="sk-your-api-key-here"
)

# Pass API key and custom base URL
# When api_key and base_url are provided, model can be used without colon
ox = OpenXtract(
    model="gpt-4o",
    api_key="sk-your-api-key-here",
    base_url="https://api.openai.com/v1"
)

# Use model without colon when provider is specified separately
ox = OpenXtract(
    model="gpt-4o",
    provider="openai",
    api_key="sk-your-api-key-here"
)

# Parameters take priority over environment variables
# This will use "direct-key" even if OPENAI_API_KEY is set
ox = OpenXtract(
    model="openai:gpt-4o",
    api_key="direct-key"
)

Complex Data Structures

from typing import List, Optional
from pydantic import BaseModel
from datetime import datetime

class LineItem(BaseModel):
    description: str
    quantity: int
    unit_price: float
    total: float

class CompanyInfo(BaseModel):
    name: str
    address: Optional[str] = None
    phone: Optional[str] = None
    email: Optional[str] = None

class DetailedInvoice(BaseModel):
    invoice_number: str
    date: datetime
    due_date: Optional[datetime] = None
    vendor: CompanyInfo
    customer: CompanyInfo
    line_items: List[LineItem]
    subtotal: float
    tax_amount: Optional[float] = None
    total_amount: float

# Extract complex nested structures
ox = OpenXtract(model="openai:gpt-4o")
result = ox.extract(complex_invoice_text, DetailedInvoice)

Use Cases

  • Document Processing: Extract data from invoices, receipts, contracts
  • Data Migration: Convert unstructured legacy data to structured formats
  • Content Analysis: Parse emails, reports, and documents for key information
  • Business Automation: Automate data entry from various document types
  • Form Processing: Extract form data from scanned documents and images

Contributing

See CONTRIBUTING.md for contribution guidelines.

License

MIT - see LICENSE.


Built by Mellow AI

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

open_xtract-0.1.3.tar.gz (144.8 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

open_xtract-0.1.3-py3-none-any.whl (9.2 kB view details)

Uploaded Python 3

File details

Details for the file open_xtract-0.1.3.tar.gz.

File metadata

  • Download URL: open_xtract-0.1.3.tar.gz
  • Upload date:
  • Size: 144.8 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_xtract-0.1.3.tar.gz
Algorithm Hash digest
SHA256 bfc202b947e3ca8c737711bb20da09a5c4e9fa4ded12b6f5b62ee84d3d8a3c3e
MD5 dd6aebb84b7f5b7981e69628b9544697
BLAKE2b-256 95a702c1d1da4b95da888b296a3504aa96c6d9cea333447093c4a4d991e22761

See more details on using hashes here.

File details

Details for the file open_xtract-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: open_xtract-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 9.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for open_xtract-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 cc6013cbc04a9b32206e3b87114156bf48c74e215d7d3b70b1b2b8719f5a02a7
MD5 da9e1ad996e417fdf6e3f3fabdd37a31
BLAKE2b-256 8d140dd149f31aba175cfdf78c837bc4c5d0e9247057e250ae2a5e2582dfc20f

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page