Open-source framework that extracts structured data from unstructured data.
Project description
OpenXtract
Turn any document into structured data with AI
Open-source toolkit for extracting clean, structured data from text, images, and PDFs using state-of-the-art large language models
Homepage • PyPI • Documentation • Examples
Quick Start
Installation
# Using pip
pip install open-xtract
# Using uv (recommended)
uv add open-xtract
Basic Usage
from pydantic import BaseModel
from open_xtract import OpenXtract
# Define your data structure
class InvoiceData(BaseModel):
invoice_number: str
date: str
total_amount: float
vendor: str
# Initialize extractor
ox = OpenXtract(model="openai:gpt-4o")
# Extract from any input type
result = ox.extract("Invoice #123 from ACME Corp on 2025-03-01 for $456.78", InvoiceData)
print(result)
# InvoiceData(invoice_number='123', date='2025-03-01', total_amount=456.78, vendor='ACME Corp')
Features
- Universal Input Support: Extract from text, images (PNG, JPG), and PDFs
- Model Agnostic: Works with OpenAI, Anthropic, Google, XAI, and any OpenAI-compatible API
- Type-Safe: Built on Pydantic for guaranteed data structure validation
- Fast & Efficient: Optimized extraction pipeline with minimal overhead
- Precise: Advanced prompt engineering for accurate structured data extraction
- Simple API: One method to extract from any input type
Detailed Usage
Input Types
The model can be specified in two formats:
- With colon:
<provider>:<model_string>(e.g.,"openai:gpt-4o") - Without colon:
<model_string>whenproviderparameter is provided separately
Examples:
OpenXtract(model="openai:gpt-4o")OpenXtract(model="gpt-4o", provider="openai")OpenXtract(model="anthropic:claude-3-5-sonnet-20241022")OpenXtract(model="xai:grok-beta")
from pydantic import BaseModel
from open_xtract import OpenXtract
class InvoiceData(BaseModel):
invoice_number: str
date: str
total_amount: float
vendor: str
ox = OpenXtract(model="openai:gpt-4o")
# Extract from text
result = ox.extract("Invoice #INV-2024-001 from TechCorp dated 2024-03-15 for $1,250.00", InvoiceData)
# Extract from image bytes
with open("receipt.png", "rb") as f:
result = ox.extract(f.read(), InvoiceData)
# Extract from PDF bytes (automatically converts pages to images)
with open("invoice.pdf", "rb") as f:
result = ox.extract(f.read(), InvoiceData)
print(result)
# InvoiceData(invoice_number='INV-2024-001', date='2024-03-15', total_amount=1250.0, vendor='TechCorp')
Supported Models
# OpenAI
ox = OpenXtract(model="openai:gpt-4o")
ox = OpenXtract(model="openai:gpt-4o-mini")
# Anthropic
ox = OpenXtract(model="anthropic:claude-3-5-sonnet-20241022")
ox = OpenXtract(model="anthropic:claude-3-5-haiku-20241022")
# Google
ox = OpenXtract(model="google:gemini-2.0-flash-exp")
# XAI
ox = OpenXtract(model="xai:grok-beta")
# OpenRouter (proxy to many models)
ox = OpenXtract(model="openrouter:qwen/qwen-2.5-72b-instruct")
Configuration Options
You can configure OpenXtract using environment variables (default) or by passing parameters directly:
# Using environment variables (default)
# Set OPENAI_API_KEY=your-key in your environment or .env file
ox = OpenXtract(model="openai:gpt-4o")
# Pass API key directly
ox = OpenXtract(
model="openai:gpt-4o",
api_key="sk-your-api-key-here"
)
# Pass API key and custom base URL
# When api_key and base_url are provided, model can be used without colon
ox = OpenXtract(
model="gpt-4o",
api_key="sk-your-api-key-here",
base_url="https://api.openai.com/v1"
)
# Use model without colon when provider is specified separately
ox = OpenXtract(
model="gpt-4o",
provider="openai",
api_key="sk-your-api-key-here"
)
# Parameters take priority over environment variables
# This will use "direct-key" even if OPENAI_API_KEY is set
ox = OpenXtract(
model="openai:gpt-4o",
api_key="direct-key"
)
Complex Data Structures
from typing import List, Optional
from pydantic import BaseModel
from datetime import datetime
class LineItem(BaseModel):
description: str
quantity: int
unit_price: float
total: float
class CompanyInfo(BaseModel):
name: str
address: Optional[str] = None
phone: Optional[str] = None
email: Optional[str] = None
class DetailedInvoice(BaseModel):
invoice_number: str
date: datetime
due_date: Optional[datetime] = None
vendor: CompanyInfo
customer: CompanyInfo
line_items: List[LineItem]
subtotal: float
tax_amount: Optional[float] = None
total_amount: float
# Extract complex nested structures
ox = OpenXtract(model="openai:gpt-4o")
result = ox.extract(complex_invoice_text, DetailedInvoice)
Use Cases
- Document Processing: Extract data from invoices, receipts, contracts
- Data Migration: Convert unstructured legacy data to structured formats
- Content Analysis: Parse emails, reports, and documents for key information
- Business Automation: Automate data entry from various document types
- Form Processing: Extract form data from scanned documents and images
Contributing
See CONTRIBUTING.md for contribution guidelines.
License
MIT - see LICENSE.
Built by Mellow AI
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file open_xtract-0.1.3.tar.gz.
File metadata
- Download URL: open_xtract-0.1.3.tar.gz
- Upload date:
- Size: 144.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfc202b947e3ca8c737711bb20da09a5c4e9fa4ded12b6f5b62ee84d3d8a3c3e
|
|
| MD5 |
dd6aebb84b7f5b7981e69628b9544697
|
|
| BLAKE2b-256 |
95a702c1d1da4b95da888b296a3504aa96c6d9cea333447093c4a4d991e22761
|
File details
Details for the file open_xtract-0.1.3-py3-none-any.whl.
File metadata
- Download URL: open_xtract-0.1.3-py3-none-any.whl
- Upload date:
- Size: 9.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cc6013cbc04a9b32206e3b87114156bf48c74e215d7d3b70b1b2b8719f5a02a7
|
|
| MD5 |
da9e1ad996e417fdf6e3f3fabdd37a31
|
|
| BLAKE2b-256 |
8d140dd149f31aba175cfdf78c837bc4c5d0e9247057e250ae2a5e2582dfc20f
|