ocr documents using vision models from all popular providers like OpenAI, Azure OpenAI, Anthropic, AWS Bedrock etc
Project description
Quantalogic PyZeroX
๐ Maintained Fork: This is a maintained fork of the original ZeroX project by Omni AI, enhanced and actively maintained by Quantalogic. This version provides extended document processing and LLM-powered workflow capabilities, building on the original foundation with additional features, integrations, and improvements for both Python and Node.js environments.
Quantalogic PyZeroX is a cross-platform toolkit for document processing and LLM-powered workflows, supporting both Python and Node.js. It enables rapid prototyping and deployment of AI-driven document pipelines with support for multiple vision models and providers.
Maintained by Quantalogic - A platform dedicated to advancing AI-powered document processing and workflow automation.
๐ Table of Contents
- โจ Features
- ๐ง Prerequisites
- ๐ Quick Start
- ๐ฆ Installation
- ๐ Usage
- ๐ฏ API Reference
- ๐ค Supported Vision Models
- ๐ Supported File Types
- ๐ก Examples
- ๐ง Development
- ๐ Documentation
- ๐ค Contributing
- ๐ License
โจ Features
- ๐ Multi-platform Support: Works seamlessly with both Python and Node.js
- ๐ค Multiple LLM Providers: OpenAI, Azure OpenAI, AWS Bedrock, Google Gemini, Anthropic
- ๐ Document Processing: PDF, Word, Excel, PowerPoint, and 20+ file formats
- ๐ OCR to Markdown: Convert documents to structured markdown format
- ๐ฏ Data Extraction: Extract structured data using JSON schemas
- โก Concurrent Processing: Process multiple pages simultaneously for speed
- ๐จ Format Preservation: Maintain document formatting across pages
- ๐ฅ๏ธ Cross-platform: Works on Windows, macOS, and Linux
๐ง Prerequisites
System Dependencies
For Python:
- Python 3.8 or higher
- Poppler (for PDF processing)
For Node.js:
- Node.js 16 or higher
- GraphicsMagick
- Ghostscript
Platform-specific Installation
macOS:
brew install poppler graphicsmagick ghostscript
Ubuntu/Debian:
sudo apt-get update
sudo apt-get install -y poppler-utils graphicsmagick ghostscript
Windows:
- Download and install Poppler from poppler-windows
- Download and install GraphicsMagick from official site
๐ Quick Start
Python Quick Start
import asyncio
from pyzerox import zerox
import os
# Set up your API key
os.environ["OPENAI_API_KEY"] = "your-api-key-here"
async def main():
result = await zerox(
file_path="path/to/your/document.pdf",
model="gpt-4o" # Latest vision-capable model
)
print(result)
# Run the example
asyncio.run(main())
โ ๏ธ Important: PyZeroX requires vision-capable models to process document images. Ensure you're using a model that supports image input.
Node.js Quick Start
import { zerox } from "zerox";
const result = await zerox({
filePath: "path/to/your/document.pdf",
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
console.log(result);
๐ฆ Installation
Python Installation
# Install system dependencies (see Prerequisites section)
pip install py-zerox
Node.js Installation
# Install system dependencies (see Prerequisites section)
npm install zerox
Development Installation
# Clone the repository
git clone https://github.com/quantalogic/quantalogic-pyzerox.git
cd quantalogic-pyzerox
# Python development setup
poetry install && poetry build
# Node.js development setup
cd node-zerox && npm install && npx tsc
# Run tests
make test # or individual commands below
poetry run pytest py_zerox/tests/
npm test
๐ Usage
Node.js Usage
Basic Document Processing
Process from URL:
import { zerox } from "zerox";
const result = await zerox({
filePath: "https://example.com/document.pdf",
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
Process from Local Path:
import { zerox } from "zerox";
import path from "path";
const result = await zerox({
filePath: path.resolve(__dirname, "./document.pdf"),
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
Advanced Configuration
import { zerox } from "zerox";
import { ModelOptions, ModelProvider, ErrorMode } from "zerox/types";
const result = await zerox({
// Required
filePath: "path/to/file.pdf",
credentials: {
apiKey: "your-api-key",
// Additional provider-specific credentials as needed
},
// Processing Options
cleanup: true, // Clear images from tmp after run
concurrency: 10, // Number of pages to run at a time
correctOrientation: true, // Attempts to identify and correct page orientation
maintainFormat: false, // Slower but helps maintain consistent formatting
// Image Processing
imageDensity: 300, // DPI for image conversion
imageHeight: 2048, // Maximum height for converted images
maxImageSize: 15, // Maximum size of images to compress (MB)
trimEdges: true, // Trims pixels from edges
// Error Handling
errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE
maxRetries: 1, // Number of retries on failed pages
// Data Extraction
extractOnly: false, // Extract structured data only
extractPerPage: false, // Extract data per page vs entire document
schema: undefined, // JSON schema for structured extraction
// Model Configuration
model: ModelOptions.OPENAI_GPT_4O,
modelProvider: ModelProvider.OPENAI,
llmParams: {}, // Additional LLM parameters
// Output Options
outputDir: undefined, // Save result.md to file
tempDir: "/tmp", // Temporary files directory
// Page Selection
pagesToConvertAsImages: -1, // -1 for all pages, or array [1,2,3]
// Custom Prompts
prompt: "", // Custom processing instructions
extractionPrompt: "", // Custom extraction instructions
});
Multi-Provider Examples
import { zerox } from "zerox";
import { ModelOptions, ModelProvider } from "zerox/types";
// OpenAI
const openaiResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.OPENAI,
model: ModelOptions.OPENAI_GPT_4O,
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
// Azure OpenAI
const azureResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.AZURE,
model: ModelOptions.OPENAI_GPT_4O,
credentials: {
apiKey: process.env.AZURE_API_KEY,
endpoint: process.env.AZURE_ENDPOINT,
},
});
// AWS Bedrock
const bedrockResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.BEDROCK,
model: ModelOptions.BEDROCK_CLAUDE_3_7_SONNET_2025_02,
credentials: {
accessKeyId: process.env.AWS_ACCESS_KEY_ID,
secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
region: process.env.AWS_REGION,
},
});
// Google Gemini
const geminiResult = await zerox({
filePath: "path/to/file.pdf",
modelProvider: ModelProvider.GOOGLE,
model: ModelOptions.GOOGLE_GEMINI_2_5_FLASH,
credentials: {
apiKey: process.env.GEMINI_API_KEY,
},
});
Python Usage
Basic Document Processing
import asyncio
from pyzerox import zerox
import os
# Set up environment
os.environ["OPENAI_API_KEY"] = "your-api-key"
async def main():
result = await zerox(
file_path="path/to/document.pdf",
model="gpt-4o-mini"
)
print(result)
asyncio.run(main())
Advanced Configuration
import asyncio
from pyzerox import zerox
import os
async def main():
result = await zerox(
file_path="https://example.com/document.pdf",
model="gpt-4o",
# Processing Options
cleanup=True,
concurrency=10,
maintain_format=False,
# Page Selection
select_pages=None, # None for all, or [1,2,3] for specific pages
# Output Options
output_dir="./output",
temp_dir=None, # Uses system temp if None
# Custom Prompts
custom_system_prompt=None,
# Additional model parameters
**{"temperature": 0.1}
)
return result
result = asyncio.run(main())
Multi-Provider Examples
import asyncio
from pyzerox import zerox
import os
import json
# OpenAI
async def openai_example():
os.environ["OPENAI_API_KEY"] = "your-api-key"
result = await zerox(
file_path="document.pdf",
model="gpt-4o-mini"
)
return result
# Azure OpenAI
async def azure_example():
os.environ["AZURE_API_KEY"] = "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "2023-05-15"
result = await zerox(
file_path="document.pdf",
model="azure/gpt-4o-mini"
)
return result
# Google Gemini (Latest)
async def gemini_example():
os.environ['GEMINI_API_KEY'] = "your-gemini-api-key"
result = await zerox(
file_path="document.pdf",
model="gemini/gemini-2.5-flash" # Latest Gemini vision model
)
return result
# Anthropic Claude (Latest)
async def anthropic_example():
os.environ["ANTHROPIC_API_KEY"] = "your-anthropic-api-key"
result = await zerox(
file_path="document.pdf",
model="claude-sonnet-4-20250514" # Latest Claude with exceptional reasoning
)
return result
๐ฏ API Reference
Node.js API
zerox(options)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
filePath |
string |
Required | Path to document (local or URL) |
credentials |
object |
Required | API credentials for chosen provider |
model |
ModelOptions |
OPENAI_GPT_4O |
Model to use for processing |
modelProvider |
ModelProvider |
OPENAI |
Provider (OPENAI, AZURE, BEDROCK, GOOGLE) |
cleanup |
boolean |
true |
Clean up temporary files after processing |
concurrency |
number |
10 |
Number of pages to process simultaneously |
maintainFormat |
boolean |
false |
Maintain formatting across pages (slower) |
extractOnly |
boolean |
false |
Extract structured data only |
schema |
object |
undefined |
JSON schema for data extraction |
outputDir |
string |
undefined |
Directory to save output files |
Returns: Promise<ZeroxOutput>
Python API
zerox(file_path, model, **kwargs)
Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
file_path |
str |
Required | Path to document (local or URL) |
model |
str |
"gpt-4o" |
Model identifier |
cleanup |
bool |
True |
Clean up temporary files |
concurrency |
int |
10 |
Number of concurrent processes |
maintain_format |
bool |
False |
Maintain formatting across pages |
select_pages |
Union[int, List[int]] |
None |
Pages to process (None for all) |
output_dir |
str |
None |
Directory to save output |
custom_system_prompt |
str |
None |
Custom system prompt |
Returns: ZeroxOutput
๐ค Supported Vision Models
PyZeroX requires vision-capable models for document processing. All models are supported via LiteLLM, ensuring compatibility with the latest model releases and API updates.
OpenAI Vision Models
Latest Vision Models (2024/2025):
- GPT-4.1 (
gpt-4.1) - Next generation multimodal model - GPT-4.1 Mini (
gpt-4.1-mini) - Efficient next-gen model - o3-mini (
o3-mini) - Reasoning model with vision - o1-mini (
o1-mini) - Advanced reasoning capabilities - GPT-4o (
gpt-4o) - Stable multimodal vision model - GPT-4o Mini (
gpt-4o-mini) - Faster, cost-effective vision option - GPT-4 Turbo (
gpt-4-turbo) - Previous generation with vision
Azure OpenAI Vision Models
- GPT-4o (
azure/gpt-4o) - Latest multimodal model - GPT-4o Mini (
azure/gpt-4o-mini) - Cost-effective vision option - GPT-4 Turbo (
azure/gpt-4-turbo) - Previous generation with vision - Format:
azure/<deployment-name>
Google Gemini Vision Models (AI Studio)
- Gemini 2.5 Pro (
gemini/gemini-2.5-pro) - Most powerful thinking model with vision - Gemini 2.5 Flash (
gemini/gemini-2.5-flash) - High-performance with adaptive thinking - Gemini 2.0 Flash (
gemini/gemini-2.0-flash) - Fast and versatile multimodal model - Gemini 1.5 Pro (
gemini/gemini-1.5-pro) - Large context window with vision - Gemini 1.5 Flash (
gemini/gemini-1.5-flash) - Fast inference with vision
Google Vertex AI Vision Models
- Gemini 2.5 Pro (
vertex_ai/gemini-2.5-pro) - Gemini 2.5 Flash (
vertex_ai/gemini-2.5-flash) - Gemini 2.0 Flash (
vertex_ai/gemini-2.0-flash) - Gemini 1.5 Pro (
vertex_ai/gemini-1.5-pro) - Gemini 1.5 Flash (
vertex_ai/gemini-1.5-flash)
AWS Bedrock Vision Models
- Claude 3.7 Sonnet (
bedrock/us.anthropic.claude-3-7-sonnet-20250219-v1:0) - Claude 3.5 Sonnet (
bedrock/anthropic.claude-3-5-sonnet-20241022-v2:0) - Claude 3.5 Haiku (
bedrock/anthropic.claude-3-5-haiku-20241022-v1:0) - Claude 3 Opus (
bedrock/anthropic.claude-3-opus-20240229-v1:0) - Claude 3 Sonnet (
bedrock/anthropic.claude-3-sonnet-20240229-v1:0) - Claude 3 Haiku (
bedrock/anthropic.claude-3-haiku-20240307-v1:0)
Anthropic Vision Models (Direct API)
- Claude Opus 4 (
claude-opus-4-20250514) - Most capable and intelligent model - Claude Sonnet 4 (
claude-sonnet-4-20250514) - High-performance with exceptional reasoning - Claude 3.7 Sonnet (
claude-3-7-sonnet-20250219) - Latest with extended thinking - Claude 3.5 Sonnet (
claude-3-5-sonnet-20241022) - Enhanced vision capabilities - Claude 3.5 Haiku (
claude-3-5-haiku-20241022) - Fast vision processing - Claude 3 Opus (
claude-3-opus-20240229) - Most capable vision model - Claude 3 Sonnet (
claude-3-sonnet-20240229) - Balanced performance - Claude 3 Haiku (
claude-3-haiku-20240307) - Fast vision processing
๐ Latest Models: This documentation is updated with the latest available vision models as of 2025. All model names and capabilities are sourced from official provider documentation and LiteLLM compatibility matrix to ensure accuracy and up-to-date information.
๐ Supported File Types
Quantalogic PyZeroX supports a wide range of document formats:
Document Formats:
- PDF, DOC, DOCX, RTF, TXT
- ODT, OTT (OpenDocument)
- HTML, HTM, XML
- WPS, WPD (WordPerfect)
Spreadsheet Formats:
- XLS, XLSX (Excel)
- ODS, OTS (OpenDocument)
- CSV, TSV
Presentation Formats:
- PPT, PPTX (PowerPoint)
- ODP, OTP (OpenDocument)
Image Formats:
- PNG, JPG, JPEG, TIFF, BMP
- SVG, WEBP
๐ก Examples
Data Extraction Example
import { zerox } from "zerox";
const result = await zerox({
filePath: "invoice.pdf",
extractOnly: true,
schema: {
type: "object",
properties: {
invoice_number: { type: "string" },
date: { type: "string" },
total: { type: "number" },
items: {
type: "array",
items: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
quantity: { type: "number" }
}
}
}
}
},
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
Batch Processing Example
import asyncio
from pyzerox import zerox
import os
async def process_documents(file_paths):
results = []
for file_path in file_paths:
result = await zerox(
file_path=file_path,
model="gpt-4o-mini",
output_dir="./processed"
)
results.append(result)
return results
# Process multiple documents
files = ["doc1.pdf", "doc2.pdf", "doc3.pdf"]
results = asyncio.run(process_documents(files))
Format Preservation Example
// For documents with complex tables spanning multiple pages
const result = await zerox({
filePath: "financial-report.pdf",
maintainFormat: true, // Slower but better for tables
concurrency: 1, // Required for maintainFormat
credentials: {
apiKey: process.env.OPENAI_API_KEY,
},
});
๐ง Development
Project Structure
quantalogic-pyzerox/
โโโ py_zerox/ # Python package
โ โโโ pyzerox/ # Main Python module
โ โ โโโ core/ # Core processing logic
โ โ โโโ models/ # Data models
โ โ โโโ processor/ # Document processors
โ โโโ tests/ # Python tests
โโโ node-zerox/ # Node.js package
โ โโโ src/ # TypeScript source
โ โ โโโ models/ # Model definitions
โ โ โโโ utils/ # Utility functions
โ โโโ tests/ # Node.js tests
โโโ docs/ # Documentation
โโโ examples/ # Example code
โโโ shared/ # Shared resources
Building from Source
# Clone the repository
git clone https://github.com/quantalogic/quantalogic-pyzerox.git
cd quantalogic-pyzerox
# Install dependencies
make install
# Build packages
make build
# Run tests
make test
# Run linting
make lint
Testing
# Python tests
poetry run pytest py_zerox/tests/
# Node.js tests
cd node-zerox && npm test
# Integration tests
make test-integration
๐ Documentation
For detailed documentation, see the docs/ directory:
- Project Overview - Purpose, stack, platform support
- Architecture - System structure, data flow, key files
- Build System - Build configs, workflows, troubleshooting
- Testing - Test types, commands, organization
- Development - Code style, patterns, workflows
- Deployment - Packaging, scripts, output locations
- Files Catalog - File groups, entry points, dependencies
๐ค Contributing
We welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
๐ License
This project is licensed under the MIT License. See the LICENSE file for details.
๐ Credits
This project is a maintained fork of the original ZeroX by Omni AI. We're grateful for their foundational work and continue to build upon their vision.
- Original ZeroX Project: getomni-ai/zerox - The original OCR and document processing toolkit
- LiteLLM - Powers our Python SDK with multi-provider support
- Original PyZeroX project contributors
- The open-source community for inspiration and feedback
Made with โค๏ธ by the Quantalogic team - Advancing AI-powered document processing and workflow automation.
Originally based on ZeroX by Omni AI
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file quantalogic_py_zerox-0.0.8-py3-none-any.whl.
File metadata
- Download URL: quantalogic_py_zerox-0.0.8-py3-none-any.whl
- Upload date:
- Size: 23.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.10
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ceab53e723c75d8290cc2f730d0adf1a4d84e177b89aaa260b46a8bdc630c3f
|
|
| MD5 |
e229d5f503b290e2a4ad0989bca7fab4
|
|
| BLAKE2b-256 |
240dbc6f4f39b875a8e1407eca41189a163e69863822ec7917258ac45464656c
|