OpenAI-Compatible & Dynamic-Batch AI post-processor for docpipe-mini
Project description
docpipe-ai
Protocol-oriented & Mixin-based AI content processor for docpipe-mini
docpipe-ai is a flexible and extensible AI content analysis library that uses modern Python design patterns to provide intelligent image content understanding and structured output capabilities.
Features
Protocol-oriented Architecture
- Type Safety: Interface definitions based on
typing.Protocol - Zero-cost Composition: Reusable implementations via Mixin classes
- External Flow Control: Users control document parsing, AI focuses on content processing
Multi AI Provider Support
- OpenAI: GPT-4o, GPT-4 Turbo and other models
- Anthropic: Claude series models
- GLM: Zhipu AI model support
- Extensible: Easy to add new AI providers
Structured Output
- JSON Schema: Define return data structure
- Type Validation: Pydantic model validation
- Multiple Analysis Types: Contract, table, general analysis
- Smart Content Extraction: Automatically identify key information
Performance Optimization
- Adaptive Batch Processing: Dynamically adjust batch size based on content
- Memory Caching: Avoid duplicate processing of same content
- Concurrent Processing: Support multi-threaded parallel processing
Quick Start
Installation
pip install docpipe-ai
Basic Usage
from docpipe_ai import process_image, create_openai_processor
from docpipe import PyMuPDFSerializer
# 1. Create processor
processor = create_openai_processor(
api_key="your-api-key",
model="gpt-4o"
)
# 2. Extract images from PDF
serializer = PyMuPDFSerializer()
images = []
for chunk in serializer.iterate_chunks("document.pdf"):
if chunk.type == "image":
images.append(chunk)
# 3. Process images
results = processor.process_images(images)
# 4. View results
for result in results:
print(f"Page {result.original.page}: {result.processed_text}")
Structured Output
from docpipe_ai import ProcessingConfig, ContentAnalysisType
# Create contract analysis configuration
config = ProcessingConfig.create_contract_analysis_config()
processor = create_openai_processor(
api_key="your-api-key",
model="gpt-4o",
config=config
)
# Process images to get structured data
results = processor.process_images(images)
for result in results:
if result.structured_data:
data = result.structured_data
print(f"Document type: {data['content_type']}")
print(f"Summary: {data['summary_text']}")
# Key information
key_elements = data['content_details']['key_elements']
for element in key_elements:
print(f" - {element}")
Documentation
Protocol-oriented API
from docpipe_ai import AdaptiveImageProcessor, ProcessingConfig
# Advanced usage: custom configuration
config = ProcessingConfig(
model_name="gpt-4o",
temperature=0.3,
max_tokens=1000,
response_format=ResponseFormatType.STRUCTURED,
content_analysis_type=ContentAnalysisType.CONTRACT
)
processor = AdaptiveImageProcessor(config)
results = processor.process_batch(image_contents)
Supported Analysis Types
| Type | Description | Use Cases |
|---|---|---|
CONTRACT |
Contract document analysis | Legal documents, agreements |
TABLE |
Table data extraction | Financial reports, data tables |
DOCUMENT |
Document structure analysis | Reports, papers, books |
GENERAL |
General content analysis | Any type of image content |
AI Provider Configuration
OpenAI
processor = create_openai_processor(
api_key="your-openai-key",
model="gpt-4o",
api_base="https://api.openai.com/v1"
)
Anthropic
processor = create_anthropic_processor(
api_key="your-anthropic-key",
model="claude-3-sonnet-20240229"
)
GLM (Zhipu AI)
from docpipe_ai.processors.adaptive_image_processor import AdaptiveImageProcessor
processor = AdaptiveImageProcessor.create_openai_processor(
api_key="your-glm-key",
api_base="https://open.bigmodel.cn/api/paas/v4/",
model="glm-4.5v"
)
Configuration
ProcessingConfig Parameters
config = ProcessingConfig(
# Basic configuration
model_name="gpt-4o", # AI model name
temperature=0.3, # Generation temperature
max_tokens=1000, # Maximum tokens
# Response format
response_format=ResponseFormatType.STRUCTURED, # Structured output
content_analysis_type=ContentAnalysisType.GENERAL, # Analysis type
# Batch processing configuration
max_concurrency=5, # Maximum concurrency
batch_size=10, # Batch size
# Custom Schema
custom_schema={ # Custom JSON Schema
"type": "object",
"properties": {
"title": {"type": "string"},
"content": {"type": "string"}
}
}
)
Project Structure
docpipe-ai/
├── src/docpipe_ai/
│ ├── api/ # Simple user interface
│ ├── core/ # Core protocol definitions
│ ├── mixins/ # Reusable components
│ ├── providers/ # AI Provider abstractions
│ ├── processors/ # Concrete processor implementations
│ ├── data/ # Data structures and configuration
│ └── pipelines/ # Legacy support
├── docs/ # Documentation
├── tests/ # Test files (git ignored)
└── .github/workflows/ # CI/CD configuration
Architecture
Protocol-oriented + Mixin Pattern
# Define capability interfaces
@runtime_checkable
class Batchable(Protocol):
@abstractmethod
def should_process_batch(self, batch_size: int, total_items: int) -> bool: ...
# Provide reusable implementations
class DynamicBatchingMixin(Generic[T]):
def calculate_optimal_batch_size(self: "Batchable", remaining_items: int) -> int: ...
# Compose usage
class AdaptiveImageProcessor(Batchable, DynamicBatchingMixin[ImageContent]):
def should_process_batch(self, batch_size: int, total_items: int) -> bool:
return batch_size <= self.config.max_concurrency * 2
Advantages of this design pattern:
- Zero-cost Abstraction: Compile-time type checking, no runtime overhead
- Flexible Composition: Combine different capabilities as needed
- Easy Testing: Protocols can be easily mocked
- Backward Compatible: Doesn't break existing code
Deployment and Publishing
Automated Publishing
The project is configured with GitHub Actions automated publishing workflow:
# Publish new version
echo '__version__ = "0.2.1"' > src/docpipe_ai/__init__.py
git add src/docpipe_ai/__init__.py
git commit -m "bump: version 0.2.1"
git tag v0.2.1
git push origin v0.2.1
Trigger conditions:
- Push
v*format tags - Manual workflow trigger
For detailed documentation, see: DEPLOYMENT.md
Development
Local Development
# Clone repository
git clone https://github.com/juncaifeng/docpipe-ai.git
cd docpipe-ai
# Install development dependencies
pip install -e ".[dev]"
# Run type checking
mypy src/docpipe_ai
# Run code checking
ruff check src/docpipe_ai
Testing
# Test files are excluded from git to protect privacy
# Create your own test files if needed for testing
python -m pytest tests/
License
MIT License - see LICENSE file for details
Contributing
Issues and Pull Requests are welcome!
Support
- 📧 Email: [your-email@example.com]
- 🐛 Bug Reports: GitHub Issues
- 📖 Documentation: GitHub Wiki
Acknowledgments
Thanks to docpipe-mini for providing the document parsing foundation support.
docpipe-ai - Making AI content analysis simple and powerful! 🚀
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docpipe_ai-0.2.1.tar.gz.
File metadata
- Download URL: docpipe_ai-0.2.1.tar.gz
- Upload date:
- Size: 161.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
38d12ee0c25cfc9819170bf0731b099a0e43d38e7eaced7e2ae1376d10076d9d
|
|
| MD5 |
15d12ff18531b00621673666e2bd44a4
|
|
| BLAKE2b-256 |
35d747117a0e45c47a9d76d8635530d8bf78c251feab58a5227f8da48b6f1f43
|
File details
Details for the file docpipe_ai-0.2.1-py3-none-any.whl.
File metadata
- Download URL: docpipe_ai-0.2.1-py3-none-any.whl
- Upload date:
- Size: 88.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f7265b1700c0dffda694d7941d98c36aa66460bea0b2c03afeb24ec7cdde48c9
|
|
| MD5 |
e2f9ccf77238d9f284d0717d96243350
|
|
| BLAKE2b-256 |
54b68ec88a9912fa3f7fa4014d04e04c5360602cecfd44957a147f5cae0b3217
|