Omni Pre-Processor: Document content extraction package
Project description
OPP - Omni Pre-Processor
Document content extraction for DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, and Image (OCR).
Features
- Multi-format extraction - DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, Image, IPYNB, YouTube URL
- Image OCR - Tesseract and RapidOCR with graceful fallback
- Email extraction - EML (RFC 822) and MSG (Outlook) with attachment recursion
- Audio/Video transcription - Whisper-based ASR
- Format auto-detection - Magic bytes detection (extension not required)
- Resource management - MD5 deduplication, UUID naming for images
- Pipeline orchestrator - detect → extract → manage → report
- CLI interface - Full command-line with batch support
- Output formats - Markdown and XLIFF 1.2/2.0
Installation
# Core package
pip install -e .
# With office/data formats (XLSX, CSV, JSON, XML)
pip install -e ".[office]"
# With email and OCR (EML, MSG, Tesseract, RapidOCR)
pip install -e ".[email]"
Quick Start
Python API
from opp import DOCXExtractor, PDFExtractor, PPTXExtractor
from opp.detector import detect_format
from opp.pipeline import OPPPipeline
# Direct extraction
extractor = DOCXExtractor()
result = extractor.extract("document.docx")
print(result.content)
# Auto-detection
fmt, confidence = detect_format("document.docx")
print(f"Format: {fmt.value}, Confidence: {confidence}")
# Full pipeline
pipeline = OPPPipeline(resource_storage_dir="./resources")
result = pipeline.process_file("document.docx")
print(f"Extracted: {len(result.content)} chars, {result.images_stored} images")
CLI
# Extract to Markdown
opp --target-format=md document.docx
# Extract to XLIFF for translation
opp --target-format=xlf --source-lang=en --target-lang=zh document.docx
# Generate both MD and XLIFF
opp --target-format=both --source-lang=en --target-lang=zh document.docx
# Custom output directory
opp --target-format=md --output-dir ./output document.docx
# Image OCR
opp --ocr-engine tesseract scan.png
# Batch processing
opp --batch file1.docx file2.pdf file3.pptx
Windows Batch Scripts
| Script | Description |
|---|---|
md.bat |
Convert to Markdown |
en2cn_xliff.bat |
English source → Chinese XLIFF |
cn2en_xliff.bat |
Chinese source → English XLIFF |
md.bat "document.docx"
md.bat "folder"
en2cn_xliff.bat "english.docx"
cn2en_xliff.bat "中文.docx"
Supports drag-drop of files and folders. Logs saved to logs/.
Project Structure
src/opp/
├── detector.py # Format auto-detection
├── extractors/ # Document extractors
│ ├── docx.py
│ ├── pptx.py
│ ├── pdf.py
│ ├── xlsx.py
│ ├── csv.py
│ ├── json.py
│ ├── xml.py
│ ├── email.py
│ └── image_ocr.py
├── channels/ # Output formatters
│ ├── table_channel.py # DataFrame → Markdown table
│ └── keyvalue_channel.py # dict → XLIFF
├── xliff/ # XLIFF 1.2/2.0 generator
├── pipeline.py # OPPPipeline orchestrator
├── resource_manager.py # Image deduplication
└── cli.py # Command-line interface
Architecture
┌─────────────────────────────────────────┐
│ OPPPipeline │
│ detect_format() → Extractor → Report │
└─────────────────────────────────────────┘
┌──────────┐ ┌───────────┐ ┌────────────────┐ ┌──────────────┐
│ detector │───▶│ extractors│───▶│resource_manager│───▶│error_handler │
│ magic │ │ DOCX/... │ │ MD5 + UUID │ │ HTML/text │
└──────────┘ └───────────┘ └────────────────┘ └──────────────┘
Development
pip install -e ".[dev]"
pytest tests/ -v --cov=src/opp --cov-report=term-missing
Test Coverage
| Module | Tests |
|---|---|
| detector | 13 |
| resource_manager | 18 |
| error_handler | 18 |
| integration | 25 |
| cli | 18 |
| e2e | 52 |
| xliff | 40+ |
| extractors | 140+ |
| Total | 479+ |
Batch Testing
Test files available in batch_test/ covering all formats.
opp --target-format=both --source-lang=en --target-lang=zh --output-dir=output batch_test/
MCP Server (Agent-Facing)
The OPP MCP server provides document extraction capabilities to AI agents via the Model Context Protocol. AI assistants can use these tools to process documents without needing to understand OPP's internal architecture.
Why Use the MCP Server?
- Agent integration - Connect OPP to any MCP-compatible AI assistant
- ** stdio transport** - Communication over standard input/output for security
- 5 extraction tools - Cover all major document formats
- Path security - Directory allowlist prevents unauthorized file access
Installation
# Install OPP with MCP server support
pip install -e ".[mcp]"
Quick Start
Start the server manually:
python -m opp.mcp.server
Auto-start with uvx:
uvx opp-mcp-server
Auto-start with npx:
npx opp-mcp-server
Hermes Configuration
Add OPP to your Hermes agent configuration:
agents:
my-agent:
tools:
- name: opp
type: code
config:
server_command: uvx opp-mcp-server
allowed_directories:
- /path/to/documents
- /path/to/output
Available Tools
| Tool | Description |
|---|---|
extract_document |
Extract content from a single document file. Supports DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, and images. Returns markdown or structured content. |
batch_extract |
Process multiple files in one request. Takes an array of file paths and processes them sequentially. Returns extraction results for each file. |
detect_format |
Identify the file format of a document using magic bytes detection. Works regardless of file extension. Returns format name and confidence score. |
generate_markdown |
Convert a document to markdown format. Specify source and target languages for proper text processing. |
generate_xliff |
Convert a document to XLIFF format for translation workflows. Requires source-lang and target-lang parameters. |
Security
The MCP server enforces path validation to prevent unauthorized file access.
Allowlist configuration:
# Via environment variable
export OPP_ALLOWED_DIRECTORIES="/allowed/documents,/allowed/output"
# Via configuration file
Configuration file (opp_mcp_config.yaml):
security:
allowed_directories:
- /mnt/d/贯维/Documents
- /mnt/d/贯维/Output
- ./documents
server:
host: localhost
port: 8765
extraction:
default_target_format: md
ocr_engine: tesseract
Environment Variables
| Variable | Description | Default |
|---|---|---|
OPP_ALLOWED_DIRECTORIES |
Comma-separated list of allowed directories | Required |
OPP_RESOURCE_STORAGE_DIR |
Directory for extracted images | ./resources |
OPP_OCR_ENGINE |
OCR engine to use | tesseract |
OPP_LOG_LEVEL |
Logging level | INFO |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file omni_pre_processor-0.1.0.tar.gz.
File metadata
- Download URL: omni_pre_processor-0.1.0.tar.gz
- Upload date:
- Size: 47.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
00cda1d74bd7a624739989c0653171698bdd0464b3ee07d29dc139b9e46acdd0
|
|
| MD5 |
4bb49f8d486500415f533d83a28170e3
|
|
| BLAKE2b-256 |
e5366e3d72fb68fa950205a4a5e4955acbddc0521f5bb55475c5f32169f0b3c5
|
Provenance
The following attestation bundles were made for omni_pre_processor-0.1.0.tar.gz:
Publisher:
publish.yml on 1StepMore/Omni_Pre_Processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omni_pre_processor-0.1.0.tar.gz -
Subject digest:
00cda1d74bd7a624739989c0653171698bdd0464b3ee07d29dc139b9e46acdd0 - Sigstore transparency entry: 1575993434
- Sigstore integration time:
-
Permalink:
1StepMore/Omni_Pre_Processor@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/1StepMore
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file omni_pre_processor-0.1.0-py3-none-any.whl.
File metadata
- Download URL: omni_pre_processor-0.1.0-py3-none-any.whl
- Upload date:
- Size: 67.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7e7e4d0bd90a10ac206bc35a7c0764677fdc00869bcfd78175c9580bedbc5fe4
|
|
| MD5 |
4f39df15cdc9f01282cb5f310d7d3c70
|
|
| BLAKE2b-256 |
e07e0a136664da6ca57cc58d7ee2b92dd2d9648f228105d4a9e97f2aa9f4b7a2
|
Provenance
The following attestation bundles were made for omni_pre_processor-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on 1StepMore/Omni_Pre_Processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
omni_pre_processor-0.1.0-py3-none-any.whl -
Subject digest:
7e7e4d0bd90a10ac206bc35a7c0764677fdc00869bcfd78175c9580bedbc5fe4 - Sigstore transparency entry: 1575993475
- Sigstore integration time:
-
Permalink:
1StepMore/Omni_Pre_Processor@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c -
Branch / Tag:
refs/heads/main - Owner: https://github.com/1StepMore
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c -
Trigger Event:
workflow_dispatch
-
Statement type: