Skip to main content

Omni Pre-Processor: Document content extraction package

Project description

OPP - Omni Pre-Processor

PyPI version Python versions License Downloads

Document content extraction for DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, and Image (OCR).

Features

  • Multi-format extraction - DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, Image, IPYNB, YouTube URL
  • Image OCR - Tesseract and RapidOCR with graceful fallback
  • Email extraction - EML (RFC 822) and MSG (Outlook) with attachment recursion
  • Audio/Video transcription - Whisper-based ASR
  • Format auto-detection - Magic bytes detection (extension not required)
  • Resource management - MD5 deduplication, UUID naming for images
  • Pipeline orchestrator - detect → extract → manage → report
  • CLI interface - Full command-line with batch support
  • Output formats - Markdown and XLIFF 1.2/2.0

Installation

# Core package
pip install -e .

# With office/data formats (XLSX, CSV, JSON, XML)
pip install -e ".[office]"

# With email and OCR (EML, MSG, Tesseract, RapidOCR)
pip install -e ".[email]"

Quick Start

Python API

from opp import DOCXExtractor, PDFExtractor, PPTXExtractor
from opp.detector import detect_format
from opp.pipeline import OPPPipeline

# Direct extraction
extractor = DOCXExtractor()
result = extractor.extract("document.docx")
print(result.content)

# Auto-detection
fmt, confidence = detect_format("document.docx")
print(f"Format: {fmt.value}, Confidence: {confidence}")

# Full pipeline
pipeline = OPPPipeline(resource_storage_dir="./resources")
result = pipeline.process_file("document.docx")
print(f"Extracted: {len(result.content)} chars, {result.images_stored} images")

CLI

# Extract to Markdown
opp --target-format=md document.docx

# Extract to XLIFF for translation
opp --target-format=xlf --source-lang=en --target-lang=zh document.docx

# Generate both MD and XLIFF
opp --target-format=both --source-lang=en --target-lang=zh document.docx

# Custom output directory
opp --target-format=md --output-dir ./output document.docx

# Image OCR
opp --ocr-engine tesseract scan.png

# Batch processing
opp --batch file1.docx file2.pdf file3.pptx

Windows Batch Scripts

Script Description
md.bat Convert to Markdown
en2cn_xliff.bat English source → Chinese XLIFF
cn2en_xliff.bat Chinese source → English XLIFF
md.bat "document.docx"
md.bat "folder"

en2cn_xliff.bat "english.docx"
cn2en_xliff.bat "中文.docx"

Supports drag-drop of files and folders. Logs saved to logs/.

Project Structure

src/opp/
├── detector.py           # Format auto-detection
├── extractors/           # Document extractors
│   ├── docx.py
│   ├── pptx.py
│   ├── pdf.py
│   ├── xlsx.py
│   ├── csv.py
│   ├── json.py
│   ├── xml.py
│   ├── email.py
│   └── image_ocr.py
├── channels/             # Output formatters
│   ├── table_channel.py   # DataFrame → Markdown table
│   └── keyvalue_channel.py # dict → XLIFF
├── xliff/                # XLIFF 1.2/2.0 generator
├── pipeline.py           # OPPPipeline orchestrator
├── resource_manager.py   # Image deduplication
└── cli.py               # Command-line interface

Architecture

                     ┌─────────────────────────────────────────┐
                     │              OPPPipeline                  │
                     │  detect_format() → Extractor → Report   │
                     └─────────────────────────────────────────┘

┌──────────┐    ┌───────────┐    ┌────────────────┐    ┌──────────────┐
│ detector │───▶│ extractors│───▶│resource_manager│───▶│error_handler │
│  magic   │    │  DOCX/...  │    │  MD5 + UUID    │    │ HTML/text    │
└──────────┘    └───────────┘    └────────────────┘    └──────────────┘

Development

pip install -e ".[dev]"
pytest tests/ -v --cov=src/opp --cov-report=term-missing

Test Coverage

Module Tests
detector 13
resource_manager 18
error_handler 18
integration 25
cli 18
e2e 52
xliff 40+
extractors 140+
Total 479+

Batch Testing

Test files available in batch_test/ covering all formats.

opp --target-format=both --source-lang=en --target-lang=zh --output-dir=output batch_test/

MCP Server (Agent-Facing)

The OPP MCP server provides document extraction capabilities to AI agents via the Model Context Protocol. AI assistants can use these tools to process documents without needing to understand OPP's internal architecture.

Why Use the MCP Server?

  • Agent integration - Connect OPP to any MCP-compatible AI assistant
  • ** stdio transport** - Communication over standard input/output for security
  • 5 extraction tools - Cover all major document formats
  • Path security - Directory allowlist prevents unauthorized file access

Installation

# Install OPP with MCP server support
pip install -e ".[mcp]"

Quick Start

Start the server manually:

python -m opp.mcp.server

Auto-start with uvx:

uvx opp-mcp-server

Auto-start with npx:

npx opp-mcp-server

Hermes Configuration

Add OPP to your Hermes agent configuration:

agents:
  my-agent:
    tools:
      - name: opp
        type: code
        config:
          server_command: uvx opp-mcp-server
          allowed_directories:
            - /path/to/documents
            - /path/to/output

Available Tools

Tool Description
extract_document Extract content from a single document file. Supports DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, and images. Returns markdown or structured content.
batch_extract Process multiple files in one request. Takes an array of file paths and processes them sequentially. Returns extraction results for each file.
detect_format Identify the file format of a document using magic bytes detection. Works regardless of file extension. Returns format name and confidence score.
generate_markdown Convert a document to markdown format. Specify source and target languages for proper text processing.
generate_xliff Convert a document to XLIFF format for translation workflows. Requires source-lang and target-lang parameters.

Security

The MCP server enforces path validation to prevent unauthorized file access.

Allowlist configuration:

# Via environment variable
export OPP_ALLOWED_DIRECTORIES="/allowed/documents,/allowed/output"

# Via configuration file

Configuration file (opp_mcp_config.yaml):

security:
  allowed_directories:
    - /mnt/d/贯维/Documents
    - /mnt/d/贯维/Output
    - ./documents

server:
  host: localhost
  port: 8765

extraction:
  default_target_format: md
  ocr_engine: tesseract

Environment Variables

Variable Description Default
OPP_ALLOWED_DIRECTORIES Comma-separated list of allowed directories Required
OPP_RESOURCE_STORAGE_DIR Directory for extracted images ./resources
OPP_OCR_ENGINE OCR engine to use tesseract
OPP_LOG_LEVEL Logging level INFO

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omni_pre_processor-0.1.0.tar.gz (47.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

omni_pre_processor-0.1.0-py3-none-any.whl (67.4 kB view details)

Uploaded Python 3

File details

Details for the file omni_pre_processor-0.1.0.tar.gz.

File metadata

  • Download URL: omni_pre_processor-0.1.0.tar.gz
  • Upload date:
  • Size: 47.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omni_pre_processor-0.1.0.tar.gz
Algorithm Hash digest
SHA256 00cda1d74bd7a624739989c0653171698bdd0464b3ee07d29dc139b9e46acdd0
MD5 4bb49f8d486500415f533d83a28170e3
BLAKE2b-256 e5366e3d72fb68fa950205a4a5e4955acbddc0521f5bb55475c5f32169f0b3c5

See more details on using hashes here.

Provenance

The following attestation bundles were made for omni_pre_processor-0.1.0.tar.gz:

Publisher: publish.yml on 1StepMore/Omni_Pre_Processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file omni_pre_processor-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for omni_pre_processor-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7e7e4d0bd90a10ac206bc35a7c0764677fdc00869bcfd78175c9580bedbc5fe4
MD5 4f39df15cdc9f01282cb5f310d7d3c70
BLAKE2b-256 e07e0a136664da6ca57cc58d7ee2b92dd2d9648f228105d4a9e97f2aa9f4b7a2

See more details on using hashes here.

Provenance

The following attestation bundles were made for omni_pre_processor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on 1StepMore/Omni_Pre_Processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page