Omni Pre-Processor: Document content extraction package

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

1StepMore

These details have not been verified by PyPI

Project description

OPP - Omni Pre-Processor

Document content extraction for DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, and Image (OCR).

Features

Multi-format extraction - DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, Image, IPYNB, YouTube URL
Image OCR - Tesseract and RapidOCR with graceful fallback
Email extraction - EML (RFC 822) and MSG (Outlook) with attachment recursion
Audio/Video transcription - Whisper-based ASR
Format auto-detection - Magic bytes detection (extension not required)
Resource management - MD5 deduplication, UUID naming for images
Pipeline orchestrator - detect → extract → manage → report
CLI interface - Full command-line with batch support
Output formats - Markdown and XLIFF 1.2/2.0

Installation

# Core package
pip install -e .

# With office/data formats (XLSX, CSV, JSON, XML)
pip install -e ".[office]"

# With email and OCR (EML, MSG, Tesseract, RapidOCR)
pip install -e ".[email]"

Quick Start

Python API

from opp import DOCXExtractor, PDFExtractor, PPTXExtractor
from opp.detector import detect_format
from opp.pipeline import OPPPipeline

# Direct extraction
extractor = DOCXExtractor()
result = extractor.extract("document.docx")
print(result.content)

# Auto-detection
fmt, confidence = detect_format("document.docx")
print(f"Format: {fmt.value}, Confidence: {confidence}")

# Full pipeline
pipeline = OPPPipeline(resource_storage_dir="./resources")
result = pipeline.process_file("document.docx")
print(f"Extracted: {len(result.content)} chars, {result.images_stored} images")

CLI

# Extract to Markdown
opp --target-format=md document.docx

# Extract to XLIFF for translation
opp --target-format=xlf --source-lang=en --target-lang=zh document.docx

# Generate both MD and XLIFF
opp --target-format=both --source-lang=en --target-lang=zh document.docx

# Custom output directory
opp --target-format=md --output-dir ./output document.docx

# Image OCR
opp --ocr-engine tesseract scan.png

# Batch processing
opp --batch file1.docx file2.pdf file3.pptx

Windows Batch Scripts

Script	Description
`md.bat`	Convert to Markdown
`en2cn_xliff.bat`	English source → Chinese XLIFF
`cn2en_xliff.bat`	Chinese source → English XLIFF

md.bat "document.docx"
md.bat "folder"

en2cn_xliff.bat "english.docx"
cn2en_xliff.bat "中文.docx"

Supports drag-drop of files and folders. Logs saved to logs/.

Project Structure

src/opp/
├── detector.py           # Format auto-detection
├── extractors/           # Document extractors
│   ├── docx.py
│   ├── pptx.py
│   ├── pdf.py
│   ├── xlsx.py
│   ├── csv.py
│   ├── json.py
│   ├── xml.py
│   ├── email.py
│   └── image_ocr.py
├── channels/             # Output formatters
│   ├── table_channel.py   # DataFrame → Markdown table
│   └── keyvalue_channel.py # dict → XLIFF
├── xliff/                # XLIFF 1.2/2.0 generator
├── pipeline.py           # OPPPipeline orchestrator
├── resource_manager.py   # Image deduplication
└── cli.py               # Command-line interface

Architecture

                     ┌─────────────────────────────────────────┐
                     │              OPPPipeline                  │
                     │  detect_format() → Extractor → Report   │
                     └─────────────────────────────────────────┘

┌──────────┐    ┌───────────┐    ┌────────────────┐    ┌──────────────┐
│ detector │───▶│ extractors│───▶│resource_manager│───▶│error_handler │
│  magic   │    │  DOCX/...  │    │  MD5 + UUID    │    │ HTML/text    │
└──────────┘    └───────────┘    └────────────────┘    └──────────────┘

Development

pip install -e ".[dev]"
pytest tests/ -v --cov=src/opp --cov-report=term-missing

Test Coverage

Module	Tests
detector	13
resource_manager	18
error_handler	18
integration	25
cli	18
e2e	52
xliff	40+
extractors	140+
Total	479+

Batch Testing

Test files available in batch_test/ covering all formats.

opp --target-format=both --source-lang=en --target-lang=zh --output-dir=output batch_test/

MCP Server (Agent-Facing)

The OPP MCP server provides document extraction capabilities to AI agents via the Model Context Protocol. AI assistants can use these tools to process documents without needing to understand OPP's internal architecture.

Why Use the MCP Server?

Agent integration - Connect OPP to any MCP-compatible AI assistant
** stdio transport** - Communication over standard input/output for security
5 extraction tools - Cover all major document formats
Path security - Directory allowlist prevents unauthorized file access

Installation

# Install OPP with MCP server support
pip install -e ".[mcp]"

Quick Start

Start the server manually:

python -m opp.mcp.server

Auto-start with uvx:

uvx opp-mcp-server

Auto-start with npx:

npx opp-mcp-server

Hermes Configuration

Add OPP to your Hermes agent configuration:

agents:
  my-agent:
    tools:
      - name: opp
        type: code
        config:
          server_command: uvx opp-mcp-server
          allowed_directories:
            - /path/to/documents
            - /path/to/output

Available Tools

Tool	Description
`extract_document`	Extract content from a single document file. Supports DOCX, PPTX, PDF, XLSX, CSV, JSON, XML, HTML, EPUB, EML, MSG, and images. Returns markdown or structured content.
`batch_extract`	Process multiple files in one request. Takes an array of file paths and processes them sequentially. Returns extraction results for each file.
`detect_format`	Identify the file format of a document using magic bytes detection. Works regardless of file extension. Returns format name and confidence score.
`generate_markdown`	Convert a document to markdown format. Specify source and target languages for proper text processing.
`generate_xliff`	Convert a document to XLIFF format for translation workflows. Requires source-lang and target-lang parameters.

Security

The MCP server enforces path validation to prevent unauthorized file access.

Allowlist configuration:

# Via environment variable
export OPP_ALLOWED_DIRECTORIES="/allowed/documents,/allowed/output"

# Via configuration file

Configuration file (opp_mcp_config.yaml):

security:
  allowed_directories:
    - /mnt/d/贯维/Documents
    - /mnt/d/贯维/Output
    - ./documents

server:
  host: localhost
  port: 8765

extraction:
  default_target_format: md
  ocr_engine: tesseract

Environment Variables

Variable	Description	Default
`OPP_ALLOWED_DIRECTORIES`	Comma-separated list of allowed directories	Required
`OPP_RESOURCE_STORAGE_DIR`	Directory for extracted images	`./resources`
`OPP_OCR_ENGINE`	OCR engine to use	`tesseract`
`OPP_LOG_LEVEL`	Logging level	`INFO`

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

1StepMore

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.1.0

May 20, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

omni_pre_processor-0.1.0.tar.gz (47.7 kB view details)

Uploaded May 20, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

omni_pre_processor-0.1.0-py3-none-any.whl (67.4 kB view details)

Uploaded May 20, 2026 Python 3

File details

Details for the file omni_pre_processor-0.1.0.tar.gz.

File metadata

Download URL: omni_pre_processor-0.1.0.tar.gz
Upload date: May 20, 2026
Size: 47.7 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omni_pre_processor-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`00cda1d74bd7a624739989c0653171698bdd0464b3ee07d29dc139b9e46acdd0`
MD5	`4bb49f8d486500415f533d83a28170e3`
BLAKE2b-256	`e5366e3d72fb68fa950205a4a5e4955acbddc0521f5bb55475c5f32169f0b3c5`

See more details on using hashes here.

Provenance

The following attestation bundles were made for omni_pre_processor-0.1.0.tar.gz:

Publisher: publish.yml on 1StepMore/Omni_Pre_Processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: omni_pre_processor-0.1.0.tar.gz
- Subject digest: 00cda1d74bd7a624739989c0653171698bdd0464b3ee07d29dc139b9e46acdd0
- Sigstore transparency entry: 1575993434
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: 1StepMore/Omni_Pre_Processor@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/1StepMore
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c
- Trigger Event: workflow_dispatch

File details

Details for the file omni_pre_processor-0.1.0-py3-none-any.whl.

File metadata

Download URL: omni_pre_processor-0.1.0-py3-none-any.whl
Upload date: May 20, 2026
Size: 67.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for omni_pre_processor-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`7e7e4d0bd90a10ac206bc35a7c0764677fdc00869bcfd78175c9580bedbc5fe4`
MD5	`4f39df15cdc9f01282cb5f310d7d3c70`
BLAKE2b-256	`e07e0a136664da6ca57cc58d7ee2b92dd2d9648f228105d4a9e97f2aa9f4b7a2`

See more details on using hashes here.

Provenance

The following attestation bundles were made for omni_pre_processor-0.1.0-py3-none-any.whl:

Publisher: publish.yml on 1StepMore/Omni_Pre_Processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: omni_pre_processor-0.1.0-py3-none-any.whl
- Subject digest: 7e7e4d0bd90a10ac206bc35a7c0764677fdc00869bcfd78175c9580bedbc5fe4
- Sigstore transparency entry: 1575993475
- Sigstore integration time: May 20, 2026
Source repository:
- Permalink: 1StepMore/Omni_Pre_Processor@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c
- Branch / Tag: refs/heads/main
- Owner: https://github.com/1StepMore
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@ecdb6a6a817c3cf3ae60f12cb98377065f49e77c
- Trigger Event: workflow_dispatch

Omni-Pre-Processor 0.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

OPP - Omni Pre-Processor

Features

Installation

Quick Start

Python API

CLI

Windows Batch Scripts

Project Structure

Architecture

Development

Test Coverage

Batch Testing

MCP Server (Agent-Facing)

Why Use the MCP Server?

Installation

Quick Start

Hermes Configuration

Available Tools

Security

Environment Variables

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance