Intelligent document processing with OCR, chunking, and AI summarization

These details have not been verified by PyPI

Project description

DocProcessor

A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.

Features
Installation
Quick Start
Advanced Usage
API Reference
Development
Contributing
License

Features

Multi-format Support: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
Intelligent OCR: Layout-aware PDF text extraction with OCR fallback for images
Semantic Chunking: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
LLM Summarization: Generate concise document summaries (with fallback)
Meilisearch Integration: Built-in support for indexing to Meilisearch
Flexible API: Use components individually or as a unified pipeline

Installation

From PyPI (Coming Soon)

pip install docprocessor

From GitHub

pip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git

For Development

git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"

System Dependencies

For OCR functionality, install system packages:

Ubuntu/Debian:

sudo apt-get install tesseract-ocr poppler-utils

macOS:

brew install tesseract poppler

Quick Start

Basic Usage

from docprocessor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor()

# Process a document
result = processor.process(
    file_path="document.pdf",
    extract_text=True,
    chunk=True,
    summarize=False  # Requires LLM client
)

print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")

With LLM Summarization

from docprocessor import DocumentProcessor

# Your LLM client (must have a complete_chat method)
class MyLLMClient:
    def complete_chat(self, messages, temperature):
        # Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
        return {"content": "Generated summary here"}

llm_client = MyLLMClient()

processor = DocumentProcessor(
    llm_client=llm_client,
    summary_target_words=500
)

result = processor.process(
    file_path="document.pdf",
    summarize=True
)

print(f"Summary: {result.summary}")

With Meilisearch Indexing

from docprocessor import DocumentProcessor, MeiliSearchIndexer

# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)

# Index to Meilisearch
indexer = MeiliSearchIndexer(
    url="http://localhost:7700",
    api_key="your_master_key",
    index_prefix="dev_"  # Optional environment prefix
)

# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)

# Index chunks
indexer.index_chunks(
    chunks=search_docs,
    index_name="document_chunks"
)

# Search
results = indexer.search(
    query="artificial intelligence",
    index_name="document_chunks",
    limit=10
)

Advanced Usage

Custom Chunking Parameters

processor = DocumentProcessor(
    chunk_size=1024,      # Larger chunks
    chunk_overlap=100,    # More overlap
    min_chunk_size=200    # Higher minimum
)

chunks = processor.chunk_text(
    text="Your long document text here...",
    filename="document.txt"
)

Extract Text Only

processor = DocumentProcessor()

extraction = processor.extract_text("document.pdf")

print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")

Multi-Environment Indexing

# Index to multiple environments
environments = {
    "dev": {
        "url": "http://localhost:7700",
        "api_key": "dev_key",
        "prefix": "dev_"
    },
    "prod": {
        "url": "https://search.production.com",
        "api_key": "prod_key",
        "prefix": "prod_"
    }
}

for env_name, config in environments.items():
    indexer = MeiliSearchIndexer(
        url=config["url"],
        api_key=config["api_key"],
        index_prefix=config["prefix"]
    )

    indexer.index_chunks(search_docs, "document_chunks")
    print(f"Indexed to {env_name}")

API Reference

DocumentProcessor

Main class for document processing.

Parameters:

ocr_enabled (bool): Enable OCR for PDFs/images. Default: True
chunk_size (int): Target chunk size in tokens. Default: 512
chunk_overlap (int): Overlap between chunks. Default: 50
min_chunk_size (int): Minimum chunk size. Default: 100
summary_target_words (int): Target summary length. Default: 500
llm_client (Optional[Any]): LLM client for summarization
llm_temperature (float): LLM temperature. Default: 0.3

Methods:

process(): Full pipeline (extract, chunk, summarize)
extract_text(): Extract text from document
chunk_text(): Chunk text into segments
summarize_text(): Generate summary
chunks_to_search_documents(): Convert chunks for indexing

MeiliSearchIndexer

Interface for Meilisearch operations.

Parameters:

url (str): Meilisearch server URL
api_key (str): Meilisearch API key
index_prefix (Optional[str]): Prefix for index names

Methods:

index_chunks(): Index multiple documents
index_document(): Index single document
search(): Search an index
delete_document(): Delete by ID
delete_documents_by_filter(): Delete by filter
create_index(): Create new index

DocumentChunk

Data class representing a text chunk.

Attributes:

chunk_id (str): Unique identifier
file_id (str): Source file identifier
output_id (str): Output identifier
project_id (int): Project identifier
filename (str): Source filename
chunk_number (int): Chunk sequence number
total_chunks (int): Total chunks in document
chunk_text (str): The chunk text content
token_count (int): Number of tokens
pages (List[int]): Page numbers (for PDFs)
metadata (Dict): Additional metadata

Architecture

DocProcessor consists of several independent components:

ContentExtractor: Extracts text from various file formats
DocumentChunker: Splits text into semantic segments
DocumentSummarizer: Generates LLM-based summaries
MeiliSearchIndexer: Indexes documents to Meilisearch

Each component can be used independently or through the unified DocumentProcessor API.

Requirements

Python: 3.10+ (tested on 3.10, 3.11, 3.12)

Core Dependencies:

pdfminer.six - PDF text extraction
pdf2image - PDF to image conversion
pytesseract - OCR engine
opencv-python - Image preprocessing
Pillow - Image handling
python-docx - DOCX extraction
python-pptx - PPTX extraction
langchain-text-splitters - Semantic chunking
tiktoken - Token counting

Optional:

meilisearch - Search engine integration

Examples

See the examples/ directory for more usage examples:

basic_usage.py - Simple document processing
multi_environment.py - Indexing to multiple environments
custom_chunking.py - Advanced chunking options

Development

Using GitHub Codespaces (Recommended)

The easiest way to start developing:

Click the Code button on GitHub
Select Codespaces → Create codespace on main
Wait for the environment to build (includes all dependencies)
Start coding!

The devcontainer automatically installs:

Python 3.11
All system dependencies (Tesseract, Poppler)
Python dependencies in editable mode
Pre-commit hooks
VS Code extensions (Black, isort, flake8, etc.)

Local Development Setup

# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor

# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

# Install Python dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run tests with coverage
pytest --cov=docprocessor

Code Quality

We use automated tools to maintain code quality:

# Format code
black docprocessor tests

# Sort imports
isort docprocessor tests

# Lint
flake8 docprocessor tests

# Type check
mypy docprocessor

# Or run all checks with pre-commit
pre-commit run --all-files

Running Tests

# Run all tests
pytest

# With coverage report
pytest --cov=docprocessor --cov-report=html

# Run specific test file
pytest tests/test_processor.py -v

# Run tests matching pattern
pytest -k "test_extract" -v

Contributing

We love contributions! Please see CONTRIBUTING.md for details on:

Development setup
Code style guidelines
Testing requirements
Pull request process
Issue reporting

Quick tips:

Use the devcontainer for consistent environment
Write tests for new features
Follow PEP 8 and use pre-commit hooks
Update documentation for API changes
Add entries to CHANGELOG.md

Changelog

See CHANGELOG.md for version history and release notes.

License

MIT License - see LICENSE file for details.

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: info@knowledgeinnovation.eu

Citation

If you use docprocessor in your research or project, please cite:

@software{docprocessor2025,
  title = {docprocessor: Intelligent Document Processing Library},
  author = {Knowledge Innovation Centre},
  year = {2025},
  url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}

Made with ❤️ by Knowledge Innovation Centre

Project details

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

1.1.0

Oct 30, 2025

1.0.0

Oct 23, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprocessor-1.1.0.tar.gz (36.3 kB view details)

Uploaded Oct 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docprocessor-1.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded Oct 30, 2025 Python 3

File details

Details for the file docprocessor-1.1.0.tar.gz.

File metadata

Download URL: docprocessor-1.1.0.tar.gz
Upload date: Oct 30, 2025
Size: 36.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docprocessor-1.1.0.tar.gz
Algorithm	Hash digest
SHA256	`6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a`
MD5	`00296084cc7f74dde6de44bdfb3ec6a3`
BLAKE2b-256	`efb5edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprocessor-1.1.0.tar.gz:

Publisher: release.yml on Knowledge-Innovation-Centre/doc-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docprocessor-1.1.0.tar.gz
- Subject digest: 6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a
- Sigstore transparency entry: 656678508
- Sigstore integration time: Oct 30, 2025
Source repository:
- Permalink: Knowledge-Innovation-Centre/doc-processor@cc84cb55831f2cee1d9f5c8a3095562fe057aced
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/Knowledge-Innovation-Centre
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cc84cb55831f2cee1d9f5c8a3095562fe057aced
- Trigger Event: push

File details

Details for the file docprocessor-1.1.0-py3-none-any.whl.

File metadata

Download URL: docprocessor-1.1.0-py3-none-any.whl
Upload date: Oct 30, 2025
Size: 23.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docprocessor-1.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f`
MD5	`9d7737783691a45a9e8270bf2c9f9b84`
BLAKE2b-256	`e313842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7`

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprocessor-1.1.0-py3-none-any.whl:

Publisher: release.yml on Knowledge-Innovation-Centre/doc-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: docprocessor-1.1.0-py3-none-any.whl
- Subject digest: c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f
- Sigstore transparency entry: 656678516
- Sigstore integration time: Oct 30, 2025
Source repository:
- Permalink: Knowledge-Innovation-Centre/doc-processor@cc84cb55831f2cee1d9f5c8a3095562fe057aced
- Branch / Tag: refs/tags/v1.1.0
- Owner: https://github.com/Knowledge-Innovation-Centre
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: release.yml@cc84cb55831f2cee1d9f5c8a3095562fe057aced
- Trigger Event: push

docprocessor 1.1.0

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

DocProcessor

Table of Contents

Features

Installation

From PyPI (Coming Soon)

From GitHub

For Development

System Dependencies

Quick Start

Basic Usage

With LLM Summarization

With Meilisearch Indexing

Advanced Usage

Custom Chunking Parameters

Extract Text Only

Multi-Environment Indexing

API Reference

DocumentProcessor

MeiliSearchIndexer

DocumentChunk

Architecture

Requirements

Examples

Development

Using GitHub Codespaces (Recommended)

Local Development Setup

Code Quality

Running Tests

Contributing

Changelog

License

Support

Citation

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance