Skip to main content

Intelligent document processing with OCR, chunking, and AI summarization

Project description

DocProcessor

Python Version License: MIT Code style: black CI Status

A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.

Table of Contents

Features

  • Multi-format Support: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
  • Intelligent OCR: Layout-aware PDF text extraction with OCR fallback for images
  • Semantic Chunking: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
  • LLM Summarization: Generate concise document summaries (with fallback)
  • Meilisearch Integration: Built-in support for indexing to Meilisearch
  • Flexible API: Use components individually or as a unified pipeline

Installation

From PyPI (Coming Soon)

pip install docprocessor

From GitHub

pip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git

For Development

git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"

System Dependencies

For OCR functionality, install system packages:

Ubuntu/Debian:

sudo apt-get install tesseract-ocr poppler-utils

macOS:

brew install tesseract poppler

Quick Start

Basic Usage

from docprocessor import DocumentProcessor

# Initialize processor
processor = DocumentProcessor()

# Process a document
result = processor.process(
    file_path="document.pdf",
    extract_text=True,
    chunk=True,
    summarize=False  # Requires LLM client
)

print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")

With LLM Summarization

from docprocessor import DocumentProcessor

# Your LLM client (must have a complete_chat method)
class MyLLMClient:
    def complete_chat(self, messages, temperature):
        # Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
        return {"content": "Generated summary here"}

llm_client = MyLLMClient()

processor = DocumentProcessor(
    llm_client=llm_client,
    summary_target_words=500
)

result = processor.process(
    file_path="document.pdf",
    summarize=True
)

print(f"Summary: {result.summary}")

With Meilisearch Indexing

from docprocessor import DocumentProcessor, MeiliSearchIndexer

# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)

# Index to Meilisearch
indexer = MeiliSearchIndexer(
    url="http://localhost:7700",
    api_key="your_master_key",
    index_prefix="dev_"  # Optional environment prefix
)

# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)

# Index chunks
indexer.index_chunks(
    chunks=search_docs,
    index_name="document_chunks"
)

# Search
results = indexer.search(
    query="artificial intelligence",
    index_name="document_chunks",
    limit=10
)

Advanced Usage

Custom Chunking Parameters

processor = DocumentProcessor(
    chunk_size=1024,      # Larger chunks
    chunk_overlap=100,    # More overlap
    min_chunk_size=200    # Higher minimum
)

chunks = processor.chunk_text(
    text="Your long document text here...",
    filename="document.txt"
)

Extract Text Only

processor = DocumentProcessor()

extraction = processor.extract_text("document.pdf")

print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")

Multi-Environment Indexing

# Index to multiple environments
environments = {
    "dev": {
        "url": "http://localhost:7700",
        "api_key": "dev_key",
        "prefix": "dev_"
    },
    "prod": {
        "url": "https://search.production.com",
        "api_key": "prod_key",
        "prefix": "prod_"
    }
}

for env_name, config in environments.items():
    indexer = MeiliSearchIndexer(
        url=config["url"],
        api_key=config["api_key"],
        index_prefix=config["prefix"]
    )

    indexer.index_chunks(search_docs, "document_chunks")
    print(f"Indexed to {env_name}")

API Reference

DocumentProcessor

Main class for document processing.

Parameters:

  • ocr_enabled (bool): Enable OCR for PDFs/images. Default: True
  • chunk_size (int): Target chunk size in tokens. Default: 512
  • chunk_overlap (int): Overlap between chunks. Default: 50
  • min_chunk_size (int): Minimum chunk size. Default: 100
  • summary_target_words (int): Target summary length. Default: 500
  • llm_client (Optional[Any]): LLM client for summarization
  • llm_temperature (float): LLM temperature. Default: 0.3

Methods:

  • process(): Full pipeline (extract, chunk, summarize)
  • extract_text(): Extract text from document
  • chunk_text(): Chunk text into segments
  • summarize_text(): Generate summary
  • chunks_to_search_documents(): Convert chunks for indexing

MeiliSearchIndexer

Interface for Meilisearch operations.

Parameters:

  • url (str): Meilisearch server URL
  • api_key (str): Meilisearch API key
  • index_prefix (Optional[str]): Prefix for index names

Methods:

  • index_chunks(): Index multiple documents
  • index_document(): Index single document
  • search(): Search an index
  • delete_document(): Delete by ID
  • delete_documents_by_filter(): Delete by filter
  • create_index(): Create new index

DocumentChunk

Data class representing a text chunk.

Attributes:

  • chunk_id (str): Unique identifier
  • file_id (str): Source file identifier
  • output_id (str): Output identifier
  • project_id (int): Project identifier
  • filename (str): Source filename
  • chunk_number (int): Chunk sequence number
  • total_chunks (int): Total chunks in document
  • chunk_text (str): The chunk text content
  • token_count (int): Number of tokens
  • pages (List[int]): Page numbers (for PDFs)
  • metadata (Dict): Additional metadata

Architecture

DocProcessor consists of several independent components:

  1. ContentExtractor: Extracts text from various file formats
  2. DocumentChunker: Splits text into semantic segments
  3. DocumentSummarizer: Generates LLM-based summaries
  4. MeiliSearchIndexer: Indexes documents to Meilisearch

Each component can be used independently or through the unified DocumentProcessor API.

Requirements

Python: 3.10+ (tested on 3.10, 3.11, 3.12)

Core Dependencies:

  • pdfminer.six - PDF text extraction
  • pdf2image - PDF to image conversion
  • pytesseract - OCR engine
  • opencv-python - Image preprocessing
  • Pillow - Image handling
  • python-docx - DOCX extraction
  • python-pptx - PPTX extraction
  • langchain-text-splitters - Semantic chunking
  • tiktoken - Token counting

Optional:

  • meilisearch - Search engine integration

Examples

See the examples/ directory for more usage examples:

  • basic_usage.py - Simple document processing
  • multi_environment.py - Indexing to multiple environments
  • custom_chunking.py - Advanced chunking options

Development

Using GitHub Codespaces (Recommended)

The easiest way to start developing:

  1. Click the Code button on GitHub
  2. Select CodespacesCreate codespace on main
  3. Wait for the environment to build (includes all dependencies)
  4. Start coding!

The devcontainer automatically installs:

  • Python 3.11
  • All system dependencies (Tesseract, Poppler)
  • Python dependencies in editable mode
  • Pre-commit hooks
  • VS Code extensions (Black, isort, flake8, etc.)

Local Development Setup

# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor

# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils

# macOS
brew install tesseract poppler

# Install Python dependencies
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

# Run tests
pytest

# Run tests with coverage
pytest --cov=docprocessor

Code Quality

We use automated tools to maintain code quality:

# Format code
black docprocessor tests

# Sort imports
isort docprocessor tests

# Lint
flake8 docprocessor tests

# Type check
mypy docprocessor

# Or run all checks with pre-commit
pre-commit run --all-files

Running Tests

# Run all tests
pytest

# With coverage report
pytest --cov=docprocessor --cov-report=html

# Run specific test file
pytest tests/test_processor.py -v

# Run tests matching pattern
pytest -k "test_extract" -v

Contributing

We love contributions! Please see CONTRIBUTING.md for details on:

  • Development setup
  • Code style guidelines
  • Testing requirements
  • Pull request process
  • Issue reporting

Quick tips:

  • Use the devcontainer for consistent environment
  • Write tests for new features
  • Follow PEP 8 and use pre-commit hooks
  • Update documentation for API changes
  • Add entries to CHANGELOG.md

Changelog

See CHANGELOG.md for version history and release notes.

License

MIT License - see LICENSE file for details.

Support

Citation

If you use docprocessor in your research or project, please cite:

@software{docprocessor2025,
  title = {docprocessor: Intelligent Document Processing Library},
  author = {Knowledge Innovation Centre},
  year = {2025},
  url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}

Made with ❤️ by Knowledge Innovation Centre

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docprocessor-1.1.0.tar.gz (36.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docprocessor-1.1.0-py3-none-any.whl (23.5 kB view details)

Uploaded Python 3

File details

Details for the file docprocessor-1.1.0.tar.gz.

File metadata

  • Download URL: docprocessor-1.1.0.tar.gz
  • Upload date:
  • Size: 36.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docprocessor-1.1.0.tar.gz
Algorithm Hash digest
SHA256 6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a
MD5 00296084cc7f74dde6de44bdfb3ec6a3
BLAKE2b-256 efb5edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprocessor-1.1.0.tar.gz:

Publisher: release.yml on Knowledge-Innovation-Centre/doc-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file docprocessor-1.1.0-py3-none-any.whl.

File metadata

  • Download URL: docprocessor-1.1.0-py3-none-any.whl
  • Upload date:
  • Size: 23.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for docprocessor-1.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f
MD5 9d7737783691a45a9e8270bf2c9f9b84
BLAKE2b-256 e313842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7

See more details on using hashes here.

Provenance

The following attestation bundles were made for docprocessor-1.1.0-py3-none-any.whl:

Publisher: release.yml on Knowledge-Innovation-Centre/doc-processor

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page