Intelligent document processing with OCR, chunking, and AI summarization
Project description
DocProcessor
A Python library for processing documents with OCR, semantic chunking, and LLM-based summarization. Designed for building semantic search systems and document analysis workflows.
Table of Contents
Features
- Multi-format Support: PDF, DOCX, PPTX, TXT, MD, and images (PNG, JPG, GIF, BMP)
- Intelligent OCR: Layout-aware PDF text extraction with OCR fallback for images
- Semantic Chunking: Smart text segmentation using LangChain's RecursiveCharacterTextSplitter
- LLM Summarization: Generate concise document summaries (with fallback)
- Meilisearch Integration: Built-in support for indexing to Meilisearch
- Flexible API: Use components individually or as a unified pipeline
Installation
From PyPI (Coming Soon)
pip install docprocessor
From GitHub
pip install git+https://github.com/Knowledge-Innovation-Centre/doc-processor.git
For Development
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
pip install -e ".[dev]"
System Dependencies
For OCR functionality, install system packages:
Ubuntu/Debian:
sudo apt-get install tesseract-ocr poppler-utils
macOS:
brew install tesseract poppler
Quick Start
Basic Usage
from docprocessor import DocumentProcessor
# Initialize processor
processor = DocumentProcessor()
# Process a document
result = processor.process(
file_path="document.pdf",
extract_text=True,
chunk=True,
summarize=False # Requires LLM client
)
print(f"Extracted {len(result.text)} characters")
print(f"Created {len(result.chunks)} chunks")
With LLM Summarization
from docprocessor import DocumentProcessor
# Your LLM client (must have a complete_chat method)
class MyLLMClient:
def complete_chat(self, messages, temperature):
# Call your LLM API (OpenAI, Anthropic, Mistral, etc.)
return {"content": "Generated summary here"}
llm_client = MyLLMClient()
processor = DocumentProcessor(
llm_client=llm_client,
summary_target_words=500
)
result = processor.process(
file_path="document.pdf",
summarize=True
)
print(f"Summary: {result.summary}")
With Meilisearch Indexing
from docprocessor import DocumentProcessor, MeiliSearchIndexer
# Process document
processor = DocumentProcessor()
result = processor.process("document.pdf", chunk=True)
# Index to Meilisearch
indexer = MeiliSearchIndexer(
url="http://localhost:7700",
api_key="your_master_key",
index_prefix="dev_" # Optional environment prefix
)
# Convert chunks to search documents
search_docs = processor.chunks_to_search_documents(result.chunks)
# Index chunks
indexer.index_chunks(
chunks=search_docs,
index_name="document_chunks"
)
# Search
results = indexer.search(
query="artificial intelligence",
index_name="document_chunks",
limit=10
)
Advanced Usage
Custom Chunking Parameters
processor = DocumentProcessor(
chunk_size=1024, # Larger chunks
chunk_overlap=100, # More overlap
min_chunk_size=200 # Higher minimum
)
chunks = processor.chunk_text(
text="Your long document text here...",
filename="document.txt"
)
Extract Text Only
processor = DocumentProcessor()
extraction = processor.extract_text("document.pdf")
print(f"Text: {extraction['text']}")
print(f"Pages: {extraction['page_count']}")
print(f"Format: {extraction['metadata']['format']}")
Multi-Environment Indexing
# Index to multiple environments
environments = {
"dev": {
"url": "http://localhost:7700",
"api_key": "dev_key",
"prefix": "dev_"
},
"prod": {
"url": "https://search.production.com",
"api_key": "prod_key",
"prefix": "prod_"
}
}
for env_name, config in environments.items():
indexer = MeiliSearchIndexer(
url=config["url"],
api_key=config["api_key"],
index_prefix=config["prefix"]
)
indexer.index_chunks(search_docs, "document_chunks")
print(f"Indexed to {env_name}")
API Reference
DocumentProcessor
Main class for document processing.
Parameters:
ocr_enabled(bool): Enable OCR for PDFs/images. Default:Truechunk_size(int): Target chunk size in tokens. Default:512chunk_overlap(int): Overlap between chunks. Default:50min_chunk_size(int): Minimum chunk size. Default:100summary_target_words(int): Target summary length. Default:500llm_client(Optional[Any]): LLM client for summarizationllm_temperature(float): LLM temperature. Default:0.3
Methods:
process(): Full pipeline (extract, chunk, summarize)extract_text(): Extract text from documentchunk_text(): Chunk text into segmentssummarize_text(): Generate summarychunks_to_search_documents(): Convert chunks for indexing
MeiliSearchIndexer
Interface for Meilisearch operations.
Parameters:
url(str): Meilisearch server URLapi_key(str): Meilisearch API keyindex_prefix(Optional[str]): Prefix for index names
Methods:
index_chunks(): Index multiple documentsindex_document(): Index single documentsearch(): Search an indexdelete_document(): Delete by IDdelete_documents_by_filter(): Delete by filtercreate_index(): Create new index
DocumentChunk
Data class representing a text chunk.
Attributes:
chunk_id(str): Unique identifierfile_id(str): Source file identifieroutput_id(str): Output identifierproject_id(int): Project identifierfilename(str): Source filenamechunk_number(int): Chunk sequence numbertotal_chunks(int): Total chunks in documentchunk_text(str): The chunk text contenttoken_count(int): Number of tokenspages(List[int]): Page numbers (for PDFs)metadata(Dict): Additional metadata
Architecture
DocProcessor consists of several independent components:
- ContentExtractor: Extracts text from various file formats
- DocumentChunker: Splits text into semantic segments
- DocumentSummarizer: Generates LLM-based summaries
- MeiliSearchIndexer: Indexes documents to Meilisearch
Each component can be used independently or through the unified DocumentProcessor API.
Requirements
Python: 3.10+ (tested on 3.10, 3.11, 3.12)
Core Dependencies:
- pdfminer.six - PDF text extraction
- pdf2image - PDF to image conversion
- pytesseract - OCR engine
- opencv-python - Image preprocessing
- Pillow - Image handling
- python-docx - DOCX extraction
- python-pptx - PPTX extraction
- langchain-text-splitters - Semantic chunking
- tiktoken - Token counting
Optional:
- meilisearch - Search engine integration
Examples
See the examples/ directory for more usage examples:
basic_usage.py- Simple document processingmulti_environment.py- Indexing to multiple environmentscustom_chunking.py- Advanced chunking options
Development
Using GitHub Codespaces (Recommended)
The easiest way to start developing:
- Click the Code button on GitHub
- Select Codespaces → Create codespace on main
- Wait for the environment to build (includes all dependencies)
- Start coding!
The devcontainer automatically installs:
- Python 3.11
- All system dependencies (Tesseract, Poppler)
- Python dependencies in editable mode
- Pre-commit hooks
- VS Code extensions (Black, isort, flake8, etc.)
Local Development Setup
# Clone the repository
git clone https://github.com/Knowledge-Innovation-Centre/doc-processor.git
cd doc-processor
# Install system dependencies
# Ubuntu/Debian
sudo apt-get install tesseract-ocr poppler-utils
# macOS
brew install tesseract poppler
# Install Python dependencies
pip install -e ".[dev]"
# Install pre-commit hooks
pre-commit install
# Run tests
pytest
# Run tests with coverage
pytest --cov=docprocessor
Code Quality
We use automated tools to maintain code quality:
# Format code
black docprocessor tests
# Sort imports
isort docprocessor tests
# Lint
flake8 docprocessor tests
# Type check
mypy docprocessor
# Or run all checks with pre-commit
pre-commit run --all-files
Running Tests
# Run all tests
pytest
# With coverage report
pytest --cov=docprocessor --cov-report=html
# Run specific test file
pytest tests/test_processor.py -v
# Run tests matching pattern
pytest -k "test_extract" -v
Contributing
We love contributions! Please see CONTRIBUTING.md for details on:
- Development setup
- Code style guidelines
- Testing requirements
- Pull request process
- Issue reporting
Quick tips:
- Use the devcontainer for consistent environment
- Write tests for new features
- Follow PEP 8 and use pre-commit hooks
- Update documentation for API changes
- Add entries to CHANGELOG.md
Changelog
See CHANGELOG.md for version history and release notes.
License
MIT License - see LICENSE file for details.
Support
- Issues: GitHub Issues
- Discussions: GitHub Discussions
- Email: info@knowledgeinnovation.eu
Citation
If you use docprocessor in your research or project, please cite:
@software{docprocessor2025,
title = {docprocessor: Intelligent Document Processing Library},
author = {Knowledge Innovation Centre},
year = {2025},
url = {https://github.com/Knowledge-Innovation-Centre/doc-processor}
}
Made with ❤️ by Knowledge Innovation Centre
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file docprocessor-1.1.0.tar.gz.
File metadata
- Download URL: docprocessor-1.1.0.tar.gz
- Upload date:
- Size: 36.3 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a
|
|
| MD5 |
00296084cc7f74dde6de44bdfb3ec6a3
|
|
| BLAKE2b-256 |
efb5edf0b167995091cee9b739c738a58b433238ada5bf25f1af6978223379c1
|
Provenance
The following attestation bundles were made for docprocessor-1.1.0.tar.gz:
Publisher:
release.yml on Knowledge-Innovation-Centre/doc-processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docprocessor-1.1.0.tar.gz -
Subject digest:
6d85fd38839ca2c5d7486d0a30385dea7ac68807d5cc402d25d402b6661b989a - Sigstore transparency entry: 656678508
- Sigstore integration time:
-
Permalink:
Knowledge-Innovation-Centre/doc-processor@cc84cb55831f2cee1d9f5c8a3095562fe057aced -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/Knowledge-Innovation-Centre
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cc84cb55831f2cee1d9f5c8a3095562fe057aced -
Trigger Event:
push
-
Statement type:
File details
Details for the file docprocessor-1.1.0-py3-none-any.whl.
File metadata
- Download URL: docprocessor-1.1.0-py3-none-any.whl
- Upload date:
- Size: 23.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f
|
|
| MD5 |
9d7737783691a45a9e8270bf2c9f9b84
|
|
| BLAKE2b-256 |
e313842ac72a471d0cb065efc24e373792e7811cf966e83688a77770423428c7
|
Provenance
The following attestation bundles were made for docprocessor-1.1.0-py3-none-any.whl:
Publisher:
release.yml on Knowledge-Innovation-Centre/doc-processor
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
docprocessor-1.1.0-py3-none-any.whl -
Subject digest:
c2f21a0fda1f6aec9cc1dbe19b892b3e3ea767adde46a04bbd344c7396cc078f - Sigstore transparency entry: 656678516
- Sigstore integration time:
-
Permalink:
Knowledge-Innovation-Centre/doc-processor@cc84cb55831f2cee1d9f5c8a3095562fe057aced -
Branch / Tag:
refs/tags/v1.1.0 - Owner: https://github.com/Knowledge-Innovation-Centre
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
release.yml@cc84cb55831f2cee1d9f5c8a3095562fe057aced -
Trigger Event:
push
-
Statement type: