Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking
Project description
Contextify
Contextify is a document processing library that converts raw documents into AI-understandable context. It analyzes, restructures, and normalizes content so that language models can reason over documents with higher accuracy and consistency.
Features
-
Multi-format Support: Process a wide variety of document formats including:
- PDF (with table detection, OCR fallback, and complex layout handling)
- Microsoft Office: DOCX, DOC, PPTX, PPT, XLSX, XLS
- Korean documents: HWP, HWPX (Hangul Word Processor)
- Text formats: TXT, MD, RTF, CSV, HTML
- Code files: Python, JavaScript, TypeScript, and 20+ languages
-
Intelligent Text Extraction:
- Preserves document structure (headings, paragraphs, lists)
- Extracts tables as HTML with proper
rowspan/colspanhandling - Handles merged cells and complex table layouts
- Extracts and processes inline images
-
OCR Integration:
- Pluggable OCR engine architecture
- Supports OpenAI, Anthropic, Google Gemini, and vLLM backends
- Automatic OCR fallback for scanned documents or image-based PDFs
-
Smart Chunking:
- Semantic text chunking with configurable size and overlap
- Table-aware chunking that preserves table integrity
- Protected regions for code blocks and special content
-
Metadata Extraction:
- Extracts document metadata (title, author, creation date, etc.)
- Formats metadata in a structured, parseable format
Installation
pip install contextify
Or using uv:
uv add contextify
Quick Start
Basic Usage
from libs.core.document_processor import DocumentProcessor
# Create processor instance
processor = DocumentProcessor()
# Extract text from a document
text = processor.extract_text("document.pdf")
print(text)
# Chunk the extracted text
chunks = processor.chunk_text(text, chunk_size=1000, chunk_overlap=200)
for chunk in chunks:
print(chunk)
With OCR Processing
from libs.core.document_processor import DocumentProcessor
from libs.ocr.ocr_engine import OpenAIOCR
# Initialize OCR engine
ocr_engine = OpenAIOCR(api_key="sk-...", model="gpt-4o")
# Create processor with OCR
processor = DocumentProcessor(ocr_engine=ocr_engine)
# Extract text with OCR processing enabled
text = processor.extract_text(
"scanned_document.pdf",
ocr_processing=True
)
Supported Formats
| Category | Extensions |
|---|---|
| Documents | .pdf, .docx, .doc, .pptx, .ppt, .hwp, .hwpx |
| Spreadsheets | .xlsx, .xls, .csv, .tsv |
| Text | .txt, .md, .rtf |
| Web | .html, .htm, .xml |
| Code | .py, .js, .ts, .java, .cpp, .c, .go, .rs, and more |
| Config | .json, .yaml, .yml, .toml, .ini, .env |
Architecture
libs/
├── core/
│ ├── document_processor.py # Main entry point
│ ├── processor/ # Format-specific handlers
│ │ ├── pdf_handler.py # PDF processing with V4 engine
│ │ ├── docx_handler.py # DOCX processing
│ │ ├── ppt_handler.py # PowerPoint processing
│ │ ├── excel_handler.py # Excel processing
│ │ ├── hwp_processor.py # HWP 5.0 OLE processing
│ │ ├── hwpx_processor.py # HWPX (ZIP/XML) processing
│ │ └── ...
│ └── functions/
│ └── img_processor.py # Image handling utilities
├── chunking/
│ ├── chunking.py # Main chunking interface
│ ├── text_chunker.py # Text-based chunking
│ ├── table_chunker.py # Table-aware chunking
│ └── page_chunker.py # Page-based chunking
└── ocr/
├── base.py # OCR base class
├── ocr_processor.py # OCR processing utilities
└── ocr_engine/ # OCR engine implementations
├── openai_ocr.py
├── anthropic_ocr.py
├── gemini_ocr.py
└── vllm_ocr.py
Requirements
- Python 3.14+
- Required dependencies are automatically installed (see
pyproject.toml)
System Dependencies
For full functionality, you may need:
- Tesseract OCR: For local OCR fallback
- LibreOffice: For DOC/RTF conversion (optional)
- Poppler: For PDF image extraction
Configuration
# Custom configuration
config = {
"pdf": {
"extract_images": True,
"ocr_fallback": True,
},
"chunking": {
"default_size": 1000,
"default_overlap": 200,
}
}
processor = DocumentProcessor(config=config)
License
MIT License - see LICENSE for details.
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextifier-0.1.0.tar.gz.
File metadata
- Download URL: contextifier-0.1.0.tar.gz
- Upload date:
- Size: 228.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
48c5eed7e8248d222e310437169f9d4403f482f48c131a9005cc9720f2a8fa90
|
|
| MD5 |
8f26b39deb040c859f7d65bdc8b82f9a
|
|
| BLAKE2b-256 |
b2fc384df7469f302dc157b3896f0459e13efee54005ce8cb9448ad69cc4b598
|
Provenance
The following attestation bundles were made for contextifier-0.1.0.tar.gz:
Publisher:
publish.yml on CocoRoF/Contextifier
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextifier-0.1.0.tar.gz -
Subject digest:
48c5eed7e8248d222e310437169f9d4403f482f48c131a9005cc9720f2a8fa90 - Sigstore transparency entry: 834377348
- Sigstore integration time:
-
Permalink:
CocoRoF/Contextifier@6be8c38f946f6c6be8cb43aa2bd7a01fc9e9c36c -
Branch / Tag:
refs/heads/deploy - Owner: https://github.com/CocoRoF
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6be8c38f946f6c6be8cb43aa2bd7a01fc9e9c36c -
Trigger Event:
workflow_dispatch
-
Statement type:
File details
Details for the file contextifier-0.1.0-py3-none-any.whl.
File metadata
- Download URL: contextifier-0.1.0-py3-none-any.whl
- Upload date:
- Size: 307.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a81137777edb2300bd312ede9a449ce01736bceda08697940e076ee994088f84
|
|
| MD5 |
71416c4ac1d242cd2c895307f65dc3d4
|
|
| BLAKE2b-256 |
13ab91b0f6b4664b1ae9a291ece1f35a9720983068ee814414872a8fb87fb0e5
|
Provenance
The following attestation bundles were made for contextifier-0.1.0-py3-none-any.whl:
Publisher:
publish.yml on CocoRoF/Contextifier
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextifier-0.1.0-py3-none-any.whl -
Subject digest:
a81137777edb2300bd312ede9a449ce01736bceda08697940e076ee994088f84 - Sigstore transparency entry: 834377349
- Sigstore integration time:
-
Permalink:
CocoRoF/Contextifier@6be8c38f946f6c6be8cb43aa2bd7a01fc9e9c36c -
Branch / Tag:
refs/heads/deploy - Owner: https://github.com/CocoRoF
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@6be8c38f946f6c6be8cb43aa2bd7a01fc9e9c36c -
Trigger Event:
workflow_dispatch
-
Statement type: