Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking
Project description
Contextifier v2
Contextifier is a Python document processing library that converts documents of various formats into structured, AI-ready text. It applies a uniform 5-stage pipeline to every document format, ensuring consistent and predictable output.
Key Features
- Broad Format Support: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, HWP, HWPX, RTF, CSV, TSV, TXT, MD, HTML, images, code files, and 80+ extensions
- Intelligent Text Extraction: Preserves document structure (headings, tables, image positions) with automatic metadata extraction
- Table Processing: Converts tables to HTML/Markdown/Text with
rowspan/colspansupport for merged cells - OCR Integration: 5 Vision LLM engines — OpenAI, Anthropic, Google Gemini, AWS Bedrock, vLLM
- Smart Chunking: 4 strategies with automatic selection — table-aware, page-boundary, protected-region, and recursive splitting
- Immutable Config System: Frozen dataclass-based
ProcessingConfigcontrols all behavior
Installation
pip install contextifier
or
uv add contextifier
Quick Start
1. Basic Text Extraction
from contextifier_new import DocumentProcessor
processor = DocumentProcessor()
text = processor.extract_text("document.pdf")
print(text)
2. Extract + Chunk in One Step
from contextifier_new import DocumentProcessor
processor = DocumentProcessor()
result = processor.extract_chunks("document.pdf")
for i, chunk in enumerate(result.chunks, 1):
print(f"Chunk {i}: {chunk[:100]}...")
# Save as Markdown files
result.save_to_md("output/chunks")
3. Custom Configuration
from contextifier_new import DocumentProcessor
from contextifier_new.config import ProcessingConfig, ChunkingConfig, TagConfig
config = ProcessingConfig(
tags=TagConfig(page_prefix="<page>", page_suffix="</page>"),
chunking=ChunkingConfig(chunk_size=2000, chunk_overlap=300),
)
processor = DocumentProcessor(config=config)
text = processor.extract_text("report.xlsx")
4. OCR Integration
from contextifier_new import DocumentProcessor
from contextifier_new.ocr.engines import OpenAIOCREngine
ocr = OpenAIOCREngine.from_api_key("sk-...", model="gpt-4o")
processor = DocumentProcessor(ocr_engine=ocr)
text = processor.extract_text("scanned.pdf", ocr_processing=True)
Supported Formats
| Category | Extensions | Notes |
|---|---|---|
| Documents | .pdf, .docx, .doc, .hwp, .hwpx, .rtf |
HWP 5.0+, HWPX supported |
| Presentations | .pptx, .ppt |
Slides, notes, and charts extracted |
| Spreadsheets | .xlsx, .xls, .csv, .tsv |
Multi-sheet, formulas, charts |
| Text | .txt, .md, .log, .rst |
Auto encoding detection |
| Web | .html, .htm, .xhtml |
Table/structure preservation |
| Code | .py, .js, .ts, .java, .cpp, .go, .rs, etc. (20+) |
Language-aware highlighting |
| Config | .json, .yaml, .toml, .ini, .xml, .env |
Structure preservation |
| Images | .jpg, .png, .gif, .bmp, .webp, .tiff |
Requires OCR engine |
Architecture
contextifier_new/
├── document_processor.py # Facade: single public entry point
├── config.py # Immutable config system (ProcessingConfig)
├── types.py # Shared types / Enums / TypedDicts
├── errors.py # Unified exception hierarchy
│
├── handlers/ # 14 format-specific handlers
│ ├── base.py # BaseHandler — enforces 5-stage pipeline
│ ├── registry.py # HandlerRegistry — extension → handler mapping
│ ├── pdf/ # PDF (default)
│ ├── pdf_plus/ # PDF (advanced: table detection, complex layouts)
│ ├── docx/ doc/ pptx/ ppt/ # Office documents
│ ├── xlsx/ xls/ csv/ # Spreadsheets / data
│ ├── hwp/ hwpx/ # Korean word processor
│ ├── rtf/ text/ # RTF / text / code / config
│ └── image/ # Image (OCR integration)
│
├── pipeline/ # 5-Stage pipeline ABCs
│ ├── converter.py # Stage 1: Binary → Format Object
│ ├── preprocessor.py # Stage 2: Preprocessing
│ ├── metadata_extractor.py # Stage 3: Metadata extraction
│ ├── content_extractor.py # Stage 4: Text / table / image / chart extraction
│ └── postprocessor.py # Stage 5: Final assembly & cleanup
│
├── services/ # Shared services (DI)
│ ├── tag_service.py # Page / slide / sheet tag generation
│ ├── image_service.py # Image saving / tagging / deduplication
│ ├── chart_service.py # Chart data formatting
│ ├── table_service.py # Table HTML / MD rendering
│ ├── metadata_service.py # Metadata formatting
│ └── storage/ # Storage backends (Local, MinIO, S3, ...)
│
├── chunking/ # Chunking subsystem
│ ├── chunker.py # TextChunker — auto strategy selection
│ ├── constants.py # Protected region patterns
│ └── strategies/ # 4 chunking strategies
│ ├── plain_strategy.py # Recursive splitting (default fallback)
│ ├── table_strategy.py # Sheet / table-based splitting
│ ├── page_strategy.py # Page-boundary splitting
│ └── protected_strategy.py # Protected region preservation
│
└── ocr/ # OCR subsystem (optional)
├── base.py # BaseOCREngine ABC
├── processor.py # OCRProcessor — tag detection + engine call
└── engines/ # 5 engine implementations
├── openai_engine.py
├── anthropic_engine.py
├── gemini_engine.py
├── bedrock_engine.py
└── vllm_engine.py
Requirements
- Python 3.12+
- Required dependencies are included in
pyproject.toml - Optional: LibreOffice (DOC/PPT/RTF conversion), Poppler (PDF image extraction)
Documentation
| Document | Contents |
|---|---|
| QUICKSTART.md | Detailed usage guide & full API reference |
| Process Logic.md | Handler processing flow diagrams |
| ARCHITECTURE.md | Internal architecture specification |
| CHANGELOG.md | Version history |
| CONTRIBUTING.md | Contribution guidelines |
License
Apache License 2.0 — see LICENSE
Contributing
Contributions are welcome! See CONTRIBUTING.md.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file contextifier-0.2.5.tar.gz.
File metadata
- Download URL: contextifier-0.2.5.tar.gz
- Upload date:
- Size: 272.9 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
248d28684566fa9c0ddcffd78bca7861e51870097aa8c0b19ccb2c0fcd4862ba
|
|
| MD5 |
b27c1c0fcdfd2aa7e665dd6d01d7405b
|
|
| BLAKE2b-256 |
13964200fde526ee013027fa6a2dadc385c2a53706e6b52c21b03f49f2c302c3
|
Provenance
The following attestation bundles were made for contextifier-0.2.5.tar.gz:
Publisher:
publish.yml on CocoRoF/Contextifier
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextifier-0.2.5.tar.gz -
Subject digest:
248d28684566fa9c0ddcffd78bca7861e51870097aa8c0b19ccb2c0fcd4862ba - Sigstore transparency entry: 1204028490
- Sigstore integration time:
-
Permalink:
CocoRoF/Contextifier@e834965438aff11ca431b8ac9fee0b1e5e954a21 -
Branch / Tag:
refs/heads/deploy - Owner: https://github.com/CocoRoF
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e834965438aff11ca431b8ac9fee0b1e5e954a21 -
Trigger Event:
push
-
Statement type:
File details
Details for the file contextifier-0.2.5-py3-none-any.whl.
File metadata
- Download URL: contextifier-0.2.5-py3-none-any.whl
- Upload date:
- Size: 378.2 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
bfc755dafcec08993e3ec46cc9d7af531fda1bfe781a22fe2cc03363a05baa47
|
|
| MD5 |
739502f630655f3493d463f4ae8addbb
|
|
| BLAKE2b-256 |
0632de5075f3e7511892b6235e454d4b56f8b04294c58582b54e679b6329eeb3
|
Provenance
The following attestation bundles were made for contextifier-0.2.5-py3-none-any.whl:
Publisher:
publish.yml on CocoRoF/Contextifier
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
contextifier-0.2.5-py3-none-any.whl -
Subject digest:
bfc755dafcec08993e3ec46cc9d7af531fda1bfe781a22fe2cc03363a05baa47 - Sigstore transparency entry: 1204028536
- Sigstore integration time:
-
Permalink:
CocoRoF/Contextifier@e834965438aff11ca431b8ac9fee0b1e5e954a21 -
Branch / Tag:
refs/heads/deploy - Owner: https://github.com/CocoRoF
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
publish.yml@e834965438aff11ca431b8ac9fee0b1e5e954a21 -
Trigger Event:
push
-
Statement type: