Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CocoRoF

These details have not been verified by PyPI

Project description

Contextifier

Contextifier is a document processing library that converts raw documents into AI-understandable context. It analyzes, restructures, and normalizes content so that language models can reason over documents with higher accuracy and consistency.

Features

Multi-format Support: Process a wide variety of document formats including:
- PDF (with table detection, OCR fallback, and complex layout handling)
- Microsoft Office: DOCX, DOC, PPTX, PPT, XLSX, XLS
- Korean documents: HWP, HWPX (Hangul Word Processor)
- Text formats: TXT, MD, RTF, CSV, HTML
- Code files: Python, JavaScript, TypeScript, and 20+ languages
Intelligent Text Extraction:
- Preserves document structure (headings, paragraphs, lists)
- Extracts tables as HTML with proper rowspan/colspan handling
- Handles merged cells and complex table layouts
- Extracts and processes inline images
OCR Integration:
- Pluggable OCR engine architecture
- Supports OpenAI, Anthropic, Google Gemini, and vLLM backends
- Automatic OCR fallback for scanned documents or image-based PDFs
Smart Chunking:
- Semantic text chunking with configurable size and overlap
- Table-aware chunking that preserves table integrity
- Protected regions for code blocks and special content
Metadata Extraction:
- Extracts document metadata (title, author, creation date, etc.)
- Formats metadata in a structured, parseable format

Installation

pip install contextifier

Or using uv:

uv add contextifier

Quick Start

Basic Usage

from contextifier import DocumentProcessor

# Create processor instance
processor = DocumentProcessor()

# Extract text from a document
text = processor.extract_text("document.pdf")
print(text)

# Extract text and chunk in one step
result = processor.extract_chunks(
    "document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

# Access chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i + 1}: {chunk[:100]}...")

# Save chunks to markdown file
result.save_to_md("output/chunks.md")

With OCR Processing

from contextifier import DocumentProcessor
from contextifier.ocr.ocr_engine.openai_ocr import OpenAIOCREngine

# Initialize OCR engine
ocr_engine = OpenAIOCREngine(api_key="sk-...", model="gpt-4o")

# Create processor with OCR
processor = DocumentProcessor(ocr_engine=ocr_engine)

# Extract text with OCR processing enabled
text = processor.extract_text(
    "scanned_document.pdf",
    ocr_processing=True
)

Supported Formats

Category	Extensions
Documents	`.pdf`, `.docx`, `.doc`, `.pptx`, `.ppt`, `.hwp`, `.hwpx`
Spreadsheets	`.xlsx`, `.xls`, `.csv`, `.tsv`
Text	`.txt`, `.md`, `.rtf`
Web	`.html`, `.htm`, `.xml`
Code	`.py`, `.js`, `.ts`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, and more
Config	`.json`, `.yaml`, `.yml`, `.toml`, `.ini`, `.env`

Architecture

libs/
├── core/
│   ├── document_processor.py    # Main entry point
│   ├── processor/               # Format-specific handlers
│   │   ├── pdf_handler.py       # PDF processing with V4 engine
│   │   ├── docx_handler.py      # DOCX processing
│   │   ├── ppt_handler.py       # PowerPoint processing
│   │   ├── excel_handler.py     # Excel processing
│   │   ├── hwp_processor.py     # HWP 5.0 OLE processing
│   │   ├── hwpx_processor.py    # HWPX (ZIP/XML) processing
│   │   └── ...
│   └── functions/
│       └── img_processor.py     # Image handling utilities
├── chunking/
│   ├── chunking.py              # Main chunking interface
│   ├── text_chunker.py          # Text-based chunking
│   ├── table_chunker.py         # Table-aware chunking
│   └── page_chunker.py          # Page-based chunking
└── ocr/
    ├── base.py                  # OCR base class
    ├── ocr_processor.py         # OCR processing utilities
    └── ocr_engine/              # OCR engine implementations
        ├── openai_ocr.py
        ├── anthropic_ocr.py
        ├── gemini_ocr.py
        └── vllm_ocr.py

Requirements

Python 3.12+
Required dependencies are automatically installed (see pyproject.toml)

System Dependencies

For full functionality, you may need:

Tesseract OCR: For local OCR fallback
LibreOffice: For DOC/RTF conversion (optional)
Poppler: For PDF image extraction

Configuration

# Custom configuration
config = {
    "pdf": {
        "extract_images": True,
        "ocr_fallback": True,
    },
    "chunking": {
        "default_size": 1000,
        "default_overlap": 200,
    }
}

processor = DocumentProcessor(config=config)

License

Apache License 2.0 - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CocoRoF

These details have not been verified by PyPI

Release history Release notifications | RSS feed

0.2.5

Apr 1, 2026

0.2.4

Mar 23, 2026

This version

0.2.2

Jan 23, 2026

0.2.0

Jan 22, 2026

0.1.6

Jan 21, 2026

0.1.5

Jan 20, 2026

0.1.4

Jan 20, 2026

0.1.3

Jan 20, 2026

0.1.2

Jan 20, 2026

0.1.1

Jan 19, 2026

0.1.0

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextifier-0.2.2.tar.gz (279.9 kB view details)

Uploaded Jan 23, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextifier-0.2.2-py3-none-any.whl (376.7 kB view details)

Uploaded Jan 23, 2026 Python 3

File details

Details for the file contextifier-0.2.2.tar.gz.

File metadata

Download URL: contextifier-0.2.2.tar.gz
Upload date: Jan 23, 2026
Size: 279.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.2.2.tar.gz
Algorithm	Hash digest
SHA256	`8b961f5cc101a6a45233243c07d7fc205ac6f024596667dbcb5965db44dc4169`
MD5	`864225befed161d0f9d2f99c53ac65dd`
BLAKE2b-256	`47d94242dd22dc477d508b885199f0c369e3bec9601bdc45b2ec1d414505834d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.2.2.tar.gz:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: contextifier-0.2.2.tar.gz
- Subject digest: 8b961f5cc101a6a45233243c07d7fc205ac6f024596667dbcb5965db44dc4169
- Sigstore transparency entry: 845967719
- Sigstore integration time: Jan 23, 2026
Source repository:
- Permalink: CocoRoF/Contextifier@9e44593af43c7401405ab5f1932d9a5bdd46fb36
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/CocoRoF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9e44593af43c7401405ab5f1932d9a5bdd46fb36
- Trigger Event: push

File details

Details for the file contextifier-0.2.2-py3-none-any.whl.

File metadata

Download URL: contextifier-0.2.2-py3-none-any.whl
Upload date: Jan 23, 2026
Size: 376.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.2.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`e7863933424c16b6549d913ce6ff24b630e5d983d8d9fc67f018b50503a6027b`
MD5	`171dd33fabb6e74ab10af6fd48c315d5`
BLAKE2b-256	`38312a63e903dbcbad60a24879bf1e5a3a00dccbd86fbc35ddbb5207572ff1c1`

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.2.2-py3-none-any.whl:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: contextifier-0.2.2-py3-none-any.whl
- Subject digest: e7863933424c16b6549d913ce6ff24b630e5d983d8d9fc67f018b50503a6027b
- Sigstore transparency entry: 845967722
- Sigstore integration time: Jan 23, 2026
Source repository:
- Permalink: CocoRoF/Contextifier@9e44593af43c7401405ab5f1932d9a5bdd46fb36
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/CocoRoF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@9e44593af43c7401405ab5f1932d9a5bdd46fb36
- Trigger Event: push

contextifier 0.2.2

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Contextifier

Features

Installation

Quick Start

Basic Usage

With OCR Processing

Supported Formats

Architecture

Requirements

System Dependencies

Configuration

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance