Skip to main content

Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking

Project description

Contextifier

Contextifier is a document processing library that converts raw documents into AI-understandable context. It analyzes, restructures, and normalizes content so that language models can reason over documents with higher accuracy and consistency.

Features

  • Multi-format Support: Process a wide variety of document formats including:

    • PDF (with table detection, OCR fallback, and complex layout handling)
    • Microsoft Office: DOCX, DOC, PPTX, PPT, XLSX, XLS
    • Korean documents: HWP, HWPX (Hangul Word Processor)
    • Text formats: TXT, MD, RTF, CSV, HTML
    • Code files: Python, JavaScript, TypeScript, and 20+ languages
  • Intelligent Text Extraction:

    • Preserves document structure (headings, paragraphs, lists)
    • Extracts tables as HTML with proper rowspan/colspan handling
    • Handles merged cells and complex table layouts
    • Extracts and processes inline images
  • OCR Integration:

    • Pluggable OCR engine architecture
    • Supports OpenAI, Anthropic, Google Gemini, and vLLM backends
    • Automatic OCR fallback for scanned documents or image-based PDFs
  • Smart Chunking:

    • Semantic text chunking with configurable size and overlap
    • Table-aware chunking that preserves table integrity
    • Protected regions for code blocks and special content
  • Metadata Extraction:

    • Extracts document metadata (title, author, creation date, etc.)
    • Formats metadata in a structured, parseable format

Installation

pip install contextifier

Or using uv:

uv add contextifier

Quick Start

Basic Usage

from contextifier import DocumentProcessor

# Create processor instance
processor = DocumentProcessor()

# Extract text from a document
text = processor.extract_text("document.pdf")
print(text)

# Extract text and chunk in one step
result = processor.extract_chunks(
    "document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

# Access chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i + 1}: {chunk[:100]}...")

# Save chunks to markdown file
result.save_to_md("output/chunks.md")

With OCR Processing

from contextifier import DocumentProcessor
from contextifier.ocr.ocr_engine.openai_ocr import OpenAIOCREngine

# Initialize OCR engine
ocr_engine = OpenAIOCREngine(api_key="sk-...", model="gpt-4o")

# Create processor with OCR
processor = DocumentProcessor(ocr_engine=ocr_engine)

# Extract text with OCR processing enabled
text = processor.extract_text(
    "scanned_document.pdf",
    ocr_processing=True
)

Supported Formats

Category Extensions
Documents .pdf, .docx, .doc, .pptx, .ppt, .hwp, .hwpx
Spreadsheets .xlsx, .xls, .csv, .tsv
Text .txt, .md, .rtf
Web .html, .htm, .xml
Code .py, .js, .ts, .java, .cpp, .c, .go, .rs, and more
Config .json, .yaml, .yml, .toml, .ini, .env

Architecture

libs/
├── core/
│   ├── document_processor.py    # Main entry point
│   ├── processor/               # Format-specific handlers
│   │   ├── pdf_handler.py       # PDF processing with V4 engine
│   │   ├── docx_handler.py      # DOCX processing
│   │   ├── ppt_handler.py       # PowerPoint processing
│   │   ├── excel_handler.py     # Excel processing
│   │   ├── hwp_processor.py     # HWP 5.0 OLE processing
│   │   ├── hwpx_processor.py    # HWPX (ZIP/XML) processing
│   │   └── ...
│   └── functions/
│       └── img_processor.py     # Image handling utilities
├── chunking/
│   ├── chunking.py              # Main chunking interface
│   ├── text_chunker.py          # Text-based chunking
│   ├── table_chunker.py         # Table-aware chunking
│   └── page_chunker.py          # Page-based chunking
└── ocr/
    ├── base.py                  # OCR base class
    ├── ocr_processor.py         # OCR processing utilities
    └── ocr_engine/              # OCR engine implementations
        ├── openai_ocr.py
        ├── anthropic_ocr.py
        ├── gemini_ocr.py
        └── vllm_ocr.py

Requirements

  • Python 3.12+
  • Required dependencies are automatically installed (see pyproject.toml)

System Dependencies

For full functionality, you may need:

  • Tesseract OCR: For local OCR fallback
  • LibreOffice: For DOC/RTF conversion (optional)
  • Poppler: For PDF image extraction

Configuration

# Custom configuration
config = {
    "pdf": {
        "extract_images": True,
        "ocr_fallback": True,
    },
    "chunking": {
        "default_size": 1000,
        "default_overlap": 200,
    }
}

processor = DocumentProcessor(config=config)

License

Apache License 2.0 - see LICENSE for details.

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextifier-0.1.5.tar.gz (238.7 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextifier-0.1.5-py3-none-any.whl (322.1 kB view details)

Uploaded Python 3

File details

Details for the file contextifier-0.1.5.tar.gz.

File metadata

  • Download URL: contextifier-0.1.5.tar.gz
  • Upload date:
  • Size: 238.7 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.1.5.tar.gz
Algorithm Hash digest
SHA256 828741428d3d034684af5e84c016a9c8a842ead7f0f449509b8ab289795a685f
MD5 d2e3f3dbc41fe3b4479adb6a242f6785
BLAKE2b-256 ac637eb1cbd8f1ed5adbf2ed5cb080a15e37fecc6176d5c141c2b4ccf2dbdc87

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.1.5.tar.gz:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file contextifier-0.1.5-py3-none-any.whl.

File metadata

  • Download URL: contextifier-0.1.5-py3-none-any.whl
  • Upload date:
  • Size: 322.1 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.1.5-py3-none-any.whl
Algorithm Hash digest
SHA256 c26ef3b0df4af1c67ab064851d3712dff9284708d40d9c18584ed5c06109414c
MD5 98f1cb0b74ff05845aafbd0729db11fd
BLAKE2b-256 7b19697f946e02a9671d91d0dd5b6ab5c5ec065877c8d7faf561c39fb810d146

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.1.5-py3-none-any.whl:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page