Skip to main content

Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking

Project description

Contextifier v2

Contextifier is a Python document processing library that converts documents of various formats into structured, AI-ready text. It applies a uniform 5-stage pipeline to every document format, ensuring consistent and predictable output.

Key Features

  • Broad Format Support: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, HWP, HWPX, RTF, CSV, TSV, TXT, MD, HTML, images, code files, and 80+ extensions
  • Intelligent Text Extraction: Preserves document structure (headings, tables, image positions) with automatic metadata extraction
  • Table Processing: Converts tables to HTML/Markdown/Text with rowspan/colspan support for merged cells
  • OCR Integration: 5 Vision LLM engines — OpenAI, Anthropic, Google Gemini, AWS Bedrock, vLLM
  • Smart Chunking: 4 strategies with automatic selection — table-aware, page-boundary, protected-region, and recursive splitting
  • Immutable Config System: Frozen dataclass-based ProcessingConfig controls all behavior

Installation

pip install contextifier

or

uv add contextifier

Quick Start

1. Basic Text Extraction

from contextifier_new import DocumentProcessor

processor = DocumentProcessor()
text = processor.extract_text("document.pdf")
print(text)

2. Extract + Chunk in One Step

from contextifier_new import DocumentProcessor

processor = DocumentProcessor()
result = processor.extract_chunks("document.pdf")

for i, chunk in enumerate(result.chunks, 1):
    print(f"Chunk {i}: {chunk[:100]}...")

# Save as Markdown files
result.save_to_md("output/chunks")

3. Custom Configuration

from contextifier_new import DocumentProcessor
from contextifier_new.config import ProcessingConfig, ChunkingConfig, TagConfig

config = ProcessingConfig(
    tags=TagConfig(page_prefix="<page>", page_suffix="</page>"),
    chunking=ChunkingConfig(chunk_size=2000, chunk_overlap=300),
)

processor = DocumentProcessor(config=config)
text = processor.extract_text("report.xlsx")

4. OCR Integration

from contextifier_new import DocumentProcessor
from contextifier_new.ocr.engines import OpenAIOCREngine

ocr = OpenAIOCREngine.from_api_key("sk-...", model="gpt-4o")
processor = DocumentProcessor(ocr_engine=ocr)

text = processor.extract_text("scanned.pdf", ocr_processing=True)

Supported Formats

Category Extensions Notes
Documents .pdf, .docx, .doc, .hwp, .hwpx, .rtf HWP 5.0+, HWPX supported
Presentations .pptx, .ppt Slides, notes, and charts extracted
Spreadsheets .xlsx, .xls, .csv, .tsv Multi-sheet, formulas, charts
Text .txt, .md, .log, .rst Auto encoding detection
Web .html, .htm, .xhtml Table/structure preservation
Code .py, .js, .ts, .java, .cpp, .go, .rs, etc. (20+) Language-aware highlighting
Config .json, .yaml, .toml, .ini, .xml, .env Structure preservation
Images .jpg, .png, .gif, .bmp, .webp, .tiff Requires OCR engine

Architecture

contextifier_new/
├── document_processor.py     # Facade: single public entry point
├── config.py                 # Immutable config system (ProcessingConfig)
├── types.py                  # Shared types / Enums / TypedDicts
├── errors.py                 # Unified exception hierarchy
│
├── handlers/                 # 14 format-specific handlers
│   ├── base.py               #   BaseHandler — enforces 5-stage pipeline
│   ├── registry.py           #   HandlerRegistry — extension → handler mapping
│   ├── pdf/                  #   PDF (default)
│   ├── pdf_plus/             #   PDF (advanced: table detection, complex layouts)
│   ├── docx/ doc/ pptx/ ppt/ #   Office documents
│   ├── xlsx/ xls/ csv/       #   Spreadsheets / data
│   ├── hwp/ hwpx/            #   Korean word processor
│   ├── rtf/ text/            #   RTF / text / code / config
│   └── image/                #   Image (OCR integration)
│
├── pipeline/                 # 5-Stage pipeline ABCs
│   ├── converter.py          #   Stage 1: Binary → Format Object
│   ├── preprocessor.py       #   Stage 2: Preprocessing
│   ├── metadata_extractor.py #   Stage 3: Metadata extraction
│   ├── content_extractor.py  #   Stage 4: Text / table / image / chart extraction
│   └── postprocessor.py      #   Stage 5: Final assembly & cleanup
│
├── services/                 # Shared services (DI)
│   ├── tag_service.py        #   Page / slide / sheet tag generation
│   ├── image_service.py      #   Image saving / tagging / deduplication
│   ├── chart_service.py      #   Chart data formatting
│   ├── table_service.py      #   Table HTML / MD rendering
│   ├── metadata_service.py   #   Metadata formatting
│   └── storage/              #   Storage backends (Local, MinIO, S3, ...)
│
├── chunking/                 # Chunking subsystem
│   ├── chunker.py            #   TextChunker — auto strategy selection
│   ├── constants.py          #   Protected region patterns
│   └── strategies/           #   4 chunking strategies
│       ├── plain_strategy.py     # Recursive splitting (default fallback)
│       ├── table_strategy.py     # Sheet / table-based splitting
│       ├── page_strategy.py      # Page-boundary splitting
│       └── protected_strategy.py # Protected region preservation
│
└── ocr/                      # OCR subsystem (optional)
    ├── base.py               #   BaseOCREngine ABC
    ├── processor.py          #   OCRProcessor — tag detection + engine call
    └── engines/              #   5 engine implementations
        ├── openai_engine.py
        ├── anthropic_engine.py
        ├── gemini_engine.py
        ├── bedrock_engine.py
        └── vllm_engine.py

Requirements

  • Python 3.12+
  • Required dependencies are included in pyproject.toml
  • Optional: LibreOffice (DOC/PPT/RTF conversion), Poppler (PDF image extraction)

Documentation

Document Contents
QUICKSTART.md Detailed usage guide & full API reference
Process Logic.md Handler processing flow diagrams
ARCHITECTURE.md Internal architecture specification
CHANGELOG.md Version history
CONTRIBUTING.md Contribution guidelines

License

Apache License 2.0 — see LICENSE

Contributing

Contributions are welcome! See CONTRIBUTING.md.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextifier-0.2.5.tar.gz (272.9 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

contextifier-0.2.5-py3-none-any.whl (378.2 kB view details)

Uploaded Python 3

File details

Details for the file contextifier-0.2.5.tar.gz.

File metadata

  • Download URL: contextifier-0.2.5.tar.gz
  • Upload date:
  • Size: 272.9 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.2.5.tar.gz
Algorithm Hash digest
SHA256 248d28684566fa9c0ddcffd78bca7861e51870097aa8c0b19ccb2c0fcd4862ba
MD5 b27c1c0fcdfd2aa7e665dd6d01d7405b
BLAKE2b-256 13964200fde526ee013027fa6a2dadc385c2a53706e6b52c21b03f49f2c302c3

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.2.5.tar.gz:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file contextifier-0.2.5-py3-none-any.whl.

File metadata

  • Download URL: contextifier-0.2.5-py3-none-any.whl
  • Upload date:
  • Size: 378.2 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.2.5-py3-none-any.whl
Algorithm Hash digest
SHA256 bfc755dafcec08993e3ec46cc9d7af531fda1bfe781a22fe2cc03363a05baa47
MD5 739502f630655f3493d463f4ae8addbb
BLAKE2b-256 0632de5075f3e7511892b6235e454d4b56f8b04294c58582b54e679b6329eeb3

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.2.5-py3-none-any.whl:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page