Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CocoRoF

These details have not been verified by PyPI

Project description

Contextifier v2

Contextifier is a Python document processing library that converts documents of various formats into structured, AI-ready text. It applies a uniform 5-stage pipeline to every document format, ensuring consistent and predictable output.

Key Features

Broad Format Support: PDF, DOCX, DOC, PPTX, PPT, XLSX, XLS, HWP, HWPX, RTF, CSV, TSV, TXT, MD, HTML, images, code files, and 80+ extensions
Intelligent Text Extraction: Preserves document structure (headings, tables, image positions) with automatic metadata extraction
Table Processing: Converts tables to HTML/Markdown/Text with rowspan/colspan support for merged cells
OCR Integration: 5 Vision LLM engines — OpenAI, Anthropic, Google Gemini, AWS Bedrock, vLLM
Smart Chunking: 4 strategies with automatic selection — table-aware, page-boundary, protected-region, and recursive splitting
Immutable Config System: Frozen dataclass-based ProcessingConfig controls all behavior

Installation

pip install contextifier

uv add contextifier

Quick Start

1. Basic Text Extraction

from contextifier_new import DocumentProcessor

processor = DocumentProcessor()
text = processor.extract_text("document.pdf")
print(text)

2. Extract + Chunk in One Step

from contextifier_new import DocumentProcessor

processor = DocumentProcessor()
result = processor.extract_chunks("document.pdf")

for i, chunk in enumerate(result.chunks, 1):
    print(f"Chunk {i}: {chunk[:100]}...")

# Save as Markdown files
result.save_to_md("output/chunks")

3. Custom Configuration

from contextifier_new import DocumentProcessor
from contextifier_new.config import ProcessingConfig, ChunkingConfig, TagConfig

config = ProcessingConfig(
    tags=TagConfig(page_prefix="<page>", page_suffix="</page>"),
    chunking=ChunkingConfig(chunk_size=2000, chunk_overlap=300),
)

processor = DocumentProcessor(config=config)
text = processor.extract_text("report.xlsx")

4. OCR Integration

from contextifier_new import DocumentProcessor
from contextifier_new.ocr.engines import OpenAIOCREngine

ocr = OpenAIOCREngine.from_api_key("sk-...", model="gpt-4o")
processor = DocumentProcessor(ocr_engine=ocr)

text = processor.extract_text("scanned.pdf", ocr_processing=True)

Supported Formats

Category	Extensions	Notes
Documents	`.pdf`, `.docx`, `.doc`, `.hwp`, `.hwpx`, `.rtf`	HWP 5.0+, HWPX supported
Presentations	`.pptx`, `.ppt`	Slides, notes, and charts extracted
Spreadsheets	`.xlsx`, `.xls`, `.csv`, `.tsv`	Multi-sheet, formulas, charts
Text	`.txt`, `.md`, `.log`, `.rst`	Auto encoding detection
Web	`.html`, `.htm`, `.xhtml`	Table/structure preservation
Code	`.py`, `.js`, `.ts`, `.java`, `.cpp`, `.go`, `.rs`, etc. (20+)	Language-aware highlighting
Config	`.json`, `.yaml`, `.toml`, `.ini`, `.xml`, `.env`	Structure preservation
Images	`.jpg`, `.png`, `.gif`, `.bmp`, `.webp`, `.tiff`	Requires OCR engine

Architecture

contextifier_new/
├── document_processor.py     # Facade: single public entry point
├── config.py                 # Immutable config system (ProcessingConfig)
├── types.py                  # Shared types / Enums / TypedDicts
├── errors.py                 # Unified exception hierarchy
│
├── handlers/                 # 14 format-specific handlers
│   ├── base.py               #   BaseHandler — enforces 5-stage pipeline
│   ├── registry.py           #   HandlerRegistry — extension → handler mapping
│   ├── pdf/                  #   PDF (default)
│   ├── pdf_plus/             #   PDF (advanced: table detection, complex layouts)
│   ├── docx/ doc/ pptx/ ppt/ #   Office documents
│   ├── xlsx/ xls/ csv/       #   Spreadsheets / data
│   ├── hwp/ hwpx/            #   Korean word processor
│   ├── rtf/ text/            #   RTF / text / code / config
│   └── image/                #   Image (OCR integration)
│
├── pipeline/                 # 5-Stage pipeline ABCs
│   ├── converter.py          #   Stage 1: Binary → Format Object
│   ├── preprocessor.py       #   Stage 2: Preprocessing
│   ├── metadata_extractor.py #   Stage 3: Metadata extraction
│   ├── content_extractor.py  #   Stage 4: Text / table / image / chart extraction
│   └── postprocessor.py      #   Stage 5: Final assembly & cleanup
│
├── services/                 # Shared services (DI)
│   ├── tag_service.py        #   Page / slide / sheet tag generation
│   ├── image_service.py      #   Image saving / tagging / deduplication
│   ├── chart_service.py      #   Chart data formatting
│   ├── table_service.py      #   Table HTML / MD rendering
│   ├── metadata_service.py   #   Metadata formatting
│   └── storage/              #   Storage backends (Local, MinIO, S3, ...)
│
├── chunking/                 # Chunking subsystem
│   ├── chunker.py            #   TextChunker — auto strategy selection
│   ├── constants.py          #   Protected region patterns
│   └── strategies/           #   4 chunking strategies
│       ├── plain_strategy.py     # Recursive splitting (default fallback)
│       ├── table_strategy.py     # Sheet / table-based splitting
│       ├── page_strategy.py      # Page-boundary splitting
│       └── protected_strategy.py # Protected region preservation
│
└── ocr/                      # OCR subsystem (optional)
    ├── base.py               #   BaseOCREngine ABC
    ├── processor.py          #   OCRProcessor — tag detection + engine call
    └── engines/              #   5 engine implementations
        ├── openai_engine.py
        ├── anthropic_engine.py
        ├── gemini_engine.py
        ├── bedrock_engine.py
        └── vllm_engine.py

Requirements

Python 3.12+
Required dependencies are included in pyproject.toml
Optional: LibreOffice (DOC/PPT/RTF conversion), Poppler (PDF image extraction)

Documentation

Document	Contents
QUICKSTART.md	Detailed usage guide & full API reference
Process Logic.md	Handler processing flow diagrams
ARCHITECTURE.md	Internal architecture specification
CHANGELOG.md	Version history
CONTRIBUTING.md	Contribution guidelines

License

Apache License 2.0 — see LICENSE

Contributing

Contributions are welcome! See CONTRIBUTING.md.

Project details

These details have been verified by PyPI

Project links

GitHub Statistics

Maintainers

CocoRoF

These details have not been verified by PyPI

Release history Release notifications | RSS feed

This version

0.2.5

Apr 1, 2026

0.2.4

Mar 23, 2026

0.2.2

Jan 23, 2026

0.2.0

Jan 22, 2026

0.1.6

Jan 21, 2026

0.1.5

Jan 20, 2026

0.1.4

Jan 20, 2026

0.1.3

Jan 20, 2026

0.1.2

Jan 20, 2026

0.1.1

Jan 19, 2026

0.1.0

Jan 19, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextifier-0.2.5.tar.gz (272.9 kB view details)

Uploaded Apr 1, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextifier-0.2.5-py3-none-any.whl (378.2 kB view details)

Uploaded Apr 1, 2026 Python 3

File details

Details for the file contextifier-0.2.5.tar.gz.

File metadata

Download URL: contextifier-0.2.5.tar.gz
Upload date: Apr 1, 2026
Size: 272.9 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.2.5.tar.gz
Algorithm	Hash digest
SHA256	`248d28684566fa9c0ddcffd78bca7861e51870097aa8c0b19ccb2c0fcd4862ba`
MD5	`b27c1c0fcdfd2aa7e665dd6d01d7405b`
BLAKE2b-256	`13964200fde526ee013027fa6a2dadc385c2a53706e6b52c21b03f49f2c302c3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.2.5.tar.gz:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: contextifier-0.2.5.tar.gz
- Subject digest: 248d28684566fa9c0ddcffd78bca7861e51870097aa8c0b19ccb2c0fcd4862ba
- Sigstore transparency entry: 1204028490
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: CocoRoF/Contextifier@e834965438aff11ca431b8ac9fee0b1e5e954a21
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/CocoRoF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e834965438aff11ca431b8ac9fee0b1e5e954a21
- Trigger Event: push

File details

Details for the file contextifier-0.2.5-py3-none-any.whl.

File metadata

Download URL: contextifier-0.2.5-py3-none-any.whl
Upload date: Apr 1, 2026
Size: 378.2 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for contextifier-0.2.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bfc755dafcec08993e3ec46cc9d7af531fda1bfe781a22fe2cc03363a05baa47`
MD5	`739502f630655f3493d463f4ae8addbb`
BLAKE2b-256	`0632de5075f3e7511892b6235e454d4b56f8b04294c58582b54e679b6329eeb3`

See more details on using hashes here.

Provenance

The following attestation bundles were made for contextifier-0.2.5-py3-none-any.whl:

Publisher: publish.yml on CocoRoF/Contextifier

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: contextifier-0.2.5-py3-none-any.whl
- Subject digest: bfc755dafcec08993e3ec46cc9d7af531fda1bfe781a22fe2cc03363a05baa47
- Sigstore transparency entry: 1204028536
- Sigstore integration time: Apr 1, 2026
Source repository:
- Permalink: CocoRoF/Contextifier@e834965438aff11ca431b8ac9fee0b1e5e954a21
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/CocoRoF
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@e834965438aff11ca431b8ac9fee0b1e5e954a21
- Trigger Event: push

contextifier 0.2.5

Navigation

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Project description

Contextifier v2

Key Features

Installation

Quick Start

1. Basic Text Extraction

2. Extract + Chunk in One Step

3. Custom Configuration

4. OCR Integration

Supported Formats

Architecture

Requirements

Documentation

License

Contributing

Project details

Verified details

Project links

GitHub Statistics

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance