Convert raw documents into AI-understandable context with intelligent text extraction, table detection, and semantic chunking

These details have not been verified by PyPI

Project links

Project description

xgen-doc2chunk

xgen-doc2chunk is a document processing library that converts raw documents into AI-understandable context. It analyzes, restructures, and normalizes content so that language models can reason over documents with higher accuracy and consistency.

Current Version: 0.2.26 — See CHANGELOG.md for release history.

Features

Multi-format Support: Process a wide variety of document formats including:
- PDF (adaptive complexity-based processing, multi-column layout, table detection)
- Microsoft Office: DOCX, DOC, PPTX, PPT, XLSX, XLS
- Korean documents: HWP, HWPX (Hangul Word Processor — full support)
- Text formats: TXT, MD, RTF, CSV, TSV, HTML
- Image files: JPG, PNG, GIF, BMP, WebP (via OCR)
- Code files: Python, JavaScript, TypeScript, and 20+ languages
- Config files: JSON, YAML, TOML, INI, ENV, and more
Intelligent Text Extraction:
- Preserves document structure (headings, paragraphs, lists)
- Extracts tables as HTML with proper rowspan/colspan handling
- Handles merged cells and complex table layouts
- Extracts and processes inline images
- Header/footer extraction for DOC, DOCX, HWPX
- Chart and diagram extraction from Office documents
OCR Integration:
- Pluggable OCR engine architecture
- Supports OpenAI, Anthropic, Google Gemini, AWS Bedrock, and vLLM backends
- Automatic OCR fallback for scanned documents or image-based PDFs
- Standalone image file processing (JPG, PNG, etc.)
- Custom image tag pattern support for OCR detection
Smart Chunking:
- Semantic text chunking with configurable size and overlap
- Table-aware chunking that preserves table integrity (HTML & Markdown)
- Page-based chunking with page number metadata
- Protected regions for code blocks, tables, images, charts, and metadata
- Small chunk merging to prevent table-title isolation
- Nested table support in protected region detection
- Position metadata (page number, line numbers, character offsets)
Metadata Extraction:
- Extracts document metadata (title, author, creation date, etc.)
- Formats metadata in a structured, parseable format
- Customizable metadata tag prefixes/suffixes
Storage Backends:
- Local file storage (default)
- MinIO / S3 compatible cloud storage
- Pluggable storage backend architecture

Installation

pip install xgen-doc2chunk

Or using uv:

uv add xgen-doc2chunk

Quick Start

Basic Usage

from xgen_doc2chunk import DocumentProcessor

# Create processor instance
processor = DocumentProcessor()

# Extract text from a document
text = processor.extract_text("document.pdf")
print(text)

# Extract text and chunk in one step
result = processor.extract_chunks(
    "document.pdf",
    chunk_size=1000,
    chunk_overlap=200
)

# Access chunks
for i, chunk in enumerate(result.chunks):
    print(f"Chunk {i + 1}: {chunk[:100]}...")

# Save chunks to markdown file
result.save_to_md("output/chunks.md")

With OCR Processing

from xgen_doc2chunk import DocumentProcessor
from xgen_doc2chunk.ocr.ocr_engine.openai_ocr import OpenAIOCR

# Initialize OCR engine
ocr_engine = OpenAIOCR(api_key="sk-...", model="gpt-4o")

# Create processor with OCR
processor = DocumentProcessor(ocr_engine=ocr_engine)

# Extract text with OCR processing enabled
text = processor.extract_text(
    "scanned_document.pdf",
    ocr_processing=True
)

With Position Metadata

from xgen_doc2chunk import DocumentProcessor

processor = DocumentProcessor()

result = processor.extract_chunks(
    "document.pdf",
    chunk_size=1000,
    include_position_metadata=True
)

# Access position metadata per chunk
if result.has_metadata:
    for chunk_data in result.chunks_with_metadata:
        print(f"Page {chunk_data['page_number']}, "
              f"Lines {chunk_data['line_start']}-{chunk_data['line_end']}: "
              f"{chunk_data['text'][:80]}...")

Available OCR Engines

from xgen_doc2chunk.ocr.ocr_engine.openai_ocr import OpenAIOCR
from xgen_doc2chunk.ocr.ocr_engine.anthropic_ocr import AnthropicOCR
from xgen_doc2chunk.ocr.ocr_engine.gemini_ocr import GeminiOCR
from xgen_doc2chunk.ocr.ocr_engine.bedrock_ocr import BedrockOCR
from xgen_doc2chunk.ocr.ocr_engine.vllm_ocr import VllmOCR

# OpenAI (recommended)
engine = OpenAIOCR(api_key="sk-...", model="gpt-4o")

# Anthropic Claude
engine = AnthropicOCR(api_key="sk-ant-...", model="claude-sonnet-4-20250514")

# Google Gemini
engine = GeminiOCR(api_key="...", model="gemini-2.0-flash")

# AWS Bedrock
engine = BedrockOCR(
    aws_access_key_id="AKIA...",
    aws_secret_access_key="...",
    aws_region="us-east-1",
    model="anthropic.claude-3-5-sonnet-20241022-v2:0"
)

# vLLM (self-hosted)
engine = VllmOCR(base_url="http://localhost:8000", model="Qwen/Qwen2-VL-7B-Instruct")

Supported Formats

Category	Extensions
Documents	`.pdf`, `.docx`, `.doc`, `.rtf`, `.pptx`, `.ppt`, `.hwp`, `.hwpx`
Spreadsheets	`.xlsx`, `.xls`, `.csv`, `.tsv`
Text	`.txt`, `.md`, `.markdown`
Web	`.html`, `.htm`, `.xhtml`
Images	`.jpg`, `.jpeg`, `.png`, `.gif`, `.bmp`, `.webp`
Code	`.py`, `.js`, `.ts`, `.jsx`, `.tsx`, `.java`, `.cpp`, `.c`, `.go`, `.rs`, `.cs`, `.swift`, `.kt`, `.rb`, `.php`, `.dart`, `.r`, `.scala`, `.sql`, `.vue`, `.svelte`
Config	`.json`, `.yaml`, `.yml`, `.xml`, `.toml`, `.ini`, `.cfg`, `.conf`, `.properties`, `.env`
Script	`.sh`, `.bat`, `.ps1`, `.zsh`, `.fish`
Log	`.log`

Architecture

xgen_doc2chunk/
├── core/
│   ├── document_processor.py       # Main entry point (DocumentProcessor, ChunkResult)
│   ├── processor/                  # Format-specific handlers
│   │   ├── base_handler.py         # Abstract base handler
│   │   ├── pdf_handler.py          # PDF processing (PyMuPDF + pdfplumber)
│   │   ├── docx_handler.py         # DOCX processing
│   │   ├── doc_handler.py          # DOC processing (auto-detects format)
│   │   ├── ppt_handler.py          # PowerPoint processing
│   │   ├── excel_handler.py        # Excel processing (XLSX/XLS)
│   │   ├── csv_handler.py          # CSV/TSV processing
│   │   ├── hwp_handler.py          # HWP (OLE) processing
│   │   ├── hwpx_handler.py         # HWPX (ZIP/XML) processing
│   │   ├── rtf_handler.py          # RTF processing
│   │   ├── text_handler.py         # Plain text / code processing
│   │   ├── html_reprocessor.py     # HTML document processing
│   │   ├── image_file_handler.py   # Standalone image processing (via OCR)
│   │   └── {format}_helper/        # Format-specific utilities
│   └── functions/
│       ├── img_processor.py        # Image handling & tag generation
│       ├── page_tag_processor.py   # Page/slide/sheet tag processing
│       ├── chart_extractor.py      # Chart data extraction
│       ├── chart_processor.py      # Chart formatting
│       ├── metadata_extractor.py   # Metadata extraction & formatting
│       ├── table_extractor.py      # Table data structures
│       ├── table_processor.py      # Table formatting (HTML/Markdown/Text)
│       ├── storage_backend.py      # Pluggable storage (Local, MinIO, S3)
│       ├── preprocessor.py         # File preprocessing
│       ├── file_converter.py       # File format conversion
│       └── utils.py                # General utilities
├── chunking/
│   ├── chunking.py                 # Main chunking API
│   ├── text_chunker.py             # Text-based chunking
│   ├── table_chunker.py            # Table-aware chunking (HTML & Markdown)
│   ├── page_chunker.py             # Page-based chunking
│   ├── sheet_processor.py          # Sheet/metadata processing
│   ├── protected_regions.py        # Protected region detection (nested tables)
│   └── constants.py                # Constants and patterns
└── ocr/
    ├── base.py                     # BaseOCR abstract class
    ├── ocr_processor.py            # OCR processing utilities
    └── ocr_engine/                 # OCR engine implementations
        ├── openai_ocr.py           # OpenAI GPT-4 Vision
        ├── anthropic_ocr.py        # Anthropic Claude Vision
        ├── gemini_ocr.py           # Google Gemini Vision
        ├── bedrock_ocr.py          # AWS Bedrock Vision
        └── vllm_ocr.py             # vLLM (self-hosted)

Requirements

Python 3.12+
Required dependencies are automatically installed (see pyproject.toml)

System Dependencies

For full functionality, you may need:

Tesseract OCR: For local OCR fallback
LibreOffice: For DOC/RTF conversion (optional)
Poppler: For PDF image extraction

Tag Customization

processor = DocumentProcessor(
    # Image tag format (default: [Image:path])
    image_directory="output/images",
    image_tag_prefix="[Image:",
    image_tag_suffix="]",
    
    # Page tag format (default: [Page Number: N])
    page_tag_prefix="[Page Number: ",
    page_tag_suffix="]",
    
    # Slide tag format (default: [Slide Number: N])
    slide_tag_prefix="[Slide Number: ",
    slide_tag_suffix="]",
    
    # Chart tag format
    chart_tag_prefix="[chart]",
    chart_tag_suffix="[/chart]",
    
    # Metadata tag format
    metadata_tag_prefix="<Document-Metadata>",
    metadata_tag_suffix="</Document-Metadata>",
)

Documentation

QUICKSTART.md — Comprehensive guide with pipeline overview, OCR setup, and examples
CHANGELOG.md — Release history
CONTRIBUTING.md — Contribution guidelines

License

Apache License 2.0 — see LICENSE for details.

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.3.3

May 28, 2026

0.3.2

May 19, 2026

0.3.1

May 15, 2026

0.3.0

May 15, 2026

0.2.26

Apr 3, 2026

0.2.23

Mar 24, 2026

0.2.22

Feb 24, 2026

0.2.21

Feb 24, 2026

0.2.20

Feb 20, 2026

0.2.18

Feb 13, 2026

0.2.17

Feb 12, 2026

0.2.15

Feb 11, 2026

0.2.14

Feb 11, 2026

0.2.13

Feb 10, 2026

0.2.12

Feb 9, 2026

0.2.11

Feb 9, 2026

0.2.0

Feb 5, 2026

0.1.54

Feb 5, 2026

0.1.53

Feb 5, 2026

0.1.52

Feb 4, 2026

0.1.51

Feb 4, 2026

0.1.5

Feb 4, 2026

0.1.4

Feb 4, 2026

0.1.3

Feb 2, 2026

0.1.2

Feb 2, 2026

0.1.1

Feb 2, 2026

0.1.0

Feb 2, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

xgen_doc2chunk-0.3.3.tar.gz (332.3 kB view details)

Uploaded May 28, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

xgen_doc2chunk-0.3.3-py3-none-any.whl (435.6 kB view details)

Uploaded May 28, 2026 Python 3

File details

Details for the file xgen_doc2chunk-0.3.3.tar.gz.

File metadata

Download URL: xgen_doc2chunk-0.3.3.tar.gz
Upload date: May 28, 2026
Size: 332.3 kB
Tags: Source
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xgen_doc2chunk-0.3.3.tar.gz
Algorithm	Hash digest
SHA256	`1612f76f855598a896bb4dbe752122cd463938bc72366f26c6bc847968800d53`
MD5	`feb5b0b75c69c331e83aa9a5cbedb9cc`
BLAKE2b-256	`90d7a4067b62db34f7f7eb1602ee35864b698307470d4a390bcb7b93adc5b31d`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xgen_doc2chunk-0.3.3.tar.gz:

Publisher: publish.yml on master0419/xgen_doc2chunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xgen_doc2chunk-0.3.3.tar.gz
- Subject digest: 1612f76f855598a896bb4dbe752122cd463938bc72366f26c6bc847968800d53
- Sigstore transparency entry: 1652784933
- Sigstore integration time: May 28, 2026
Source repository:
- Permalink: master0419/xgen_doc2chunk@4eb0f6b1ccbfb1e6a9f86eaf74160c773aded0df
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/master0419
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4eb0f6b1ccbfb1e6a9f86eaf74160c773aded0df
- Trigger Event: push

File details

Details for the file xgen_doc2chunk-0.3.3-py3-none-any.whl.

File metadata

Download URL: xgen_doc2chunk-0.3.3-py3-none-any.whl
Upload date: May 28, 2026
Size: 435.6 kB
Tags: Python 3
Uploaded using Trusted Publishing? Yes
Uploaded via: twine/6.1.0 CPython/3.13.12

File hashes

Hashes for xgen_doc2chunk-0.3.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d21b2962776ed3e388918fa0c1d665e38004c52474f6efab0b278da910ed42d5`
MD5	`6fe578171687c649fc1fae0f3aee4625`
BLAKE2b-256	`d26aca79b1a9252869885a24113c4c9ed18ba9bdf5696e7bb0e32bb1ee6be77e`

See more details on using hashes here.

Provenance

The following attestation bundles were made for xgen_doc2chunk-0.3.3-py3-none-any.whl:

Publisher: publish.yml on master0419/xgen_doc2chunk

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Statement:
- Statement type: https://in-toto.io/Statement/v1
- Predicate type: https://docs.pypi.org/attestations/publish/v1
- Subject name: xgen_doc2chunk-0.3.3-py3-none-any.whl
- Subject digest: d21b2962776ed3e388918fa0c1d665e38004c52474f6efab0b278da910ed42d5
- Sigstore transparency entry: 1652785116
- Sigstore integration time: May 28, 2026
Source repository:
- Permalink: master0419/xgen_doc2chunk@4eb0f6b1ccbfb1e6a9f86eaf74160c773aded0df
- Branch / Tag: refs/heads/deploy
- Owner: https://github.com/master0419
- Access: public
Publication detail:
- Token Issuer: https://token.actions.githubusercontent.com
- Runner Environment: github-hosted
- Publication workflow: publish.yml@4eb0f6b1ccbfb1e6a9f86eaf74160c773aded0df
- Trigger Event: push

xgen-doc2chunk 0.3.3

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

xgen-doc2chunk

Features

Installation

Quick Start

Basic Usage

With OCR Processing

With Position Metadata

Available OCR Engines

Supported Formats

Architecture

Requirements

System Dependencies

Tag Customization

Documentation

License

Contributing

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

Provenance

File details

File metadata

File hashes

Provenance