A specialized document chunking library for complex document structures

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Text Processing :: Markup

Project description

DocChunker

A specialized document chunking library designed to handle complex document structures in DOCX and PDF files. DocChunker intelligently processes structured documents containing tables, nested lists, images, and other complex elements to create semantically meaningful chunks that preserve context.

Key Features

Advanced DOCX Parsing: Handles complex elements like nested lists and tables with merged cells.
Contextual Chunking: Preserves document hierarchy (headings, etc.) within chunks.
Configurable Strategy: Tune chunk size (tokens) and element-based overlap.
Semantic Cohesion: Aims to keep related content (list items, table rows) together.
RAG-Optimized: Produces chunks ideal for effective information retrieval.

Installation

DocChunker requires Python 3.9+ and is best installed using uv, a fast Python package installer and resolver.

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate

# Install with uv
uv pip install -r requirements.txt

Quick Start

from docchunker import DocChunker

# Initialize the chunker with desired settings
chunker = DocChunker(chunk_size=200)

# Process a document
chunks = chunker.process_document("complex_document.docx")

# Work with chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.metadata['type']} - {len(chunk.text)} chars")

RAG DEMO

For an end-to-end example of building a simple RAG system using DocChunker with LangChain, check out the examples/RAG_demo.ipynb notebook.

Development

To contribute to DocChunker:

# Clone the repository
git clone https://github.com/vladGriguta/DocChunker
cd docchunker

# Set up development environment
python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Run tests
pytest

Future Roadmap

Chunk Size Homogenization: Implement strategies to reduce chunk size variance.
Langchain RAG Examples: Provide integration guides for Langchain.
Enhanced Unit Testing: Add more tests for complex tables and lists.
Retrieval Evaluation Framework: Develop a framework to assess chunk effectiveness.
Increased Test Coverage: Systematically improve overall code coverage.
PDF Support: Extend parsing and chunking to PDF documents.
Advanced Element Handling: Support for images (captions/alt-text), headers/footers, footnotes.
Performance Optimizations: Profile and optimize for very large documents.

License

MIT

About the Author

DocChunker is developed by Vlad Griguta. Connect with me on LinkedIn or GitHub.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
Topic
- Text Processing :: Markup

Release history Release notifications | RSS feed

0.3.0

Sep 5, 2025

0.2.0

Aug 27, 2025

0.1.4

Jun 6, 2025

0.1.3

Jun 6, 2025

This version

0.1.2

Jun 5, 2025

0.1.0

Jun 5, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docchunker-0.1.2.tar.gz (1.5 MB view details)

Uploaded Jun 5, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

docchunker-0.1.2-py3-none-any.whl (5.0 kB view details)

Uploaded Jun 5, 2025 Python 3

File details

Details for the file docchunker-0.1.2.tar.gz.

File metadata

Download URL: docchunker-0.1.2.tar.gz
Upload date: Jun 5, 2025
Size: 1.5 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docchunker-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`a7412bd7c5ac56d5177e665ada9dc1bc99dd88ded8e3e4b45aae3baff903ff4e`
MD5	`d2d847d49b9c912bbbc8be40d95999a1`
BLAKE2b-256	`758f00cad48f56250fd1cb2924d27c7e0351a9acdedcabb417c27760c1213b41`

See more details on using hashes here.

File details

Details for the file docchunker-0.1.2-py3-none-any.whl.

File metadata

Download URL: docchunker-0.1.2-py3-none-any.whl
Upload date: Jun 5, 2025
Size: 5.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docchunker-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d782851b40ad0eba12b4b8ef27970954e26435c9be28784f1591e54a125b0dbf`
MD5	`82c62d485eba76cf10cd221794c3ac48`
BLAKE2b-256	`7771fb8b71c4ea454ccb68153125ce70db7166c1af55e62b4621ad18eff1a1c4`

See more details on using hashes here.

docchunker 0.1.2

Navigation

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Project description

DocChunker

Key Features

Installation

Quick Start

RAG DEMO

Development

Future Roadmap

License

About the Author

Project details

Verified details

Maintainers

Meta

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes