Skip to main content

A specialized document chunking library for complex document structures

Project description

DocChunker

A specialized document chunking library designed to handle complex document structures in DOCX and PDF files. DocChunker intelligently processes structured documents containing tables, nested lists, images, and other complex elements to create semantically meaningful chunks that preserve context.

Key Features

  • Advanced DOCX Parsing: Handles complex elements like nested lists and tables with merged cells.
  • Contextual Chunking: Preserves document hierarchy (headings, etc.) within chunks.
  • Configurable Strategy: Tune chunk size (tokens) and element-based overlap.
  • Semantic Cohesion: Aims to keep related content (list items, table rows) together.
  • RAG-Optimized: Produces chunks ideal for effective information retrieval.

Installation

DocChunker requires Python 3.9+ and is best installed using uv, a fast Python package installer and resolver.

# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate  # On Windows, use: .venv\Scripts\activate

# Install with uv
uv pip install -r requirements.txt

Quick Start

from docchunker import DocChunker

# Initialize the chunker with desired settings
chunker = DocChunker(
    chunk_size=1000,
    chunk_overlap=200,
    preserve_structure=True
)

# Process a document
chunks = chunker.process_document("complex_document.docx")

# Work with chunks
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.metadata['type']} - {len(chunk.text)} chars")

Development

To contribute to DocChunker:

# Clone the repository
git clone https://github.com/vladGriguta/DocChunker
cd docchunker

# Set up development environment
python -m venv .venv
source .venv/bin/activate
uv pip install -e ".[dev]"

# Run tests
pytest

Future Roadmap

  • Chunk Size Homogenization: Implement strategies to reduce chunk size variance.
  • Langchain RAG Examples: Provide integration guides for Langchain.
  • Enhanced Unit Testing: Add more tests for complex tables and lists.
  • Retrieval Evaluation Framework: Develop a framework to assess chunk effectiveness.
  • Increased Test Coverage: Systematically improve overall code coverage.
  • PDF Support: Extend parsing and chunking to PDF documents.
  • Advanced Element Handling: Support for images (captions/alt-text), headers/footers, footnotes.
  • Performance Optimizations: Profile and optimize for very large documents.

License

MIT

About the Author

DocChunker is developed by Vlad Griguta. Connect with me on LinkedIn or GitHub.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

docchunker-0.1.0.tar.gz (1.5 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

docchunker-0.1.0-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file docchunker-0.1.0.tar.gz.

File metadata

  • Download URL: docchunker-0.1.0.tar.gz
  • Upload date:
  • Size: 1.5 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docchunker-0.1.0.tar.gz
Algorithm Hash digest
SHA256 0ca735d2db08df220bf81f803901c83c25cb066cd20bdce1656281e0a01792d5
MD5 50f250cf5bf2ae1a86ec339b798af29e
BLAKE2b-256 74c612701b3f1029045e173e4ed730576bbd605a64f3775be5f0fda9b0bc5dbd

See more details on using hashes here.

File details

Details for the file docchunker-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: docchunker-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.9

File hashes

Hashes for docchunker-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 7b952526fc8a8c224f0a5a5350bd79edb07496403ea43fbdd0f6eca33617f345
MD5 a926bee637fa9e71eb7f03e1536e5a04
BLAKE2b-256 590572ed4e7fea4b99200664ec928406f048df867de3c7b64b17f1e359c221fe

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page