A comprehensive document chunking library with context generation for RAG applications

These details have not been verified by PyPI

Project links

Project description

Document Chunker

A comprehensive Python library for document chunking with intelligent context generation, designed specifically for RAG (Retrieval-Augmented Generation) applications.

Features

Multiple Document Formats: Support for PDF and text files (.txt, .md)
Flexible Chunking Strategies: Recursive and semantic text splitting
Context Generation: AI-powered context generation using OpenAI models
Parallel Processing: Multi-threaded context generation for efficiency
Multiple Output Formats: JSON and plain text output options
CLI Interface: Easy-to-use command-line interface
Configurable: Extensive configuration options for different use cases

Installation

pip install contextual-chunker

Quick Start

Using as a Library

from document_chunker import DocumentChunker, create_chunking_config

# Create configuration
config = create_chunking_config(
    openai_api_key="your-openai-api-key",
    chunk_size=1500,
    chunk_overlap=100,
    chunking_strategy="recursive",
    save_contexts=True
)

# Initialize chunker
chunker = DocumentChunker(config)

# Process PDF files
results = chunker.process_pdf_files(["document.pdf"])

# Or process a directory
results = chunker.process_directory("./documents")

# Save results
output_file = chunker.save_results(results)
print(f"Results saved to: {output_file}")

Using the CLI

# Process a single PDF file
document-chunker document.pdf --chunk-size 1000 --output-dir ./output

# Process a directory with custom settings
document-chunker ./documents --strategy semantic --chunk-size 1500 --save-txt

# Process without context generation
document-chunker ./documents --no-context --chunk-size 800

Configuration Options

ChunkingConfig Parameters

openai_api_key: Your OpenAI API key (required for context generation)
chunk_size: Maximum size of each chunk in characters (default: 1000)
chunk_overlap: Overlap between chunks in characters (default: 200)
chunking_strategy: "recursive" or "semantic" (default: "recursive")
save_contexts: Enable AI context generation (default: True)
context_model: OpenAI model for context generation (default: "gpt-4o-mini")
parallel_threads: Number of threads for parallel processing (default: 5)
output_dir: Directory for output files (default: "./chunked_documents")

Chunking Strategies

Recursive Text Splitter

Splits text using a hierarchy of separators (paragraphs → sentences → words → characters) while respecting chunk size limits.

Semantic Text Splitter

Preserves semantic meaning by splitting on paragraph and sentence boundaries first, ensuring coherent chunks.

Context Generation

The library can automatically generate contextual information for each chunk using OpenAI's models. This context helps improve retrieval accuracy in RAG applications by providing additional information about where each chunk fits within the larger document.

Output Formats

JSON Output

Structured output containing:

Document metadata
Individual chunks with content and metadata
Context information (if enabled)
Processing statistics

Text Output

Simple text file with all chunks for easy review and debugging.

CLI Usage

document-chunker [OPTIONS] INPUT

Arguments:
  INPUT                    Input file or directory path

Options:
  -o, --output-dir TEXT    Output directory (default: ./chunked_documents)
  -s, --chunk-size INT     Chunk size in characters (default: 1000)
  -p, --chunk-overlap INT  Chunk overlap in characters (default: 200)
  -t, --strategy CHOICE    Chunking strategy: recursive|semantic (default: recursive)
  --no-context            Disable context generation
  --context-model TEXT    OpenAI model for context (default: gpt-4o-mini)
  -j, --threads INT       Parallel threads (default: 5)
  -e, --extensions LIST   File extensions to process (default: .pdf .txt .md)
  --save-txt              Also save chunks to text file
  -r, --recursive         Process directories recursively

Environment Variables

Set your OpenAI API key:

export OPENAI_API_KEY="your-api-key-here"

Or create a .env file:

OPENAI_API_KEY=your-api-key-here

Examples

Basic PDF Processing

from document_chunker import DocumentChunker, create_chunking_config

config = create_chunking_config(
    chunk_size=1000,
    save_contexts=False  # Disable context generation
)

chunker = DocumentChunker(config)
results = chunker.process_pdf_files(["research_paper.pdf"])
chunker.save_results(results)

Advanced Configuration with Context

config = create_chunking_config(
    openai_api_key="sk-...",
    chunk_size=1500,
    chunk_overlap=150,
    chunking_strategy="semantic",
    context_model="gpt-4",
    parallel_threads=8,
    save_contexts=True
)

chunker = DocumentChunker(config)
results = chunker.process_directory("./research_papers", recursive=True)
output_file = chunker.save_results(results)

# Also save as text file
from document_chunker import save_chunks_to_txt
save_chunks_to_txt(output_file, "chunks.txt")

Requirements

Python 3.8+
OpenAI API key (for context generation)
PyMuPDF or PyPDF2 (for PDF processing)

License

MIT License

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.0

Jun 29, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

contextual_chunker-0.1.0.tar.gz (13.8 kB view details)

Uploaded Jun 29, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

contextual_chunker-0.1.0-py3-none-any.whl (12.4 kB view details)

Uploaded Jun 29, 2025 Python 3

File details

Details for the file contextual_chunker-0.1.0.tar.gz.

File metadata

Download URL: contextual_chunker-0.1.0.tar.gz
Upload date: Jun 29, 2025
Size: 13.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for contextual_chunker-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`899212cb6b5e32575046d081f19303600bf58a6f9f03e1099abf79207e88ef81`
MD5	`1577029afa95aea5362fcfb035da2132`
BLAKE2b-256	`fcbee1e9a7bfcb89e0ef2c4d085b2ed6b6b0c2bda51f1f142fc993b575c7f6ec`

See more details on using hashes here.

File details

Details for the file contextual_chunker-0.1.0-py3-none-any.whl.

File metadata

Download URL: contextual_chunker-0.1.0-py3-none-any.whl
Upload date: Jun 29, 2025
Size: 12.4 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.6

File hashes

Hashes for contextual_chunker-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`c11027d840af46c901c0dcff0dc45de3e522a41e150358fdf108a369bc3b39df`
MD5	`008f32bc1d0f9c3af557c3ada07fbe98`
BLAKE2b-256	`053ac1348b36802405c5272685f8a83196e29539dd0c5d4d4855511513253acc`

See more details on using hashes here.

contextual-chunker 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Document Chunker

Features

Installation

Quick Start

Using as a Library

Using the CLI

Configuration Options

ChunkingConfig Parameters

Chunking Strategies

Recursive Text Splitter

Semantic Text Splitter

Context Generation

Output Formats

JSON Output

Text Output

CLI Usage

Environment Variables

Examples

Basic PDF Processing

Advanced Configuration with Context

Requirements

License

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes