A semantic document chunking library

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

ChunkIt Pro - Semantic Document Chunking Library

Python library for document chunking using semantic analysis. ChunkIt Pro breaks down documents into meaningful segments based on content similarity rather than arbitrary size limits.

Features

Multiple Document Formats: Supports PDF, DOCX, TXT, MARKDOWN
Semantic Analysis: Uses embedding models to understand content similarity
Multiple Embedding Providers: OpenAI, VoyageAI, and Sentence Transformers support
Intelligent Chunking: Two-pass algorithm for optimal chunk boundaries
Configurable Thresholds: Three methods for similarity threshold computation
Visual Analysis: Generates plots showing similarity patterns
Easy Integration: Simple API for quick integration into existing projects

Installation

Clone the repository:

git clone https://github.com/adw777/chunk_it_pro
cd chunk_it_pro

Install dependencies:

pip install -r requirements.txt

Set up environment variables (create a .env file):

OPENAI_API_KEY=your_openai_api_key_here
VOYAGEAI_API_KEY=your_voyage_api_key_here
OMNIPARSE_API_URL=your_omniparse_api_url_here

Quick Start

Basic Usage

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def main():
    # Initialize pipeline
    pipeline = SemanticChunkingPipeline()
    
    # Process document
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="your_document.pdf",
        embedding_provider="voyage",  # or "openai", "axon"
        threshold_method="percentile",
        percentile=95,
        max_chunk_len=1024
    )
    
    print(f"Created {len(semantic_chunks)} semantic chunks")
    print(f"Similarity threshold: {threshold:.4f}")

# Run the example
asyncio.run(main())

Convenience Function (with default values)

import asyncio
from chunk_it_pro.pipeline import chunk_document

async def main():
    # Quick chunking with default settings
    initial_chunks, semantic_chunks, threshold = await chunk_document(
        file_path="document.pdf",
        embedding_provider="voyage"
    )
    
    # Use the chunks in your application
    for i, chunk in enumerate(semantic_chunks):
        print(f"Chunk {i+1}: {chunk[:100]}...")

asyncio.run(main())

Configuration

Environment Variables

Variable	Description	Required
`OPENAI_API_KEY`	OpenAI API key for embeddings	Optional*
`VOYAGEAI_API_KEY`	VoyageAI API key for embeddings	Optional*
`OMNIPARSE_API_URL`	Omniparse API URL for document parsing	Optional

*At least one embedding provider API key is required.

Embedding Providers

VoyageAI (Recommended for legal documents):
- Model: voyage-law-2
OpenAI:
- Model: text-embedding-3-large
- High quality general-purpose embeddings
Axon (Local):
- Model: Fine-tuned legal embedding model Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct
- Runs locally, no API costs
- Requires sentence transfomers setup

Threshold Methods

Percentile (Default): Uses the Nth percentile of cosine distances
Gradient: Finds points with highest gradient change
Local Maxima: Uses local maxima in distance patterns

API Reference

SemanticChunkingPipeline

Main class for semantic chunking operations.

class SemanticChunkingPipeline:
    def __init__(self, openai_api_key: str = None, voyage_api_key: str = None)
    
    async def process_document(
        self,
        file_path: str,
        threshold_method: str = "percentile",
        percentile: float = 95,
        max_chunk_len: int = 1024,
        embedding_provider: str = "voyage",
        save_files: bool = True,
        verbose: bool = True
    ) -> Tuple[list, list, float]
    
    def get_chunk_statistics(self) -> Dict[str, Any]
    def print_statistics(self)

chunk_document Function

Convenience function (default) for quick document processing.

async def chunk_document(
    file_path: str,
    embedding_provider: str = "voyage",
    threshold_method: str = "percentile",
    percentile: float = 95,
    max_chunk_len: int = 1024,
    save_files: bool = True,
    verbose: bool = True
) -> Tuple[list, list, float]

Output Files

When save_files=True, the pipeline creates:

initial_chunks.txt: Fixed-size initial chunks (256 tokens each)
semantic_chunks.txt: Final semantic chunks
embeddings.npy: Numpy array of embeddings (for initial_chunks)
cosine_distances.png: Visualization of similarity patterns between initial_chunks

Project Structure

chunk_it_pro/
├── chunk_it_pro/             # Main package
│   ├── __init__.py           # Package initialization
│   ├── config.py             # Configuration management
│   ├── pipeline.py           # Main pipeline implementation
│   ├── parsers/              # Document parsing modules
│   │   ├── __init__.py
│   │   ├── document_parser.py
│   │   └── omniparse.py
│   ├── chunkers/             # Chunking algorithms
│   │   ├── __init__.py
│   │   ├── initial_chunker.py
│   │   └── semantic_chunker.py
│   ├── embeddings/           # Embedding generation
│   │   ├── __init__.py
│   │   ├── embedding_analyzer.py
│   │   └── voyage_client.py
│   └── utils/                # Utility modules
│       ├── __init__.py
│       └── singleton.py
├── example.py                # Usage examples
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Algorithm Overview

ChunkIt uses two-pass algorithm:

Initial Chunking

Parse document to markdown format
Identify structural breakpoints (headers, page breaks)
Create fixed-size chunks (256 tokens) respecting breakpoints
Generate embeddings for each chunk
Compute cosine distances between consecutive chunks
Determine similarity threshold using statistical methods

Semantic Refinement

Split text into sentences
Generate sentence-level embeddings
Group sentences based on similarity threshold
Merge similar adjacent chunks (respecting max length)
Output final semantic chunks

Advanced Usage

Custom Configuration

from chunk_it_pro.config import Config

# Override default settings
Config.INITIAL_CHUNK_SIZE = 512
Config.MAX_CHUNK_LENGTH = 2048
Config.DEFAULT_PERCENTILE = 90

# Check configuration
status = Config.validate_config()
print(status)

Processing Multiple Documents

import asyncio
from pathlib import Path
from chunk_it_pro import SemanticChunkingPipeline

async def process_multiple_documents():
    pipeline = SemanticChunkingPipeline()
    
    documents = Path("documents/").glob("*.pdf")
    
    for doc_path in documents:
        print(f"Processing {doc_path.name}...")
        
        initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
            file_path=str(doc_path),
            save_files=True,
            verbose=False
        )
        
        # Save results with document name
        output_dir = Path("output") / doc_path.stem
        output_dir.mkdir(parents=True, exist_ok=True)
        
        # Custom processing for each document...

asyncio.run(process_multiple_documents())

Troubleshooting

Common Issues

Missing API Keys: Ensure environment variables are set correctly
Document Parsing Errors: Check if Omniparse API is accessible
Memory Issues: Reduce batch size or chunk size for large documents

Performance Tips

Use VoyageAI for best performance/cost balance
Adjust max_chunk_len based on your use case
Set save_files=False for better performance when processing many documents
Use local embedding models (Axon) for privacy-sensitive applications & saving money

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.1.4

Jun 30, 2025

0.1.3

Jun 27, 2025

0.1.2

Jun 27, 2025

This version

0.1.0

Jun 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_it_pro-0.1.0.tar.gz (26.4 kB view details)

Uploaded Jun 27, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunk_it_pro-0.1.0-py3-none-any.whl (20.0 kB view details)

Uploaded Jun 27, 2025 Python 3

File details

Details for the file chunk_it_pro-0.1.0.tar.gz.

File metadata

Download URL: chunk_it_pro-0.1.0.tar.gz
Upload date: Jun 27, 2025
Size: 26.4 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for chunk_it_pro-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`8e1ba30653ee19493bba2f1a73eb6b64e020d70df64ba0a2a5e97e6dc95bd8b1`
MD5	`79e50ab67388e1302565dc73c06e41c5`
BLAKE2b-256	`e1688247d1349a0a86c74f32a27516f7ce2be4a836e44c118f76abde165e92cc`

See more details on using hashes here.

File details

Details for the file chunk_it_pro-0.1.0-py3-none-any.whl.

File metadata

Download URL: chunk_it_pro-0.1.0-py3-none-any.whl
Upload date: Jun 27, 2025
Size: 20.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for chunk_it_pro-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8324821588c87b9537a3c917a79aa996dd75897dafa6bc869bdf4e8525187257`
MD5	`935eb39b924d691f00608fb1f35e7139`
BLAKE2b-256	`6a4cc2e52144b69fe333734036e7c30d64e8e6a65a49ade0548fbe2d2a4c0135`

See more details on using hashes here.

chunk-it-pro 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkIt Pro - Semantic Document Chunking Library

Features

Installation

Quick Start

Basic Usage

Convenience Function (with default values)

Configuration

Environment Variables

Embedding Providers

Threshold Methods

API Reference

SemanticChunkingPipeline

chunk_document Function

Output Files

Project Structure

Algorithm Overview

Initial Chunking

Semantic Refinement

Advanced Usage

Custom Configuration

Processing Multiple Documents

Troubleshooting

Common Issues

Performance Tips

Contributing

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes