Skip to main content

A semantic document chunking library

Project description

ChunkIt Pro - Semantic Document Chunking Library

Python library for document chunking using semantic analysis. ChunkIt Pro breaks down documents into meaningful segments based on content similarity rather than arbitrary size limits.

Features

  • Multiple Document Formats: Supports PDF, DOCX, TXT, MARKDOWN
  • Semantic Analysis: Uses embedding models to understand content similarity
  • Multiple Embedding Providers: OpenAI, VoyageAI, and Sentence Transformers support
  • Intelligent Chunking: Two-pass algorithm for optimal chunk boundaries
  • Configurable Thresholds: Three methods for similarity threshold computation
  • Visual Analysis: Generates plots showing similarity patterns
  • Easy Integration: Simple API for quick integration into existing projects

Installation

  1. Clone the repository:
git clone https://github.com/adw777/chunk_it_pro
cd chunk_it_pro
  1. Install dependencies:
pip install -r requirements.txt
  1. Set up environment variables (create a .env file):
OPENAI_API_KEY=your_openai_api_key_here
VOYAGEAI_API_KEY=your_voyage_api_key_here
OMNIPARSE_API_URL=your_omniparse_api_url_here

Quick Start

Basic Usage

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def main():
    # Initialize pipeline
    pipeline = SemanticChunkingPipeline()
    
    # Process document
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="your_document.pdf",
        embedding_provider="voyage",  # or "openai", "axon"
        threshold_method="percentile",
        percentile=95,
        max_chunk_len=1024
    )
    
    print(f"Created {len(semantic_chunks)} semantic chunks")
    print(f"Similarity threshold: {threshold:.4f}")

# Run the example
asyncio.run(main())

Convenience Function (with default values)

import asyncio
from chunk_it_pro.pipeline import chunk_document

async def main():
    # Quick chunking with default settings
    initial_chunks, semantic_chunks, threshold = await chunk_document(
        file_path="document.pdf",
        embedding_provider="voyage"
    )
    
    # Use the chunks in your application
    for i, chunk in enumerate(semantic_chunks):
        print(f"Chunk {i+1}: {chunk[:100]}...")

asyncio.run(main())

Configuration

Environment Variables

Variable Description Required
OPENAI_API_KEY OpenAI API key for embeddings Optional*
VOYAGEAI_API_KEY VoyageAI API key for embeddings Optional*
OMNIPARSE_API_URL Omniparse API URL for document parsing Optional

*At least one embedding provider API key is required.

Embedding Providers

  1. VoyageAI (Recommended for legal documents):

    • Model: voyage-law-2
  2. OpenAI:

    • Model: text-embedding-3-large
    • High quality general-purpose embeddings
  3. Axon (Local):

Threshold Methods

  1. Percentile (Default): Uses the Nth percentile of cosine distances
  2. Gradient: Finds points with highest gradient change
  3. Local Maxima: Uses local maxima in distance patterns

API Reference

SemanticChunkingPipeline

Main class for semantic chunking operations.

class SemanticChunkingPipeline:
    def __init__(self, openai_api_key: str = None, voyage_api_key: str = None)
    
    async def process_document(
        self,
        file_path: str,
        threshold_method: str = "percentile",
        percentile: float = 95,
        max_chunk_len: int = 1024,
        embedding_provider: str = "voyage",
        save_files: bool = True,
        verbose: bool = True
    ) -> Tuple[list, list, float]
    
    def get_chunk_statistics(self) -> Dict[str, Any]
    def print_statistics(self)

chunk_document Function

Convenience function (default) for quick document processing.

async def chunk_document(
    file_path: str,
    embedding_provider: str = "voyage",
    threshold_method: str = "percentile",
    percentile: float = 95,
    max_chunk_len: int = 1024,
    save_files: bool = True,
    verbose: bool = True
) -> Tuple[list, list, float]

Output Files

When save_files=True, the pipeline creates:

  • initial_chunks.txt: Fixed-size initial chunks (256 tokens each)
  • semantic_chunks.txt: Final semantic chunks
  • embeddings.npy: Numpy array of embeddings (for initial_chunks)
  • cosine_distances.png: Visualization of similarity patterns between initial_chunks

Project Structure

chunk_it_pro/
├── chunk_it_pro/             # Main package
│   ├── __init__.py           # Package initialization
│   ├── config.py             # Configuration management
│   ├── pipeline.py           # Main pipeline implementation
│   ├── parsers/              # Document parsing modules
│   │   ├── __init__.py
│   │   ├── document_parser.py
│   │   └── omniparse.py
│   ├── chunkers/             # Chunking algorithms
│   │   ├── __init__.py
│   │   ├── initial_chunker.py
│   │   └── semantic_chunker.py
│   ├── embeddings/           # Embedding generation
│   │   ├── __init__.py
│   │   ├── embedding_analyzer.py
│   │   └── voyage_client.py
│   └── utils/                # Utility modules
│       ├── __init__.py
│       └── singleton.py
├── example.py                # Usage examples
├── requirements.txt          # Dependencies
└── README.md                # Documentation

Algorithm Overview

ChunkIt uses two-pass algorithm:

Initial Chunking

  1. Parse document to markdown format
  2. Identify structural breakpoints (headers, page breaks)
  3. Create fixed-size chunks (256 tokens) respecting breakpoints
  4. Generate embeddings for each chunk
  5. Compute cosine distances between consecutive chunks
  6. Determine similarity threshold using statistical methods

Semantic Refinement

  1. Split text into sentences
  2. Generate sentence-level embeddings
  3. Group sentences based on similarity threshold
  4. Merge similar adjacent chunks (respecting max length)
  5. Output final semantic chunks

Advanced Usage

Custom Configuration

from chunk_it_pro.config import Config

# Override default settings
Config.INITIAL_CHUNK_SIZE = 512
Config.MAX_CHUNK_LENGTH = 2048
Config.DEFAULT_PERCENTILE = 90

# Check configuration
status = Config.validate_config()
print(status)

Processing Multiple Documents

import asyncio
from pathlib import Path
from chunk_it_pro import SemanticChunkingPipeline

async def process_multiple_documents():
    pipeline = SemanticChunkingPipeline()
    
    documents = Path("documents/").glob("*.pdf")
    
    for doc_path in documents:
        print(f"Processing {doc_path.name}...")
        
        initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
            file_path=str(doc_path),
            save_files=True,
            verbose=False
        )
        
        # Save results with document name
        output_dir = Path("output") / doc_path.stem
        output_dir.mkdir(parents=True, exist_ok=True)
        
        # Custom processing for each document...

asyncio.run(process_multiple_documents())

Troubleshooting

Common Issues

  1. Missing API Keys: Ensure environment variables are set correctly
  2. Document Parsing Errors: Check if Omniparse API is accessible
  3. Memory Issues: Reduce batch size or chunk size for large documents

Performance Tips

  1. Use VoyageAI for best performance/cost balance
  2. Adjust max_chunk_len based on your use case
  3. Set save_files=False for better performance when processing many documents
  4. Use local embedding models (Axon) for privacy-sensitive applications & saving money

Contributing

Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_it_pro-0.1.0.tar.gz (26.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunk_it_pro-0.1.0-py3-none-any.whl (20.0 kB view details)

Uploaded Python 3

File details

Details for the file chunk_it_pro-0.1.0.tar.gz.

File metadata

  • Download URL: chunk_it_pro-0.1.0.tar.gz
  • Upload date:
  • Size: 26.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for chunk_it_pro-0.1.0.tar.gz
Algorithm Hash digest
SHA256 8e1ba30653ee19493bba2f1a73eb6b64e020d70df64ba0a2a5e97e6dc95bd8b1
MD5 79e50ab67388e1302565dc73c06e41c5
BLAKE2b-256 e1688247d1349a0a86c74f32a27516f7ce2be4a836e44c118f76abde165e92cc

See more details on using hashes here.

File details

Details for the file chunk_it_pro-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: chunk_it_pro-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 20.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.3

File hashes

Hashes for chunk_it_pro-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 8324821588c87b9537a3c917a79aa996dd75897dafa6bc869bdf4e8525187257
MD5 935eb39b924d691f00608fb1f35e7139
BLAKE2b-256 6a4cc2e52144b69fe333734036e7c30d64e8e6a65a49ade0548fbe2d2a4c0135

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page