Skip to main content

A semantic document chunking library

Project description

ChunkIt Pro - Semantic Document Chunking Library

PyPI version License: MIT

Python library for document chunking using semantic analysis. ChunkIt Pro breaks down documents into meaningful segments based on content similarity rather than arbitrary size limits.

Features

  • Multiple Document Formats: Supports PDF, DOCX, TXT, MARKDOWN
  • Semantic Analysis: Uses embedding models to understand content similarity
  • Multiple Embedding Providers: OpenAI, VoyageAI, and Sentence Transformers support
  • Intelligent Chunking: Two-pass algorithm for optimal chunk boundaries
  • Configurable Thresholds: Three methods for similarity threshold computation
  • Visual Analysis: Generates plots showing similarity patterns
  • Easy Integration: Simple API for quick integration into existing projects

Installation

Install from PyPI using pip:

pip install chunk-it-pro

Development Installation

For development:

git clone https://github.com/adw777/chunk-it-pro
cd chunk-it-pro
pip install -e .

Quick Start

Simple Usage (After pip install)

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def main():
    # Initialize pipeline
    pipeline = SemanticChunkingPipeline()
    
    # Process document with default settings
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="your_document.pdf"
    )
    
    print(f"Created {len(semantic_chunks)} semantic chunks")
    for i, chunk in enumerate(semantic_chunks):
        print(f"\nChunk {i+1}: {chunk[:200]}...")

# Run the example
asyncio.run(main())

Convenience Function

import asyncio
from chunk_it_pro.pipeline import chunk_document

async def main():
    # Quick chunking with default settings
    initial_chunks, semantic_chunks, threshold = await chunk_document(
        file_path="document.pdf",
        embedding_provider="voyage"  # or "openai", "axon"
    )
    
    # Use the chunks in your application
    for chunk in semantic_chunks:
        print(f"Chunk length: {len(chunk.split())} words")

asyncio.run(main())

Configuration

Environment Variables

Set up environment variables (create a .env file or set them in your environment):

# Embedding Provider API Keys (at least one required)
OPENAI_API_KEY=your_openai_api_key_here
VOYAGEAI_API_KEY=your_voyage_api_key_here

# Optional: Document parsing service
OMNIPARSE_API_URL=https://your-omniparse-url.com/parse_document
Variable Description Required Default
OPENAI_API_KEY OpenAI API key for embeddings Optional* None
VOYAGEAI_API_KEY VoyageAI API key for embeddings Optional* None
OMNIPARSE_API_URL Omniparse API URL Optional Default URL

*At least one embedding provider API key is required.

Custom Configuration

You can customize all aspects of the chunking process:

import asyncio
from chunk_it_pro import SemanticChunkingPipeline
from chunk_it_pro.config import Config

async def main():
    # Method 1: Override config globally
    Config.INITIAL_CHUNK_SIZE = 512  # Default: 256 tokens
    Config.MAX_CHUNK_LENGTH = 2048   # Default: 1024 tokens
    Config.MIN_CHUNK_SIZE = 20       # Default: 10 tokens
    Config.DEFAULT_PERCENTILE = 90   # Default: 95
    
    pipeline = SemanticChunkingPipeline()
    
    # Method 2: Pass parameters to process_document
    chunks = await pipeline.process_document(
        file_path="document.pdf",
        threshold_method="gradient",     # Options: "percentile", "gradient", "local_maxima"
        percentile=90,                   # Only used with "percentile" method
        max_chunk_len=2048,              # Override max chunk length
        embedding_provider="openai",     # Options: "openai", "voyage", "axon"
        save_files=True,                 # Save intermediate files
        verbose=True                     # Print progress
    )

asyncio.run(main())

Complete Configuration Reference

Core Parameters

from chunk_it_pro.config import Config

# Chunking Parameters
Config.INITIAL_CHUNK_SIZE = 256      # Initial chunk size in tokens
Config.MIN_CHUNK_SIZE = 10           # Minimum tokens for valid chunk
Config.MAX_CHUNK_LENGTH = 1024       # Maximum tokens for semantic chunks
Config.DEFAULT_SIMILARITY_THRESHOLD = 0.5  # Fallback similarity threshold

# Threshold Computation
Config.DEFAULT_PERCENTILE = 95       # Percentile for threshold method
Config.THRESHOLD_METHODS = ["percentile", "gradient", "local_maxima"]

# Omniparse URL
OMNIPARSE_API_URL = "http://localhost:8000/parse_document"

# Embedding Models
Config.OPENAI_MODEL = "text-embedding-3-large"
Config.VOYAGE_MODEL = "voyage-law-2"
Config.AXON_MODEL = "axondendriteplus/Legal-Embed-intfloat-multilingual-e5-large-instruct"

# Processing
Config.EMBEDDING_BATCH_SIZE = 100    # Batch size for embedding generation
Config.TOKENIZER_NAME = "cl100k_base"  # Tokenizer for token counting

# Supported file formats
Config.SUPPORTED_FORMATS = {'.pdf', '.txt', '.docx', '.md'}

# Output file names
Config.DEFAULT_OUTPUT_FILES = {
    "initial_chunks": "initial_chunks.txt",
    "semantic_chunks": "semantic_chunks.txt", 
    "embeddings": "embeddings.npy",
    "cosine_plot": "cosine_distances.png"
}

Advanced Usage Examples

1. Custom Chunking with Fine-tuned Parameters

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def advanced_chunking():
    pipeline = SemanticChunkingPipeline()
    
    # Fine-tuned parameters for legal documents
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="legal_document.pdf",
        threshold_method="gradient",      # Better for structured documents
        max_chunk_len=1500,              # Larger chunks for complex content
        embedding_provider="voyage",      # Optimized for legal content
        save_files=True,
        verbose=True
    )
    
    # Get detailed statistics
    stats = pipeline.get_chunk_statistics()
    print(f"Average semantic chunk size: {stats['semantic_chunks']['avg_tokens']:.1f} tokens")
    print(f"Similarity threshold used: {stats['similarity_threshold']:.4f}")

asyncio.run(advanced_chunking())

2. Batch Processing Multiple Documents

import asyncio
import os
from chunk_it_pro.pipeline import chunk_document

async def process_directory(directory_path: str):
    results = {}
    
    for filename in os.listdir(directory_path):
        if filename.endswith(('.pdf', '.docx', '.txt', '.md')):
            file_path = os.path.join(directory_path, filename)
            print(f"Processing {filename}...")
            
            try:
                initial, semantic, threshold = await chunk_document(
                    file_path=file_path,
                    embedding_provider="voyage",
                    threshold_method="percentile",
                    percentile=95,
                    max_chunk_len=1024,
                    save_files=False,  # Don't save files for batch processing
                    verbose=False      # Reduce output for batch
                )
                
                results[filename] = {
                    'initial_chunks': len(initial),
                    'semantic_chunks': len(semantic),
                    'threshold': threshold
                }
                
            except Exception as e:
                print(f"Error processing {filename}: {e}")
                results[filename] = {'error': str(e)}
    
    return results

# Usage
results = asyncio.run(process_directory("./documents"))
for file, data in results.items():
    if 'error' not in data:
        print(f"{file}: {data['semantic_chunks']} chunks (threshold: {data['threshold']:.3f})")

3. Custom Configuration Class

import asyncio
from chunk_it_pro import SemanticChunkingPipeline
from chunk_it_pro.chunkers import InitialChunker, SemanticChunker

async def custom_pipeline():
    # Create custom chunkers with specific parameters
    initial_chunker = InitialChunker(
        tokenizer_name="cl100k_base",
        chunk_size=512  # Custom initial chunk size
    )
    
    pipeline = SemanticChunkingPipeline()
    pipeline.initial_chunker = initial_chunker  # Override default
    
    # Process with custom semantic chunker parameters
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="document.pdf",
        threshold_method="local_maxima",
        max_chunk_len=2048,
        embedding_provider="openai"
    )
    
    return semantic_chunks

chunks = asyncio.run(custom_pipeline())

4. Working with Embeddings Directly

import asyncio
import numpy as np
from chunk_it_pro.embeddings import EmbeddingAnalyzer

async def analyze_embeddings():
    analyzer = EmbeddingAnalyzer()
    
    # Custom text chunks
    texts = [
        "This is about machine learning and AI.",
        "Machine learning algorithms are powerful tools.",
        "The weather today is sunny and warm.",
        "Climate change affects weather patterns globally."
    ]
    
    # Generate embeddings
    embeddings = await analyzer.generate_voyage_embeddings(texts)
    analyzer.normalize_embeddings()
    
    # Compute similarities
    distances = analyzer.compute_cosine_distances()
    threshold = analyzer.compute_similarity_threshold(
        distances, 
        method="percentile", 
        percentile=90
    )
    
    print(f"Similarity threshold: {threshold:.4f}")
    print(f"Cosine distances: {distances}")
    
    # Plot similarity pattern
    analyzer.plot_cosine_distances(distances)

asyncio.run(analyze_embeddings())

Embedding Providers

1. VoyageAI (Recommended for Legal Documents)

  • Model: voyage-law-2
  • Strengths: Optimized for legal and structured documents
  • API Key: Set VOYAGEAI_API_KEY environment variable

2. OpenAI

  • Model: text-embedding-3-large
  • Strengths: High quality general-purpose embeddings
  • API Key: Set OPENAI_API_KEY environment variable

3. Axon (Local/Self-hosted)

# Use specific provider
chunks = await chunk_document(
    file_path="document.pdf",
    embedding_provider="voyage"  # or "openai" or "axon"
)

Threshold Methods

1. Percentile (Default)

Uses the Nth percentile of cosine distances as threshold.

chunks = await pipeline.process_document(
    file_path="document.pdf",
    threshold_method="percentile",
    percentile=95  # Use 95th percentile
)

2. Gradient

Finds points with highest gradient change in similarity.

chunks = await pipeline.process_document(
    file_path="document.pdf", 
    threshold_method="gradient"
)

3. Local Maxima

Uses local maxima in distance patterns.

chunks = await pipeline.process_document(
    file_path="document.pdf",
    threshold_method="local_maxima"
)

Output Files

When save_files=True, the pipeline creates:

  • initial_chunks.txt: Fixed-size initial chunks (256 tokens each)
  • semantic_chunks.txt: Final semantic chunks
  • embeddings.npy: Numpy array of embeddings
  • cosine_distances.png: Visualization of similarity patterns
# Control output files
chunks = await pipeline.process_document(
    file_path="document.pdf",
    save_files=True  # Set to False to skip file creation
)

Error Handling

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def robust_chunking():
    pipeline = SemanticChunkingPipeline()
    
    try:
        chunks = await pipeline.process_document("document.pdf")
        return chunks
    except FileNotFoundError:
        print("Document file not found")
    except ValueError as e:
        print(f"Configuration error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    
    return None, None, None

result = asyncio.run(robust_chunking())

Performance Tips

  1. Batch Processing: Set save_files=False and verbose=False for faster batch processing
  2. Embedding Provider: VoyageAI is generally faster than OpenAI for document chunking
  3. Chunk Size: Larger initial chunks (512+ tokens) works better for long documents
  4. Local Embeddings Generation: Use Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct to generate embeddings locally

API Reference

SemanticChunkingPipeline

class SemanticChunkingPipeline:
    def __init__(self, openai_api_key: str = None, voyage_api_key: str = None)
    
    async def process_document(
        self,
        file_path: str,
        threshold_method: str = "percentile",
        percentile: float = 95,
        max_chunk_len: int = 1024,
        embedding_provider: str = "voyage", 
        save_files: bool = True,
        verbose: bool = True
    ) -> Tuple[List[str], List[str], float]
    
    def get_chunk_statistics(self) -> Dict[str, Any]
    def print_statistics(self) -> None

chunk_document Function

async def chunk_document(
    file_path: str,
    embedding_provider: str = "voyage",
    threshold_method: str = "percentile", 
    percentile: float = 95,
    max_chunk_len: int = 1024,
    save_files: bool = True,
    verbose: bool = True
) -> Tuple[List[str], List[str], float]

Supported File Formats

  • PDF: .pdf
  • Word Documents: .docx
  • Text Files: .txt
  • Markdown: .md

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Support

If you encounter any issues or have questions:

  1. Check the GitHub Issues
  2. Create a new issue with detailed description
  3. Include sample code and error messages

Changelog

See GitHub Releases for version history and changes.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_it_pro-0.1.3.tar.gz (30.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunk_it_pro-0.1.3-py3-none-any.whl (21.3 kB view details)

Uploaded Python 3

File details

Details for the file chunk_it_pro-0.1.3.tar.gz.

File metadata

  • Download URL: chunk_it_pro-0.1.3.tar.gz
  • Upload date:
  • Size: 30.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for chunk_it_pro-0.1.3.tar.gz
Algorithm Hash digest
SHA256 d0490e34a66e09df02650a01c1496993d7e8310cd83790f7bc8faaad4a3aeaea
MD5 75409946d502363de4aa482ce9362e73
BLAKE2b-256 d137782affa68d3e6ba5302f72e9770a73f73c5c5245a446a4067ad55560c8a8

See more details on using hashes here.

File details

Details for the file chunk_it_pro-0.1.3-py3-none-any.whl.

File metadata

  • Download URL: chunk_it_pro-0.1.3-py3-none-any.whl
  • Upload date:
  • Size: 21.3 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for chunk_it_pro-0.1.3-py3-none-any.whl
Algorithm Hash digest
SHA256 50fee250dec40c20eb4fd6a073f872a211fa21887c65ff4c4f42fc1c6c6f9880
MD5 7209c2c75093846b17d800fc56ad1450
BLAKE2b-256 e6c4dc9ae4fbb86266038b53fa629dc422db3cb233a847219c577520d299fe00

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page