A semantic document chunking library

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Project description

ChunkIt Pro - Semantic Document Chunking Library

Python library for document chunking using semantic analysis. ChunkIt Pro breaks down documents into meaningful segments based on content similarity rather than arbitrary size limits.

Features

Multiple Document Formats: Supports PDF, DOCX, TXT, MARKDOWN
Semantic Analysis: Uses embedding models to understand content similarity
Multiple Embedding Providers: OpenAI and Sentence Transformers support
Intelligent Chunking: Two-pass algorithm for optimal chunk boundaries
Configurable Thresholds: Three methods for similarity threshold computation
Visual Analysis: Generates plots showing similarity patterns
Easy Integration: Simple API for quick integration into existing projects

Installation

Install from PyPI using pip:

pip install chunk-it-pro

Development Installation

For development:

git clone https://github.com/adw777/chunk-it-pro
cd chunk-it-pro
pip install -e .

Quick Start

Simple Usage (After pip install)

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def main():
    # Initialize pipeline
    pipeline = SemanticChunkingPipeline()
    
    # Process document with default settings
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="your_document.pdf"
    )
    
    print(f"Created {len(semantic_chunks)} semantic chunks")
    for i, chunk in enumerate(semantic_chunks):
        print(f"\nChunk {i+1}: {chunk[:200]}...")

# Run the example
asyncio.run(main())

Convenience Function

import asyncio
from chunk_it_pro.pipeline import chunk_document

async def main():
    # Quick chunking with default settings
    initial_chunks, semantic_chunks, threshold = await chunk_document(
        file_path="document.pdf",
        embedding_provider="openai"  # or "axon"
    )
    
    # Use the chunks in your application
    for chunk in semantic_chunks:
        print(f"Chunk length: {len(chunk.split())} words")

asyncio.run(main())

Configuration

Environment Variables

Set up environment variables (create a .env file or set them in your environment):

# Embedding Provider API Keys (at least one required)
OPENAI_API_KEY=your_openai_api_key_here

# Optional: Document parsing service
OMNIPARSE_API_URL=https://your-omniparse-url.com/parse_document

Variable	Description	Required	Default
`OPENAI_API_KEY`	OpenAI API key for embeddings	Optional*	None
`OMNIPARSE_API_URL`	Omniparse API URL	Optional	Default URL

*At least one embedding provider API key is required.

Custom Configuration

You can customize all aspects of the chunking process:

import asyncio
from chunk_it_pro import SemanticChunkingPipeline
from chunk_it_pro.config import Config

async def main():
    # Method 1: Override config globally
    Config.INITIAL_CHUNK_SIZE = 512  # Default: 256 tokens
    Config.MAX_CHUNK_LENGTH = 2048   # Default: 1024 tokens
    Config.MIN_CHUNK_SIZE = 20       # Default: 10 tokens
    Config.DEFAULT_PERCENTILE = 90   # Default: 95
    
    pipeline = SemanticChunkingPipeline()
    
    # Method 2: Pass parameters to process_document
    chunks = await pipeline.process_document(
        file_path="document.pdf",
        threshold_method="gradient",     # Options: "percentile", "gradient", "local_maxima"
        percentile=90,                   # Only used with "percentile" method
        max_chunk_len=2048,              # Override max chunk length
        embedding_provider="openai",     # Options: "openai", "axon"
        save_files=True,                 # Save intermediate files
        verbose=True                     # Print progress
    )

asyncio.run(main())

Complete Configuration Reference

Core Parameters

from chunk_it_pro.config import Config

# Chunking Parameters
Config.INITIAL_CHUNK_SIZE = 256      # Initial chunk size in tokens
Config.MIN_CHUNK_SIZE = 10           # Minimum tokens for valid chunk
Config.MAX_CHUNK_LENGTH = 1024       # Maximum tokens for semantic chunks
Config.DEFAULT_SIMILARITY_THRESHOLD = 0.5  # Fallback similarity threshold

# Threshold Computation
Config.DEFAULT_PERCENTILE = 95       # Percentile for threshold method
Config.THRESHOLD_METHODS = ["percentile", "gradient", "local_maxima"]

# Omniparse URL
OMNIPARSE_API_URL = "http://localhost:8000/parse_document"

# Embedding Models
Config.OPENAI_MODEL = "text-embedding-3-large"
Config.AXON_MODEL = "axondendriteplus/Legal-Embed-intfloat-multilingual-e5-large-instruct"

# Processing
Config.EMBEDDING_BATCH_SIZE = 100    # Batch size for embedding generation
Config.TOKENIZER_NAME = "cl100k_base"  # Tokenizer for token counting

# Supported file formats
Config.SUPPORTED_FORMATS = {'.pdf', '.txt', '.docx', '.md'}

# Output file names
Config.DEFAULT_OUTPUT_FILES = {
    "initial_chunks": "initial_chunks.txt",
    "semantic_chunks": "semantic_chunks.txt", 
    "embeddings": "embeddings.npy",
    "cosine_plot": "cosine_distances.png"
}

Advanced Usage Examples

1. Custom Chunking with Fine-tuned Parameters

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def advanced_chunking():
    pipeline = SemanticChunkingPipeline()
    
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="document.pdf",
        threshold_method="gradient",      # Better for structured documents
        max_chunk_len=1500,              # Larger chunks for complex content
        embedding_provider="openai",   
        save_files=True,
        verbose=True
    )
    
    # Get detailed statistics
    stats = pipeline.get_chunk_statistics()
    print(f"Average semantic chunk size: {stats['semantic_chunks']['avg_tokens']:.1f} tokens")
    print(f"Similarity threshold used: {stats['similarity_threshold']:.4f}")

asyncio.run(advanced_chunking())

2. Batch Processing Multiple Documents

import asyncio
import os
from chunk_it_pro.pipeline import chunk_document

async def process_directory(directory_path: str):
    results = {}
    
    for filename in os.listdir(directory_path):
        if filename.endswith(('.pdf', '.docx', '.txt', '.md')):
            file_path = os.path.join(directory_path, filename)
            print(f"Processing {filename}...")
            
            try:
                initial, semantic, threshold = await chunk_document(
                    file_path=file_path,
                    embedding_provider="openai",
                    threshold_method="percentile",
                    percentile=95,
                    max_chunk_len=1024,
                    save_files=False,  # Don't save files for batch processing
                    verbose=False      # Reduce output for batch
                )
                
                results[filename] = {
                    'initial_chunks': len(initial),
                    'semantic_chunks': len(semantic),
                    'threshold': threshold
                }
                
            except Exception as e:
                print(f"Error processing {filename}: {e}")
                results[filename] = {'error': str(e)}
    
    return results

# Usage
results = asyncio.run(process_directory("./documents"))
for file, data in results.items():
    if 'error' not in data:
        print(f"{file}: {data['semantic_chunks']} chunks (threshold: {data['threshold']:.3f})")

3. Custom Configuration Class

import asyncio
from chunk_it_pro import SemanticChunkingPipeline
from chunk_it_pro.chunkers import InitialChunker, SemanticChunker

async def custom_pipeline():
    # Create custom chunkers with specific parameters
    initial_chunker = InitialChunker(
        tokenizer_name="cl100k_base",
        chunk_size=512  # Custom initial chunk size
    )
    
    pipeline = SemanticChunkingPipeline()
    pipeline.initial_chunker = initial_chunker  # Override default
    
    # Process with custom semantic chunker parameters
    initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
        file_path="document.pdf",
        threshold_method="local_maxima",
        max_chunk_len=2048,
        embedding_provider="openai"
    )
    
    return semantic_chunks

chunks = asyncio.run(custom_pipeline())

4. Working with Embeddings Directly

import asyncio
import numpy as np
from chunk_it_pro.embeddings import EmbeddingAnalyzer

async def analyze_embeddings():
    analyzer = EmbeddingAnalyzer()
    
    # Custom text chunks
    texts = [
        "This is about machine learning and AI.",
        "Machine learning algorithms are powerful tools.",
        "The weather today is sunny and warm.",
        "Climate change affects weather patterns globally."
    ]
    
    # Generate embeddings
    embeddings = await analyzer.generate_openai_embeddings(texts)
    analyzer.normalize_embeddings()
    
    # Compute similarities
    distances = analyzer.compute_cosine_distances()
    threshold = analyzer.compute_similarity_threshold(
        distances, 
        method="percentile", 
        percentile=90
    )
    
    print(f"Similarity threshold: {threshold:.4f}")
    print(f"Cosine distances: {distances}")
    
    # Plot similarity pattern
    analyzer.plot_cosine_distances(distances)

asyncio.run(analyze_embeddings())

Embedding Providers

1. OpenAI

Model: text-embedding-3-large
Strengths: High quality general-purpose embeddings
API Key: Set OPENAI_API_KEY environment variable

2. Axon (Local/Self-hosted)

Model: Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct
Setup: Requires sentence-transformers installation

# Use specific provider
chunks = await chunk_document(
    file_path="document.pdf",
    embedding_provider="openai"  # or "axon"
)

Threshold Methods

1. Percentile (Default)

Uses the Nth percentile of cosine distances as threshold.

chunks = await pipeline.process_document(
    file_path="document.pdf",
    threshold_method="percentile",
    percentile=95  # Use 95th percentile
)

2. Gradient

Finds points with highest gradient change in similarity.

chunks = await pipeline.process_document(
    file_path="document.pdf", 
    threshold_method="gradient"
)

3. Local Maxima

Uses local maxima in distance patterns.

chunks = await pipeline.process_document(
    file_path="document.pdf",
    threshold_method="local_maxima"
)

Output Files

When save_files=True, the pipeline creates:

initial_chunks.txt: Fixed-size initial chunks (256 tokens each)
semantic_chunks.txt: Final semantic chunks
embeddings.npy: Numpy array of embeddings
cosine_distances.png: Visualization of similarity patterns

# Control output files
chunks = await pipeline.process_document(
    file_path="document.pdf",
    save_files=True  # Set to False to skip file creation
)

Error Handling

import asyncio
from chunk_it_pro import SemanticChunkingPipeline

async def robust_chunking():
    pipeline = SemanticChunkingPipeline()
    
    try:
        chunks = await pipeline.process_document("document.pdf")
        return chunks
    except FileNotFoundError:
        print("Document file not found")
    except ValueError as e:
        print(f"Configuration error: {e}")
    except Exception as e:
        print(f"Unexpected error: {e}")
    
    return None, None, None

result = asyncio.run(robust_chunking())

Performance Tips

Batch Processing: Set save_files=False and verbose=False for faster batch processing
Chunk Size: Larger initial chunks (512+ tokens) works better for long documents
Local Embeddings Generation: Use Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct to generate embeddings locally

API Reference

SemanticChunkingPipeline

class SemanticChunkingPipeline:
    def __init__(self, openai_api_key: str = None, openai_api_key: str = None)
    
    async def process_document(
        self,
        file_path: str,
        threshold_method: str = "percentile",
        percentile: float = 95,
        max_chunk_len: int = 1024,
        embedding_provider: str = "openai
        save_files: bool = True,
        verbose: bool = True
    ) -> Tuple[List[str], List[str], float]
    
    def get_chunk_statistics(self) -> Dict[str, Any]
    def print_statistics(self) -> None

chunk_document Function

async def chunk_document(
    file_path: str,
    embedding_provider: str = "openai
    threshold_method: str = "percentile", 
    percentile: float = 95,
    max_chunk_len: int = 1024,
    save_files: bool = True,
    verbose: bool = True
) -> Tuple[List[str], List[str], float]

Supported File Formats

PDF: .pdf
Word Documents: .docx
Text Files: .txt
Markdown: .md

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License.

Support

If you encounter any issues or have questions:

Check the GitHub Issues
Create a new issue with detailed description
Include sample code and error messages

Changelog

See GitHub Releases for version history and changes.

Project details

These details have not been verified by PyPI

Project links

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

This version

0.1.4

Jun 30, 2025

0.1.3

Jun 27, 2025

0.1.2

Jun 27, 2025

0.1.0

Jun 27, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunk_it_pro-0.1.4.tar.gz (30.0 kB view details)

Uploaded Jun 30, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

chunk_it_pro-0.1.4-py3-none-any.whl (21.1 kB view details)

Uploaded Jun 30, 2025 Python 3

File details

Details for the file chunk_it_pro-0.1.4.tar.gz.

File metadata

Download URL: chunk_it_pro-0.1.4.tar.gz
Upload date: Jun 30, 2025
Size: 30.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for chunk_it_pro-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`d5e74d2ac97d297e338bd709b9ecedcfc109d25d4c57934214eb1bb6c3a66a52`
MD5	`4b344031edf9c443dfa2a3ad59e5e283`
BLAKE2b-256	`5387a89353de8e24c4a782a3f2b8b2fd8d6173faf048928d8d90fd61a780ea4c`

See more details on using hashes here.

File details

Details for the file chunk_it_pro-0.1.4-py3-none-any.whl.

File metadata

Download URL: chunk_it_pro-0.1.4-py3-none-any.whl
Upload date: Jun 30, 2025
Size: 21.1 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.11.13

File hashes

Hashes for chunk_it_pro-0.1.4-py3-none-any.whl
Algorithm	Hash digest
SHA256	`93a6fe2dfa0795f11d0bb4b930cda0a5deee0cfd8277969047b21ea0b41aa39b`
MD5	`602b7ab2c43abef2519f1a0aa6c3fec2`
BLAKE2b-256	`4b36a3ba9599988b18736b8de4659fcfe2d39bbf59148e8479d7707525167ff1`

See more details on using hashes here.

chunk-it-pro 0.1.4

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

ChunkIt Pro - Semantic Document Chunking Library

Features

Installation

Development Installation

Quick Start

Simple Usage (After pip install)

Convenience Function

Configuration

Environment Variables

Custom Configuration

Complete Configuration Reference

Core Parameters

Advanced Usage Examples

1. Custom Chunking with Fine-tuned Parameters

2. Batch Processing Multiple Documents

3. Custom Configuration Class

4. Working with Embeddings Directly

Embedding Providers

1. OpenAI

2. Axon (Local/Self-hosted)

Threshold Methods

1. Percentile (Default)

2. Gradient

3. Local Maxima

Output Files

Error Handling

Performance Tips

API Reference

SemanticChunkingPipeline

chunk_document Function

Supported File Formats

Contributing

License

Support

Changelog

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes