A semantic document chunking library
Project description
ChunkIt Pro - Semantic Document Chunking Library
Python library for document chunking using semantic analysis. ChunkIt Pro breaks down documents into meaningful segments based on content similarity rather than arbitrary size limits.
Features
- Multiple Document Formats: Supports PDF, DOCX, TXT, MARKDOWN
- Semantic Analysis: Uses embedding models to understand content similarity
- Multiple Embedding Providers: OpenAI, VoyageAI, and Sentence Transformers support
- Intelligent Chunking: Two-pass algorithm for optimal chunk boundaries
- Configurable Thresholds: Three methods for similarity threshold computation
- Visual Analysis: Generates plots showing similarity patterns
- Easy Integration: Simple API for quick integration into existing projects
Installation
Install from PyPI using pip:
pip install chunk-it-pro
Development Installation
For development:
git clone https://github.com/adw777/chunk-it-pro
cd chunk-it-pro
pip install -e .
Quick Start
Simple Usage (After pip install)
import asyncio
from chunk_it_pro import SemanticChunkingPipeline
async def main():
# Initialize pipeline
pipeline = SemanticChunkingPipeline()
# Process document with default settings
initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
file_path="your_document.pdf"
)
print(f"Created {len(semantic_chunks)} semantic chunks")
for i, chunk in enumerate(semantic_chunks):
print(f"\nChunk {i+1}: {chunk[:200]}...")
# Run the example
asyncio.run(main())
Convenience Function
import asyncio
from chunk_it_pro.pipeline import chunk_document
async def main():
# Quick chunking with default settings
initial_chunks, semantic_chunks, threshold = await chunk_document(
file_path="document.pdf",
embedding_provider="voyage" # or "openai", "axon"
)
# Use the chunks in your application
for chunk in semantic_chunks:
print(f"Chunk length: {len(chunk.split())} words")
asyncio.run(main())
Configuration
Environment Variables
Set up environment variables (create a .env file or set them in your environment):
# Embedding Provider API Keys (at least one required)
OPENAI_API_KEY=your_openai_api_key_here
VOYAGEAI_API_KEY=your_voyage_api_key_here
# Optional: Document parsing service
OMNIPARSE_API_URL=https://your-omniparse-url.com/parse_document
| Variable | Description | Required | Default |
|---|---|---|---|
OPENAI_API_KEY |
OpenAI API key for embeddings | Optional* | None |
VOYAGEAI_API_KEY |
VoyageAI API key for embeddings | Optional* | None |
OMNIPARSE_API_URL |
Omniparse API URL | Optional | Default URL |
*At least one embedding provider API key is required.
Custom Configuration
You can customize all aspects of the chunking process:
import asyncio
from chunk_it_pro import SemanticChunkingPipeline
from chunk_it_pro.config import Config
async def main():
# Method 1: Override config globally
Config.INITIAL_CHUNK_SIZE = 512 # Default: 256 tokens
Config.MAX_CHUNK_LENGTH = 2048 # Default: 1024 tokens
Config.MIN_CHUNK_SIZE = 20 # Default: 10 tokens
Config.DEFAULT_PERCENTILE = 90 # Default: 95
pipeline = SemanticChunkingPipeline()
# Method 2: Pass parameters to process_document
chunks = await pipeline.process_document(
file_path="document.pdf",
threshold_method="gradient", # Options: "percentile", "gradient", "local_maxima"
percentile=90, # Only used with "percentile" method
max_chunk_len=2048, # Override max chunk length
embedding_provider="openai", # Options: "openai", "voyage", "axon"
save_files=True, # Save intermediate files
verbose=True # Print progress
)
asyncio.run(main())
Complete Configuration Reference
Core Parameters
from chunk_it_pro.config import Config
# Chunking Parameters
Config.INITIAL_CHUNK_SIZE = 256 # Initial chunk size in tokens
Config.MIN_CHUNK_SIZE = 10 # Minimum tokens for valid chunk
Config.MAX_CHUNK_LENGTH = 1024 # Maximum tokens for semantic chunks
Config.DEFAULT_SIMILARITY_THRESHOLD = 0.5 # Fallback similarity threshold
# Threshold Computation
Config.DEFAULT_PERCENTILE = 95 # Percentile for threshold method
Config.THRESHOLD_METHODS = ["percentile", "gradient", "local_maxima"]
# Omniparse URL
OMNIPARSE_API_URL = "http://localhost:8000/parse_document"
# Embedding Models
Config.OPENAI_MODEL = "text-embedding-3-large"
Config.VOYAGE_MODEL = "voyage-law-2"
Config.AXON_MODEL = "axondendriteplus/Legal-Embed-intfloat-multilingual-e5-large-instruct"
# Processing
Config.EMBEDDING_BATCH_SIZE = 100 # Batch size for embedding generation
Config.TOKENIZER_NAME = "cl100k_base" # Tokenizer for token counting
# Supported file formats
Config.SUPPORTED_FORMATS = {'.pdf', '.txt', '.docx', '.md'}
# Output file names
Config.DEFAULT_OUTPUT_FILES = {
"initial_chunks": "initial_chunks.txt",
"semantic_chunks": "semantic_chunks.txt",
"embeddings": "embeddings.npy",
"cosine_plot": "cosine_distances.png"
}
Advanced Usage Examples
1. Custom Chunking with Fine-tuned Parameters
import asyncio
from chunk_it_pro import SemanticChunkingPipeline
async def advanced_chunking():
pipeline = SemanticChunkingPipeline()
# Fine-tuned parameters for legal documents
initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
file_path="legal_document.pdf",
threshold_method="gradient", # Better for structured documents
max_chunk_len=1500, # Larger chunks for complex content
embedding_provider="voyage", # Optimized for legal content
save_files=True,
verbose=True
)
# Get detailed statistics
stats = pipeline.get_chunk_statistics()
print(f"Average semantic chunk size: {stats['semantic_chunks']['avg_tokens']:.1f} tokens")
print(f"Similarity threshold used: {stats['similarity_threshold']:.4f}")
asyncio.run(advanced_chunking())
2. Batch Processing Multiple Documents
import asyncio
import os
from chunk_it_pro.pipeline import chunk_document
async def process_directory(directory_path: str):
results = {}
for filename in os.listdir(directory_path):
if filename.endswith(('.pdf', '.docx', '.txt', '.md')):
file_path = os.path.join(directory_path, filename)
print(f"Processing {filename}...")
try:
initial, semantic, threshold = await chunk_document(
file_path=file_path,
embedding_provider="voyage",
threshold_method="percentile",
percentile=95,
max_chunk_len=1024,
save_files=False, # Don't save files for batch processing
verbose=False # Reduce output for batch
)
results[filename] = {
'initial_chunks': len(initial),
'semantic_chunks': len(semantic),
'threshold': threshold
}
except Exception as e:
print(f"Error processing {filename}: {e}")
results[filename] = {'error': str(e)}
return results
# Usage
results = asyncio.run(process_directory("./documents"))
for file, data in results.items():
if 'error' not in data:
print(f"{file}: {data['semantic_chunks']} chunks (threshold: {data['threshold']:.3f})")
3. Custom Configuration Class
import asyncio
from chunk_it_pro import SemanticChunkingPipeline
from chunk_it_pro.chunkers import InitialChunker, SemanticChunker
async def custom_pipeline():
# Create custom chunkers with specific parameters
initial_chunker = InitialChunker(
tokenizer_name="cl100k_base",
chunk_size=512 # Custom initial chunk size
)
pipeline = SemanticChunkingPipeline()
pipeline.initial_chunker = initial_chunker # Override default
# Process with custom semantic chunker parameters
initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
file_path="document.pdf",
threshold_method="local_maxima",
max_chunk_len=2048,
embedding_provider="openai"
)
return semantic_chunks
chunks = asyncio.run(custom_pipeline())
4. Working with Embeddings Directly
import asyncio
import numpy as np
from chunk_it_pro.embeddings import EmbeddingAnalyzer
async def analyze_embeddings():
analyzer = EmbeddingAnalyzer()
# Custom text chunks
texts = [
"This is about machine learning and AI.",
"Machine learning algorithms are powerful tools.",
"The weather today is sunny and warm.",
"Climate change affects weather patterns globally."
]
# Generate embeddings
embeddings = await analyzer.generate_voyage_embeddings(texts)
analyzer.normalize_embeddings()
# Compute similarities
distances = analyzer.compute_cosine_distances()
threshold = analyzer.compute_similarity_threshold(
distances,
method="percentile",
percentile=90
)
print(f"Similarity threshold: {threshold:.4f}")
print(f"Cosine distances: {distances}")
# Plot similarity pattern
analyzer.plot_cosine_distances(distances)
asyncio.run(analyze_embeddings())
Embedding Providers
1. VoyageAI (Recommended for Legal Documents)
- Model:
voyage-law-2 - Strengths: Optimized for legal and structured documents
- API Key: Set
VOYAGEAI_API_KEYenvironment variable
2. OpenAI
- Model:
text-embedding-3-large - Strengths: High quality general-purpose embeddings
- API Key: Set
OPENAI_API_KEYenvironment variable
3. Axon (Local/Self-hosted)
- Model: Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct
- Setup: Requires sentence-transformers installation
# Use specific provider
chunks = await chunk_document(
file_path="document.pdf",
embedding_provider="voyage" # or "openai" or "axon"
)
Threshold Methods
1. Percentile (Default)
Uses the Nth percentile of cosine distances as threshold.
chunks = await pipeline.process_document(
file_path="document.pdf",
threshold_method="percentile",
percentile=95 # Use 95th percentile
)
2. Gradient
Finds points with highest gradient change in similarity.
chunks = await pipeline.process_document(
file_path="document.pdf",
threshold_method="gradient"
)
3. Local Maxima
Uses local maxima in distance patterns.
chunks = await pipeline.process_document(
file_path="document.pdf",
threshold_method="local_maxima"
)
Output Files
When save_files=True, the pipeline creates:
initial_chunks.txt: Fixed-size initial chunks (256 tokens each)semantic_chunks.txt: Final semantic chunksembeddings.npy: Numpy array of embeddingscosine_distances.png: Visualization of similarity patterns
# Control output files
chunks = await pipeline.process_document(
file_path="document.pdf",
save_files=True # Set to False to skip file creation
)
Error Handling
import asyncio
from chunk_it_pro import SemanticChunkingPipeline
async def robust_chunking():
pipeline = SemanticChunkingPipeline()
try:
chunks = await pipeline.process_document("document.pdf")
return chunks
except FileNotFoundError:
print("Document file not found")
except ValueError as e:
print(f"Configuration error: {e}")
except Exception as e:
print(f"Unexpected error: {e}")
return None, None, None
result = asyncio.run(robust_chunking())
Performance Tips
- Batch Processing: Set
save_files=Falseandverbose=Falsefor faster batch processing - Embedding Provider: VoyageAI is generally faster than OpenAI for document chunking
- Chunk Size: Larger initial chunks (512+ tokens) works better for long documents
- Local Embeddings Generation: Use Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct to generate embeddings locally
API Reference
SemanticChunkingPipeline
class SemanticChunkingPipeline:
def __init__(self, openai_api_key: str = None, voyage_api_key: str = None)
async def process_document(
self,
file_path: str,
threshold_method: str = "percentile",
percentile: float = 95,
max_chunk_len: int = 1024,
embedding_provider: str = "voyage",
save_files: bool = True,
verbose: bool = True
) -> Tuple[List[str], List[str], float]
def get_chunk_statistics(self) -> Dict[str, Any]
def print_statistics(self) -> None
chunk_document Function
async def chunk_document(
file_path: str,
embedding_provider: str = "voyage",
threshold_method: str = "percentile",
percentile: float = 95,
max_chunk_len: int = 1024,
save_files: bool = True,
verbose: bool = True
) -> Tuple[List[str], List[str], float]
Supported File Formats
- PDF:
.pdf - Word Documents:
.docx - Text Files:
.txt - Markdown:
.md
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License.
Support
If you encounter any issues or have questions:
- Check the GitHub Issues
- Create a new issue with detailed description
- Include sample code and error messages
Changelog
See GitHub Releases for version history and changes.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunk_it_pro-0.1.2.tar.gz.
File metadata
- Download URL: chunk_it_pro-0.1.2.tar.gz
- Upload date:
- Size: 30.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
794a17de984cb00659e680adcb94dab48c3aeafd039bdfb44341e30a0d21a94b
|
|
| MD5 |
535a241b2cef9d3eafcc0eed960b03e2
|
|
| BLAKE2b-256 |
308b29ef107269296032d1804bd05f1e28da75c8b98629166a2cae4b3b2afedd
|
File details
Details for the file chunk_it_pro-0.1.2-py3-none-any.whl.
File metadata
- Download URL: chunk_it_pro-0.1.2-py3-none-any.whl
- Upload date:
- Size: 21.4 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.13
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8effd6f8ae2e4d2a4f0ac8b2583c01cfc110d5ee66bd657ac2d496f7a50397d2
|
|
| MD5 |
88eb0bddf17da8d9a91c6e36b19cbdf7
|
|
| BLAKE2b-256 |
9a07d308c794f406c8e7185a019b8703cbedbc641383cc76661e95088a430d24
|