A semantic document chunking library
Project description
ChunkIt Pro - Semantic Document Chunking Library
Python library for document chunking using semantic analysis. ChunkIt Pro breaks down documents into meaningful segments based on content similarity rather than arbitrary size limits.
Features
- Multiple Document Formats: Supports PDF, DOCX, TXT, MARKDOWN
- Semantic Analysis: Uses embedding models to understand content similarity
- Multiple Embedding Providers: OpenAI, VoyageAI, and Sentence Transformers support
- Intelligent Chunking: Two-pass algorithm for optimal chunk boundaries
- Configurable Thresholds: Three methods for similarity threshold computation
- Visual Analysis: Generates plots showing similarity patterns
- Easy Integration: Simple API for quick integration into existing projects
Installation
- Clone the repository:
git clone https://github.com/adw777/chunk_it_pro
cd chunk_it_pro
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables (create a
.envfile):
OPENAI_API_KEY=your_openai_api_key_here
VOYAGEAI_API_KEY=your_voyage_api_key_here
OMNIPARSE_API_URL=your_omniparse_api_url_here
Quick Start
Basic Usage
import asyncio
from chunk_it_pro import SemanticChunkingPipeline
async def main():
# Initialize pipeline
pipeline = SemanticChunkingPipeline()
# Process document
initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
file_path="your_document.pdf",
embedding_provider="voyage", # or "openai", "axon"
threshold_method="percentile",
percentile=95,
max_chunk_len=1024
)
print(f"Created {len(semantic_chunks)} semantic chunks")
print(f"Similarity threshold: {threshold:.4f}")
# Run the example
asyncio.run(main())
Convenience Function (with default values)
import asyncio
from chunk_it_pro.pipeline import chunk_document
async def main():
# Quick chunking with default settings
initial_chunks, semantic_chunks, threshold = await chunk_document(
file_path="document.pdf",
embedding_provider="voyage"
)
# Use the chunks in your application
for i, chunk in enumerate(semantic_chunks):
print(f"Chunk {i+1}: {chunk[:100]}...")
asyncio.run(main())
Configuration
Environment Variables
| Variable | Description | Required |
|---|---|---|
OPENAI_API_KEY |
OpenAI API key for embeddings | Optional* |
VOYAGEAI_API_KEY |
VoyageAI API key for embeddings | Optional* |
OMNIPARSE_API_URL |
Omniparse API URL for document parsing | Optional |
*At least one embedding provider API key is required.
Embedding Providers
-
VoyageAI (Recommended for legal documents):
- Model:
voyage-law-2
- Model:
-
OpenAI:
- Model:
text-embedding-3-large - High quality general-purpose embeddings
- Model:
-
Axon (Local):
- Model: Fine-tuned legal embedding model Wasserstoff-AI/Legal-Embed-intfloat-multilingual-e5-large-instruct
- Runs locally, no API costs
- Requires sentence transfomers setup
Threshold Methods
- Percentile (Default): Uses the Nth percentile of cosine distances
- Gradient: Finds points with highest gradient change
- Local Maxima: Uses local maxima in distance patterns
API Reference
SemanticChunkingPipeline
Main class for semantic chunking operations.
class SemanticChunkingPipeline:
def __init__(self, openai_api_key: str = None, voyage_api_key: str = None)
async def process_document(
self,
file_path: str,
threshold_method: str = "percentile",
percentile: float = 95,
max_chunk_len: int = 1024,
embedding_provider: str = "voyage",
save_files: bool = True,
verbose: bool = True
) -> Tuple[list, list, float]
def get_chunk_statistics(self) -> Dict[str, Any]
def print_statistics(self)
chunk_document Function
Convenience function (default) for quick document processing.
async def chunk_document(
file_path: str,
embedding_provider: str = "voyage",
threshold_method: str = "percentile",
percentile: float = 95,
max_chunk_len: int = 1024,
save_files: bool = True,
verbose: bool = True
) -> Tuple[list, list, float]
Output Files
When save_files=True, the pipeline creates:
initial_chunks.txt: Fixed-size initial chunks (256 tokens each)semantic_chunks.txt: Final semantic chunksembeddings.npy: Numpy array of embeddings (for initial_chunks)cosine_distances.png: Visualization of similarity patterns between initial_chunks
Project Structure
chunk_it_pro/
├── chunk_it_pro/ # Main package
│ ├── __init__.py # Package initialization
│ ├── config.py # Configuration management
│ ├── pipeline.py # Main pipeline implementation
│ ├── parsers/ # Document parsing modules
│ │ ├── __init__.py
│ │ ├── document_parser.py
│ │ └── omniparse.py
│ ├── chunkers/ # Chunking algorithms
│ │ ├── __init__.py
│ │ ├── initial_chunker.py
│ │ └── semantic_chunker.py
│ ├── embeddings/ # Embedding generation
│ │ ├── __init__.py
│ │ ├── embedding_analyzer.py
│ │ └── voyage_client.py
│ └── utils/ # Utility modules
│ ├── __init__.py
│ └── singleton.py
├── example.py # Usage examples
├── requirements.txt # Dependencies
└── README.md # Documentation
Algorithm Overview
ChunkIt uses two-pass algorithm:
Initial Chunking
- Parse document to markdown format
- Identify structural breakpoints (headers, page breaks)
- Create fixed-size chunks (256 tokens) respecting breakpoints
- Generate embeddings for each chunk
- Compute cosine distances between consecutive chunks
- Determine similarity threshold using statistical methods
Semantic Refinement
- Split text into sentences
- Generate sentence-level embeddings
- Group sentences based on similarity threshold
- Merge similar adjacent chunks (respecting max length)
- Output final semantic chunks
Advanced Usage
Custom Configuration
from chunk_it_pro.config import Config
# Override default settings
Config.INITIAL_CHUNK_SIZE = 512
Config.MAX_CHUNK_LENGTH = 2048
Config.DEFAULT_PERCENTILE = 90
# Check configuration
status = Config.validate_config()
print(status)
Processing Multiple Documents
import asyncio
from pathlib import Path
from chunk_it_pro import SemanticChunkingPipeline
async def process_multiple_documents():
pipeline = SemanticChunkingPipeline()
documents = Path("documents/").glob("*.pdf")
for doc_path in documents:
print(f"Processing {doc_path.name}...")
initial_chunks, semantic_chunks, threshold = await pipeline.process_document(
file_path=str(doc_path),
save_files=True,
verbose=False
)
# Save results with document name
output_dir = Path("output") / doc_path.stem
output_dir.mkdir(parents=True, exist_ok=True)
# Custom processing for each document...
asyncio.run(process_multiple_documents())
Troubleshooting
Common Issues
- Missing API Keys: Ensure environment variables are set correctly
- Document Parsing Errors: Check if Omniparse API is accessible
- Memory Issues: Reduce batch size or chunk size for large documents
Performance Tips
- Use VoyageAI for best performance/cost balance
- Adjust
max_chunk_lenbased on your use case - Set
save_files=Falsefor better performance when processing many documents - Use local embedding models (Axon) for privacy-sensitive applications & saving money
Contributing
Contributions are welcome! Please feel free to submit pull requests or open issues for bugs and feature requests.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunk_it_pro-0.1.0.tar.gz.
File metadata
- Download URL: chunk_it_pro-0.1.0.tar.gz
- Upload date:
- Size: 26.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8e1ba30653ee19493bba2f1a73eb6b64e020d70df64ba0a2a5e97e6dc95bd8b1
|
|
| MD5 |
79e50ab67388e1302565dc73c06e41c5
|
|
| BLAKE2b-256 |
e1688247d1349a0a86c74f32a27516f7ce2be4a836e44c118f76abde165e92cc
|
File details
Details for the file chunk_it_pro-0.1.0-py3-none-any.whl.
File metadata
- Download URL: chunk_it_pro-0.1.0-py3-none-any.whl
- Upload date:
- Size: 20.0 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.11.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
8324821588c87b9537a3c917a79aa996dd75897dafa6bc869bdf4e8525187257
|
|
| MD5 |
935eb39b924d691f00608fb1f35e7139
|
|
| BLAKE2b-256 |
6a4cc2e52144b69fe333734036e7c30d64e8e6a65a49ade0548fbe2d2a4c0135
|