Semantic chunking of documents using Sentence Transformers and LangChain,
Project description
📄 Semantic Document Chunking
This repository provides a method to convert documents into semantically meaningful chunks without relying on fixed chunk sizes or overlapping windows. Instead, it uses semantic chunking, dividing text based on meaning, topics, and the natural structure of the content to preserve contextual relevance.
🚀 Overview
Traditional chunking techniques split documents based solely on size or fixed length, often leading to fragmented and contextually inconsistent segments.
Our approach reverses this by organizing and splitting content based on semantic similarity, then feeding those into a dynamic chunking strategy. This results in more meaningful and context-aware chunks, and significantly reduces computational costs.
🔍 How It Works
📝 Sentence-wise Splitting
The document is first split into individual sentences or paragraphs, depending on the selected mode ('sentence' or 'para').
🔗 Semantic Segregation
- Calculate cosine similarity between sentences using a Sentence Transformer.
- Group sentences where similarity scores >
0.4into clusters. - Recursively repeat for ungrouped sentences until all are grouped semantically.
⚙️ Dynamic Chunking with Retrieval Optimization
After semantic grouping:
- A recursive character splitter is applied with dynamic chunk sizing.
- The chunk size is computed as:
By default, calling .as_retriever() uses semantic similarity to retrieve the top 4 most relevant chunks. Typically, one chunk—approximately one-fourth of the document—is enough to provide a meaningful response, depending on the question.
chunk_size = length_of_document / N Where N is a configurable parameter that determines granularity.
🔢 Chunk Size Calculation Example For a document of length 1200 and N = 16:
chunk_size = 1200 / 16 = 75
This would yield chunks of ~75 characters with some overlap.
💡 Chunk Usage Guide
Depending on the desired response length, vary how many chunks are used:
| Response Type | Approx. Chunks Used | Chunking Config (N, overlap_ratio) |
|---|---|---|
| Short Answer | ~1/4 of total chunks (e.g., top 4 of 16) | light → N=16, overlap_ratio=0.15 |
| Moderate (Detailed) | ~1/2 of total chunks (e.g., top 6 of 12) | standard → N=12, overlap_ratio=0.25 |
| Detailed Answer | ~3/4 of total chunks (e.g., top 6 of 8) | deep → N=8, overlap_ratio=0.35 |
| Very Detailed Answer | All chunks (~8 of 8) | max_detail → N=8, overlap_ratio=0.45 |
This balances context coverage and retrieval efficiency.
🎯 Benefits
✅ Produces contextually relevant and semantically consistent chunks
✅ Saves computational and cost resources by minimizing redundant input
✅ Automatically adjusts chunk size and overlap based on document length and depth
📦 Installation
Make sure the required dependencies are installed:
pip install nltk sentence-transformers langchain
If needed, download NLTK tokenizers:
import nltk
nltk.download("punkt")
🧪 Usage Example
from splitter import SemanticSplitter
splitter = SemanticSplitter(
threshold=0.4, # Semantic similarity threshold for splitting
depth='standard', # Options: 'light', 'standard', 'deep', 'max_detail'
tokenization_mode='para', # Options: 'para' (paragraph), 'sent' (sentence)
model="BAAI/bge-base-en" # Sentence embedding model (default: "BAAI/bge-base-en")
)
with open("path/to/your/document.txt", "r", encoding="utf-8") as f:
document = f.read()
chunks = splitter.auto_split(document)
for i, chunk in enumerate(chunks):
print(f"Chunk {i+1}:\n{chunk.page_content}\n")
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_splitter-0.1.1.tar.gz.
File metadata
- Download URL: semantic_splitter-0.1.1.tar.gz
- Upload date:
- Size: 7.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5e39b3f9809bff7caadd426c58ae71bab63c2a269f851b493bd51a6924a3a36f
|
|
| MD5 |
3187c03c64381f8471833277dc684527
|
|
| BLAKE2b-256 |
a97132452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f
|
File details
Details for the file semantic_splitter-0.1.1-py3-none-any.whl.
File metadata
- Download URL: semantic_splitter-0.1.1-py3-none-any.whl
- Upload date:
- Size: 7.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.3
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
5cde8d10b43eca26b75451fda15d27e73dcfd1ac87d0fffc0f21d0cbf7d9da89
|
|
| MD5 |
65b98332b689605b9d7841487ea6fa8d
|
|
| BLAKE2b-256 |
38c687be17c1aab5fe25be436756962331158777db766990d3bbe78c11237c61
|