Skip to main content

Semantic chunking of documents using Sentence Transformers and LangChain,

Project description

📄 Semantic Document Chunking

This repository provides a method to convert documents into semantically meaningful chunks without relying on fixed chunk sizes or overlapping windows. Instead, it uses semantic chunking, dividing text based on meaning, topics, and the natural structure of the content to preserve contextual relevance.


🚀 Overview

Traditional chunking techniques split documents based solely on size or fixed length, often leading to fragmented and contextually inconsistent segments.
Our approach reverses this by organizing and splitting content based on semantic similarity, then feeding those into a dynamic chunking strategy. This results in more meaningful and context-aware chunks, and significantly reduces computational costs.


🔍 How It Works

📝 Sentence-wise Splitting

The document is first split into individual sentences or paragraphs, depending on the selected mode ('sentence' or 'para').

🔗 Semantic Segregation

  1. Calculate cosine similarity between sentences using a Sentence Transformer.
  2. Group sentences where similarity scores > 0.4 into clusters.
  3. Recursively repeat for ungrouped sentences until all are grouped semantically.

⚙️ Dynamic Chunking with Retrieval Optimization

After semantic grouping:

  • A recursive character splitter is applied with dynamic chunk sizing.
  • The chunk size is computed as:

By default, calling .as_retriever() uses semantic similarity to retrieve the top 4 most relevant chunks. Typically, one chunk—approximately one-fourth of the document—is enough to provide a meaningful response, depending on the question.

chunk_size = length_of_document / N Where N is a configurable parameter that determines granularity.

🔢 Chunk Size Calculation Example For a document of length 1200 and N = 16:

chunk_size = 1200 / 16 = 75

This would yield chunks of ~75 characters with some overlap.

💡 Chunk Usage Guide

Depending on the desired response length, vary how many chunks are used:

Response Type Approx. Chunks Used Chunking Config (N, overlap_ratio)
Short Answer ~1/4 of total chunks (e.g., top 4 of 16) lightN=16, overlap_ratio=0.15
Moderate (Detailed) ~1/2 of total chunks (e.g., top 6 of 12) standardN=12, overlap_ratio=0.25
Detailed Answer ~3/4 of total chunks (e.g., top 6 of 8) deepN=8, overlap_ratio=0.35
Very Detailed Answer All chunks (~8 of 8) max_detailN=8, overlap_ratio=0.45

This balances context coverage and retrieval efficiency.


🎯 Benefits

✅ Produces contextually relevant and semantically consistent chunks

✅ Saves computational and cost resources by minimizing redundant input

✅ Automatically adjusts chunk size and overlap based on document length and depth


📦 Installation

Make sure the required dependencies are installed:

pip install nltk sentence-transformers langchain

If needed, download NLTK tokenizers:

import nltk

nltk.download("punkt")

🧪 Usage Example

from splitter import SemanticSplitter

splitter = SemanticSplitter(
    threshold=0.4,                # Semantic similarity threshold for splitting
    depth='standard',            # Options: 'light', 'standard', 'deep', 'max_detail'
    tokenization_mode='para',    # Options: 'para' (paragraph), 'sent' (sentence)
    model="BAAI/bge-base-en"     # Sentence embedding model (default: "BAAI/bge-base-en")
)
with open("path/to/your/document.txt", "r", encoding="utf-8") as f:
    document = f.read()

chunks = splitter.auto_split(document)

for i, chunk in enumerate(chunks):
    print(f"Chunk {i+1}:\n{chunk.page_content}\n")

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_splitter-0.1.1.tar.gz (7.6 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_splitter-0.1.1-py3-none-any.whl (7.7 kB view details)

Uploaded Python 3

File details

Details for the file semantic_splitter-0.1.1.tar.gz.

File metadata

  • Download URL: semantic_splitter-0.1.1.tar.gz
  • Upload date:
  • Size: 7.6 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.3

File hashes

Hashes for semantic_splitter-0.1.1.tar.gz
Algorithm Hash digest
SHA256 5e39b3f9809bff7caadd426c58ae71bab63c2a269f851b493bd51a6924a3a36f
MD5 3187c03c64381f8471833277dc684527
BLAKE2b-256 a97132452bbf310da5cf52a48023e3e1d588b3a5400a1a6516c5196fc8c5d39f

See more details on using hashes here.

File details

Details for the file semantic_splitter-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_splitter-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 5cde8d10b43eca26b75451fda15d27e73dcfd1ac87d0fffc0f21d0cbf7d9da89
MD5 65b98332b689605b9d7841487ea6fa8d
BLAKE2b-256 38c687be17c1aab5fe25be436756962331158777db766990d3bbe78c11237c61

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page