Token-aware, LangChain-compatible semantic chunker with PDF and layout support
Project description
Semantic Chunker for LangChain
Hitting limits on passing the larger context to your limited character token limit llm model not anymore this chunker solves the problem It is a token-aware, LangChain-compatible chunker that splits text (from PDF, markdown, or plain text) into semantically coherent chunks while respecting model token limits.
🚀 Features
-
🔍 Model-Aware Token Limits: Automatically adjusts chunking size for GPT-3.5, GPT-4, Claude, and others.
-
📄 Multi-format Input Support:
- PDF via
pdfplumber - Plain
.txt - Markdown
- (Extendable to
.docxand.html)
- PDF via
-
🔁 Overlapping Chunks: Smart overlap between paragraphs to preserve context.
-
🧠 Smart Merging: Merges chunks smaller than 300 tokens.
-
🧩 Retriever-Ready: Direct integration with
LangChainretrievers via FAISS. -
🔧 CLI Support: Run from terminal with one command.
📆 Installation
pip install semantic-chunker-langchain
Requires Python 3.9 - 3.12
🛠️ Usage
🔸 Chunk a PDF and Save to JSON/TXT
semantic-chunker sample.pdf --txt chunks.txt --json chunks.json
🔸 From Code
from semantic_chunker_langchain.chunker import SemanticChunker, SimpleSemanticChunker
from semantic_chunker_langchain.extractors.pdf import extract_pdf
from semantic_chunker_langchain.outputs.formatter import write_to_txt
# Extract
docs = extract_pdf("sample.pdf")
# Using SemanticChunker
chunker = SemanticChunker(model_name="gpt-3.5-turbo")
chunks = chunker.split_documents(docs)
# Save to file
write_to_txt(chunks, "output.txt")
# Using SimpleSemanticChunker
simple_chunker = SimpleSemanticChunker(model_name="gpt-3.5-turbo")
simple_chunks = simple_chunker.split_documents(docs)
🔸 Convert to Retriever
from langchain_community.embeddings import OpenAIEmbeddings
retriever = chunker.to_retriever(chunks, embedding=OpenAIEmbeddings())
📊 Testing
poetry run pytest tests/
👨💻 Authors
- Prajwal Shivaji Mandale
- Sudhnwa Ghorpade
📜 License
This project is licensed under the MIT License.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file semantic_chunker_langchain-0.1.4.tar.gz.
File metadata
- Download URL: semantic_chunker_langchain-0.1.4.tar.gz
- Upload date:
- Size: 4.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.5 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
2002a1ea802245a82e913cb9b30865464c5bae663d822ec137e8f2778c2930cd
|
|
| MD5 |
32f8feb735d28adc6e656e520204a502
|
|
| BLAKE2b-256 |
7c40ea58fd1f7a40d9da4e361ec2bb164dc42b367060cbadcb0e5f5b65db0779
|
File details
Details for the file semantic_chunker_langchain-0.1.4-py3-none-any.whl.
File metadata
- Download URL: semantic_chunker_langchain-0.1.4-py3-none-any.whl
- Upload date:
- Size: 6.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.12.5 Windows/11
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
10dae2ed14a2a0729e15111acbf8d447f7ee555f8be9f8d96a299d332913d319
|
|
| MD5 |
8b0adb7e3f541eda6b4d5bdccccde6b1
|
|
| BLAKE2b-256 |
028eff90e2327207320d54d18de3e1d318f36c3026c34364ecd4fb34f5a80acc
|