Skip to main content

Token-aware, LangChain-compatible semantic chunker with PDF and layout support

Project description

Semantic Chunker for LangChain

Hitting limits on passing the larger context to your limited character token limit llm model not anymore this chunker solves the problem It is a token-aware, LangChain-compatible chunker that splits text (from PDF, markdown, or plain text) into semantically coherent chunks while respecting model token limits.


🚀 Features

  • 🔍 Model-Aware Token Limits: Automatically adjusts chunking size for GPT-3.5, GPT-4, Claude, and others.

  • 📄 Multi-format Input Support:

    • PDF via pdfplumber
    • Plain .txt
    • Markdown
    • (Extendable to .docx and .html)
  • 🔁 Overlapping Chunks: Smart overlap between paragraphs to preserve context.

  • 🧠 Smart Merging: Merges chunks smaller than 300 tokens.

  • 🧩 Retriever-Ready: Direct integration with LangChain retrievers via FAISS.

  • 🔧 CLI Support: Run from terminal with one command.


📆 Installation

pip install semantic-chunker-langchain

Requires Python 3.9 - 3.12


🛠️ Usage

🔸 Chunk a PDF and Save to JSON/TXT

semantic-chunker sample.pdf --txt chunks.txt --json chunks.json

🔸 From Code

from semantic_chunker_langchain.chunker import SemanticChunker, SimpleSemanticChunker
from semantic_chunker_langchain.extractors.pdf import extract_pdf
from semantic_chunker_langchain.outputs.formatter import write_to_txt

# Extract
docs = extract_pdf("sample.pdf")

# Using SemanticChunker
chunker = SemanticChunker(model_name="gpt-3.5-turbo")
chunks = chunker.split_documents(docs)

# Save to file
write_to_txt(chunks, "output.txt")

# Using SimpleSemanticChunker
simple_chunker = SimpleSemanticChunker(model_name="gpt-3.5-turbo")
simple_chunks = simple_chunker.split_documents(docs)

🔸 Convert to Retriever

from langchain_community.embeddings import OpenAIEmbeddings
retriever = chunker.to_retriever(chunks, embedding=OpenAIEmbeddings())

📊 Testing

poetry run pytest tests/

👨‍💻 Authors

  • Prajwal Shivaji Mandale
  • Sudhnwa Ghorpade

📜 License

This project is licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

semantic_chunker_langchain-0.1.4.tar.gz (4.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

semantic_chunker_langchain-0.1.4-py3-none-any.whl (6.5 kB view details)

Uploaded Python 3

File details

Details for the file semantic_chunker_langchain-0.1.4.tar.gz.

File metadata

  • Download URL: semantic_chunker_langchain-0.1.4.tar.gz
  • Upload date:
  • Size: 4.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.12.5 Windows/11

File hashes

Hashes for semantic_chunker_langchain-0.1.4.tar.gz
Algorithm Hash digest
SHA256 2002a1ea802245a82e913cb9b30865464c5bae663d822ec137e8f2778c2930cd
MD5 32f8feb735d28adc6e656e520204a502
BLAKE2b-256 7c40ea58fd1f7a40d9da4e361ec2bb164dc42b367060cbadcb0e5f5b65db0779

See more details on using hashes here.

File details

Details for the file semantic_chunker_langchain-0.1.4-py3-none-any.whl.

File metadata

File hashes

Hashes for semantic_chunker_langchain-0.1.4-py3-none-any.whl
Algorithm Hash digest
SHA256 10dae2ed14a2a0729e15111acbf8d447f7ee555f8be9f8d96a299d332913d319
MD5 8b0adb7e3f541eda6b4d5bdccccde6b1
BLAKE2b-256 028eff90e2327207320d54d18de3e1d318f36c3026c34364ecd4fb34f5a80acc

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page