Skip to main content

Semantic-aware chunking and clustering for LLM and RAG pipelines.

Project description

Semantic Chunker 🧠✂️

Semantic Chunker is a powerful, lightweight Python library for semantically-aware chunking and clustering of text. It’s designed to support retrieval-augmented generation (RAG), LLM pipelines, and knowledge processing workflows by intelligently grouping related ideas.


🔥 Features

  • Embedding-based chunk similarity (via Sentence Transformers)
  • Token-aware merging with real model tokenizers
  • Clustered chunk merging for optimized RAG inputs
  • Preserves chunk metadata through merging
  • Visual tools: attention heatmaps, semantic graphs, cluster previews
  • Export options: JSON, Markdown, CSV
  • CLI Interface for scripting and automation
  • 🧪 Debug mode with embeddings, similarity matrix, semantic pairs

🚀 Installation

pip install semantic-chunker

📦 Quick Start

from semantic_chunker.refactor import SemanticChunker

chunks = [
    {"text": "Artificial intelligence is a growing field."},
    {"text": "Machine learning is a subset of AI."},
    {"text": "Photosynthesis occurs in plants."},
    {"text": "Deep learning uses neural networks."},
    {"text": "Plants convert sunlight into energy."},
]

chunker = SemanticChunker(max_tokens=512)
merged_chunks = chunker.chunk(chunks)

for i, merged in enumerate(merged_chunks):
    print(f"Chunk {i}:")
    print(merged["text"])
    print()

🧠 Debugging & Visualization

from semantic_chunker.visualization import plot_attention_matrix, plot_semantic_graph, preview_clusters

chunker = SemanticChunker(max_tokens=512)
debug = chunker.get_debug_info(chunks)

preview_clusters(debug["original_chunks"], debug["clusters"])
plot_attention_matrix(debug["similarity_matrix"], debug["clusters"])
plot_semantic_graph(debug["original_chunks"], debug["semantic_pairs"], debug["clusters"])

🛠 CLI Usage

Merge chunks semantically:

chunker chunk \
  --chunks path/to/chunks.json \
  --threshold 0.5 \
  --similarity-threshold 0.4 \
  --max-tokens 512 \
  --preview \
  --visualize \
  --export \
  --export-path output/merged \
  --export-format json

📊 Exports

Export clustered or merged chunks to:

  • .json: for ML/data pipelines
  • .md: for human-readable inspection
  • .csv: for spreadsheets or BI tools

📐 Architecture

Chunks → Embeddings → Cosine Similarity → Clustering → Merging
                                   ↓
                             Semantic Pairs (Optional)
                                   ↓
                             Visualization & Export

🧪 Testing

pytest tests/

🤝 Contributing

Pull requests are welcome! Please open an issue first if you'd like to add a feature or fix a bug.


📄 License

MIT License. See LICENSE for details.


🙌 Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advanced_chunker-0.1.1.tar.gz (9.3 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

advanced_chunker-0.1.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file advanced_chunker-0.1.1.tar.gz.

File metadata

  • Download URL: advanced_chunker-0.1.1.tar.gz
  • Upload date:
  • Size: 9.3 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.8

File hashes

Hashes for advanced_chunker-0.1.1.tar.gz
Algorithm Hash digest
SHA256 017165d775697295bc6ac4e3902e8f455aa8fac12731bcbeb48f28f65512a0ed
MD5 ac28ffb519a65a51fe4a1dc7c9dbabd9
BLAKE2b-256 403fdbad8f16ef1dde7afc2b2819117a8f08419f8b51927ea7085639d0ea215e

See more details on using hashes here.

File details

Details for the file advanced_chunker-0.1.1-py3-none-any.whl.

File metadata

File hashes

Hashes for advanced_chunker-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 1b190885e2d1760bd39e640deb5d8857622db594959b4610e482748204606e15
MD5 8e05f933ea8bad049e4892908e1b859c
BLAKE2b-256 271925207ac7c2dda912942108df60f339bc0505d57c94e91ed4510bf17ce2d3

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page