Skip to main content

Semantic-aware chunking and clustering for LLM and RAG pipelines.

Project description

Semantic Chunker 🧠✂️

Semantic Chunker is a powerful, lightweight Python library for semantically-aware chunking and clustering of text. It’s designed to support retrieval-augmented generation (RAG), LLM pipelines, and knowledge processing workflows by intelligently grouping related ideas.


🔥 Features

  • Embedding-based chunk similarity (via Sentence Transformers)
  • Token-aware merging with real model tokenizers
  • Clustered chunk merging for optimized RAG inputs
  • Preserves chunk metadata through merging
  • Visual tools: attention heatmaps, semantic graphs, cluster previews
  • Export options: JSON, Markdown, CSV
  • CLI Interface for scripting and automation
  • 🧪 Debug mode with embeddings, similarity matrix, semantic pairs

🚀 Installation

pip install semantic-chunker

📦 Quick Start

from semantic_chunker.refactor import SemanticChunker

chunks = [
    {"text": "Artificial intelligence is a growing field."},
    {"text": "Machine learning is a subset of AI."},
    {"text": "Photosynthesis occurs in plants."},
    {"text": "Deep learning uses neural networks."},
    {"text": "Plants convert sunlight into energy."},
]

chunker = SemanticChunker(max_tokens=512)
merged_chunks = chunker.chunk(chunks)

for i, merged in enumerate(merged_chunks):
    print(f"Chunk {i}:")
    print(merged["text"])
    print()

🧠 Debugging & Visualization

from semantic_chunker.visualization import plot_attention_matrix, plot_semantic_graph, preview_clusters

chunker = SemanticChunker(max_tokens=512)
debug = chunker.get_debug_info(chunks)

preview_clusters(debug["original_chunks"], debug["clusters"])
plot_attention_matrix(debug["similarity_matrix"], debug["clusters"])
plot_semantic_graph(debug["original_chunks"], debug["semantic_pairs"], debug["clusters"])

🛠 CLI Usage

Merge chunks semantically:

chunker chunk \
  --chunks path/to/chunks.json \
  --threshold 0.5 \
  --similarity-threshold 0.4 \
  --max-tokens 512 \
  --preview \
  --visualize \
  --export \
  --export-path output/merged \
  --export-format json

📊 Exports

Export clustered or merged chunks to:

  • .json: for ML/data pipelines
  • .md: for human-readable inspection
  • .csv: for spreadsheets or BI tools

📐 Architecture

Chunks → Embeddings → Cosine Similarity → Clustering → Merging
                                   ↓
                             Semantic Pairs (Optional)
                                   ↓
                             Visualization & Export

🧪 Testing

pytest tests/

🤝 Contributing

Pull requests are welcome! Please open an issue first if you'd like to add a feature or fix a bug.


📄 License

MIT License. See LICENSE for details.


🙌 Acknowledgements

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

advanced_chunker-0.1.2.tar.gz (9.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

advanced_chunker-0.1.2-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file advanced_chunker-0.1.2.tar.gz.

File metadata

  • Download URL: advanced_chunker-0.1.2.tar.gz
  • Upload date:
  • Size: 9.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.11.11

File hashes

Hashes for advanced_chunker-0.1.2.tar.gz
Algorithm Hash digest
SHA256 828c94007d5e8a8e430cd3cc545381e219a22cc610d118db2402108a4a1a7333
MD5 98917fbcd7f3c72d45cc8295ea295615
BLAKE2b-256 898ef79ecad5b79c07793969cccd6953e9b82704cfeb097f254260c6a3c2e1a7

See more details on using hashes here.

File details

Details for the file advanced_chunker-0.1.2-py3-none-any.whl.

File metadata

File hashes

Hashes for advanced_chunker-0.1.2-py3-none-any.whl
Algorithm Hash digest
SHA256 f7b427f453371f4dacbfa75f192b20c6c287d29a3b201bbce543ef39ecb86274
MD5 21a15775ddbd82427f14943df028a463
BLAKE2b-256 0934bae765d5e448045451122ca0c4c39419bed9587428e65b3aa5d52131cfc0

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page