Chunking components for the Sayou Data Platform
Project description
sayou-chunking
The Intelligent Text Splitter for Sayou Fabric.
sayou-chunking splits large texts into smaller, semantically meaningful units called Chunks. This is a critical step for RAG (Retrieval-Augmented Generation) systems, as it directly impacts retrieval accuracy.
It goes beyond simple character splitting by offering structure-aware, semantic, and hierarchical chunking strategies.
💡 Core Philosophy
"Context is King."
Blindly cutting text at 500 characters breaks sentences and loses meaning. sayou-chunking aims to preserve context by:
- Structure Awareness: Respects document headers, tables, and code blocks (especially in Markdown).
- Semantic Coherence: Groups sentences that belong to the same topic using similarity metrics.
- Hierarchy: Maintains Parent-Child relationships to retrieve small precise chunks while providing large context to the LLM.
📦 Installation
pip install sayou-chunking
⚡ Quick Start
The ChunkingPipeline provides a unified interface for various splitting strategies.
from sayou.chunking.pipeline import ChunkingPipeline
def run_demo():
# 1. Initialize Pipeline
pipeline = ChunkingPipeline()
pipeline.initialize()
# 2. Prepare Input (e.g., from Refinery)
text_content = """
# Section 1: Introduction
Chunking is the process of breaking text down.
## Benefits
- Better Retrieval
- Context Preservation
"""
request = {
"content": text_content,
"metadata": {"source": "doc.md"},
"config": {"chunk_size": 50}
}
# 3. Run with Strategy ('markdown', 'recursive', 'semantic', etc.)
chunks = pipeline.run(request, strategy="markdown")
# 4. Result
for i, chunk in enumerate(chunks):
print(f"[{i}] Type: {chunk.metadata.get('semantic_type')}")
print(f" Content: {chunk.chunk_content}")
if __name__ == "__main__":
run_demo()
🔑 Key Components
Splitter
RecursiveSplitter: The standard strategy. Splits by paragraph -> line -> sentence -> word to keep related text together.MarkdownSplitter: Aware of Markdown syntax. Splits by headers (#) first, protecting tables and code blocks.FixedLengthSplitter: Hard split by character count. Useful when strict token limits are required.StructureSplitter: Splits based on user-defined regex patterns (e.g., "Article \d+").SemanticSplitter: Uses cosine similarity between sentences to find topic breakpoints.ParentDocumentSplitter: Creates large "Parent" chunks for context and small "Child" chunks for retrieval, linking them together.
🤝 Contributing
We welcome contributions for New Strategies (e.g., CodeSplitter for Python/JS) or Integrations with other embedding models for Semantic Splitting.
📜 License
Apache 2.0 License © 2025 Sayouzone
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_chunking-0.1.6.tar.gz.
File metadata
- Download URL: sayou_chunking-0.1.6.tar.gz
- Upload date:
- Size: 19.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4ea277146a755424e76ff41e5bc5329a9e956cf8420bffcb0480dd810d2bb223
|
|
| MD5 |
110fa0d84bf772067e661694342fb3bf
|
|
| BLAKE2b-256 |
f63b0c807679e3a6b4fa52b54df3d9789a8399b23c5cd2bd9953005edf80176a
|
File details
Details for the file sayou_chunking-0.1.6-py3-none-any.whl.
File metadata
- Download URL: sayou_chunking-0.1.6-py3-none-any.whl
- Upload date:
- Size: 20.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
cb1df6357f54c70f6cbc4c039d8a3f659a8d7edd1f0a0e6808903951e62bcefd
|
|
| MD5 |
8c1645317d628282a265cd8e32256411
|
|
| BLAKE2b-256 |
28b12a8559eb832dc29ebdde0859fb1c9b4e6c03dadf4ad77adfde1fccef5198
|