Chunking components for the Sayou Data Platform
Project description
sayou-chunking
sayou-chunking is a context-aware text splitting library for Python. It transforms raw text documents into Knowledge Graph-ready nodes, focusing on preserving semantic structure, hierarchy, and context.
This library is the "Assembler Preparation" component of the Sayou Data Platform. It sits between data cleaning (sayou-refinery) and knowledge graph construction (sayou-assembler), ensuring that RAG pipelines operate on logical, structured units of information rather than fragmented text.
Philosophy
sayou-chunking believes that "How you split determines how you retrieve."
Naive chunking destroys context (e.g., splitting a table in half). We prioritize Structure-First Splitting:
- Atomic Protection: Never split atomic blocks like Tables or Code Snippets.
- Hierarchical Binding: Headers are parents; contents are children. We maintain
parent_idlinkages for KG construction. - Composite Strategies: Combine multiple splitting strategies (e.g., Structure for context, Recursive for retrieval units).
🚀 Key Features
- 4-Tier Architecture: Highly extensible design (Engine -> Interface -> Template -> Plugin -> Composite).
- Atomic Protection: Built-in
TextSegmenterengine prevents breaking Markdown tables and code blocks. - KG-Ready Output: Automatically generates
parent_id,doc_level, andsemantic_typemetadata. - Smart Plugins:
- MarkdownPlugin: Anchors chunks to Headers (
#) and classifies content (Table, List, H1...). - ParentDocument: Implements "Small-to-Big" retrieval strategy using composite splitters.
- MarkdownPlugin: Anchors chunks to Headers (
- Semantic Awareness: Detects topic shifts to create logically grouped chunks (Tier 2 Template).
📦 Installation
pip install sayou-chunking
⚡ Quickstart
The ChunkingPipeline orchestrates the splitting process. You can register any combination of Tier 2 Templates or Tier 3 Plugins.
Here is a complete example demonstrating how to process a Markdown file generated by sayou-refinery.
import os
import json
from typing import List
# 1. Import Core Pipeline & Interface
from sayou.chunking.pipeline import ChunkingPipeline
from sayou.chunking.interfaces.base_splitter import BaseSplitter
# 2. Import Splitters (Templates & Plugins)
from sayou.chunking.splitter.fixed_length import FixedLengthSplitter
from sayou.chunking.splitter.recursive import RecursiveSplitter
from sayou.chunking.splitter.structure import StructureSplitter
from sayou.chunking.splitter.semantic import SemanticSplitter
from sayou.chunking.splitter.parent_document import ParentDocumentSplitter
from sayou.chunking.plugins.markdown_plugin import MarkdownPlugin
def run_chunking_demo():
# Setup File Paths
refinery_output_md = os.path.join(".", "test.md")
with open(REFINERY_OUTPUT_MD, "r", encoding="utf-8") as f:
markdown_content = f.read()
source_metadata = {
"source_file": refinery_output_md,
"id": "doc_refinery_output"
}
# 3. Register Splitters
default_splitters: List[BaseSplitter] = [
FixedLengthSplitter(),
ParentDocumentSplitter(),
RecursiveSplitter(),
SemanticSplitter(),
StructureSplitter(),
MarkdownPlugin(),
]
# 4. Initialize Pipeline
pipeline = ChunkingPipeline(splitters=default_splitters)
pipeline.initialize()
# 5. Create Split Request
# We use 'markdown' type to leverage structure-aware splitting
split_request = {
"type": "markdown",
"content": markdown_content,
"metadata": source_metadata,
"chunk_size": 1000,
"chunk_overlap": 50
}
print(f"--- [Sayou Chunking Demo] ---")
print(f"Splitting using '{split_request['type']}'...")
try:
# 6. Run Splitting
chunks = pipeline.split(split_request)
print(f"✅ Successfully split content into {len(chunks)} chunks.\n")
# 7. Save Output
output_dir = os.path.join(os.path.dirname(__file__), "output")
os.makedirs(output_dir, exist_ok=True)
output_path = os.path.join(output_dir, f"chunks_output.json")
with open(output_path, "w", encoding="utf-8") as f:
json.dump(chunks, f, indent=2, ensure_ascii=False)
print(f"Full output saved to {output_path}")
except Exception as e:
print(f"❌ Chunking failed: {e}")
if __name__ == "__main__":
run_chunking_demo()
Example JSON Output (KG-Ready)
Notice how semantic_type is identified and parent_id links the content to its header.
[
{
"chunk_content": "# 1. Introduction",
"metadata": {
"chunk_id": "doc_123_h_0",
"semantic_type": "h1",
"is_header": true,
"part_index": 0
}
},
{
"chunk_content": "| Feature | Status |\n|---|---|\n| Protect | Done |...",
"metadata": {
"chunk_id": "doc_123_part_1",
"semantic_type": "table",
"parent_id": "doc_123_h_0",
"section_title": "1. Introduction",
"part_index": 1
}
}
]
🗺️ Roadmap (v0.1.0+)
sayou-chunking v0.0.1 establishes the structural foundation.
- HTML Plugin: Applying the "Parent-Child" strategy to HTML DOM trees.
- Real Semantic Engine: Integrating OpenAI/HuggingFace embeddings into
SemanticSplitter. - Tokenizer Support: Switching
chunk_sizecalculation from characters to tokens (e.g., tiktoken).
🤝 Contributing
We welcome contributions! Whether it's a new Tier 3 Plugin for a specific format or optimization of the Tier 1 Engine. Please check our contributing guidelines.
📜 License
Apache 2.0 License © 2025 Sayouzone
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_chunking-0.1.2.tar.gz.
File metadata
- Download URL: sayou_chunking-0.1.2.tar.gz
- Upload date:
- Size: 14.8 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
beca4b846f19a04f2304702e86b484937302828961b236844ebb1a640a87e5a5
|
|
| MD5 |
011a2be5191beb97cfa8f1e74a370eed
|
|
| BLAKE2b-256 |
d6a7991dfc0d9981cbead8c5a8f6a3fa279903f85c5634a3dc8698ff325e5646
|
File details
Details for the file sayou_chunking-0.1.2-py3-none-any.whl.
File metadata
- Download URL: sayou_chunking-0.1.2-py3-none-any.whl
- Upload date:
- Size: 18.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0a223f2ee2fafd7f69392764eb756237a0cffe5e4bad211595e3800126ad3df2
|
|
| MD5 |
9da38aacf143613fd45cfc25dc5f3124
|
|
| BLAKE2b-256 |
7959833ff1d3d9adca82e3ee6008c40f4de9e2e6855003323f62819880e2c4a2
|