Chunking components for the Sayou Data Platform
Project description
sayou-chunking
Overview
The Structure-Aware Splitter for Sayou Fabric.
sayou-chunking splits large texts into smaller, semantically meaningful units called Chunks. Unlike traditional splitters that blindly cut text by character count, this library understands the syntax structure of the data.
It focuses on preserving the integrity of code blocks, tables, and JSON objects, ensuring that Retrieval (RAG) systems fetch complete and executable contexts.
1. Architecture & Role
The Chunking engine takes raw text (from Refinery) and applies a Syntax-Aware Strategy to produce atomic chunks.
graph LR
Text[Refined Text] --> Pipeline[Chunking Pipeline]
subgraph Strategies
MD[Markdown Header]
Code[Code AST]
JSON[JSON Object]
end
Pipeline -->|Config Routing| Strategies
Strategies --> Chunks[Atomic Chunks]
1.1. Core Features
- Syntax Awareness: Never splits in the middle of a code block or a markdown table.
- Hierarchy Preservation: Attaches metadata about the parent section (e.g., Header Path, Class Name) to every chunk.
- Atomic Integrity: Ensures that a JSON object or a Python function remains a single unit.
2. Available Strategies
sayou-chunking prioritizes deterministic structural splitting over probabilistic methods.
| Strategy Key | Target Format | Description |
|---|---|---|
markdown |
Markdown, Text | Splits by Headers (#, ##). Preserves Tables and Code Blocks as atomic units. |
code |
Python, JS, Java | Uses AST (Abstract Syntax Tree) to split by Class and Function definitions. |
json |
JSON, JSONL | Splits large JSON arrays into individual records or sub-trees. |
3. Installation
pip install sayou-chunking
4. Usage
The ChunkingPipeline is the entry point. It accepts a ChunkingRequest containing content and metadata.
Case A: Markdown Splitting (RAG Standard)
Ideal for documentation. It splits by headers while keeping sections together.
from sayou.chunking import ChunkingPipeline
text_content = """
# Section 1
Introduction text...
## Subsection 1.1
- Item A
- Item B
"""
chunks = ChunkingPipeline.process(
data={"content": text_content, "metadata": {"source": "doc.md"}},
strategy="markdown"
)
# 4. Result
for chunk in chunks:
print(f"[{chunk.metadata['type']}] {chunk.content[:20]}...")
# Output: [heading] # Section 1...
# Output: [text] Introduction text...
Case B: Code Splitting (Python AST)
Ideal for code analysis. It splits by logical units (Functions/Classes).
from sayou.chunking import ChunkingPipeline
code_content = """
class MyClass:
def method_a(self):
print("hello")
def global_func():
pass
"""
chunks = ChunkingPipeline.process(
data={"content": code_content, "metadata": {"language": "python"}},
strategy="code"
)
# Result: 2 Chunks (1 Class block, 1 Function block)
print(f"Generated {len(chunks)} logic blocks.")
Case C: JSON Splitting
Ideal for processing large data logs or API responses.
from sayou.chunking import ChunkingPipeline
json_content = '[{"id": 1, "val": "A"}, {"id": 2, "val": "B"}]'
chunks = ChunkingPipeline.process(
data={"content": json_content, "metadata": {}},
strategy="json"
)
# Result: 2 Chunks (Each object is a separate chunk)
5. Configuration Keys
Customize the behavior of each splitter via the config dictionary.
markdown:header_depth(1-6),strip_headers(bool).code:language(python),chunk_lines(min/max lines).json:jq_query(filter pattern),max_size.
6. License
Apache 2.0 License © 2026 Sayouzone
7. Plugin List
| Plugin | Example | Description |
|---|---|---|
Json Chunking |
▶ | |
Markdown Chunking |
▶ | |
Python Chunking |
▶ | |
Java Chunking |
▶ | |
Javascript Chunking |
▶ |
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sayou_chunking-0.4.6.tar.gz.
File metadata
- Download URL: sayou_chunking-0.4.6.tar.gz
- Upload date:
- Size: 39.6 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
f5871b801f548b806a1a806f176cd6b87f3c07c722219fb58cb798ae68390467
|
|
| MD5 |
5df551d6b058ee289434dbb295681f4a
|
|
| BLAKE2b-256 |
f12fe685105c88235a24a0e6be31d6a41833d1e19e8f21f6c004a3cb3df0fc22
|
File details
Details for the file sayou_chunking-0.4.6-py3-none-any.whl.
File metadata
- Download URL: sayou_chunking-0.4.6-py3-none-any.whl
- Upload date:
- Size: 44.8 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.13.7
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0dbb854f641c8a3949461afc0e690dc78be8eb6c714260926c169d165802750e
|
|
| MD5 |
c6d169b4fadd196c961c837a8c6e5615
|
|
| BLAKE2b-256 |
6229b477a81519b7bbb557cef9f3a4a30fb8eea7f8b544a899eedffa281702d1
|