Skip to main content

Chunking components for the Sayou Data Platform

Project description

sayou-chunking

PyPI version License Docs

Overview

The Structure-Aware Splitter for Sayou Fabric.

sayou-chunking splits large texts into smaller, semantically meaningful units called Chunks. Unlike traditional splitters that blindly cut text by character count, this library understands the syntax structure of the data.

It focuses on preserving the integrity of code blocks, tables, and JSON objects, ensuring that Retrieval (RAG) systems fetch complete and executable contexts.


1. Architecture & Role

The Chunking engine takes raw text (from Refinery) and applies a Syntax-Aware Strategy to produce atomic chunks.

graph LR
    Text[Refined Text] --> Pipeline[Chunking Pipeline]
    
    subgraph Strategies
        MD[Markdown Header]
        Code[Code AST]
        JSON[JSON Object]
    end
    
    Pipeline -->|Config Routing| Strategies
    Strategies --> Chunks[Atomic Chunks]

1.1. Core Features

  • Syntax Awareness: Never splits in the middle of a code block or a markdown table.
  • Hierarchy Preservation: Attaches metadata about the parent section (e.g., Header Path, Class Name) to every chunk.
  • Atomic Integrity: Ensures that a JSON object or a Python function remains a single unit.

2. Available Strategies

sayou-chunking prioritizes deterministic structural splitting over probabilistic methods.

Strategy Key Target Format Description
markdown Markdown, Text Splits by Headers (#, ##). Preserves Tables and Code Blocks as atomic units.
code Python, JS, Java Uses AST (Abstract Syntax Tree) to split by Class and Function definitions.
json JSON, JSONL Splits large JSON arrays into individual records or sub-trees.

3. Installation

pip install sayou-chunking

4. Usage

The ChunkingPipeline is the entry point. It accepts a ChunkingRequest containing content and metadata.

Case A: Markdown Splitting (RAG Standard)

Ideal for documentation. It splits by headers while keeping sections together.

from sayou.chunking import ChunkingPipeline

text_content = """
# Section 1
Introduction text...

## Subsection 1.1
- Item A
- Item B
"""

chunks = ChunkingPipeline.process(
    data={"content": text_content, "metadata": {"source": "doc.md"}},
    strategy="markdown"
)

# 4. Result
for chunk in chunks:
    print(f"[{chunk.metadata['type']}] {chunk.content[:20]}...")
    # Output: [heading] # Section 1...
    # Output: [text] Introduction text...

Case B: Code Splitting (Python AST)

Ideal for code analysis. It splits by logical units (Functions/Classes).

from sayou.chunking import ChunkingPipeline

code_content = """
class MyClass:
    def method_a(self):
        print("hello")

def global_func():
    pass
"""

chunks = ChunkingPipeline.process(
    data={"content": code_content, "metadata": {"language": "python"}},
    strategy="code"
)

# Result: 2 Chunks (1 Class block, 1 Function block)
print(f"Generated {len(chunks)} logic blocks.")

Case C: JSON Splitting

Ideal for processing large data logs or API responses.

from sayou.chunking import ChunkingPipeline

json_content = '[{"id": 1, "val": "A"}, {"id": 2, "val": "B"}]'

chunks = ChunkingPipeline.process(
    data={"content": json_content, "metadata": {}},
    strategy="json"
)

# Result: 2 Chunks (Each object is a separate chunk)

5. Configuration Keys

Customize the behavior of each splitter via the config dictionary.

  • markdown: header_depth (1-6), strip_headers (bool).
  • code: language (python), chunk_lines (min/max lines).
  • json: jq_query (filter pattern), max_size.

6. License

Apache 2.0 License © 2026 Sayouzone

7. Plugin List

Plugin Example Description
Json Chunking
Markdown Chunking
Python Chunking
Java Chunking
Javascript Chunking

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sayou_chunking-0.5.0.tar.gz (76.0 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sayou_chunking-0.5.0-py3-none-any.whl (44.5 kB view details)

Uploaded Python 3

File details

Details for the file sayou_chunking-0.5.0.tar.gz.

File metadata

  • Download URL: sayou_chunking-0.5.0.tar.gz
  • Upload date:
  • Size: 76.0 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_chunking-0.5.0.tar.gz
Algorithm Hash digest
SHA256 78b979dd6abde06ebd36c08056e3f9a7690ca78325beb69c728dd5bb2abd1e2c
MD5 ec91cd4a596e6d752d9ec609cbb47963
BLAKE2b-256 7af2428d89bf18efb84b084e002f2bdaca2a36d6b718fc653567f6229cbe1138

See more details on using hashes here.

File details

Details for the file sayou_chunking-0.5.0-py3-none-any.whl.

File metadata

  • Download URL: sayou_chunking-0.5.0-py3-none-any.whl
  • Upload date:
  • Size: 44.5 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.13.7

File hashes

Hashes for sayou_chunking-0.5.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3b26535daf393f839cee819fff1ee41d7dacbeba4871d81a085b0435f5523fb6
MD5 0d2b230c75c74483cef38f291c43a2e8
BLAKE2b-256 cb5f7085f4285e58d6208d1dfe95aa9517c94d9f71aba577e2768e5d380d8b94

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page