Skip to main content

Chunk long text with policies.

Project description

chunkle

Smart text chunking that respects both line and token limits while preserving semantic boundaries.

GitHub: https://github.com/allen2c/chunkle Pypi: https://pypi.org/project/chunkle/

Install

pip install chunkle

Quick Start

from chunkle import chunk

# Basic usage
for piece in chunk(text, lines_per_chunk=20, tokens_per_chunk=500):
    print(piece)

# Custom limits
chunks = list(chunk(text, lines_per_chunk=5, tokens_per_chunk=100))

How It Works

flowchart TD
    A["📝 Start processing text"] --> B["📊 Accumulate chars<br/>Count lines & tokens"]
    B --> C{"✅ Both limits met?<br/>(lines ≥ min AND tokens ≥ min)"}
    C -->|No| D{"🚨 Exceeded 2x limits?"}
    C -->|Yes| E{"🎯 Good break point?<br/>(newline > whitespace)"}

    D -->|No| B
    D -->|Yes| F["💥 Force flush<br/>(semantic boundary ignored)"]

    E -->|No| D
    E -->|Yes| G["✂️ Flush chunk<br/>(clean semantic boundary)"]

    F --> H["🧽 Absorb whitespace/punctuation<br/>into previous chunk"]
    G --> H
    H --> I{"📄 More text?"}
    I -->|Yes| B
    I -->|No| J["🏁 Done"]

Rules

  1. Dual Requirements: Chunks must meet BOTH line AND token minimums
  2. Smart Boundaries: Prefers newlines (best) > whitespace (good) > force split
  3. Force Split: Splits at 2x limits even if it breaks semantics
  4. Clean Starts: New chunks begin with meaningful characters

Examples

English Text:

text = "Hello world!\nThis is a test.\nAnother line here."
chunks = list(chunk(text, lines_per_chunk=1, tokens_per_chunk=8))
# Result: ['Hello world!\n', 'This is a test.\n', 'Another line here.']

Chinese Text (force split):

text = "這是一個很長的句子,沒有空格,會觸發強制切分機制。"
chunks = list(chunk(text, lines_per_chunk=1, tokens_per_chunk=10))
# May split mid-sentence when no whitespace available

API

def chunk(
    content: str,
    *,
    lines_per_chunk: int = 20,
    tokens_per_chunk: int = 500,
    encoding: tiktoken.Encoding | None = None,
) -> Generator[str, None, None]:

Parameters:

  • content: Text to split
  • lines_per_chunk: Minimum lines per chunk (default: 20)
  • tokens_per_chunk: Minimum tokens per chunk (default: 500)
  • encoding: Custom tiktoken encoding (default: gpt-4o-mini)

License

MIT © 2025 Allen Chou

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkle-0.2.1.tar.gz (4.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkle-0.2.1-py3-none-any.whl (4.9 kB view details)

Uploaded Python 3

File details

Details for the file chunkle-0.2.1.tar.gz.

File metadata

  • Download URL: chunkle-0.2.1.tar.gz
  • Upload date:
  • Size: 4.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.13 Darwin/24.5.0

File hashes

Hashes for chunkle-0.2.1.tar.gz
Algorithm Hash digest
SHA256 120153a06f2eb583d77aa696a16d3d75064a7cc943ec13863c99eb9759841816
MD5 e2f3490af90b22fd33e83065f4afa76e
BLAKE2b-256 67bdbfb33b7dd938492d35a0e3ea6210e6fe36b43726f15009cec11f6f0be54c

See more details on using hashes here.

File details

Details for the file chunkle-0.2.1-py3-none-any.whl.

File metadata

  • Download URL: chunkle-0.2.1-py3-none-any.whl
  • Upload date:
  • Size: 4.9 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: poetry/2.1.3 CPython/3.11.13 Darwin/24.5.0

File hashes

Hashes for chunkle-0.2.1-py3-none-any.whl
Algorithm Hash digest
SHA256 30c9d3c5e6328427e30e43e31d4820b9fac0d366aba1741dca1e1d7ee51ca769
MD5 7b1086314ff1308fa89012d3df5f4186
BLAKE2b-256 2feea1f34641812f5339f02312a7959a07976fc461c55f9bda3a07a924965ab9

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page