Chunk long text with policies.
Project description
chunkle
Smart text chunking that respects both line and token limits while preserving semantic boundaries.
GitHub: https://github.com/allen2c/chunkle Pypi: https://pypi.org/project/chunkle/
Install
pip install chunkle
Quick Start
from chunkle import chunk
# Basic usage
for piece in chunk(text, lines_per_chunk=20, tokens_per_chunk=500):
print(piece)
# Custom limits
chunks = list(chunk(text, lines_per_chunk=5, tokens_per_chunk=100))
How It Works
flowchart TD
A["📝 Start processing text"] --> B["📊 Accumulate chars<br/>Count lines & tokens"]
B --> C{"✅ Both limits met?<br/>(lines ≥ min AND tokens ≥ min)"}
C -->|No| D{"🚨 Exceeded 2x limits?"}
C -->|Yes| E{"🎯 Good break point?<br/>(newline > whitespace)"}
D -->|No| B
D -->|Yes| F["💥 Force flush<br/>(semantic boundary ignored)"]
E -->|No| D
E -->|Yes| G["✂️ Flush chunk<br/>(clean semantic boundary)"]
F --> H["🧽 Absorb whitespace/punctuation<br/>into previous chunk"]
G --> H
H --> I{"📄 More text?"}
I -->|Yes| B
I -->|No| J["🏁 Done"]
Rules
- Dual Requirements: Chunks must meet BOTH line AND token minimums
- Smart Boundaries: Prefers newlines (best) > whitespace (good) > force split
- Force Split: Splits at 2x limits even if it breaks semantics
- Clean Starts: New chunks begin with meaningful characters
Examples
English Text:
text = "Hello world!\nThis is a test.\nAnother line here."
chunks = list(chunk(text, lines_per_chunk=1, tokens_per_chunk=8))
# Result: ['Hello world!\n', 'This is a test.\n', 'Another line here.']
Chinese Text (force split):
text = "這是一個很長的句子,沒有空格,會觸發強制切分機制。"
chunks = list(chunk(text, lines_per_chunk=1, tokens_per_chunk=10))
# May split mid-sentence when no whitespace available
API
def chunk(
content: str,
*,
lines_per_chunk: int = 20,
tokens_per_chunk: int = 500,
encoding: tiktoken.Encoding | None = None,
) -> Generator[str, None, None]:
Parameters:
content: Text to splitlines_per_chunk: Minimum lines per chunk (default: 20)tokens_per_chunk: Minimum tokens per chunk (default: 500)encoding: Custom tiktoken encoding (default: gpt-4o-mini)
License
MIT © 2025 Allen Chou
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
chunkle-0.2.1.tar.gz
(4.2 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkle-0.2.1.tar.gz.
File metadata
- Download URL: chunkle-0.2.1.tar.gz
- Upload date:
- Size: 4.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.13 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
120153a06f2eb583d77aa696a16d3d75064a7cc943ec13863c99eb9759841816
|
|
| MD5 |
e2f3490af90b22fd33e83065f4afa76e
|
|
| BLAKE2b-256 |
67bdbfb33b7dd938492d35a0e3ea6210e6fe36b43726f15009cec11f6f0be54c
|
File details
Details for the file chunkle-0.2.1-py3-none-any.whl.
File metadata
- Download URL: chunkle-0.2.1-py3-none-any.whl
- Upload date:
- Size: 4.9 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.13 Darwin/24.5.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
30c9d3c5e6328427e30e43e31d4820b9fac0d366aba1741dca1e1d7ee51ca769
|
|
| MD5 |
7b1086314ff1308fa89012d3df5f4186
|
|
| BLAKE2b-256 |
2feea1f34641812f5339f02312a7959a07976fc461c55f9bda3a07a924965ab9
|