Chunk long text with policies.
Project description
chunkle
Smart token-based chunking that respects both line and token limits while preserving clean starts.
GitHub: https://github.com/allen2c/chunkle Pypi: https://pypi.org/project/chunkle/
Install
pip install chunkle
Quick Start
from chunkle import chunk
# Basic usage
for piece in chunk(text, lines_per_chunk=20, tokens_per_chunk=500):
print(piece)
# Custom limits
chunks = list(chunk(text, lines_per_chunk=5, tokens_per_chunk=100))
How It Works (Token-based)
flowchart TD
A["📝 Start encoding to tokens"] --> B["📊 Accumulate token ids<br/>Track newline-token count"]
B --> C{"✅ Both limits met?<br/>(lines ≥ min AND tokens ≥ min)"}
C -->|No| B
C -->|Yes| D{"🔀 Current token is breaking?"}
D -->|Yes| E["🟡 Arm emit (should_emit=True)"] --> F{"🔤 Next token non-breaking?"}
D -->|No| G{"🚀 Force beyond multiplier?"}
G -->|Yes| H{"🔡 Token boundary is whitespace?"}
H -->|Yes| E
H -->|No| B
G -->|No| B
F -->|Yes| I["✂️ Emit buffer; new chunk starts meaningful"]
F -->|No| B
I --> J{"📄 More tokens?"}
J -->|Yes| B
J -->|No| K["🏁 Emit remaining buffer (merge trailing breaks)"]
Rules
- Dual Requirements: Emit only when both line and token minimums are met.
- Clean Starts: New chunks begin at the first non-breaking token.
- Trailing Breaks Merge: Line breaks at the boundary are absorbed into the previous chunk.
- Force Emit (2x multiplier): When exceeding thresholds×multiplier, force emit only if current token boundary is whitespace.
Examples
English Text:
text = "Hello world!\nThis is a test.\nAnother line here."
chunks = list(chunk(text, lines_per_chunk=1, tokens_per_chunk=8))
# Result: ['Hello world!\nThis is a test.\n', 'Another line here.']
English Text (force split):
text = " ".join(["This is a long sentence without newlines."] * 4)
chunks = list(chunk(text, lines_per_chunk=1, tokens_per_chunk=8, force_chunk_over_threshold_times=2))
# ['This is a long sentence without newlines. This is a long sentence without new', 'lines. This is a long sentence without newlines. This is a long sentence', ' without newlines.']
API
def chunk(
content: str,
*,
lines_per_chunk: int = 20,
tokens_per_chunk: int = 500,
force_chunk_over_threshold_times: int = 2,
encoding: tiktoken.Encoding | None = None,
) -> Generator[str, None, None]:
Parameters:
content: Text to splitlines_per_chunk: Minimum lines per chunk (default: 20)tokens_per_chunk: Minimum tokens per chunk (default: 500)force_chunk_over_threshold_times: Force emit multiplier (default: 2)encoding: Custom tiktoken encoding (default: gpt-4o-mini)
License
MIT © 2025 Allen Chou
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
chunkle-0.3.0.tar.gz
(3.7 kB
view details)
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkle-0.3.0.tar.gz.
File metadata
- Download URL: chunkle-0.3.0.tar.gz
- Upload date:
- Size: 3.7 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.13 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
ab2ce23b6fb868fd2e54dfb193d08b4fc29c13faf8a1b804c6de9041eb834b59
|
|
| MD5 |
1566bf1945c3088fc0b858c116617782
|
|
| BLAKE2b-256 |
e0a3b397ba577aa8bfdbce6ed71586c290ad9946702550390384e29f4abd080f
|
File details
Details for the file chunkle-0.3.0-py3-none-any.whl.
File metadata
- Download URL: chunkle-0.3.0-py3-none-any.whl
- Upload date:
- Size: 4.3 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: poetry/2.1.3 CPython/3.11.13 Darwin/24.6.0
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1e430982a930981f88e68ce9121354352a6c8efd6003697d323045ddc94f02ae
|
|
| MD5 |
dddac4d1625a0ee5ad20fa44e1636367
|
|
| BLAKE2b-256 |
d56fc190f3037d4b6b1db3687dea6afaa3cf68dd11127cd7eea0beccddb0c89c
|