Text segmentation and tokenization utilities for LLMs
Project description
sentences
Text segmentation and tokenization utilities for LLM tokenizers.
Features
- Sentence Splitting: Split text into sentences with exact position tracking
- Paragraph Splitting: Split text into paragraphs while preserving structure
- Token Range Extraction: Get exact token ranges for each sentence using iterative tokenization
- Perfect Reconstruction: Guaranteed text reconstruction from segments
- LLM-Ready: Designed for use with transformer tokenizers and chat templates
Installation
pip install sentences
For transformer tokenizer support:
pip install sentences[transformers]
Quick Start
Sentence Splitting
from sentences import split_text_to_sentences
text = "Dr. Smith went to the store. They bought some milk. It cost $3.50."
sentences, positions = split_text_to_sentences(text)
for i, (sent, pos) in enumerate(zip(sentences, positions)):
print(f"{i}: {repr(sent)}")
# Verify reconstruction
assert text[pos:positions[i+1] if i+1 < len(positions) else len(text)] == sent
Token Range Extraction
Get exact token ranges for sentences with any tokenizer:
from sentences import get_token_ranges
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
# Example with Qwen3-32B chat template
pre_string = """<|im_start|>system
This system message is just for demonstration purposes.<|im_end|>
<|im_start|>user
Solve this math problem step by step.<|im_end|>
<|im_start|>assistant
<think>
"""
sentences = ["Let me think about this problem. ", "First, I'll break it down. "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)
for sent, (start, end) in zip(sentences, ranges):
print(f"Tokens [{start}:{end}] = '{sent.strip()}'")
Example with GPT-OSS-20B
# GPT-OSS uses a different format without <think> tags
pre_string = """<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-22<|end|><|start|>user<|message|>Solve this math problem step by step.<|end|><|start|>assistant<|channel|>analysis<|message|>"""
sentences = ["Let me analyze this step by step. ", "The key insight is that... "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)
Key Concepts
Exact Position Tracking
The sentence splitter guarantees that:
text == ''.join(sentences) # Perfect reconstruction
text[positions[i]:positions[i+1]] == sentences[i] # Exact position match
Iterative Tokenization
Token ranges are calculated iteratively to avoid boundary issues:
- Tokenize pre_string → get initial count
- Tokenize pre_string + sentence1 → get new count
- Tokenize pre_string + sentence1 + sentence2 → get new count
- Continue for all sentences
This ensures token boundaries align correctly with how the model will process the text.
License
MIT License - see LICENSE file for details.
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file sentences-0.1.0.tar.gz.
File metadata
- Download URL: sentences-0.1.0.tar.gz
- Upload date:
- Size: 10.5 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
60a135a60bea9337077434e5920bb4c799ecca981f7a9c11331a116d3eb60869
|
|
| MD5 |
c489ccd0e59d0d94b5596a0ea14813d2
|
|
| BLAKE2b-256 |
1199136c70652edfec929865660388734197b83765d1a69566f073d6f7d1896f
|
File details
Details for the file sentences-0.1.0-py3-none-any.whl.
File metadata
- Download URL: sentences-0.1.0-py3-none-any.whl
- Upload date:
- Size: 9.7 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.1.0 CPython/3.12.4
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
87dad9bbf4b1c8734a0c6a279ff4871f91f4b73a87f1573ad413796fba3d9e3a
|
|
| MD5 |
bf21bd08d6d3452c4c4639c4af8d9601
|
|
| BLAKE2b-256 |
63fa3e65e1ca0c6ba84d1dbc28fc3e78afd63d6e9908204698045d5972104f54
|