Text segmentation and tokenization utilities for LLMs

These details have not been verified by PyPI

Project links

Project description

sentences

Text segmentation and tokenization utilities for LLM tokenizers.

Features

Sentence Splitting: Split text into sentences with exact position tracking
Paragraph Splitting: Split text into paragraphs while preserving structure
Token Range Extraction: Get exact token ranges for each sentence using iterative tokenization
Perfect Reconstruction: Guaranteed text reconstruction from segments
LLM-Ready: Designed for use with transformer tokenizers and chat templates

Installation

pip install sentences

For transformer tokenizer support:

pip install sentences[transformers]

Quick Start

Sentence Splitting

from sentences import split_text_to_sentences

text = "Dr. Smith went to the store. They bought some milk. It cost $3.50."
sentences, positions = split_text_to_sentences(text)

for i, (sent, pos) in enumerate(zip(sentences, positions)):
    print(f"{i}: {repr(sent)}")
    # Verify reconstruction
    assert text[pos:positions[i+1] if i+1 < len(positions) else len(text)] == sent

Token Range Extraction

Get exact token ranges for sentences with any tokenizer:

from sentences import get_token_ranges
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")

# Example with Qwen3-32B chat template
pre_string = """<|im_start|>system
This system message is just for demonstration purposes.<|im_end|>
<|im_start|>user
Solve this math problem step by step.<|im_end|>
<|im_start|>assistant
<think>
"""

sentences = ["Let me think about this problem. ", "First, I'll break it down. "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)

for sent, (start, end) in zip(sentences, ranges):
    print(f"Tokens [{start}:{end}] = '{sent.strip()}'")

Example with GPT-OSS-20B

# GPT-OSS uses a different format without <think> tags
pre_string = """<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-22<|end|><|start|>user<|message|>Solve this math problem step by step.<|end|><|start|>assistant<|channel|>analysis<|message|>"""

sentences = ["Let me analyze this step by step. ", "The key insight is that... "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)

Key Concepts

Exact Position Tracking

The sentence splitter guarantees that:

text == ''.join(sentences)  # Perfect reconstruction
text[positions[i]:positions[i+1]] == sentences[i]  # Exact position match

Iterative Tokenization

Token ranges are calculated iteratively to avoid boundary issues:

Tokenize pre_string → get initial count
Tokenize pre_string + sentence1 → get new count
Tokenize pre_string + sentence1 + sentence2 → get new count
Continue for all sentences

This ensures token boundaries align correctly with how the model will process the text.

License

MIT License - see LICENSE file for details.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.1.1

Nov 22, 2025

This version

0.1.0

Nov 22, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentences-0.1.0.tar.gz (10.5 kB view details)

Uploaded Nov 22, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

sentences-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Nov 22, 2025 Python 3

File details

Details for the file sentences-0.1.0.tar.gz.

File metadata

Download URL: sentences-0.1.0.tar.gz
Upload date: Nov 22, 2025
Size: 10.5 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sentences-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`60a135a60bea9337077434e5920bb4c799ecca981f7a9c11331a116d3eb60869`
MD5	`c489ccd0e59d0d94b5596a0ea14813d2`
BLAKE2b-256	`1199136c70652edfec929865660388734197b83765d1a69566f073d6f7d1896f`

See more details on using hashes here.

File details

Details for the file sentences-0.1.0-py3-none-any.whl.

File metadata

Download URL: sentences-0.1.0-py3-none-any.whl
Upload date: Nov 22, 2025
Size: 9.7 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sentences-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`87dad9bbf4b1c8734a0c6a279ff4871f91f4b73a87f1573ad413796fba3d9e3a`
MD5	`bf21bd08d6d3452c4c4639c4af8d9601`
BLAKE2b-256	`63fa3e65e1ca0c6ba84d1dbc28fc3e78afd63d6e9908204698045d5972104f54`

See more details on using hashes here.

sentences 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

sentences

Features

Installation

Quick Start

Sentence Splitting

Token Range Extraction

Example with GPT-OSS-20B

Key Concepts

Exact Position Tracking

Iterative Tokenization

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes