Skip to main content

Text segmentation and tokenization utilities for LLMs

Project description

sentences

Text segmentation and tokenization utilities for LLM tokenizers.

Features

  • Sentence Splitting: Split text into sentences with exact position tracking
  • Paragraph Splitting: Split text into paragraphs while preserving structure
  • Token Range Extraction: Get exact token ranges for each sentence using iterative tokenization
  • Perfect Reconstruction: Guaranteed text reconstruction from segments
  • LLM-Ready: Designed for use with transformer tokenizers and chat templates

Installation

pip install sentences

For transformer tokenizer support:

pip install sentences[transformers]

Quick Start

Sentence Splitting

from sentences import split_text_to_sentences

text = "Dr. Smith went to the store. They bought some milk. It cost $3.50."
sentences, positions = split_text_to_sentences(text)

for i, (sent, pos) in enumerate(zip(sentences, positions)):
    print(f"{i}: {repr(sent)}")
    # Verify reconstruction
    assert text[pos:positions[i+1] if i+1 < len(positions) else len(text)] == sent

Token Range Extraction

Get exact token ranges for sentences with any tokenizer:

from sentences import get_token_ranges
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")

# Example with Qwen3-32B chat template
pre_string = """<|im_start|>system
This system message is just for demonstration purposes.<|im_end|>
<|im_start|>user
Solve this math problem step by step.<|im_end|>
<|im_start|>assistant
<think>
"""

sentences = ["Let me think about this problem. ", "First, I'll break it down. "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)

for sent, (start, end) in zip(sentences, ranges):
    print(f"Tokens [{start}:{end}] = '{sent.strip()}'")

Example with GPT-OSS-20B

# GPT-OSS uses a different format without <think> tags
pre_string = """<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-22<|end|><|start|>user<|message|>Solve this math problem step by step.<|end|><|start|>assistant<|channel|>analysis<|message|>"""

sentences = ["Let me analyze this step by step. ", "The key insight is that... "]
ranges = get_token_ranges(sentences, tokenizer, pre_string)

Key Concepts

Exact Position Tracking

The sentence splitter guarantees that:

text == ''.join(sentences)  # Perfect reconstruction
text[positions[i]:positions[i+1]] == sentences[i]  # Exact position match

Iterative Tokenization

Token ranges are calculated iteratively to avoid boundary issues:

  1. Tokenize pre_string → get initial count
  2. Tokenize pre_string + sentence1 → get new count
  3. Tokenize pre_string + sentence1 + sentence2 → get new count
  4. Continue for all sentences

This ensures token boundaries align correctly with how the model will process the text.

License

MIT License - see LICENSE file for details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentences-0.1.0.tar.gz (10.5 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentences-0.1.0-py3-none-any.whl (9.7 kB view details)

Uploaded Python 3

File details

Details for the file sentences-0.1.0.tar.gz.

File metadata

  • Download URL: sentences-0.1.0.tar.gz
  • Upload date:
  • Size: 10.5 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sentences-0.1.0.tar.gz
Algorithm Hash digest
SHA256 60a135a60bea9337077434e5920bb4c799ecca981f7a9c11331a116d3eb60869
MD5 c489ccd0e59d0d94b5596a0ea14813d2
BLAKE2b-256 1199136c70652edfec929865660388734197b83765d1a69566f073d6f7d1896f

See more details on using hashes here.

File details

Details for the file sentences-0.1.0-py3-none-any.whl.

File metadata

  • Download URL: sentences-0.1.0-py3-none-any.whl
  • Upload date:
  • Size: 9.7 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sentences-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 87dad9bbf4b1c8734a0c6a279ff4871f91f4b73a87f1573ad413796fba3d9e3a
MD5 bf21bd08d6d3452c4c4639c4af8d9601
BLAKE2b-256 63fa3e65e1ca0c6ba84d1dbc28fc3e78afd63d6e9908204698045d5972104f54

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page