Skip to main content

Text segmentation and tokenization utilities for LLMs

Project description

Sentences

Utilities for sentence-level text segmentation and tokenization tailored to LLM tokenizers.

This package is designed to support sentence-level (“Thought Anchor”) analyses like those in:

Bogdan, P. C.*, Macar, U.*, Nanda, N.°, & Conmy, A.° Thought anchors: Which LLM reasoning steps matter? 2025. https://arxiv.org/abs/2506.19143.

Features

  • Splits a given text into sentences
  • Avoids common issues (e.g., "Dr. Fu" shouldn't be split into two sentences)
  • Respects standard LLM tokenization patterns (e.g., leading-space tokens)
  • Given a tokenizer, returns token ranges for each sentence in the tokenized input text

Installation

pip install sentences

Sentence Splitting

Sentences split here adhere to typical LLM tokenization strategies. For example, this sentence "I love my cat. It is big." should be split with a leading space rather than a trailing one, ["I love my cat.", " It is big."]

from sentences import split_text_to_sentences

text = "Dr. Smith went to the store. They bought some milk. It cost $3.50."
sentences, positions = split_text_to_sentences(text)

for i, (sent, pos) in enumerate(zip(sentences, positions)):
    print(f"{i}: {repr(sent)}")
    # 0: 'Dr. Smith went to the store.'
    # 1: ' They bought some milk.'
    # 2: ' It cost $3.50.'

Token Range Extraction

You can use this package to get the exact token ranges for sentences. You can use this to split up a model's chain-of-thought into sentences. You can include pre_string, where you provide a string that will appear before your sentences (e.g., a chat template), and the token ranges will respect that.

Token ranges are calculated by repeatedly appending a new sentence to the pre_string, tokenizing the new string, and counting the number of tokens. This helps avoid tokenization oddities. Simply tokenizing each sentence independently can cause problems.

from sentences import get_token_ranges
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")

# Example with Qwen-2.5 chat template
pre_string = """<|im_start|>system
This system message is just for demonstration purposes.<|im_end|>
<|im_start|>user
Solve this math problem step by step.<|im_end|>
<|im_start|>assistant
<think>

"""

sentences = ["Let me think about this problem.", " First, I'll break it down."]
ranges = get_token_ranges(sentences, tokenizer, pre_string)
tokens_all = tokenizer.batch_decode(tokenizer.encode(pre_string + ''.join(sentences)))

for sent, (start, end) in zip(sentences, ranges):
    print(f"Tokens [{start}:{end}] = '{sent}'\n\t{tokens_all[start:end]}")
    # Tokens [39:46] = 'Let me think about this problem.'
    #   [' Let', ' me', ' think', ' about', ' this', ' problem', '.']
    # Tokens [46:54] = ' First, I'll break it down.'
    #   [' First', ',', ' I', "'ll", ' break', ' it', ' down', '.']

Note on CoT pre-filling

gpt-oss models don't use <think> tags but instead employ a special format:

pre_string = """<|start|>system<|message|>You are ChatGPT, a large language model trained by OpenAI.
Knowledge cutoff: 2024-06
Current date: 2025-11-22<|end|><|start|>user<|message|>Solve this math problem step by step.<|end|><|start|>assistant<|channel|>analysis<|message|>"""

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

sentences-0.1.1.tar.gz (11.1 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

sentences-0.1.1-py3-none-any.whl (10.0 kB view details)

Uploaded Python 3

File details

Details for the file sentences-0.1.1.tar.gz.

File metadata

  • Download URL: sentences-0.1.1.tar.gz
  • Upload date:
  • Size: 11.1 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sentences-0.1.1.tar.gz
Algorithm Hash digest
SHA256 393c8cb3b0bc8318d9b09f2d9bfb16b06d03269feb9e966e0f6be87ed8ce9113
MD5 fdbd6387d13d4c72431d013cbd6b00d6
BLAKE2b-256 d9dc4676179d4d163deffb8d7befadd02ea522998aa59db6b9a5596c652eb25e

See more details on using hashes here.

File details

Details for the file sentences-0.1.1-py3-none-any.whl.

File metadata

  • Download URL: sentences-0.1.1-py3-none-any.whl
  • Upload date:
  • Size: 10.0 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.1.0 CPython/3.12.4

File hashes

Hashes for sentences-0.1.1-py3-none-any.whl
Algorithm Hash digest
SHA256 a02cd71281a38d373678923969ec96faca5bec25e7ab294435fbb6e456cb1da7
MD5 294a38980be6caf39915300b06104d63
BLAKE2b-256 6881a422227ae6fafa3f14eba006e6fa206eeff1ad3239e98646dcd75f81be25

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page