High-precision prose tokenization for natural language processing.

These details have not been verified by PyPI

Project links

Project description

prose-tokenizer

A high-precision, rule-based prose tokenizer and sentence segmentation library for English and Markdown. Designed for accurate splitting of paragraphs, sentences, and words in AI pipelines, LLM context window management, and editorial automation.

prose-tokenizer is built for writing tools, LLM preprocessing, readability tools, and lightweight text analysis where consistency and speed are more important than complex, probabilistic NLP models.

Features

Deterministic Rule-Based Engine: Consistent, predictable output without the overhead or unpredictability of machine learning models.
Markdown-Native Support: Properly handles structural elements including headings (# and Setext), list items (*, -, +, 1.), and blockquotes (>).
Intelligent Sentence Segmentation: Respects English prose heuristics such as prefix titles (Dr., Mr.), acronyms (U.S.A.), initials (J.R.R. Tolkien), and interior decimals.
Hierarchical Analysis: Access text at the block, paragraph, sentence, or word level with a single call.
Character Metrics: Accurate counts for total characters, non-whitespace characters, and alphanumeric letter counts.
Zero Dependencies: Pure Python implementation with no runtime requirements.
Fully Typed: Built with PEP 484 type hints for excellent IDE support.

Installation

pip install prose-tokenizer

Quick Start

from prose_tokenizer import tokenize

content = """
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1. 

*   Growth was driven by tech.
*   Inflation remains stable at 2.1%.
"""

doc = tokenize(content)

print(doc.counts.word_count)     # 20
print(doc.blocks[0].kind)        # "heading"
print(doc.sentences[1])          # "The U.S.A. economy grew by 2.5% in Q1."

API Reference

`tokenize(text: str) -> TokenizedDocument`

The primary entry point for full document analysis. Returns a dataclass containing:

blocks: List of ParagraphBlock objects (includes text, kind, line_start, and line_end).
paragraphs: List of raw paragraph strings.
sentences: List of sentence strings.
words: List of lowercase word tokens.
counts: StructureCounts object with aggregated metrics.

tokenize_prose is provided as an alias for this function.

`split_sentences(text: str) -> List[str]`

Splits prose into a list of sentence strings using deterministic rules that protect abbreviations and decimal numbers.

`split_paragraphs(text: str) -> List[str]`

Splits text into a list of raw paragraph strings based on double newlines.

`split_words(text: str) -> List[str]`

Splits text into lowercase alphanumeric word tokens, preserving contractions (e.g., "can't") and interior hyphens or decimals.

`get_character_metrics(text: str) -> CharacterMetrics`

Calculates character-level statistics:

character_count: Total character length.
character_count_no_spaces: Count excluding whitespace.
letter_count: Count of alphanumeric letters (a-z, A-Z, 0-9).

`get_structure_counts(text: str) -> StructureCounts`

A convenience function that returns structural metrics without full tokenization arrays. Includes word_count, sentence_count, paragraph_count, heading_count, list_item_count, and blockquote_count.

`is_stopword(word: str) -> bool`

Checks if a word is a common English stopword.

Practical Use Cases

LLM Preprocessing: Chunking text into logical paragraphs or sentences for RAG or context window management while preserving Markdown structure.
Writing Tools: Real-time statistics for word count, sentence length, and readability metrics (e.g., Flesch-Kincaid).
Clean Text Extraction: Removing or identifying Markdown noise while preserving structural context.
Search Indexing: Generating clean, lowercase word tokens for search engines.

Limitations

Language Support: Optimized specifically for English prose.
NLP Scope: Does not perform POS tagging, NER, or dependency parsing.
Rule-Based: While highly accurate, it uses deterministic heuristics rather than probabilistic context analysis.

Development

prose-tokenizer uses Hatch for development and builds.

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Linting and Type Checking
ruff check .
mypy .

Ownership & Authority

This package is maintained by Veldica Research as a core part of our writing analysis platform. Built for production environments that demand high reliability, precision, and performance.

Full Documentation: veldica.com/python-prose-tokenizer
Veldica Platform: veldica.com
Report Bugs: GitHub Issues

License

MIT © Veldica Research

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

1.0.0

Apr 26, 2026

0.1.0

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prose_tokenizer-1.0.0.tar.gz (10.8 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prose_tokenizer-1.0.0-py3-none-any.whl (13.0 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file prose_tokenizer-1.0.0.tar.gz.

File metadata

Download URL: prose_tokenizer-1.0.0.tar.gz
Upload date: Apr 26, 2026
Size: 10.8 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for prose_tokenizer-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`204900eb5276d02df865ca4e12d082685db9eb38e5832443c4982acf9376f592`
MD5	`9f5c3c3222a7dceffb2e586c5cd69ec4`
BLAKE2b-256	`e131f32ac6d7c6ef5bd54e6a867c0fe67b0d2bad31efdc877d98045a5ba054a6`

See more details on using hashes here.

File details

Details for the file prose_tokenizer-1.0.0-py3-none-any.whl.

File metadata

Download URL: prose_tokenizer-1.0.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 13.0 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for prose_tokenizer-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`779765b71bb66f7db5c2f1cacb6e9bf66829c9819c81793815d74d047c3cdc06`
MD5	`2d4f2b4585df3066f7bc025bd1206f76`
BLAKE2b-256	`b6c31d925956c4a70f557e759f2e9232ceaefc5e8d7ca6080a3f3868635e40fd`

See more details on using hashes here.

prose-tokenizer 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

prose-tokenizer

Features

Installation

Quick Start

API Reference

tokenize(text: str) -> TokenizedDocument

split_sentences(text: str) -> List[str]

split_paragraphs(text: str) -> List[str]

split_words(text: str) -> List[str]

get_character_metrics(text: str) -> CharacterMetrics

get_structure_counts(text: str) -> StructureCounts

is_stopword(word: str) -> bool

Practical Use Cases

Limitations

Development

Ownership & Authority

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tokenize(text: str) -> TokenizedDocument`

`split_sentences(text: str) -> List[str]`

`split_paragraphs(text: str) -> List[str]`

`split_words(text: str) -> List[str]`

`get_character_metrics(text: str) -> CharacterMetrics`

`get_structure_counts(text: str) -> StructureCounts`

`is_stopword(word: str) -> bool`