High-precision prose tokenization for natural language processing.

These details have not been verified by PyPI

Project links

Project description

prose-tokenizer

Lightweight, deterministic, dependency-free Python library for tokenizing English prose and Markdown into blocks, paragraphs, sentences, words, and structural metrics.

prose-tokenizer is built for writing tools, LLM preprocessing, readability tools, and lightweight text analysis where consistency and speed are more important than complex NLP models.

Features

Deterministic: Rule-based logic ensures the same output every time.
Markdown-Aware: Correctly segments headings, list items, and blockquotes.
Smart Sentence Splitting: Handles prefix abbreviations (Mr., Dr.), acronyms (U.S.A.), and decimals (10.5) without breaking sentences.
Structure Analysis: Access text at the block, paragraph, sentence, or word level.
Character Metrics: Total characters, non-whitespace characters, and alphanumeric counts.
Zero Dependencies: Pure Python with no runtime requirements.
Fully Typed: Built with PEP 484 type hints.

Installation

pip install prose-tokenizer

Quick Start

from prose_tokenizer import tokenize

content = """
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1. 

*   Growth was driven by tech.
*   Inflation remains stable at 2.1%.
"""

doc = tokenize(content)

print(doc.counts.word_count)     # 20
print(doc.blocks[0].kind)        # "heading"
print(doc.sentences[1])          # "The U.S.A. economy grew by 2.5% in Q1."

What it handles

Markdown structural elements: Headings (# and Setext), list items (*, -, +, 1.), blockquotes (>).
English prose heuristics: Initials (J.R.R. Tolkien), common abbreviations (Jan., etc., vs.), and prefix titles (Dr., Rev.).
Complex word tokens: Contractions (can't), hyphenated words (high-tech), and numbers with commas (1,000) or decimals (2.5).

What it is not

Not a full NLP suite: Does not perform POS tagging, NER, or dependency parsing. Use spaCy or NLTK for those.
Not multi-lingual: Optimized specifically for English prose.
Not AI-powered: Uses deterministic rules and regular expressions, not machine learning models.

Use Cases

LLM Preprocessing: Chunking text into logical paragraphs or sentences for RAG or context windows.
Writing Tools: Real-time statistics for word count, sentence length, and readability metrics.
Clean Text Extraction: Removing Markdown noise while preserving structural context.

API Overview

`tokenize(text: str) -> TokenizedDocument`

Full analysis of the input text. Returns a dataclass containing blocks, paragraphs, sentences, words, and counts.

`split_sentences(text: str) -> List[str]`

Returns a list of sentences, protecting abbreviations and acronyms.

`split_words(text: str) -> List[str]`

Returns a list of lowercase words, preserving contractions and hyphenation.

`get_character_metrics(text: str) -> CharacterMetrics`

Calculates total length, length without spaces, and alphanumeric letter counts.

Development

prose-tokenizer uses Hatch for development and builds.

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Linting and Type Checking
ruff check .
mypy .

Release Checklist

Update version in pyproject.toml and prose_tokenizer/__init__.py.
Update CHANGELOG.md.
Run full test suite: pytest && ruff check . && mypy ..
Build package: python -m build.
Check distribution: twine check dist/*.
Upload to PyPI: twine upload dist/*.

License

MIT © Veldica

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.0.0

Apr 26, 2026

This version

0.1.0

Apr 26, 2026

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prose_tokenizer-0.1.0.tar.gz (11.2 kB view details)

Uploaded Apr 26, 2026 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

prose_tokenizer-0.1.0-py3-none-any.whl (12.5 kB view details)

Uploaded Apr 26, 2026 Python 3

File details

Details for the file prose_tokenizer-0.1.0.tar.gz.

File metadata

Download URL: prose_tokenizer-0.1.0.tar.gz
Upload date: Apr 26, 2026
Size: 11.2 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for prose_tokenizer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`4632def931a51f705622769cf1df3402b57ae6d352a69570574d8b80a99fb8db`
MD5	`5ef0ed9a9808201778ea417f26075d7c`
BLAKE2b-256	`43adccb1a145cb1b21458d12e13d9d6b912f011d5f18a4e3d69741d3ac0008d2`

See more details on using hashes here.

File details

Details for the file prose_tokenizer-0.1.0-py3-none-any.whl.

File metadata

Download URL: prose_tokenizer-0.1.0-py3-none-any.whl
Upload date: Apr 26, 2026
Size: 12.5 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for prose_tokenizer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`3fa5c629fa3fe30e97b1400a6c59244e32222fe7f64b1594e91158fd75754be3`
MD5	`d8ca9061b240cec635bb2902cfddabec`
BLAKE2b-256	`6177f561117d5e176958823a020dbb29811e683d719cf7f3541b43ab438661cf`

See more details on using hashes here.

prose-tokenizer 0.1.0

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

prose-tokenizer

Features

Installation

Quick Start

What it handles

What it is not

Use Cases

API Overview

tokenize(text: str) -> TokenizedDocument

split_sentences(text: str) -> List[str]

split_words(text: str) -> List[str]

get_character_metrics(text: str) -> CharacterMetrics

Development

Release Checklist

License

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes

`tokenize(text: str) -> TokenizedDocument`

`split_sentences(text: str) -> List[str]`

`split_words(text: str) -> List[str]`

`get_character_metrics(text: str) -> CharacterMetrics`