High-precision prose tokenization for natural language processing.
Project description
prose-tokenizer
Lightweight, deterministic, dependency-free Python library for tokenizing English prose and Markdown into blocks, paragraphs, sentences, words, and structural metrics.
prose-tokenizer is built for writing tools, LLM preprocessing, readability tools, and lightweight text analysis where consistency and speed are more important than complex NLP models.
Features
- Deterministic: Rule-based logic ensures the same output every time.
- Markdown-Aware: Correctly segments headings, list items, and blockquotes.
- Smart Sentence Splitting: Handles prefix abbreviations (Mr., Dr.), acronyms (U.S.A.), and decimals (10.5) without breaking sentences.
- Structure Analysis: Access text at the block, paragraph, sentence, or word level.
- Character Metrics: Total characters, non-whitespace characters, and alphanumeric counts.
- Zero Dependencies: Pure Python with no runtime requirements.
- Fully Typed: Built with PEP 484 type hints.
Installation
pip install prose-tokenizer
Quick Start
from prose_tokenizer import tokenize
content = """
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1.
* Growth was driven by tech.
* Inflation remains stable at 2.1%.
"""
doc = tokenize(content)
print(doc.counts.word_count) # 20
print(doc.blocks[0].kind) # "heading"
print(doc.sentences[1]) # "The U.S.A. economy grew by 2.5% in Q1."
What it handles
- Markdown structural elements: Headings (# and Setext), list items (*, -, +, 1.), blockquotes (>).
- English prose heuristics: Initials (J.R.R. Tolkien), common abbreviations (Jan., etc., vs.), and prefix titles (Dr., Rev.).
- Complex word tokens: Contractions (can't), hyphenated words (high-tech), and numbers with commas (1,000) or decimals (2.5).
What it is not
- Not a full NLP suite: Does not perform POS tagging, NER, or dependency parsing. Use spaCy or NLTK for those.
- Not multi-lingual: Optimized specifically for English prose.
- Not AI-powered: Uses deterministic rules and regular expressions, not machine learning models.
Use Cases
- LLM Preprocessing: Chunking text into logical paragraphs or sentences for RAG or context windows.
- Writing Tools: Real-time statistics for word count, sentence length, and readability metrics.
- Clean Text Extraction: Removing Markdown noise while preserving structural context.
API Overview
tokenize(text: str) -> TokenizedDocument
Full analysis of the input text. Returns a dataclass containing blocks, paragraphs, sentences, words, and counts.
split_sentences(text: str) -> List[str]
Returns a list of sentences, protecting abbreviations and acronyms.
split_words(text: str) -> List[str]
Returns a list of lowercase words, preserving contractions and hyphenation.
get_character_metrics(text: str) -> CharacterMetrics
Calculates total length, length without spaces, and alphanumeric letter counts.
Development
prose-tokenizer uses Hatch for development and builds.
# Install development dependencies
pip install -e ".[dev]"
# Run tests
pytest
# Linting and Type Checking
ruff check .
mypy .
Release Checklist
- Update version in
pyproject.tomlandprose_tokenizer/__init__.py. - Update
CHANGELOG.md. - Run full test suite:
pytest && ruff check . && mypy .. - Build package:
python -m build. - Check distribution:
twine check dist/*. - Upload to PyPI:
twine upload dist/*.
License
MIT © Veldica
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file prose_tokenizer-0.1.0.tar.gz.
File metadata
- Download URL: prose_tokenizer-0.1.0.tar.gz
- Upload date:
- Size: 11.2 kB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
4632def931a51f705622769cf1df3402b57ae6d352a69570574d8b80a99fb8db
|
|
| MD5 |
5ef0ed9a9808201778ea417f26075d7c
|
|
| BLAKE2b-256 |
43adccb1a145cb1b21458d12e13d9d6b912f011d5f18a4e3d69741d3ac0008d2
|
File details
Details for the file prose_tokenizer-0.1.0-py3-none-any.whl.
File metadata
- Download URL: prose_tokenizer-0.1.0-py3-none-any.whl
- Upload date:
- Size: 12.5 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: twine/6.2.0 CPython/3.11.14
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
3fa5c629fa3fe30e97b1400a6c59244e32222fe7f64b1594e91158fd75754be3
|
|
| MD5 |
d8ca9061b240cec635bb2902cfddabec
|
|
| BLAKE2b-256 |
6177f561117d5e176958823a020dbb29811e683d719cf7f3541b43ab438661cf
|