Skip to main content

High-precision prose tokenization for natural language processing.

Project description

prose-tokenizer

Lightweight, deterministic, dependency-free Python library for tokenizing English prose and Markdown into blocks, paragraphs, sentences, words, and structural metrics.

License: MIT

prose-tokenizer is built for writing tools, LLM preprocessing, readability tools, and lightweight text analysis where consistency and speed are more important than complex NLP models.

Features

  • Deterministic: Rule-based logic ensures the same output every time.
  • Markdown-Aware: Correctly segments headings, list items, and blockquotes.
  • Smart Sentence Splitting: Handles prefix abbreviations (Mr., Dr.), acronyms (U.S.A.), and decimals (10.5) without breaking sentences.
  • Structure Analysis: Access text at the block, paragraph, sentence, or word level.
  • Character Metrics: Total characters, non-whitespace characters, and alphanumeric counts.
  • Zero Dependencies: Pure Python with no runtime requirements.
  • Fully Typed: Built with PEP 484 type hints.

Installation

pip install prose-tokenizer

Quick Start

from prose_tokenizer import tokenize

content = """
### Q1 Review
The U.S.A. economy grew by 2.5% in Q1. 

*   Growth was driven by tech.
*   Inflation remains stable at 2.1%.
"""

doc = tokenize(content)

print(doc.counts.word_count)     # 20
print(doc.blocks[0].kind)        # "heading"
print(doc.sentences[1])          # "The U.S.A. economy grew by 2.5% in Q1."

What it handles

  • Markdown structural elements: Headings (# and Setext), list items (*, -, +, 1.), blockquotes (>).
  • English prose heuristics: Initials (J.R.R. Tolkien), common abbreviations (Jan., etc., vs.), and prefix titles (Dr., Rev.).
  • Complex word tokens: Contractions (can't), hyphenated words (high-tech), and numbers with commas (1,000) or decimals (2.5).

What it is not

  • Not a full NLP suite: Does not perform POS tagging, NER, or dependency parsing. Use spaCy or NLTK for those.
  • Not multi-lingual: Optimized specifically for English prose.
  • Not AI-powered: Uses deterministic rules and regular expressions, not machine learning models.

Use Cases

  • LLM Preprocessing: Chunking text into logical paragraphs or sentences for RAG or context windows.
  • Writing Tools: Real-time statistics for word count, sentence length, and readability metrics.
  • Clean Text Extraction: Removing Markdown noise while preserving structural context.

API Overview

tokenize(text: str) -> TokenizedDocument

Full analysis of the input text. Returns a dataclass containing blocks, paragraphs, sentences, words, and counts.

split_sentences(text: str) -> List[str]

Returns a list of sentences, protecting abbreviations and acronyms.

split_words(text: str) -> List[str]

Returns a list of lowercase words, preserving contractions and hyphenation.

get_character_metrics(text: str) -> CharacterMetrics

Calculates total length, length without spaces, and alphanumeric letter counts.

Development

prose-tokenizer uses Hatch for development and builds.

# Install development dependencies
pip install -e ".[dev]"

# Run tests
pytest

# Linting and Type Checking
ruff check .
mypy .

Release Checklist

  1. Update version in pyproject.toml and prose_tokenizer/__init__.py.
  2. Update CHANGELOG.md.
  3. Run full test suite: pytest && ruff check . && mypy ..
  4. Build package: python -m build.
  5. Check distribution: twine check dist/*.
  6. Upload to PyPI: twine upload dist/*.

License

MIT © Veldica

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

prose_tokenizer-0.1.0.tar.gz (11.2 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

prose_tokenizer-0.1.0-py3-none-any.whl (12.5 kB view details)

Uploaded Python 3

File details

Details for the file prose_tokenizer-0.1.0.tar.gz.

File metadata

  • Download URL: prose_tokenizer-0.1.0.tar.gz
  • Upload date:
  • Size: 11.2 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: twine/6.2.0 CPython/3.11.14

File hashes

Hashes for prose_tokenizer-0.1.0.tar.gz
Algorithm Hash digest
SHA256 4632def931a51f705622769cf1df3402b57ae6d352a69570574d8b80a99fb8db
MD5 5ef0ed9a9808201778ea417f26075d7c
BLAKE2b-256 43adccb1a145cb1b21458d12e13d9d6b912f011d5f18a4e3d69741d3ac0008d2

See more details on using hashes here.

File details

Details for the file prose_tokenizer-0.1.0-py3-none-any.whl.

File metadata

File hashes

Hashes for prose_tokenizer-0.1.0-py3-none-any.whl
Algorithm Hash digest
SHA256 3fa5c629fa3fe30e97b1400a6c59244e32222fe7f64b1594e91158fd75754be3
MD5 d8ca9061b240cec635bb2902cfddabec
BLAKE2b-256 6177f561117d5e176958823a020dbb29811e683d719cf7f3541b43ab438661cf

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page