Skip to main content

High-performance Rust implementation of nupunkt sentence/paragraph tokenization

Project description

nupunkt-rs

CI License: MIT Python 3.11+ Rust

High-performance Rust implementation of nupunkt, a modern reimplementation of the Punkt sentence tokenizer optimized for high-precision legal and financial text processing. This project provides the same accurate sentence segmentation as the original Python nupunkt library, but with 3x faster performance thanks to Rust's efficiency.

Based on the research paper: Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary (Bommarito et al., 2025)

Features

  • 🚀 High Performance: 30M+ characters/second (3x faster than Python nupunkt)
  • 🎯 High Precision: 91.1% precision on legal text benchmarks
  • ⚡ Runtime Adjustable: Tune precision/recall balance at inference time without retraining
  • 📚 Legal-Optimized: Pre-trained model handles complex legal abbreviations and citations
  • 🐍 Python API: Drop-in replacement for Python nupunkt with PyO3 bindings
  • 🧵 Thread-Safe: Safe for parallel processing

Installation

From PyPI

# pip
pip install nupunkt-rs

# uv
uv pip install nupunkt-rs

From Source

  1. Prerequisites:

    • Python 3.11+
    • Rust toolchain (install from rustup.rs)
    • maturin (pip install maturin)
  2. Clone and Install:

git clone https://github.com/alea-institute/nupunkt-rs.git
cd nupunkt-rs

# pip
pip install maturin
maturin develop --release

# uv
uvx maturin develop --release --uv

Quick Start

Why nupunkt-rs for Legal & Financial Documents?

Most tokenizers fail on legal and financial text, breaking incorrectly at abbreviations like "v.", "U.S.", "Inc.", "Id.", and "Fed." This library is specifically optimized for high-precision tokenization of complex professional documents.

import nupunkt_rs

# Real Supreme Court text with complex citations and abbreviations
legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150."""

# Most tokenizers would incorrectly break at "v.", "Inc.", "U.S.", "Co.", and "Id."
# nupunkt-rs handles all of these correctly:
sentences = nupunkt_rs.sent_tokenize(legal_text)
print(f"Correctly identified {len(sentences)} sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"\n{i}. {sent}")

# Output:
# Correctly identified 3 sentences:
#
# 1. As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability.
#
# 2. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
#
# 3. There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150.

Fine-Tuning Precision with the precision_recall Parameter

The precision_recall parameter (0.0-1.0) gives you exact control over the precision/recall trade-off. For legal and financial documents, you typically want higher precision (0.3-0.5) to avoid breaking at abbreviations.

# Longer legal text to show the impact
long_legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150. This Court further noted that Rule 702 was amended in response to Daubert and this Court's subsequent cases. See Fed. Rule Evid. 702, Advisory Committee Notes to 2000 Amendments. The amendment affirms the trial court's role as gatekeeper but provides that "all types of expert testimony present questions of admissibility for the trial court." Ibid. Consequently, whether the specific expert testimony on the question at issue focuses on specialized observations, the specialized translation of those observations into theory, a specialized theory itself, or the application of such a theory in a particular case, the expert's testimony often will rest "upon an experience confessedly foreign in kind to [the jury's] own." Hand, Historical and Practical Considerations Regarding Expert Testimony, 15 Harv. L. Rev. 40, 54 (1901). For this reason, the trial judge, in all cases of proffered expert testimony, must find that it is properly grounded, well-reasoned, and not speculative before it can be admitted. The trial judge must determine whether the testimony has "a reliable basis in the knowledge and experience of [the relevant] discipline." Daubert, 509 U. S., at 592."""

# Compare different precision levels
print(f"High recall (PR=0.1): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.1))} sentences")
print(f"Balanced (PR=0.5):    {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5))} sentences")  
print(f"High precision (PR=0.9): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.9))} sentences")

# Output:
# High recall (PR=0.1): 8 sentences
# Balanced (PR=0.5):    7 sentences  
# High precision (PR=0.9): 5 sentences

# Show the actual sentences at balanced setting (recommended for legal text)
sentences = nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5)
print("\nBalanced output (PR=0.5) - Recommended for legal documents:")
for i, sent in enumerate(sentences, 1):
    # Show that abbreviations are correctly preserved
    if "v." in sent or "U.S." in sent or "Id." in sent or "Fed." in sent:
        print(f"\n{i}. ✓ Correctly preserves legal abbreviations:")
        print(f"   {sent[:100]}...")

Recommended precision_recall settings:

  • Legal documents: 0.3-0.5 (preserves "v.", "Id.", "Fed.", "U.S.", "Inc.")
  • Financial reports: 0.4-0.6 (preserves "Inc.", "Ltd.", "Q1", monetary abbreviations)
  • Scientific papers: 0.4-0.6 (preserves "et al.", "e.g.", "i.e.", technical terms)
  • General text: 0.5 (default, balanced)
  • Social media: 0.1-0.3 (more aggressive breaking for informal text)

Paragraph Tokenization

For documents with multiple paragraphs, you can tokenize at both paragraph and sentence levels:

import nupunkt_rs

text = """First paragraph with legal citations.
See Smith v. Jones, 123 U.S. 456 (2020).

Second paragraph with more detail.
The court in Id. at 457 stated clearly."""

# Get paragraphs as lists of sentences
paragraphs = nupunkt_rs.para_tokenize(text)
print(f"Found {len(paragraphs)} paragraphs")
# Each paragraph is a list of properly segmented sentences

# Or get paragraphs as joined strings
paragraphs_joined = nupunkt_rs.para_tokenize_joined(text)
# Each paragraph is a single string with sentences joined

Advanced Approach (Using Tokenizer Class)

import nupunkt_rs

# Create a tokenizer with the default model
tokenizer = nupunkt_rs.create_default_tokenizer()

# Default (0.5) - balanced mode
text = "The meeting is at 5 p.m. tomorrow. We'll discuss Q4."
print(tokenizer.tokenize(text))
# Output: ['The meeting is at 5 p.m. tomorrow.', "We'll discuss Q4."]

# High recall (0.1) - more breaks, may split at abbreviations
tokenizer.set_precision_recall_balance(0.1)
print(tokenizer.tokenize(text))
# May split after "p.m."

# High precision (0.9) - fewer breaks, preserves abbreviations
tokenizer.set_precision_recall_balance(0.9) 
print(tokenizer.tokenize(text))
# Won't split after "p.m."

Common Use Cases

Processing Multiple Documents

import nupunkt_rs

# Process multiple documents efficiently
documents = [
    "First doc. Two sentences.",
    "Second document here.",
    "Third doc. Also two sentences."
]

# Use list comprehension for batch processing
all_sentences = [nupunkt_rs.sent_tokenize(doc) for doc in documents]
print(all_sentences)
# Output: [['First doc.', 'Two sentences.'], ['Second document here.'], ['Third doc.', 'Also two sentences.']]

Getting Character Positions

import nupunkt_rs

# Get sentence boundaries as character positions
tokenizer = nupunkt_rs.create_default_tokenizer()
text = "First sentence. Second sentence."
spans = tokenizer.tokenize_spans(text)
print(spans)
# Output: [(0, 15), (16, 32)]

# Extract sentences using spans
for start, end in spans:
    print(f"'{text[start:end]}'")
# Output: 'First sentence.' 'Second sentence.'

Command-Line Interface

# Quick tokenization with default model
echo "Dr. Smith arrived. He was late." | nupunkt tokenize

# Adjust precision/recall from command line
nupunkt tokenize --pr-balance 0.8 "Your text here."

# Process a file
nupunkt tokenize --input document.txt --output sentences.txt

Advanced Usage

Understanding Tokenization Decisions

Get detailed insights into why breaks occur or don't occur:

# Get detailed analysis of each token
analysis = tokenizer.analyze_tokens(text)

for token in analysis.tokens:
    if token.has_period:
        print(f"Token: {token.text}")
        print(f"  Break decision: {token.decision}")
        print(f"  Confidence: {token.confidence:.2f}")
        
# Explain a specific position
explanation = tokenizer.explain_decision(text, 28)  # Position of period after "Dr."
print(explanation)

Getting Sentence Boundaries as Spans

# Get character positions instead of text
spans = tokenizer.tokenize_spans(text)
# Returns: [(start1, end1), (start2, end2), ...]

for start, end in spans:
    print(f"Sentence: {text[start:end]}")

Training Custom Models

For domain-specific text, you can train your own model:

trainer = nupunkt_rs.Trainer()

# Optional: Load domain-specific abbreviations
trainer.load_abbreviations_from_json("legal_abbreviations.json")

# Train on your corpus
params = trainer.train(your_text_corpus, verbose=True)

# Save model for reuse
params.save("my_model.npkt.gz")

# Load and use later
params = nupunkt_rs.Parameters.load("my_model.npkt.gz")
tokenizer = nupunkt_rs.SentenceTokenizer(params)

Performance

Benchmarks on commodity hardware (Linux, Intel x86_64):

Text Size Processing Time Speed
1 KB < 0.1ms ~10 MB/s
100 KB ~3ms ~30 MB/s
1 MB ~33ms ~30 MB/s
10 MB ~330ms ~30 MB/s

The tokenizer maintains consistent speed regardless of text size, processing approximately 30 million characters per second.

Memory usage is minimal - the default model uses about 12 MB of RAM, compared to 85+ MB for NLTK's Punkt implementation.

API Reference

Main Functions

  • sent_tokenize(text, model_params=None, precision_recall=None) → List of sentences

    • text: The text to tokenize
    • model_params: Optional custom model parameters
    • precision_recall: Optional PR balance (0.0=recall, 1.0=precision, default=0.5)
  • para_tokenize(text, model_params=None, precision_recall=None) → List of paragraphs (each as list of sentences)

    • Same parameters as sent_tokenize
  • para_tokenize_joined(text, model_params=None, precision_recall=None) → List of paragraphs (each as single string)

    • Same parameters as sent_tokenize
  • create_default_tokenizer() → Returns a SentenceTokenizer with default model

  • load_default_model() → Returns default Parameters

  • train_model(text, verbose=False) → Train new model on text

Main Classes

  • SentenceTokenizer: The main class for tokenizing text

    • tokenize(text) → List of sentences
    • tokenize_spans(text) → List of (start, end) positions
    • tokenize_paragraphs(text) → List of paragraphs (each as list of sentences)
    • tokenize_paragraphs_flat(text) → List of paragraphs (each as single string)
    • set_precision_recall_balance(0.0-1.0) → Adjust behavior
    • analyze_tokens(text) → Detailed token analysis
    • explain_decision(text, position) → Explain break decision at position
  • Parameters: Model parameters

    • save(path) → Save model to disk (compressed)
    • load(path) → Load model from disk
  • Trainer: For training custom models (advanced users only)

    • train(text, verbose=False) → Train on text corpus
    • load_abbreviations_from_json(path) → Load custom abbreviations

Development

Running Tests

# Rust tests
cargo test

# Python tests
pytest python/tests/

# With coverage
cargo tarpaulin
pytest --cov=nupunkt_rs

Code Quality

# Format code
cargo fmt
black python/

# Lint
cargo clippy -- -D warnings
ruff check python/

# Type checking
mypy python/

Building Documentation

# Rust docs
cargo doc --open

# Python docs
cd docs && make html

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

  • Additional language support
  • Performance optimizations
  • More abbreviation lists
  • Documentation improvements
  • Test coverage expansion

License

MIT License - see LICENSE for details.

Citation

If you use nupunkt-rs in your research, please cite the original nupunkt paper:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

For the Rust implementation specifically:

@software{nupunkt-rs,
  title = {nupunkt-rs: High-performance Rust implementation of nupunkt},
  author = {ALEA Institute},
  year = {2025},
  url = {https://github.com/alea-institute/nupunkt-rs}
}

Acknowledgments

  • Original Punkt algorithm by Kiss & Strunk (2006)

Support

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nupunkt_rs-0.1.1.tar.gz (16.4 MB view details)

Uploaded Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl (12.6 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.34+ ARM64

nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl (12.6 MB view details)

Uploaded CPython 3.11+manylinux: glibc 2.28+ x86-64

File details

Details for the file nupunkt_rs-0.1.1.tar.gz.

File metadata

  • Download URL: nupunkt_rs-0.1.1.tar.gz
  • Upload date:
  • Size: 16.4 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: maturin/1.9.4

File hashes

Hashes for nupunkt_rs-0.1.1.tar.gz
Algorithm Hash digest
SHA256 11eed015de8ead8f97f8ed1d30cf771981e2e76dd4fdf5ae9334f7607573e993
MD5 c3e5048fcc4ccb6db57148598fe22b44
BLAKE2b-256 2a3f59caded83e63dbea69f2ae695f7746e5af50ba1ae91623ade81d0633a77a

See more details on using hashes here.

File details

Details for the file nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl.

File metadata

File hashes

Hashes for nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl
Algorithm Hash digest
SHA256 d8f4f48869322b4a2b2237d843653979668ffb04507b092f41e9d471a2bdb8a2
MD5 2f6611c5c46b210f9335e58b4dcad55c
BLAKE2b-256 3a6552151933460a162d09acc2dafcc9d4c9a81b828c04d3a9f039cb38b08824

See more details on using hashes here.

File details

Details for the file nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

File hashes

Hashes for nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm Hash digest
SHA256 631c50d6bc3741cfc1d0a133d990966d6068ce060d913156ea94e607934e63cd
MD5 94b92c9fe7be0cf944eee4510e0db05d
BLAKE2b-256 f5d973e3f6ed689e0ebe8c4d2af5b4e7ba7671c3de7975cbe7dd102f66c65656

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page