High-performance Rust implementation of nupunkt sentence/paragraph tokenization

These details have not been verified by PyPI

Project links

Project description

nupunkt-rs

High-performance Rust implementation of nupunkt, a modern reimplementation of the Punkt sentence tokenizer optimized for high-precision legal and financial text processing. This project provides the same accurate sentence segmentation as the original Python nupunkt library, but with 3x faster performance thanks to Rust's efficiency.

Based on the research paper: Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary (Bommarito et al., 2025)

Features

🚀 High Performance: 30M+ characters/second (3x faster than Python nupunkt)
🎯 High Precision: 91.1% precision on legal text benchmarks
⚡ Runtime Adjustable: Tune precision/recall balance at inference time without retraining
📚 Legal-Optimized: Pre-trained model handles complex legal abbreviations and citations
🐍 Python API: Drop-in replacement for Python nupunkt with PyO3 bindings
🧵 Thread-Safe: Safe for parallel processing

Installation

From PyPI

# pip
pip install nupunkt-rs

# uv
uv pip install nupunkt-rs

From Source

Prerequisites:
- Python 3.11+
- Rust toolchain (install from rustup.rs)
- maturin (pip install maturin)
Clone and Install:

git clone https://github.com/alea-institute/nupunkt-rs.git
cd nupunkt-rs

# pip
pip install maturin
maturin develop --release

# uv
uvx maturin develop --release --uv

Quick Start

Why nupunkt-rs for Legal & Financial Documents?

Most tokenizers fail on legal and financial text, breaking incorrectly at abbreviations like "v.", "U.S.", "Inc.", "Id.", and "Fed." This library is specifically optimized for high-precision tokenization of complex professional documents.

import nupunkt_rs

# Real Supreme Court text with complex citations and abbreviations
legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150."""

# Most tokenizers would incorrectly break at "v.", "Inc.", "U.S.", "Co.", and "Id."
# nupunkt-rs handles all of these correctly:
sentences = nupunkt_rs.sent_tokenize(legal_text)
print(f"Correctly identified {len(sentences)} sentences:")
for i, sent in enumerate(sentences, 1):
    print(f"\n{i}. {sent}")

# Output:
# Correctly identified 3 sentences:
#
# 1. As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability.
#
# 2. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999).
#
# 3. There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150.

Fine-Tuning Precision with the `precision_recall` Parameter

The precision_recall parameter (0.0-1.0) gives you exact control over the precision/recall trade-off. For legal and financial documents, you typically want higher precision (0.3-0.5) to avoid breaking at abbreviations.

# Longer legal text to show the impact
long_legal_text = """As we explained in Daubert v. Merrell Dow Pharmaceuticals, Inc., 509 U.S. 579, 597 (1993), Rule 702's requirement that an expert's testimony pertain to "scientific knowledge" establishes a standard of evidentiary reliability. This Court addressed the application of this standard to technical, as opposed to scientific, expert testimony in Kumho Tire Co. v. Carmichael, 526 U.S. 137 (1999). There, we explained that the gatekeeping inquiry must be tied to the facts of a particular case. Id. at 150. This Court further noted that Rule 702 was amended in response to Daubert and this Court's subsequent cases. See Fed. Rule Evid. 702, Advisory Committee Notes to 2000 Amendments. The amendment affirms the trial court's role as gatekeeper but provides that "all types of expert testimony present questions of admissibility for the trial court." Ibid. Consequently, whether the specific expert testimony on the question at issue focuses on specialized observations, the specialized translation of those observations into theory, a specialized theory itself, or the application of such a theory in a particular case, the expert's testimony often will rest "upon an experience confessedly foreign in kind to [the jury's] own." Hand, Historical and Practical Considerations Regarding Expert Testimony, 15 Harv. L. Rev. 40, 54 (1901). For this reason, the trial judge, in all cases of proffered expert testimony, must find that it is properly grounded, well-reasoned, and not speculative before it can be admitted. The trial judge must determine whether the testimony has "a reliable basis in the knowledge and experience of [the relevant] discipline." Daubert, 509 U. S., at 592."""

# Compare different precision levels
print(f"High recall (PR=0.1): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.1))} sentences")
print(f"Balanced (PR=0.5):    {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5))} sentences")  
print(f"High precision (PR=0.9): {len(nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.9))} sentences")

# Output:
# High recall (PR=0.1): 8 sentences
# Balanced (PR=0.5):    7 sentences  
# High precision (PR=0.9): 5 sentences

# Show the actual sentences at balanced setting (recommended for legal text)
sentences = nupunkt_rs.sent_tokenize(long_legal_text, precision_recall=0.5)
print("\nBalanced output (PR=0.5) - Recommended for legal documents:")
for i, sent in enumerate(sentences, 1):
    # Show that abbreviations are correctly preserved
    if "v." in sent or "U.S." in sent or "Id." in sent or "Fed." in sent:
        print(f"\n{i}. ✓ Correctly preserves legal abbreviations:")
        print(f"   {sent[:100]}...")

Recommended precision_recall settings:

Legal documents: 0.3-0.5 (preserves "v.", "Id.", "Fed.", "U.S.", "Inc.")
Financial reports: 0.4-0.6 (preserves "Inc.", "Ltd.", "Q1", monetary abbreviations)
Scientific papers: 0.4-0.6 (preserves "et al.", "e.g.", "i.e.", technical terms)
General text: 0.5 (default, balanced)
Social media: 0.1-0.3 (more aggressive breaking for informal text)

Paragraph Tokenization

For documents with multiple paragraphs, you can tokenize at both paragraph and sentence levels:

import nupunkt_rs

text = """First paragraph with legal citations.
See Smith v. Jones, 123 U.S. 456 (2020).

Second paragraph with more detail.
The court in Id. at 457 stated clearly."""

# Get paragraphs as lists of sentences
paragraphs = nupunkt_rs.para_tokenize(text)
print(f"Found {len(paragraphs)} paragraphs")
# Each paragraph is a list of properly segmented sentences

# Or get paragraphs as joined strings
paragraphs_joined = nupunkt_rs.para_tokenize_joined(text)
# Each paragraph is a single string with sentences joined

Advanced Approach (Using Tokenizer Class)

import nupunkt_rs

# Create a tokenizer with the default model
tokenizer = nupunkt_rs.create_default_tokenizer()

# Default (0.5) - balanced mode
text = "The meeting is at 5 p.m. tomorrow. We'll discuss Q4."
print(tokenizer.tokenize(text))
# Output: ['The meeting is at 5 p.m. tomorrow.', "We'll discuss Q4."]

# High recall (0.1) - more breaks, may split at abbreviations
tokenizer.set_precision_recall_balance(0.1)
print(tokenizer.tokenize(text))
# May split after "p.m."

# High precision (0.9) - fewer breaks, preserves abbreviations
tokenizer.set_precision_recall_balance(0.9) 
print(tokenizer.tokenize(text))
# Won't split after "p.m."

Common Use Cases

Processing Multiple Documents

import nupunkt_rs

# Process multiple documents efficiently
documents = [
    "First doc. Two sentences.",
    "Second document here.",
    "Third doc. Also two sentences."
]

# Use list comprehension for batch processing
all_sentences = [nupunkt_rs.sent_tokenize(doc) for doc in documents]
print(all_sentences)
# Output: [['First doc.', 'Two sentences.'], ['Second document here.'], ['Third doc.', 'Also two sentences.']]

Getting Character Positions

import nupunkt_rs

# Get sentence boundaries as character positions
tokenizer = nupunkt_rs.create_default_tokenizer()
text = "First sentence. Second sentence."
spans = tokenizer.tokenize_spans(text)
print(spans)
# Output: [(0, 15), (16, 32)]

# Extract sentences using spans
for start, end in spans:
    print(f"'{text[start:end]}'")
# Output: 'First sentence.' 'Second sentence.'

Command-Line Interface

# Quick tokenization with default model
echo "Dr. Smith arrived. He was late." | nupunkt tokenize

# Adjust precision/recall from command line
nupunkt tokenize --pr-balance 0.8 "Your text here."

# Process a file
nupunkt tokenize --input document.txt --output sentences.txt

Advanced Usage

Understanding Tokenization Decisions

Get detailed insights into why breaks occur or don't occur:

# Get detailed analysis of each token
analysis = tokenizer.analyze_tokens(text)

for token in analysis.tokens:
    if token.has_period:
        print(f"Token: {token.text}")
        print(f"  Break decision: {token.decision}")
        print(f"  Confidence: {token.confidence:.2f}")
        
# Explain a specific position
explanation = tokenizer.explain_decision(text, 28)  # Position of period after "Dr."
print(explanation)

Getting Sentence Boundaries as Spans

# Get character positions instead of text
spans = tokenizer.tokenize_spans(text)
# Returns: [(start1, end1), (start2, end2), ...]

for start, end in spans:
    print(f"Sentence: {text[start:end]}")

Training Custom Models

For domain-specific text, you can train your own model:

trainer = nupunkt_rs.Trainer()

# Optional: Load domain-specific abbreviations
trainer.load_abbreviations_from_json("legal_abbreviations.json")

# Train on your corpus
params = trainer.train(your_text_corpus, verbose=True)

# Save model for reuse
params.save("my_model.npkt.gz")

# Load and use later
params = nupunkt_rs.Parameters.load("my_model.npkt.gz")
tokenizer = nupunkt_rs.SentenceTokenizer(params)

Performance

Benchmarks on commodity hardware (Linux, Intel x86_64):

Text Size	Processing Time	Speed
1 KB	< 0.1ms	~10 MB/s
100 KB	~3ms	~30 MB/s
1 MB	~33ms	~30 MB/s
10 MB	~330ms	~30 MB/s

The tokenizer maintains consistent speed regardless of text size, processing approximately 30 million characters per second.

Memory usage is minimal - the default model uses about 12 MB of RAM, compared to 85+ MB for NLTK's Punkt implementation.

API Reference

Main Functions

sent_tokenize(text, model_params=None, precision_recall=None) → List of sentences
- text: The text to tokenize
- model_params: Optional custom model parameters
- precision_recall: Optional PR balance (0.0=recall, 1.0=precision, default=0.5)
para_tokenize(text, model_params=None, precision_recall=None) → List of paragraphs (each as list of sentences)
- Same parameters as sent_tokenize
para_tokenize_joined(text, model_params=None, precision_recall=None) → List of paragraphs (each as single string)
- Same parameters as sent_tokenize
create_default_tokenizer() → Returns a SentenceTokenizer with default model
load_default_model() → Returns default Parameters
train_model(text, verbose=False) → Train new model on text

Main Classes

SentenceTokenizer: The main class for tokenizing text
- tokenize(text) → List of sentences
- tokenize_spans(text) → List of (start, end) positions
- tokenize_paragraphs(text) → List of paragraphs (each as list of sentences)
- tokenize_paragraphs_flat(text) → List of paragraphs (each as single string)
- set_precision_recall_balance(0.0-1.0) → Adjust behavior
- analyze_tokens(text) → Detailed token analysis
- explain_decision(text, position) → Explain break decision at position
Parameters: Model parameters
- save(path) → Save model to disk (compressed)
- load(path) → Load model from disk
Trainer: For training custom models (advanced users only)
- train(text, verbose=False) → Train on text corpus
- load_abbreviations_from_json(path) → Load custom abbreviations

Development

Running Tests

# Rust tests
cargo test

# Python tests
pytest python/tests/

# With coverage
cargo tarpaulin
pytest --cov=nupunkt_rs

Code Quality

# Format code
cargo fmt
black python/

# Lint
cargo clippy -- -D warnings
ruff check python/

# Type checking
mypy python/

Building Documentation

# Rust docs
cargo doc --open

# Python docs
cd docs && make html

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Areas for Contribution

Additional language support
Performance optimizations
More abbreviation lists
Documentation improvements
Test coverage expansion

License

MIT License - see LICENSE for details.

Citation

If you use nupunkt-rs in your research, please cite the original nupunkt paper:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

For the Rust implementation specifically:

@software{nupunkt-rs,
  title = {nupunkt-rs: High-performance Rust implementation of nupunkt},
  author = {ALEA Institute},
  year = {2025},
  url = {https://github.com/alea-institute/nupunkt-rs}
}

Acknowledgments

Original Punkt algorithm by Kiss & Strunk (2006)

Support

Issues: GitHub Issues
Discussions: GitHub Discussions
Email: hello@aleainstitute.ai

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

This version

0.1.1

Aug 16, 2025

0.1.0

Aug 11, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nupunkt_rs-0.1.1.tar.gz (16.4 MB view details)

Uploaded Sep 25, 2025 Source

Built Distributions

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl (12.6 MB view details)

Uploaded Sep 25, 2025 CPython 3.11+manylinux: glibc 2.34+ ARM64

nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl (12.6 MB view details)

Uploaded Aug 16, 2025 CPython 3.11+manylinux: glibc 2.28+ x86-64

File details

Details for the file nupunkt_rs-0.1.1.tar.gz.

File metadata

Download URL: nupunkt_rs-0.1.1.tar.gz
Upload date: Sep 25, 2025
Size: 16.4 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.4

File hashes

Hashes for nupunkt_rs-0.1.1.tar.gz
Algorithm	Hash digest
SHA256	`11eed015de8ead8f97f8ed1d30cf771981e2e76dd4fdf5ae9334f7607573e993`
MD5	`c3e5048fcc4ccb6db57148598fe22b44`
BLAKE2b-256	`2a3f59caded83e63dbea69f2ae695f7746e5af50ba1ae91623ade81d0633a77a`

See more details on using hashes here.

File details

Details for the file nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl.

File metadata

Download URL: nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl
Upload date: Sep 25, 2025
Size: 12.6 MB
Tags: CPython 3.11+, manylinux: glibc 2.34+ ARM64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.4

File hashes

Hashes for nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_34_aarch64.whl
Algorithm	Hash digest
SHA256	`d8f4f48869322b4a2b2237d843653979668ffb04507b092f41e9d471a2bdb8a2`
MD5	`2f6611c5c46b210f9335e58b4dcad55c`
BLAKE2b-256	`3a6552151933460a162d09acc2dafcc9d4c9a81b828c04d3a9f039cb38b08824`

See more details on using hashes here.

File details

Details for the file nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl.

File metadata

Download URL: nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl
Upload date: Aug 16, 2025
Size: 12.6 MB
Tags: CPython 3.11+, manylinux: glibc 2.28+ x86-64
Uploaded using Trusted Publishing? No
Uploaded via: maturin/1.9.3

File hashes

Hashes for nupunkt_rs-0.1.1-cp311-abi3-manylinux_2_28_x86_64.whl
Algorithm	Hash digest
SHA256	`631c50d6bc3741cfc1d0a133d990966d6068ce060d913156ea94e607934e63cd`
MD5	`94b92c9fe7be0cf944eee4510e0db05d`
BLAKE2b-256	`f5d973e3f6ed689e0ebe8c4d2af5b4e7ba7671c3de7975cbe7dd102f66c65656`

See more details on using hashes here.

nupunkt-rs 0.1.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nupunkt-rs

Features

Installation

From PyPI

From Source

Quick Start

Why nupunkt-rs for Legal & Financial Documents?

Fine-Tuning Precision with the precision_recall Parameter

Paragraph Tokenization

Advanced Approach (Using Tokenizer Class)

Common Use Cases

Processing Multiple Documents

Getting Character Positions

Command-Line Interface

Advanced Usage

Understanding Tokenization Decisions

Getting Sentence Boundaries as Spans

Training Custom Models

Performance

API Reference

Main Functions

Main Classes

Development

Running Tests

Code Quality

Building Documentation

Contributing

Areas for Contribution

License

Citation

Acknowledgments

Support

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distributions

File details

File metadata

File hashes

File details

File metadata

File hashes

File details

File metadata

File hashes

Fine-Tuning Precision with the `precision_recall` Parameter