Next-generation Punkt sentence and paragraph boundary detection with zero dependencies
Project description
nupunkt
A high-precision, high-throughput sentence boundary detection library optimized for legal text processing, with zero runtime dependencies.
Note on Performance: Version 0.6.0+ includes adaptive tokenization features that add slight overhead compared to v0.5.1 and earlier. While 0.6.0+ is marginally slower, it remains faster than comparable methods and provides user-configurable precision/recall control through the threshold parameter (e.g.,
sent_tokenize_adaptive(text, threshold=0.1)for more conservative sentence splitting).
Overview
nupunkt is a next-generation implementation of the Punkt algorithm specifically optimized for legal text processing. It accurately detects sentence boundaries in complex legal documents where periods are used for abbreviations, citations, and other non-sentence-ending contexts.
Key features:
- Zero dependencies: Pure Python 3.11+ (tqdm optional for progress bars)
- Adaptive mode: Starting with
0.6.0, supports an adaptive, confidence-based variant - High precision: 91.1% precision on legal text benchmarks
- High performance: Processes 10+ million characters per second on standard CPU hardware
- Pre-trained model: Ready to use with legal-optimized abbreviations
- Trainable: Can be trained on domain-specific text
- Paragraph detection: Split text into both sentences and paragraphs
- CLI tools: Complete command-line interface for training and evaluation
Paper
For the research behind this implementation, see:
Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito
arXiv:2504.04131 [cs.CL]
https://arxiv.org/abs/2504.04131
Interactive demo available at: https://sentences.aleainstitute.ai/
Installation
pip install nupunkt
Quick Start
from nupunkt import sent_tokenize
text = """
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in
Employment Act (29 U.S.C. § 621 et. seq.). This release does not include Employee's right to indemnification,
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement.
"""
# Tokenize into sentences
sentences = sent_tokenize(text)
for i, sentence in enumerate(sentences, 1):
print(f"Sentence {i}: {sentence}\n")
Adaptive Tokenization (New in v0.6.0)
Adaptive mode dynamically discovers abbreviation patterns and improves sentence boundary detection:
from nupunkt import sent_tokenize_adaptive
text = """Dr. Smith graduated from M.I.T. in 2020. She works at N.A.S.A. now.
Her colleague Mr. Johnson has a Ph.D. from U.C.L.A. and collaborates with researchers
at C.E.R.N. on quantum physics."""
# Use adaptive mode with abbreviation pattern detection
sentences = sent_tokenize_adaptive(text)
# Adjust confidence threshold (default: 0.7)
sentences = sent_tokenize_adaptive(text, threshold=0.8)
# Get confidence scores for each decision
sentences_with_scores = sent_tokenize_adaptive(text, return_confidence=True)
for sentence, confidence in sentences_with_scores:
print(f"[{confidence:.2f}] {sentence}")
The adaptive tokenizer:
- Automatically detects abbreviation patterns (M.I.T., Ph.D., etc.)
- Uses context clues to make better decisions
- Provides confidence scores for each boundary decision
- Falls back to the robust base algorithm when uncertain
Sentence and Paragraph Spans
Get character-level spans for sentences and paragraphs:
from nupunkt import sent_spans, sent_spans_with_text, para_spans, para_spans_with_text
# Get sentence spans (start, end positions)
sentence_spans = sent_spans(text)
# Get sentences with their spans
sentences_with_spans = sent_spans_with_text(text)
for sentence, (start, end) in sentences_with_spans:
print(f"[{start}:{end}] {sentence}")
# Same for paragraphs
paragraph_spans = para_spans(text)
paragraphs_with_spans = para_spans_with_text(text)
Adaptive Spans
Get spans using the adaptive algorithm for better abbreviation handling:
from nupunkt import sent_spans_adaptive, sent_spans_with_text_adaptive
# Get adaptive sentence spans
text = "Dr. Smith studied at M.I.T. in Cambridge."
spans = sent_spans_adaptive(text)
# Returns: [(0, 41)] - single sentence preserved
# Get sentences with spans
results = sent_spans_with_text_adaptive(text)
for sentence, (start, end) in results:
print(f"[{start}:{end}] {sentence}")
# With confidence scores
results = sent_spans_with_text_adaptive(text, return_confidence=True)
for sentence, (start, end), confidence in results:
print(f"[{confidence:.2f}] [{start}:{end}] {sentence}")
All span functions guarantee:
- Contiguous spans with no gaps
- Full coverage of the input text
- Preservation of all whitespace
Paragraph Detection
from nupunkt import para_tokenize
# Get paragraph text
paragraphs = para_tokenize(text)
Command-line Interface
Basic usage
# Using Python directly
echo "Hello world. How are you?" | python -c "import sys; from nupunkt import sent_tokenize; print('\n'.join(sent_tokenize(sys.stdin.read())))"
# Or create a simple script
python -c "from nupunkt import sent_tokenize; import sys; [print(s) for s in sent_tokenize(sys.stdin.read())]"
Training models
# Train from text files
nupunkt train corpus.txt --output model.bin
# Train from HuggingFace datasets
nupunkt train hf:alea-institute/kl3m-data-usc -o legal_model.bin
# Memory-efficient training for large datasets
nupunkt train huge_corpus.txt --batch-size 1000000 --min-type-freq 5
Evaluating models
# Evaluate a model
nupunkt evaluate test_data.jsonl -m my_model.bin
# Compare multiple models
nupunkt evaluate test_data.jsonl --compare --models baseline.bin custom.bin
Model management
# Convert between formats
nupunkt convert model.json model.bin
# Get model information
nupunkt info model.bin
# Optimize hyperparameters
nupunkt optimize-params train.jsonl test.jsonl -o best_model.bin
Performance
nupunkt is designed for high-precision, high-throughput processing:
- Token caching for common tokens
- Fast path processing for texts without sentence boundaries
- Pre-computed properties to avoid repeated calculations
- Efficient character processing in hot spots
Example benchmark on legal text:
Documents processed: 1
Total characters: 16,567,769
Total sentences found: 16,095
Processing time: 0.49 seconds
Processing speed: 33,927,693 characters/second
Documentation
- Getting Started Guide - Detailed usage examples
- Training Guide - Train custom models
- Algorithm Overview - How nupunkt works
- API Reference - Complete API documentation
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Citation
If you use nupunkt in your research, please cite:
@article{bommarito2025precise,
title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
journal={arXiv preprint arXiv:2504.04131},
year={2025}
}
Acknowledgments
nupunkt is based on the Punkt algorithm originally developed by Tibor Kiss and Jan Strunk.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nupunkt-0.6.0.tar.gz.
File metadata
- Download URL: nupunkt-0.6.0.tar.gz
- Upload date:
- Size: 9.1 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
1caee9c1b31326b542c95bf33f2562e5d2b991dd35c89c480baea65cc57f9638
|
|
| MD5 |
9f42d161ff33d9a444babdecbc00a395
|
|
| BLAKE2b-256 |
b9d593f053d3ffda443eeffada12194eb924b5ed4f4e1d94ae69974dba2f38d6
|
File details
Details for the file nupunkt-0.6.0-py3-none-any.whl.
File metadata
- Download URL: nupunkt-0.6.0-py3-none-any.whl
- Upload date:
- Size: 9.1 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.7.12
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
35e52d4f9c48fac6678b5c4d942e746860e7b1c337cb9a5d8a5bd97dd2b6603a
|
|
| MD5 |
84a91446ec0e9657ab794727a848f112
|
|
| BLAKE2b-256 |
10b9618f5af01158d9feb57b27d7ad5f5f6efb1bff724ddcfc5d7cde179d3b80
|