Skip to main content

Next-generation Punkt sentence and paragraph boundary detection with zero dependencies

Project description

nupunkt

A high-precision, high-throughput sentence boundary detection library optimized for legal text processing, with zero runtime dependencies.

Note on Performance: Version 0.6.0+ includes adaptive tokenization features that add slight overhead compared to v0.5.1 and earlier. While 0.6.0+ is marginally slower, it remains faster than comparable methods and provides user-configurable precision/recall control through the threshold parameter (e.g., sent_tokenize_adaptive(text, threshold=0.1) for more conservative sentence splitting).

PyPI version Python Version License

Overview

nupunkt is a next-generation implementation of the Punkt algorithm specifically optimized for legal text processing. It accurately detects sentence boundaries in complex legal documents where periods are used for abbreviations, citations, and other non-sentence-ending contexts.

Key features:

  • Zero dependencies: Pure Python 3.11+ (tqdm optional for progress bars)
  • Adaptive mode: Starting with 0.6.0, supports an adaptive, confidence-based variant
  • High precision: 91.1% precision on legal text benchmarks
  • High performance: Processes 10+ million characters per second on standard CPU hardware
  • Pre-trained model: Ready to use with legal-optimized abbreviations
  • Trainable: Can be trained on domain-specific text
  • Paragraph detection: Split text into both sentences and paragraphs
  • CLI tools: Complete command-line interface for training and evaluation

Paper

For the research behind this implementation, see:

Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary
Michael J Bommarito, Daniel Martin Katz, Jillian Bommarito
arXiv:2504.04131 [cs.CL]
https://arxiv.org/abs/2504.04131

Interactive demo available at: https://sentences.aleainstitute.ai/

Installation

pip install nupunkt

Quick Start

from nupunkt import sent_tokenize

text = """
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and 
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful 
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in 
Employment Act (29 U.S.C. § 621 et. seq.). This release does not include Employee's right to indemnification, 
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement.
"""

# Tokenize into sentences
sentences = sent_tokenize(text)

for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}\n")

Adaptive Tokenization (New in v0.6.0)

Adaptive mode dynamically discovers abbreviation patterns and improves sentence boundary detection:

from nupunkt import sent_tokenize_adaptive

text = """Dr. Smith graduated from M.I.T. in 2020. She works at N.A.S.A. now.
Her colleague Mr. Johnson has a Ph.D. from U.C.L.A. and collaborates with researchers
at C.E.R.N. on quantum physics."""

# Use adaptive mode with abbreviation pattern detection
sentences = sent_tokenize_adaptive(text)

# Adjust confidence threshold (default: 0.7)
sentences = sent_tokenize_adaptive(text, threshold=0.8)

# Get confidence scores for each decision
sentences_with_scores = sent_tokenize_adaptive(text, return_confidence=True)
for sentence, confidence in sentences_with_scores:
    print(f"[{confidence:.2f}] {sentence}")

The adaptive tokenizer:

  • Automatically detects abbreviation patterns (M.I.T., Ph.D., etc.)
  • Uses context clues to make better decisions
  • Provides confidence scores for each boundary decision
  • Falls back to the robust base algorithm when uncertain

Sentence and Paragraph Spans

Get character-level spans for sentences and paragraphs:

from nupunkt import sent_spans, sent_spans_with_text, para_spans, para_spans_with_text

# Get sentence spans (start, end positions)
sentence_spans = sent_spans(text)

# Get sentences with their spans
sentences_with_spans = sent_spans_with_text(text)
for sentence, (start, end) in sentences_with_spans:
    print(f"[{start}:{end}] {sentence}")

# Same for paragraphs
paragraph_spans = para_spans(text)
paragraphs_with_spans = para_spans_with_text(text)

Adaptive Spans

Get spans using the adaptive algorithm for better abbreviation handling:

from nupunkt import sent_spans_adaptive, sent_spans_with_text_adaptive

# Get adaptive sentence spans
text = "Dr. Smith studied at M.I.T. in Cambridge."
spans = sent_spans_adaptive(text)
# Returns: [(0, 41)] - single sentence preserved

# Get sentences with spans
results = sent_spans_with_text_adaptive(text)
for sentence, (start, end) in results:
    print(f"[{start}:{end}] {sentence}")

# With confidence scores
results = sent_spans_with_text_adaptive(text, return_confidence=True)
for sentence, (start, end), confidence in results:
    print(f"[{confidence:.2f}] [{start}:{end}] {sentence}")

All span functions guarantee:

  • Contiguous spans with no gaps
  • Full coverage of the input text
  • Preservation of all whitespace

Paragraph Detection

from nupunkt import para_tokenize

# Get paragraph text
paragraphs = para_tokenize(text)

Command-line Interface

Basic usage

# Using Python directly
echo "Hello world. How are you?" | python -c "import sys; from nupunkt import sent_tokenize; print('\n'.join(sent_tokenize(sys.stdin.read())))"

# Or create a simple script
python -c "from nupunkt import sent_tokenize; import sys; [print(s) for s in sent_tokenize(sys.stdin.read())]"

Training models

# Train from text files
nupunkt train corpus.txt --output model.bin

# Train from HuggingFace datasets
nupunkt train hf:alea-institute/kl3m-data-usc -o legal_model.bin

# Memory-efficient training for large datasets
nupunkt train huge_corpus.txt --batch-size 1000000 --min-type-freq 5

Evaluating models

# Evaluate a model
nupunkt evaluate test_data.jsonl -m my_model.bin

# Compare multiple models
nupunkt evaluate test_data.jsonl --compare --models baseline.bin custom.bin

Model management

# Convert between formats
nupunkt convert model.json model.bin

# Get model information
nupunkt info model.bin

# Optimize hyperparameters
nupunkt optimize-params train.jsonl test.jsonl -o best_model.bin

Performance

nupunkt is designed for high-precision, high-throughput processing:

  • Token caching for common tokens
  • Fast path processing for texts without sentence boundaries
  • Pre-computed properties to avoid repeated calculations
  • Efficient character processing in hot spots

Example benchmark on legal text:

Documents processed:      1
Total characters:         16,567,769
Total sentences found:    16,095
Processing time:          0.49 seconds
Processing speed:         33,927,693 characters/second

Documentation

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Citation

If you use nupunkt in your research, please cite:

@article{bommarito2025precise,
  title={Precise Legal Sentence Boundary Detection for Retrieval at Scale: NUPunkt and CharBoundary},
  author={Bommarito, Michael J and Katz, Daniel Martin and Bommarito, Jillian},
  journal={arXiv preprint arXiv:2504.04131},
  year={2025}
}

Acknowledgments

nupunkt is based on the Punkt algorithm originally developed by Tibor Kiss and Jan Strunk.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nupunkt-0.6.0.tar.gz (9.1 MB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

nupunkt-0.6.0-py3-none-any.whl (9.1 MB view details)

Uploaded Python 3

File details

Details for the file nupunkt-0.6.0.tar.gz.

File metadata

  • Download URL: nupunkt-0.6.0.tar.gz
  • Upload date:
  • Size: 9.1 MB
  • Tags: Source
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for nupunkt-0.6.0.tar.gz
Algorithm Hash digest
SHA256 1caee9c1b31326b542c95bf33f2562e5d2b991dd35c89c480baea65cc57f9638
MD5 9f42d161ff33d9a444babdecbc00a395
BLAKE2b-256 b9d593f053d3ffda443eeffada12194eb924b5ed4f4e1d94ae69974dba2f38d6

See more details on using hashes here.

File details

Details for the file nupunkt-0.6.0-py3-none-any.whl.

File metadata

  • Download URL: nupunkt-0.6.0-py3-none-any.whl
  • Upload date:
  • Size: 9.1 MB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? No
  • Uploaded via: uv/0.7.12

File hashes

Hashes for nupunkt-0.6.0-py3-none-any.whl
Algorithm Hash digest
SHA256 35e52d4f9c48fac6678b5c4d942e746860e7b1c337cb9a5d8a5bd97dd2b6603a
MD5 84a91446ec0e9657ab794727a848f112
BLAKE2b-256 10b9618f5af01158d9feb57b27d7ad5f5f6efb1bff724ddcfc5d7cde179d3b80

See more details on using hashes here.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page