Next-generation Punkt sentence and paragraph boundary detection with zero dependencies

These details have not been verified by PyPI

Project links

Project description

nupunkt

nupunkt is a next-generation implementation of the Punkt algorithm for sentence boundary detection with zero runtime dependencies.

Overview

nupunkt accurately detects sentence boundaries in text, even in challenging cases where periods are used for abbreviations, ellipses, and other non-sentence-ending contexts. It's built on the statistical principles of the Punkt algorithm, with modern enhancements for improved handling of edge cases.

Key features:

Minimal dependencies: Only requires Python 3.11+ and tqdm for progress bars
Pre-trained model: Ready to use out of the box
Fast and accurate: Optimized implementation of the Punkt algorithm
Trainable: Can be trained on domain-specific text
Full support for ellipsis: Handles various ellipsis patterns
Type annotations: Complete type hints for better IDE integration

Installation

pip install nupunkt

Quick Start

from nupunkt import sent_tokenize

text = """
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and 
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful 
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in 
Employment Act (29 U.S.C. § 621 et. seq.). This release does not include Employee’s right to indemnification, 
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement, his right to equity awards,
or continued exercise, pursuant to the terms of any specific equity award (or similar) agreement between 
Employee and the Company nor to Employee’s right to benefits under any Company plan or program in which
Employee participated and is due a benefit in accordance with the terms of the plan or program as of the Effective
Date and ending at 11:59 p.m. Eastern Time on Sep. 15, 2013.
"""

# Tokenize into sentences
sentences = sent_tokenize(text)

# Print the results
for i, sentence in enumerate(sentences, 1):
    print(f"Sentence {i}: {sentence}\n")

Output:

Sentence 1:
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in
Employment Act (29 U.S.C. § 621 et. seq.).

Sentence 2:  This release does not include Employee’s right to indemnification,
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement, his right to equity awards,
or continued exercise, pursuant to the terms of any specific equity award (or similar) agreement between
Employee and the Company nor to Employee’s right to benefits under any Company plan or program in which
Employee participated and is due a benefit in accordance with the terms of the plan or program as of the Effective
Date and ending at 11:59 p.m. Eastern Time on Sep. 15, 2013.

Documentation

For more detailed documentation, see the docs directory:

Command-line Tools

nupunkt comes with several utility scripts for working with models:

check_abbreviation.py: Check if a token is in the model's abbreviation list

python -m scripts.utils.check_abbreviation "U.S." 
python -m scripts.utils.check_abbreviation --list   # List all abbreviations
python -m scripts.utils.check_abbreviation --count  # Count abbreviations

test_tokenizer.py: Test the tokenizer on sample text
model_info.py: Display information about a model file

See the scripts/utils/README.md for more details on available tools.

Advanced Example

from nupunkt import PunktTrainer, PunktSentenceTokenizer

# Train a new model on domain-specific text
with open("legal_corpus.txt", "r", encoding="utf-8") as f:
    legal_text = f.read()

trainer = PunktTrainer(legal_text, verbose=True)
params = trainer.get_params()

# Save the trained model
trainer.save("legal_model.json")

# Create a tokenizer with the trained parameters
tokenizer = PunktSentenceTokenizer(params)

# Tokenize legal text
legal_sample = "The court ruled in favor of the plaintiff. 28 U.S.C. § 1332 provides jurisdiction."
sentences = tokenizer.tokenize(legal_sample)

for s in sentences:
    print(s)

Performance

nupunkt is designed to be both accurate and efficient. It can process large volumes of text quickly, making it suitable for production NLP pipelines.

Highly Optimized

The tokenizer has been extensively optimized for performance:

Token caching for common tokens
Fast path processing for texts without sentence boundaries (up to 1.4B chars/sec)
Pre-computed properties to avoid repeated calculations
Efficient character processing and string handling in hot spots

Example Legal Domain Benchmark

Performance Results:
  Documents processed:      1
  Total characters:         16,567,769
  Total sentences found:    16,095
  Processing time:          0.49 seconds
  Processing speed:         33,927,693 characters/second
  Average sentence length:  1029.4 characters

Specialized Use Cases

Normal text processing: ~31M characters/second
Text without sentence boundaries: ~1.4B characters/second
Short text fragments: Extremely fast with early exit paths

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

nupunkt is based on the Punkt algorithm originally developed by Tibor Kiss and Jan Strunk.

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

0.6.0

Aug 5, 2025

This version

0.5.1

Apr 6, 2025

0.5.0

Apr 2, 2025

0.4.1

Mar 31, 2025

0.4.0

Mar 31, 2025

0.3.0

Mar 31, 2025

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nupunkt-0.5.1.tar.gz (5.6 MB view details)

Uploaded Apr 6, 2025 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

nupunkt-0.5.1-py3-none-any.whl (5.6 MB view details)

Uploaded Apr 6, 2025 Python 3

File details

Details for the file nupunkt-0.5.1.tar.gz.

File metadata

Download URL: nupunkt-0.5.1.tar.gz
Upload date: Apr 6, 2025
Size: 5.6 MB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.23

File hashes

Hashes for nupunkt-0.5.1.tar.gz
Algorithm	Hash digest
SHA256	`b2953fca78edb5cf16050642d4c75144b5993b5dba94cda97e5b0b25c121d332`
MD5	`78719df7c05a7fe5d1778844c67a1d28`
BLAKE2b-256	`fc72a4eee974baf391c411f9e2388fefb50915e8ea9eb4b2ba87357044b9aea1`

See more details on using hashes here.

File details

Details for the file nupunkt-0.5.1-py3-none-any.whl.

File metadata

Download URL: nupunkt-0.5.1-py3-none-any.whl
Upload date: Apr 6, 2025
Size: 5.6 MB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: uv/0.5.23

File hashes

Hashes for nupunkt-0.5.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b39cd5b115bcd132bebaf42c2d88b47a034d5e8ae397c23a194901dd00f72025`
MD5	`9dec1ead6e52b7cd6c183aa0c50aacef`
BLAKE2b-256	`a397092b7cb864fbbfe3949c0bf41c9ffa50e58bbea3af8667933997ce513188`

See more details on using hashes here.

nupunkt 0.5.1

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

nupunkt

Overview

Installation

Quick Start

Documentation

Command-line Tools

Advanced Example

Performance

Highly Optimized

Example Legal Domain Benchmark

Specialized Use Cases

Contributing

License

Acknowledgments

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes