Next-generation Punkt sentence boundary detection with zero dependencies
Project description
nupunkt
nupunkt is a next-generation implementation of the Punkt algorithm for sentence boundary detection with zero runtime dependencies.
Overview
nupunkt accurately detects sentence boundaries in text, even in challenging cases where periods are used for abbreviations, ellipses, and other non-sentence-ending contexts. It's built on the statistical principles of the Punkt algorithm, with modern enhancements for improved handling of edge cases.
Key features:
- Minimal dependencies: Only requires Python 3.11+ and tqdm for progress bars
- Pre-trained model: Ready to use out of the box
- Fast and accurate: Optimized implementation of the Punkt algorithm
- Trainable: Can be trained on domain-specific text
- Full support for ellipsis: Handles various ellipsis patterns
- Type annotations: Complete type hints for better IDE integration
Installation
pip install nupunkt
Quick Start
from nupunkt import sent_tokenize
text = """
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in
Employment Act (29 U.S.C. § 621 et. seq.). This release does not include Employee’s right to indemnification,
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement, his right to equity awards,
or continued exercise, pursuant to the terms of any specific equity award (or similar) agreement between
Employee and the Company nor to Employee’s right to benefits under any Company plan or program in which
Employee participated and is due a benefit in accordance with the terms of the plan or program as of the Effective
Date and ending at 11:59 p.m. Eastern Time on Sep. 15, 2013.
"""
# Tokenize into sentences
sentences = sent_tokenize(text)
# Print the results
for i, sentence in enumerate(sentences, 1):
print(f"Sentence {i}: {sentence}\n")
Output:
Sentence 1:
Employee also specifically and forever releases the Acme Inc. (Company) and the Company Parties (except where and
to the extent that such a release is expressly prohibited or made void by law) from any claims based on unlawful
employment discrimination or harassment, including, but not limited to, the Federal Age Discrimination in
Employment Act (29 U.S.C. § 621 et. seq.).
Sentence 2: This release does not include Employee’s right to indemnification,
and related insurance coverage, under Sec. 7.1.4 or Ex. 1-1 of the Employment Agreement, his right to equity awards,
or continued exercise, pursuant to the terms of any specific equity award (or similar) agreement between
Employee and the Company nor to Employee’s right to benefits under any Company plan or program in which
Employee participated and is due a benefit in accordance with the terms of the plan or program as of the Effective
Date and ending at 11:59 p.m. Eastern Time on Sep. 15, 2013.
Documentation
For more detailed documentation, see the docs directory:
Command-line Tools
nupunkt comes with several utility scripts for working with models:
-
check_abbreviation.py: Check if a token is in the model's abbreviation list
python -m scripts.utils.check_abbreviation "U.S." python -m scripts.utils.check_abbreviation --list # List all abbreviations python -m scripts.utils.check_abbreviation --count # Count abbreviations
-
test_tokenizer.py: Test the tokenizer on sample text
-
model_info.py: Display information about a model file
See the scripts/utils/README.md for more details on available tools.
Advanced Example
from nupunkt import PunktTrainer, PunktSentenceTokenizer
# Train a new model on domain-specific text
with open("legal_corpus.txt", "r", encoding="utf-8") as f:
legal_text = f.read()
trainer = PunktTrainer(legal_text, verbose=True)
params = trainer.get_params()
# Save the trained model
trainer.save("legal_model.json")
# Create a tokenizer with the trained parameters
tokenizer = PunktSentenceTokenizer(params)
# Tokenize legal text
legal_sample = "The court ruled in favor of the plaintiff. 28 U.S.C. § 1332 provides jurisdiction."
sentences = tokenizer.tokenize(legal_sample)
for s in sentences:
print(s)
Performance
nupunkt is designed to be both accurate and efficient. It can process large volumes of text quickly, making it suitable for production NLP pipelines.
Example Legal Domain Benchmark
Performance Results:
Documents processed: 1
Total characters: 16,567,769
Total sentences found: 16,070
Processing time: 2.81 seconds
Processing speed: 5,896,222 characters/second
Average sentence length: 1031.0 characters
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
License
This project is licensed under the MIT License - see the LICENSE file for details.
Acknowledgments
nupunkt is based on the Punkt algorithm originally developed by Tibor Kiss and Jan Strunk.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file nupunkt-0.3.0.tar.gz.
File metadata
- Download URL: nupunkt-0.3.0.tar.gz
- Upload date:
- Size: 1.8 MB
- Tags: Source
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
a93e99f8179f2752bc19d7dcd1e649fe7a47574b953a768636b200773886d4f5
|
|
| MD5 |
ab62bac9ec2e93dd13c0ca0333d7b487
|
|
| BLAKE2b-256 |
6e6b87d1d2338818318383c92cffb42afaaad3731a46ac421e1f724d492ead0c
|
File details
Details for the file nupunkt-0.3.0-py3-none-any.whl.
File metadata
- Download URL: nupunkt-0.3.0-py3-none-any.whl
- Upload date:
- Size: 1.8 MB
- Tags: Python 3
- Uploaded using Trusted Publishing? No
- Uploaded via: uv/0.5.23
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
7473c715b790f22ae4b2a245874debda410dced5ce13599f60ff88cd857b8564
|
|
| MD5 |
85a3603c7314e618ed0fdc537599b894
|
|
| BLAKE2b-256 |
b6bf6838a619219cbd484e6ebe5248c5c03f56f22c5bde89006d85b1da25cf1e
|