A library for splitting text into sentences
Project description
Chunkator
Welcome to chunkator, a Python library designed for efficient and precise sentence segmentation. This library provides a robust alternative to existing tools like NLTK, LangChain, and LlamaIndex sentence splitters. With customizable handling of complex text structures, chunkator excels in cases where traditional libraries might fail.
Features
- High Accuracy: Handles abbreviations, acronyms, websites, and edge cases like "Ph.D." without breaking sentences incorrectly.
- Regex-Driven: Precompiled regex patterns for faster processing.
- Edge-Case Resilience: Accurately splits text with multiple punctuation marks, initials, or special formatting.
- Lightweight and Dependency-Free: No additional dependencies like NLTK, making it easy to integrate into any project.
Why chunkator?
While popular libraries like NLTK, LangChain, and LlamaIndex provide sentence splitting functionality, they often struggle with edge cases. Here's why chunkator stands out:
1. Handling Abbreviations
chunkator processes abbreviations like "Dr.", "Mr.", and "Ph.D." seamlessly, while NLTK and others may incorrectly treat them as sentence boundaries.
Example:
Input:
Dr. Smith is a leading scientist. He earned his Ph.D. in Physics.
- NLTK Output:
['Dr.', 'Smith is a leading scientist.', 'He earned his Ph.D.', 'in Physics.']
- chunkator Output:
['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.']
2. Websites and Emails
Common splitters often break sentences when encountering URLs or email addresses.
Example:
Input:
Visit our website at www.example.com. Contact us at support@example.com.
- LangChain Output:
['Visit our website at www.example.', 'com.', 'Contact us at support@example.', 'com.']
- chunkator Output:
['Visit our website at www.example.com.', 'Contact us at support@example.com.']
3. Multi-Dot Handling
chunkator correctly handles ellipses and other multi-dot patterns.
Example:
Input:
She hesitated... but eventually agreed. It was unexpected...
- LlamaIndex Output:
['She hesitated.', '.', '.', 'but eventually agreed.', 'It was unexpected.', '.', '.', '.']
- chunkator Output:
['She hesitated... but eventually agreed.', 'It was unexpected...']
4. Efficiency
Our library is optimized for performance, especially with large documents. Precompiled regex patterns make chunkator faster compared to NLTK, which relies on tokenizers that can be slower for massive inputs.
Installation
Install chunkator via pip:
pip install chunkator
Usage
Here's how to use the chunkator library in your projects:
from sentence_split import sentence_split
# Input text
text = "Dr. Smith is a leading scientist. He earned his Ph.D. in Physics. Visit www.example.com for more info."
# Split into sentences
sentences = sentence_split(text)
# Output
print(sentences)
# Output: ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.', 'Visit www.example.com for more info.']
Advanced Use Cases
Custom Text Processing
chunkator can be extended to handle custom patterns or rules. Modify the regex patterns in the library to suit your specific needs.
Benchmarking
| Library | Handles Abbreviations | Handles Websites | Handles Ellipses | Speed (ms for 1000 sentences) |
|---|---|---|---|---|
| NLTK | No | No | Partial | 120 |
| LangChain | Partial | No | No | 150 |
| LlamaIndex | No | Partial | No | 130 |
| chunkator | Yes | Yes | Yes | 90 |
Contributing
We welcome contributions! Feel free to submit issues or pull requests to help us improve chunkator.
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Filter files by name, interpreter, ABI, and platform.
If you're not sure about the file name format, learn more about wheel file names.
Copy a direct link to the current filters
File details
Details for the file chunkator-0.0.7.tar.gz.
File metadata
- Download URL: chunkator-0.0.7.tar.gz
- Upload date:
- Size: 4.4 kB
- Tags: Source
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0435a9920bd21209ff9b0caa03c4e07b1119c7eab99849a57bb53248b65295fe
|
|
| MD5 |
0282fbb88e28cf7c61f9f02d9bb83f0f
|
|
| BLAKE2b-256 |
5b55d784f54ab7a6ad66933c417eda9e9827c0235875822bce15c745c07e8058
|
Provenance
The following attestation bundles were made for chunkator-0.0.7.tar.gz:
Publisher:
workflow-release.yml on sahillihas/chunkator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chunkator-0.0.7.tar.gz -
Subject digest:
0435a9920bd21209ff9b0caa03c4e07b1119c7eab99849a57bb53248b65295fe - Sigstore transparency entry: 155544522
- Sigstore integration time:
-
Permalink:
sahillihas/chunkator@a1dfc3ff5a6db19dbbf9ae11437c40ccbf7fdca5 -
Branch / Tag:
refs/tags/v0.0.7-release - Owner: https://github.com/sahillihas
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow-release.yml@a1dfc3ff5a6db19dbbf9ae11437c40ccbf7fdca5 -
Trigger Event:
push
-
Statement type:
File details
Details for the file chunkator-0.0.7-py3-none-any.whl.
File metadata
- Download URL: chunkator-0.0.7-py3-none-any.whl
- Upload date:
- Size: 4.6 kB
- Tags: Python 3
- Uploaded using Trusted Publishing? Yes
- Uploaded via: twine/6.0.1 CPython/3.12.8
File hashes
| Algorithm | Hash digest | |
|---|---|---|
| SHA256 |
0f3817c50a79b12bd57054cb60cda573bb0dd4a9e12a150a884da077d4a04857
|
|
| MD5 |
53c1dae7b979c21073110938992f0285
|
|
| BLAKE2b-256 |
2509aa38ce2c69712317ebb9b8e51a997e43561a17d8ee6fa51d66c9e6ebbfaa
|
Provenance
The following attestation bundles were made for chunkator-0.0.7-py3-none-any.whl:
Publisher:
workflow-release.yml on sahillihas/chunkator
-
Statement:
-
Statement type:
https://in-toto.io/Statement/v1 -
Predicate type:
https://docs.pypi.org/attestations/publish/v1 -
Subject name:
chunkator-0.0.7-py3-none-any.whl -
Subject digest:
0f3817c50a79b12bd57054cb60cda573bb0dd4a9e12a150a884da077d4a04857 - Sigstore transparency entry: 155544523
- Sigstore integration time:
-
Permalink:
sahillihas/chunkator@a1dfc3ff5a6db19dbbf9ae11437c40ccbf7fdca5 -
Branch / Tag:
refs/tags/v0.0.7-release - Owner: https://github.com/sahillihas
-
Access:
public
-
Token Issuer:
https://token.actions.githubusercontent.com -
Runner Environment:
github-hosted -
Publication workflow:
workflow-release.yml@a1dfc3ff5a6db19dbbf9ae11437c40ccbf7fdca5 -
Trigger Event:
push
-
Statement type: