Skip to main content

A library for splitting text into sentences

Project description

Chunkator

Welcome to chunkator, a Python library designed for efficient and precise sentence segmentation. This library provides a robust alternative to existing tools like NLTK, LangChain, and LlamaIndex sentence splitters. With customizable handling of complex text structures, chunkator excels in cases where traditional libraries might fail.

Features

  • High Accuracy: Handles abbreviations, acronyms, websites, and edge cases like "Ph.D." without breaking sentences incorrectly.
  • Regex-Driven: Precompiled regex patterns for faster processing.
  • Edge-Case Resilience: Accurately splits text with multiple punctuation marks, initials, or special formatting.
  • Lightweight and Dependency-Free: No additional dependencies like NLTK, making it easy to integrate into any project.

Why chunkator?

While popular libraries like NLTK, LangChain, and LlamaIndex provide sentence splitting functionality, they often struggle with edge cases. Here's why chunkator stands out:

1. Handling Abbreviations

chunkator processes abbreviations like "Dr.", "Mr.", and "Ph.D." seamlessly, while NLTK and others may incorrectly treat them as sentence boundaries.

Example:

Input:

Dr. Smith is a leading scientist. He earned his Ph.D. in Physics.
  • NLTK Output:
    • ['Dr.', 'Smith is a leading scientist.', 'He earned his Ph.D.', 'in Physics.']
  • chunkator Output:
    • ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.']

2. Websites and Emails

Common splitters often break sentences when encountering URLs or email addresses.

Example:

Input:

Visit our website at www.example.com. Contact us at support@example.com.
  • LangChain Output:
    • ['Visit our website at www.example.', 'com.', 'Contact us at support@example.', 'com.']
  • chunkator Output:
    • ['Visit our website at www.example.com.', 'Contact us at support@example.com.']

3. Multi-Dot Handling

chunkator correctly handles ellipses and other multi-dot patterns.

Example:

Input:

She hesitated... but eventually agreed. It was unexpected...
  • LlamaIndex Output:
    • ['She hesitated.', '.', '.', 'but eventually agreed.', 'It was unexpected.', '.', '.', '.']
  • chunkator Output:
    • ['She hesitated... but eventually agreed.', 'It was unexpected...']

4. Efficiency

Our library is optimized for performance, especially with large documents. Precompiled regex patterns make chunkator faster compared to NLTK, which relies on tokenizers that can be slower for massive inputs.


Installation

Install chunkator via pip:

pip install chunkator

Usage

Here's how to use the chunkator library in your projects:

from sentence_split import sentence_split

# Input text
text = "Dr. Smith is a leading scientist. He earned his Ph.D. in Physics. Visit www.example.com for more info."

# Split into sentences
sentences = sentence_split(text)

# Output
print(sentences)
# Output: ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.', 'Visit www.example.com for more info.']

Advanced Use Cases

Custom Text Processing

chunkator can be extended to handle custom patterns or rules. Modify the regex patterns in the library to suit your specific needs.


Benchmarking

Library Handles Abbreviations Handles Websites Handles Ellipses Speed (ms for 1000 sentences)
NLTK No No Partial 120
LangChain Partial No No 150
LlamaIndex No Partial No 130
chunkator Yes Yes Yes 90

Contributing

We welcome contributions! Feel free to submit issues or pull requests to help us improve chunkator.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkator-0.0.6.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkator-0.0.6-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file chunkator-0.0.6.tar.gz.

File metadata

  • Download URL: chunkator-0.0.6.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.6.tar.gz
Algorithm Hash digest
SHA256 d29fe5219d98c07b3152ef6c51efdd4dfc32496a7c92b63c396efedda40de365
MD5 5a1f96f70bfac4328bf0c77ba64274e3
BLAKE2b-256 79123616073e9226b2b4ce9a7a9017cb03d851eafeb6705aeac409fb933f3d22

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.6.tar.gz:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkator-0.0.6-py3-none-any.whl.

File metadata

  • Download URL: chunkator-0.0.6-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.6-py3-none-any.whl
Algorithm Hash digest
SHA256 3cb84b1d5c9c1633132a32d5514a424f5f41027245d3b372f563f8276486d0c2
MD5 83f9e233a9ccf2c60af0c89cb1cf0d02
BLAKE2b-256 3dc9c39974c2fb47dbb3823851393af1fee4b8e189adf7f49d7d508146ce9eb5

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.6-py3-none-any.whl:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page