Skip to main content

A library for splitting text into sentences

Project description

Chunkator

Welcome to chunkator, a Python library designed for efficient and precise sentence segmentation. This library provides a robust alternative to existing tools like NLTK, LangChain, and LlamaIndex sentence splitters. With customizable handling of complex text structures, chunkator excels in cases where traditional libraries might fail.

Features

  • High Accuracy: Handles abbreviations, acronyms, websites, and edge cases like "Ph.D." without breaking sentences incorrectly.
  • Regex-Driven: Precompiled regex patterns for faster processing.
  • Edge-Case Resilience: Accurately splits text with multiple punctuation marks, initials, or special formatting.
  • Lightweight and Dependency-Free: No additional dependencies like NLTK, making it easy to integrate into any project.

Why chunkator?

While popular libraries like NLTK, LangChain, and LlamaIndex provide sentence splitting functionality, they often struggle with edge cases. Here's why chunkator stands out:

1. Handling Abbreviations

chunkator processes abbreviations like "Dr.", "Mr.", and "Ph.D." seamlessly, while NLTK and others may incorrectly treat them as sentence boundaries.

Example:

Input:

Dr. Smith is a leading scientist. He earned his Ph.D. in Physics.
  • NLTK Output:
    • ['Dr.', 'Smith is a leading scientist.', 'He earned his Ph.D.', 'in Physics.']
  • chunkator Output:
    • ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.']

2. Websites and Emails

Common splitters often break sentences when encountering URLs or email addresses.

Example:

Input:

Visit our website at www.example.com. Contact us at support@example.com.
  • LangChain Output:
    • ['Visit our website at www.example.', 'com.', 'Contact us at support@example.', 'com.']
  • chunkator Output:
    • ['Visit our website at www.example.com.', 'Contact us at support@example.com.']

3. Multi-Dot Handling

chunkator correctly handles ellipses and other multi-dot patterns.

Example:

Input:

She hesitated... but eventually agreed. It was unexpected...
  • LlamaIndex Output:
    • ['She hesitated.', '.', '.', 'but eventually agreed.', 'It was unexpected.', '.', '.', '.']
  • chunkator Output:
    • ['She hesitated... but eventually agreed.', 'It was unexpected...']

4. Efficiency

Our library is optimized for performance, especially with large documents. Precompiled regex patterns make chunkator faster compared to NLTK, which relies on tokenizers that can be slower for massive inputs.


Installation

Install chunkator via pip:

pip install chunkator

Usage

Here's how to use the chunkator library in your projects:

from sentence_split import sentence_split

# Input text
text = "Dr. Smith is a leading scientist. He earned his Ph.D. in Physics. Visit www.example.com for more info."

# Split into sentences
sentences = sentence_split(text)

# Output
print(sentences)
# Output: ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.', 'Visit www.example.com for more info.']

Advanced Use Cases

Custom Text Processing

chunkator can be extended to handle custom patterns or rules. Modify the regex patterns in the library to suit your specific needs.


Benchmarking

Library Handles Abbreviations Handles Websites Handles Ellipses Speed (ms for 1000 sentences)
NLTK No No Partial 120
LangChain Partial No No 150
LlamaIndex No Partial No 130
chunkator Yes Yes Yes 90

Contributing

We welcome contributions! Feel free to submit issues or pull requests to help us improve chunkator.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkator-0.0.10.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkator-0.0.10-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file chunkator-0.0.10.tar.gz.

File metadata

  • Download URL: chunkator-0.0.10.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.10.tar.gz
Algorithm Hash digest
SHA256 0b699a95566ae1311f7b5506d042deae71a180ddfe025be7faaf783a99b0a4e7
MD5 adf2273af9d59a16c511b95a284badcf
BLAKE2b-256 836c9e806155737f5f88ced32ce7a9ba68bb04de5cefaf45cd491bc6f474f238

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.10.tar.gz:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkator-0.0.10-py3-none-any.whl.

File metadata

  • Download URL: chunkator-0.0.10-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.10-py3-none-any.whl
Algorithm Hash digest
SHA256 8279093f338b5c1fd22c15266dbb83f9ea64e57f86bb8710791b219d58adf01a
MD5 c84fb3c9456170fe5c1f7d3658ce8626
BLAKE2b-256 969a2d8239b6273d37cef37e6524c5b86a50a0c9f094faa63d3eeb61e15c92fb

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.10-py3-none-any.whl:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page