Skip to main content

A library for splitting text into sentences

Project description

Chunkator

Welcome to chunkator, a Python library designed for efficient and precise sentence segmentation. This library provides a robust alternative to existing tools like NLTK, LangChain, and LlamaIndex sentence splitters. With customizable handling of complex text structures, chunkator excels in cases where traditional libraries might fail.

Features

  • High Accuracy: Handles abbreviations, acronyms, websites, and edge cases like "Ph.D." without breaking sentences incorrectly.
  • Regex-Driven: Precompiled regex patterns for faster processing.
  • Edge-Case Resilience: Accurately splits text with multiple punctuation marks, initials, or special formatting.
  • Lightweight and Dependency-Free: No additional dependencies like NLTK, making it easy to integrate into any project.

Why chunkator?

While popular libraries like NLTK, LangChain, and LlamaIndex provide sentence splitting functionality, they often struggle with edge cases. Here's why chunkator stands out:

1. Handling Abbreviations

chunkator processes abbreviations like "Dr.", "Mr.", and "Ph.D." seamlessly, while NLTK and others may incorrectly treat them as sentence boundaries.

Example:

Input:

Dr. Smith is a leading scientist. He earned his Ph.D. in Physics.
  • NLTK Output:
    • ['Dr.', 'Smith is a leading scientist.', 'He earned his Ph.D.', 'in Physics.']
  • chunkator Output:
    • ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.']

2. Websites and Emails

Common splitters often break sentences when encountering URLs or email addresses.

Example:

Input:

Visit our website at www.example.com. Contact us at support@example.com.
  • LangChain Output:
    • ['Visit our website at www.example.', 'com.', 'Contact us at support@example.', 'com.']
  • chunkator Output:
    • ['Visit our website at www.example.com.', 'Contact us at support@example.com.']

3. Multi-Dot Handling

chunkator correctly handles ellipses and other multi-dot patterns.

Example:

Input:

She hesitated... but eventually agreed. It was unexpected...
  • LlamaIndex Output:
    • ['She hesitated.', '.', '.', 'but eventually agreed.', 'It was unexpected.', '.', '.', '.']
  • chunkator Output:
    • ['She hesitated... but eventually agreed.', 'It was unexpected...']

4. Efficiency

Our library is optimized for performance, especially with large documents. Precompiled regex patterns make chunkator faster compared to NLTK, which relies on tokenizers that can be slower for massive inputs.


Installation

Install chunkator via pip:

pip install chunkator

Usage

Here's how to use the chunkator library in your projects:

from sentence_split import sentence_split

# Input text
text = "Dr. Smith is a leading scientist. He earned his Ph.D. in Physics. Visit www.example.com for more info."

# Split into sentences
sentences = sentence_split(text)

# Output
print(sentences)
# Output: ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.', 'Visit www.example.com for more info.']

Advanced Use Cases

Custom Text Processing

chunkator can be extended to handle custom patterns or rules. Modify the regex patterns in the library to suit your specific needs.


Benchmarking

Library Handles Abbreviations Handles Websites Handles Ellipses Speed (ms for 1000 sentences)
NLTK No No Partial 120
LangChain Partial No No 150
LlamaIndex No Partial No 130
chunkator Yes Yes Yes 90

Contributing

We welcome contributions! Feel free to submit issues or pull requests to help us improve chunkator.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkator-0.0.5.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkator-0.0.5-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file chunkator-0.0.5.tar.gz.

File metadata

  • Download URL: chunkator-0.0.5.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.5.tar.gz
Algorithm Hash digest
SHA256 bf0be8917cd6a4cce33c5922beef2a2e76549b6b5c8676f834449aff840e8a25
MD5 09ff9dc3fd9b32ad2b74009219b8a5c4
BLAKE2b-256 5b295571c02af02b7f26c88d3c3dcbc70c820aefc18f0c39262a2479acd65942

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.5.tar.gz:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkator-0.0.5-py3-none-any.whl.

File metadata

  • Download URL: chunkator-0.0.5-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.5-py3-none-any.whl
Algorithm Hash digest
SHA256 5e547a792d5746cb2aaedde1e56b38a170a774031d890179e9735e6836efeda7
MD5 273723b9a88e7f8c4098f9a3fb3409dc
BLAKE2b-256 ba5d6e17f5af752828cbd50a213b86ece91e31950b54306bbd7368e029845518

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.5-py3-none-any.whl:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page