Skip to main content

A library for splitting text into sentences

Project description

Chunkator

Welcome to chunkator, a Python library designed for efficient and precise sentence segmentation. This library provides a robust alternative to existing tools like NLTK, LangChain, and LlamaIndex sentence splitters. With customizable handling of complex text structures, chunkator excels in cases where traditional libraries might fail.

Features

  • High Accuracy: Handles abbreviations, acronyms, websites, and edge cases like "Ph.D." without breaking sentences incorrectly.
  • Regex-Driven: Precompiled regex patterns for faster processing.
  • Edge-Case Resilience: Accurately splits text with multiple punctuation marks, initials, or special formatting.
  • Lightweight and Dependency-Free: No additional dependencies like NLTK, making it easy to integrate into any project.

Why chunkator?

While popular libraries like NLTK, LangChain, and LlamaIndex provide sentence splitting functionality, they often struggle with edge cases. Here's why chunkator stands out:

1. Handling Abbreviations

chunkator processes abbreviations like "Dr.", "Mr.", and "Ph.D." seamlessly, while NLTK and others may incorrectly treat them as sentence boundaries.

Example:

Input:

Dr. Smith is a leading scientist. He earned his Ph.D. in Physics.
  • NLTK Output:
    • ['Dr.', 'Smith is a leading scientist.', 'He earned his Ph.D.', 'in Physics.']
  • chunkator Output:
    • ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.']

2. Websites and Emails

Common splitters often break sentences when encountering URLs or email addresses.

Example:

Input:

Visit our website at www.example.com. Contact us at support@example.com.
  • LangChain Output:
    • ['Visit our website at www.example.', 'com.', 'Contact us at support@example.', 'com.']
  • chunkator Output:
    • ['Visit our website at www.example.com.', 'Contact us at support@example.com.']

3. Multi-Dot Handling

chunkator correctly handles ellipses and other multi-dot patterns.

Example:

Input:

She hesitated... but eventually agreed. It was unexpected...
  • LlamaIndex Output:
    • ['She hesitated.', '.', '.', 'but eventually agreed.', 'It was unexpected.', '.', '.', '.']
  • chunkator Output:
    • ['She hesitated... but eventually agreed.', 'It was unexpected...']

4. Efficiency

Our library is optimized for performance, especially with large documents. Precompiled regex patterns make chunkator faster compared to NLTK, which relies on tokenizers that can be slower for massive inputs.


Installation

Install chunkator via pip:

pip install chunkator

Usage

Here's how to use the chunkator library in your projects:

from sentence_split import sentence_split

# Input text
text = "Dr. Smith is a leading scientist. He earned his Ph.D. in Physics. Visit www.example.com for more info."

# Split into sentences
sentences = sentence_split(text)

# Output
print(sentences)
# Output: ['Dr. Smith is a leading scientist.', 'He earned his Ph.D. in Physics.', 'Visit www.example.com for more info.']

Advanced Use Cases

Custom Text Processing

chunkator can be extended to handle custom patterns or rules. Modify the regex patterns in the library to suit your specific needs.


Benchmarking

Library Handles Abbreviations Handles Websites Handles Ellipses Speed (ms for 1000 sentences)
NLTK No No Partial 120
LangChain Partial No No 150
LlamaIndex No Partial No 130
chunkator Yes Yes Yes 90

Contributing

We welcome contributions! Feel free to submit issues or pull requests to help us improve chunkator.


Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

chunkator-0.0.11.tar.gz (4.4 kB view details)

Uploaded Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

chunkator-0.0.11-py3-none-any.whl (4.6 kB view details)

Uploaded Python 3

File details

Details for the file chunkator-0.0.11.tar.gz.

File metadata

  • Download URL: chunkator-0.0.11.tar.gz
  • Upload date:
  • Size: 4.4 kB
  • Tags: Source
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.11.tar.gz
Algorithm Hash digest
SHA256 55683081ab0d3d0bb5f0b8b30d4851a7f20cd363f795ce60ecf8054f4f914732
MD5 5b06d675b7c6857190024ac5f153f854
BLAKE2b-256 fb688a54f5afdd4740dd8a6a329bf2a1b0ae62d88d5af837aa934245d2eccda0

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.11.tar.gz:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

File details

Details for the file chunkator-0.0.11-py3-none-any.whl.

File metadata

  • Download URL: chunkator-0.0.11-py3-none-any.whl
  • Upload date:
  • Size: 4.6 kB
  • Tags: Python 3
  • Uploaded using Trusted Publishing? Yes
  • Uploaded via: twine/6.0.1 CPython/3.12.8

File hashes

Hashes for chunkator-0.0.11-py3-none-any.whl
Algorithm Hash digest
SHA256 f7b360ec0d34719dbf540337018f024590e5f3fc30c8602ed864fb6cf324f61b
MD5 5cb34993c42574e2b9e10131ba2a6935
BLAKE2b-256 6005cf50acc68fbb97e0a3c5e09ba4c04fe51c6124d1a17e9b7ba3a87234107b

See more details on using hashes here.

Provenance

The following attestation bundles were made for chunkator-0.0.11-py3-none-any.whl:

Publisher: workflow-release.yml on sahillihas/chunkator

Attestations: Values shown here reflect the state when the release was signed and may no longer be current.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Depot Continuous Integration Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page